Himeno optimization through CPU parallelism
Learn how to use Codee to parallelize Himeno, a fluid analysis simulation code, on CPU.
This guide is part of the NERSC + Codee Training Series 2024. Code available for download at the previous link.
Getting started
First, navigate to the source code for Himeno:
cd codee-demos/Fortran/Himeno
Next, load the latest Codee version available on Perlmutter:
module load codee/2024.3.1
Walkthrough
1. Run the screening report
To explore the recommendations of the Open
Catalog that are applicable to
Himeno, let's run Codee's screening report; use --target-arch
to include
multithreaded CPU checks in the analysis:
codee screening --target-arch cpu -- gfortran himeno.f90 -O3
Date: 2024-09-05 Codee version: 2024.3.1 License type: Full
Compiler invocation: gfortran himeno.f90 -O3
[1/1] himeno.f90 ... Done
SCREENING REPORT
---Number of files---
Total | C C++ Fortran
----- | - --- -------
1 | 0 0 1
Lines of code Analysis time # checks Profiling
------------- ------------- -------- ---------
214 234 ms 23 n/a
Lines of code : total lines of code found in the target (computed the same way as the sloccount tool)
Analysis time : time required to analyze the target
# checks : total actionable items (opportunities, recommendations, defects and remarks) detected
Profiling : estimation of overall execution time required by this target
RANKING OF CHECKERS
Checker Priority AutoFix # Title
------- -------- ------- - ------------------------------------------------------------------------------------------------
PWR068 P27 (L1) 6 Encapsulate external procedures within modules to avoid the risks of calling implicit interfaces
PWR054 P12 (L1) x 1 Consider applying vectorization to scalar reduction loop
PWR063 P12 (L1) 1 Avoid using legacy Fortran constructs
PWR051 P6 (L2) x 1 Consider applying multithreading parallelism to scalar reduction loop
PWR069 P3 (L3) 6 Use the keyword only to explicitly state what to import from a module
PWR001 P3 (L3) 5 Declare global variables as function parameters
RMK001 P3 (L3) x 1 Loop nesting that might benefit from hybrid parallelization using multithreading and SIMD
PWR035 P2 (L3) 2 Avoid non-consecutive array access to improve performance
SUGGESTIONS
Use 'roi' to get a return of investment estimation report:
codee roi --target-arch cpu -- gfortran himeno.f90 -O3
Use 'checks' to find out details about the detected checks:
codee checks --target-arch cpu -- gfortran himeno.f90 -O3
1 file, 7 functions, 5 loops successfully analyzed (23 checkers) and 0 non-analyzed files in 235 ms
2. Run the checks report
Follow the suggestions to generate Codee's checks report, which helps identify all places in the code where each check is applicable:
codee checks --target-arch cpu -- gfortran himeno.f90 -O3
Date: 2024-09-05 Codee version: 2024.3.1 License type: Full
Compiler invocation: gfortran himeno.f90 -O3
[1/1] himeno.f90 ... Done
CHECKS REPORT
himeno.f90:136:1 [PWR068] (level: L1): Encapsulate external procedures within modules to avoid the risks of calling implicit interfaces
himeno.f90:164:1 [PWR068] (level: L1): Encapsulate external procedures within modules to avoid the risks of calling implicit interfaces
himeno.f90:223:1 [PWR068] (level: L1): Encapsulate external procedures within modules to avoid the risks of calling implicit interfaces
himeno.f90:255:1 [PWR068] (level: L1): Encapsulate external procedures within modules to avoid the risks of calling implicit interfaces
himeno.f90:275:1 [PWR068] (level: L1): Encapsulate external procedures within modules to avoid the risks of calling implicit interfaces
himeno.f90:325:1 [PWR068] (level: L1): Encapsulate external procedures within modules to avoid the risks of calling implicit interfaces
himeno.f90:295:12 [PWR054] (level: L1): Consider applying vectorization to scalar reduction loop
himeno.f90 [PWR063] (level: L1): Avoid using legacy Fortran constructs
himeno.f90:293:6 [PWR051] (level: L2): Consider applying multithreading parallelism to scalar reduction loop
himeno.f90:136:1 [PWR001] (level: L3): Declare global variables as function parameters
himeno.f90:164:1 [PWR001] (level: L3): Declare global variables as function parameters
himeno.f90:223:1 [PWR001] (level: L3): Declare global variables as function parameters
himeno.f90:255:1 [PWR001] (level: L3): Declare global variables as function parameters
himeno.f90:275:1 [PWR001] (level: L3): Declare global variables as function parameters
himeno.f90:67:1 [PWR069] (level: L3): Use the keyword only to explicitly state what to import from a module
himeno.f90:136:1 [PWR069] (level: L3): Use the keyword only to explicitly state what to import from a module
himeno.f90:164:1 [PWR069] (level: L3): Use the keyword only to explicitly state what to import from a module
himeno.f90:223:1 [PWR069] (level: L3): Use the keyword only to explicitly state what to import from a module
himeno.f90:255:1 [PWR069] (level: L3): Use the keyword only to explicitly state what to import from a module
himeno.f90:275:1 [PWR069] (level: L3): Use the keyword only to explicitly state what to import from a module
himeno.f90:295:12 [RMK001] (level: L3): Loop nesting that might benefit from hybrid parallelization using multithreading and SIMD
himeno.f90:293:6 [PWR035] (level: L3): Avoid non-consecutive array access to improve performance
himeno.f90:294:9 [PWR035] (level: L3): Avoid non-consecutive array access to improve performance
SUGGESTIONS
Use --verbose to get more details, e.g:
codee checks --verbose --target-arch cpu -- gfortran himeno.f90 -O3
Use --check-id to focus on specific subsets of checkers, e.g.:
codee checks --check-id PWR068 --target-arch cpu -- gfortran himeno.f90 -O3
1 file, 7 functions, 5 loops successfully analyzed (23 checkers) and 0 non-analyzed files in 232 ms
We can also run the detailed output of the checks report (option --verbose
)
to obtain more information about each suggestion. This detailed output includes
links to the Open Catalog, along with the precise location in the source code.
All this additional information can be overwhelming when many checkers are
reported. To prevent this, use the --check-id
flag to filter the output.
As an example, let's focus on the checker PWR051, related to parallelizing a loop with multithreading:
codee checks --verbose --target-arch cpu --check-id PWR051 -- gfortran himeno.f90 -O3
Date: 2024-09-05 Codee version: 2024.3.1 License type: Full
Compiler invocation: gfortran himeno.f90 -O3
[1/1] himeno.f90 ... Done
CHECKS REPORT
himeno.f90:293:6 [PWR051] (level: L2): Consider applying multithreading parallelism to scalar reduction loop
Suggestion: Use 'rewrite' to automatically optimize the code
Documentation: https://github.com/codee-com/open-catalog/tree/main/Checks/PWR051
AutoFix (choose one option):
* Using OpenMP 'for' with built-in reduction (recommended):
codee rewrite --multi omp-for --in-place himeno.f90:293:6 -- gfortran himeno.f90 -O3
* Using OpenMP 'taskwait':
codee rewrite --multi omp-taskwait --in-place himeno.f90:293:6 -- gfortran himeno.f90 -O3
* Using OpenMP 'taskloop':
codee rewrite --multi omp-taskloop --in-place himeno.f90:293:6 -- gfortran himeno.f90 -O3
1 file, 7 functions, 5 loops successfully analyzed (1 checker) and 0 non-analyzed files in 186 ms
3. Autofix
Let's use Codee's autofix capabilities to automatically optimize the code with
OpenMP. Copy-paste the suggested Codee invocation, and replace the --in-place
argument with -o
to create a new file with the modification:
codee rewrite --multi omp-for -o himeno_codee.f90 himeno.f90:293:6 -- gfortran himeno.f90 -O3
Date: 2024-09-05 Codee version: 2024.3.1 License type: Full
Compiler invocation: gfortran himeno.f90 -O3
Results for file '/global/homes/u/user/codee-demos/Fortran/Himeno/himeno.f90':
Successfully applied AutoFix to the loop at 'himeno.f90:jacobi:293:6' [using multi-threading]:
[INFO] himeno.f90:293:6 Parallel scalar reduction pattern identified for variable 'gosa' with associative, commutative operator '+'
[INFO] himeno.f90:293:6 Parallel forall: variable 'wrk2'
[INFO] himeno.f90:293:6 Available parallelization strategies for variable 'gosa'
[INFO] himeno.f90:293:6 #1 OpenMP scalar reduction (* implemented)
[INFO] himeno.f90:293:6 #2 OpenMP atomic access
[INFO] himeno.f90:293:6 #3 OpenMP explicit privatization
[INFO] himeno.f90:293:6 Loop parallelized with multithreading using OpenMP directive 'for'
[INFO] himeno.f90:293:6 Parallel region defined by OpenMP directive 'parallel'
Successfully created himeno_codee.f90
Minimum software stack requirements: OpenMP version 3.0 with multithreading capabilities
4. Execution
Compile the original source code of Himeno (himeno.f90
) and the optimized
version (himeno_codee.f90
) to compare their performance. For instance, using
the gfortran
compiler:
gfortran himeno.f90 -o himeno -O3 && \
gfortran himeno_codee.f90 -o himeno_codee -O3 -fopenmp
And run the original executable (himeno
) and the optimized one
(himeno_codee
), choosing the L
input dataset size:
mimax= 513 mjmax= 257 mkmax= 257
imax= 512 jmax= 256 kmax= 256
Time measurement accuracy : .10000E-02
Start rehearsal measurement process.
Measure the performance in 3 times.
MFLOPS: 7715.22363 time(s): 0.43500000000000000 4.88281250E-04
Now, start the actual measurement process.
The loop will be excuted in 413 times.
This will take about one minute.
Wait for a while.
Loop executed for 413 times
Gosa : 4.88281250E-04
MFLOPS: 7730.45605 time(s): 59.767000000000003
Score based on Pentium III 600MHz : 93.3179169
mimax= 513 mjmax= 257 mkmax= 257
imax= 512 jmax= 256 kmax= 256
Time measurement accuracy : .10000E-02
Start rehearsal measurement process.
Measure the performance in 3 times.
MFLOPS: 17036.1523 time(s): 0.19700000000000001 8.46543233E-04
Now, start the actual measurement process.
The loop will be excuted in 913 times.
This will take about one minute.
Wait for a while.
Loop executed for 913 times
Gosa : 6.00490777E-04
MFLOPS: 17406.5215 time(s): 58.678000000000004
Score based on Pentium III 600MHz : 210.122192
The performance has increased from 7730 MFLOPS to 17406 MFLOPS, representing a 2.25X speedup.