Skip to main content

Himeno optimization through CPU parallelism

Goal

Learn how to use Codee to parallelize Himeno, a fluid analysis simulation code, on CPU.

Getting ready

For this demonstration, we will use the Fortran implementation of the Himeno benchmark, a Poisson equation solver. Start by cloning the repository:

git clone https://github.com/codee-com/codee-demos.git && \
cd codee-demos/Fortran/Himeno

Now navigate to the source code:

cd codee-demos/Fortran/Himeno

Walkthrough

1. Run the screening report

To explore the recommendations of the Open Catalog that are applicable to Himeno, let's run Codee's screening report; use --target-arch to include multithreaded CPU checks in the analysis:

Codee command
codee screening --target-arch cpu -- gfortran himeno.f90 -O3
Codee output
Date: 2025-11-06 Codee version: 2025.4 License type: Professional
Compiler invocation: gfortran himeno.f90 -O3

[1/1] himeno.f90 ... Done

SCREENING REPORT

------Number of files------
Total | C C++ Fortran Other
----- | - --- ------- -----
1 | 0 0 1 0

RANKING OF QUALITY CHECKERS

Checker Category Priority AutoFixes # Title
------- ----------------------------- -------- --------- -- ---------------------------------------------------------------------------------------
PWR063 correctness, modern, security P12 (L1) 1 Avoid using legacy Fortran constructs
PWR069 correctness, modern, security P9 (L2) 6 6 Use the keyword only to explicitly state what to import from a module
PWR007 correctness, modern, security P9 (L2) 5 Disable the implicit declaration of variables and procedures
PWR068 correctness, modern, security P9 (L2) 5 Encapsulate procedures within modules to avoid the risks of calling implicit interfaces
PWR001 correctness, modern, security P1 (L3) 5 Pass global variables as function arguments
------- ----------------------------- -------- --------- -- ---------------------------------------------------------------------------------------
Total 6 22

RANKING OF OPTIMIZATION CHECKERS

Checker Category Priority AutoFixes # Title
------- -------- -------- --------- - -----------------------------------------------------------------------------------------
PWR054 vector P12 (L1) 1 1 Consider applying vectorization to scalar reduction loop
PWR051 multi P6 (L2) 1 1 Consider applying multithreading parallelism to scalar reduction loop
RMK001 multi P3 (L3) 1 1 Loop nesting that might benefit from hybrid parallelization using multithreading and SIMD
PWR035 memory P2 (L3) 2 Avoid non-consecutive array access to improve performance
------- -------- -------- --------- - -----------------------------------------------------------------------------------------
Total 3 5

SUGGESTIONS

Use 'checks' to find out details about the detected checks:
codee checks --target-arch cpu -- gfortran himeno.f90 -O3

1 target file, 7 functions, 5 loops, 214 LOCs successfully analyzed (27 checkers) and 0 non-analyzed files in 247 ms

2. Run the checks report

Follow the suggestions to generate Codee's checks report, which helps identify all places in the code where each check is applicable:

Codee command
codee checks --target-arch cpu -- gfortran himeno.f90 -O3
Codee output
Date: 2025-11-06 Codee version: 2025.4 License type: Professional
Compiler invocation: gfortran himeno.f90 -O3

[1/1] himeno.f90 ... Done

CHECKS REPORT

himeno.f90:295:12 [PWR054] (level: L1): Consider applying vectorization to scalar reduction loop
himeno.f90 [PWR063] (level: L1): Avoid using legacy Fortran constructs
himeno.f90:67:1 [PWR069] (level: L2): Use the keyword only to explicitly state what to import from a module
himeno.f90:136:1 [PWR069] (level: L2): Use the keyword only to explicitly state what to import from a module
himeno.f90:164:1 [PWR069] (level: L2): Use the keyword only to explicitly state what to import from a module
himeno.f90:223:1 [PWR069] (level: L2): Use the keyword only to explicitly state what to import from a module
himeno.f90:255:1 [PWR069] (level: L2): Use the keyword only to explicitly state what to import from a module
himeno.f90:275:1 [PWR069] (level: L2): Use the keyword only to explicitly state what to import from a module
himeno.f90:44:1 [PWR007] (level: L2): Disable the implicit declaration of variables and procedures
himeno.f90:48:1 [PWR007] (level: L2): Disable the implicit declaration of variables and procedures
himeno.f90:52:1 [PWR007] (level: L2): Disable the implicit declaration of variables and procedures
himeno.f90:56:1 [PWR007] (level: L2): Disable the implicit declaration of variables and procedures
himeno.f90:60:1 [PWR007] (level: L2): Disable the implicit declaration of variables and procedures
himeno.f90:80:3 [PWR068] (level: L2): Encapsulate procedures within modules to avoid the risks of calling implicit interfaces
himeno.f90:83:3 [PWR068] (level: L2): Encapsulate procedures within modules to avoid the risks of calling implicit interfaces
himeno.f90:84:3 [PWR068] (level: L2): Encapsulate procedures within modules to avoid the risks of calling implicit interfaces
himeno.f90:99:3 [PWR068] (level: L2): Encapsulate procedures within modules to avoid the risks of calling implicit interfaces
himeno.f90:154:3 [PWR068] (level: L2): Encapsulate procedures within modules to avoid the risks of calling implicit interfaces
himeno.f90:293:6 [PWR051] (level: L2): Consider applying multithreading parallelism to scalar reduction loop
himeno.f90:295:12 [RMK001] (level: L3): Loop nesting that might benefit from hybrid parallelization using multithreading and SIMD
himeno.f90:293:6 [PWR035] (level: L3): Avoid non-consecutive array access to improve performance
himeno.f90:294:9 [PWR035] (level: L3): Avoid non-consecutive array access to improve performance
himeno.f90:136:1 [PWR001] (level: L3): Pass global variables as function arguments
himeno.f90:164:1 [PWR001] (level: L3): Pass global variables as function arguments
himeno.f90:223:1 [PWR001] (level: L3): Pass global variables as function arguments
himeno.f90:255:1 [PWR001] (level: L3): Pass global variables as function arguments
himeno.f90:275:1 [PWR001] (level: L3): Pass global variables as function arguments

SUGGESTIONS

Use --check-id and --verbose to focus on specific subsets of checkers, e.g.:
codee checks --check-id PWR063 --verbose --target-arch cpu -- gfortran himeno.f90 -O3

1 target file, 7 functions, 5 loops, 214 LOCs successfully analyzed (27 checkers) and 0 non-analyzed files in 242 ms

We can also run the detailed output of the checks report (option --verbose) to obtain more information about each suggestion. This detailed output includes links to the Open Catalog, along with the precise location in the source code. All this additional information can be overwhelming when many checkers are reported. To prevent this, use the --check-id flag to filter the output.

As an example, let's focus on the checker PWR051, related to parallelizing a loop with multithreading:

Codee command
codee checks --verbose --target-arch cpu --check-id PWR051 -- gfortran himeno.f90 -O3
Codee output
Date: 2025-11-06 Codee version: 2025.4 License type: Professional
Compiler invocation: gfortran himeno.f90 -O3

[1/1] himeno.f90 ... Done

CHECKS REPORT

himeno.f90:293:6 [PWR051] (level: L2): Consider applying multithreading parallelism to scalar reduction loop
Suggestion: Use 'rewrite' to automatically optimize the code
Documentation:
https://open-catalog.codee.com/Checks/PWR051
AutoFix (choose one option):
* Using OpenMP 'for' with built-in reduction (recommended):
codee rewrite --check-id pwr051 --variant omp-for --in-place himeno.f90:293:6 -- gfortran himeno.f90 -O3
* Using OpenMP 'taskwait':
codee rewrite --check-id pwr051 --variant omp-taskwait --in-place himeno.f90:293:6 -- gfortran himeno.f90 -O3
* Using OpenMP 'taskloop':
codee rewrite --check-id pwr051 --variant omp-taskloop --in-place himeno.f90:293:6 -- gfortran himeno.f90 -O3

1 target file, 7 functions, 5 loops, 214 LOCs successfully analyzed (1 checker) and 0 non-analyzed files in 172 ms

3. Autofix

Let's use Codee's autofix capabilities to automatically optimize the code with OpenMP. Copy-paste the suggested Codee invocation, and replace the --in-place argument with -o to create a new file with the modification:

Codee command
codee rewrite --check-id pwr051 --variant omp-for -o himeno_codee.f90 himeno.f90:293:6 -- gfortran himeno.f90 -O3
Codee output
Date: 2025-11-06 Codee version: 2025.4 License type: Professional
Compiler invocation: gfortran himeno.f90 -O3

[1/1] himeno.f90 ... Done

Results for file 'himeno.f90':
Successfully applied AutoFix to the loop at 'himeno.f90:jacobi:293:6' [using multi-threading]:
[INFO] himeno.f90:293:6 Parallel scalar reduction pattern identified for variable 'gosa' with associative, commutative operator '+'
[INFO] himeno.f90:293:6 Parallel forall: variable 'wrk2'
[INFO] himeno.f90:293:6 Available parallelization strategies for variable 'gosa'
[INFO] himeno.f90:293:6 #1 OpenMP scalar reduction (* implemented)
[INFO] himeno.f90:293:6 #2 OpenMP atomic access
[INFO] himeno.f90:293:6 #3 OpenMP explicit privatization
[INFO] himeno.f90:293:6 Loop parallelized with multithreading using OpenMP directive 'for'
[INFO] himeno.f90:293:6 Parallel region defined by OpenMP directive 'parallel'

Successfully created himeno_codee.f90

Minimum software stack requirements: OpenMP version 3.0 with multithreading capabilities

4. Execution

Compile the original source code of Himeno (himeno.f90) and the optimized version (himeno_codee.f90) to compare their performance. For instance, using the gfortran compiler:

Compiler commands
gfortran himeno.f90 -o himeno -O3 && \
gfortran himeno_codee.f90 -o himeno_codee -O3 -fopenmp

And run the original executable (himeno) and the optimized one (himeno_codee), choosing the L input dataset size:

./himeno
  mimax=         513  mjmax=         257  mkmax=         257
imax= 512 jmax= 256 kmax= 256
Time measurement accuracy : .10000E-02
Start rehearsal measurement process.
Measure the performance in 3 times.
MFLOPS: 7715.22363 time(s): 0.43500000000000000 4.88281250E-04
Now, start the actual measurement process.
The loop will be excuted in 413 times.
This will take about one minute.
Wait for a while.
Loop executed for 413 times
Gosa : 4.88281250E-04
MFLOPS: 7730.45605 time(s): 59.767000000000003
Score based on Pentium III 600MHz : 93.3179169
./himeno_codee
  mimax=         513  mjmax=         257  mkmax=         257
imax= 512 jmax= 256 kmax= 256
Time measurement accuracy : .10000E-02
Start rehearsal measurement process.
Measure the performance in 3 times.
MFLOPS: 17036.1523 time(s): 0.19700000000000001 8.46543233E-04
Now, start the actual measurement process.
The loop will be excuted in 913 times.
This will take about one minute.
Wait for a while.
Loop executed for 913 times
Gosa : 6.00490777E-04
MFLOPS: 17406.5215 time(s): 58.678000000000004
Score based on Pentium III 600MHz : 210.122192

The performance has increased from 7730 MFLOPS to 17406 MFLOPS, representing a 2.25X speedup.