Skip to main content

Himeno (Fortran)

Acknowledgment

We gratefully acknowledge the NERSC team for providing access to their platform, which enabled us to conduct our experiments with the Cray Compiler Environment (CCE).

Goal

Learn how to use Codee to optimize Himeno, a fluid analysis simulation code, by parallelizing computations on CPU with OpenMP.

Prerequisites

Ensure you have installed:

  • Codee installed.
  • The Cray Compiler Environment
  • The codee-demos repository on your machine.

To clone the necessary repository just execute the following on your terminal:

git clone https://github.com/codee-com/codee-demos.git

Getting started

First, navigate to the source code for Himeno:

cd codee-demos/Fortran/Himeno

Walkthrough

1. Explore the source code

For the optimization we will focus on this triple-nested do loop within himeno.f90

  do loop=1,nn
gosa= 0.0
do k=2,kmax-1
do j=2,jmax-1
do i=2,imax-1
s0=a(I,J,K,1)*p(I+1,J,K) &
+a(I,J,K,2)*p(I,J+1,K) &
+a(I,J,K,3)*p(I,J,K+1) &
+b(I,J,K,1)*(p(I+1,J+1,K)-p(I+1,J-1,K) &
-p(I-1,J+1,K)+p(I-1,J-1,K)) &
+b(I,J,K,2)*(p(I,J+1,K+1)-p(I,J-1,K+1) &
-p(I,J+1,K-1)+p(I,J-1,K-1)) &
+b(I,J,K,3)*(p(I+1,J,K+1)-p(I-1,J,K+1) &
-p(I+1,J,K-1)+p(I-1,J,K-1)) &
+c(I,J,K,1)*p(I-1,J,K) &
+c(I,J,K,2)*p(I,J-1,K) &
+c(I,J,K,3)*p(I,J,K-1)+wrk1(I,J,K)
ss=(s0*a(I,J,K,4)-p(I,J,K))*bnd(I,J,K)
GOSA=GOSA+SS*SS
wrk2(I,J,K)=p(I,J,K)+OMEGA *SS
enddo
enddo
enddo

2. Run the checks report

Note

It is recommended to run the screening report first to obtain a ranking of the checkers, which can help you decide which one to implement first.

To explore how Codee can help speed up this loop by parallelizing it, use --target-arch to include CPU-related checks in the analysis:

Codee command
codee checks --target-arch cpu -- ftn himeno.f90 -O3
Codee output
Date: 2025-04-16 Codee version: 2025.2 License type: Full
Compiler invocation: ftn himeno.f90 -O3

[1/1] himeno.f90 ... Done

CHECKS REPORT

himeno.f90 [RMK015] (level: L1): Tune compiler optimization flags to increase the speed of the code
himeno.f90:295:12 [PWR054] (level: L1): Consider applying vectorization to scalar reduction loop
himeno.f90 [PWR063] (level: L1): Avoid using legacy Fortran constructs
himeno.f90:80:3 [PWR068] (level: L2): Encapsulate procedures within modules to avoid the risks of calling implicit interfaces
himeno.f90:83:3 [PWR068] (level: L2): Encapsulate procedures within modules to avoid the risks of calling implicit interfaces
himeno.f90:84:3 [PWR068] (level: L2): Encapsulate procedures within modules to avoid the risks of calling implicit interfaces
himeno.f90:99:3 [PWR068] (level: L2): Encapsulate procedures within modules to avoid the risks of calling implicit interfaces
himeno.f90:154:3 [PWR068] (level: L2): Encapsulate procedures within modules to avoid the risks of calling implicit interfaces
himeno.f90:67:1 [PWR069] (level: L2): Use the keyword only to explicitly state what to import from a module
himeno.f90:136:1 [PWR069] (level: L2): Use the keyword only to explicitly state what to import from a module
himeno.f90:164:1 [PWR069] (level: L2): Use the keyword only to explicitly state what to import from a module
himeno.f90:223:1 [PWR069] (level: L2): Use the keyword only to explicitly state what to import from a module
himeno.f90:255:1 [PWR069] (level: L2): Use the keyword only to explicitly state what to import from a module
himeno.f90:275:1 [PWR069] (level: L2): Use the keyword only to explicitly state what to import from a module
himeno.f90:293:6 [PWR051] (level: L2): Consider applying multithreading parallelism to scalar reduction loop
himeno.f90:295:12 [RMK001] (level: L3): Loop nesting that might benefit from hybrid parallelization using multithreading and SIMD
himeno.f90:293:6 [PWR035] (level: L3): Avoid non-consecutive array access to improve performance
himeno.f90:294:9 [PWR035] (level: L3): Avoid non-consecutive array access to improve performance
himeno.f90:136:1 [PWR001] (level: L3): Declare global variables as function parameters
himeno.f90:164:1 [PWR001] (level: L3): Declare global variables as function parameters
himeno.f90:223:1 [PWR001] (level: L3): Declare global variables as function parameters
himeno.f90:255:1 [PWR001] (level: L3): Declare global variables as function parameters
himeno.f90:275:1 [PWR001] (level: L3): Declare global variables as function parameters

SUGGESTIONS

Use --verbose to get more details, e.g:
codee checks --verbose --target-arch cpu -- ftn himeno.f90 -O3

Use --check-id to focus on specific subsets of checkers, e.g.:
codee checks --check-id RMK015 --target-arch cpu -- ftn himeno.f90 -O3

1 file, 7 functions, 5 loops, 214 LOCs successfully analyzed (23 checkers) and 0 non-analyzed files in 1007 ms

We can also run the detailed output of the checks report (option --verbose) to obtain more information about each suggestion. This detailed output includes links to the Open Catalog, along with the precise location in the source code. All this additional information can be overwhelming when many checkers are reported. To prevent this, use the --check-id flag to filter the output.

As an example, let's focus on the checker PWR051, related to parallelizing a loop with multithreading:

Codee command
codee checks --verbose --target-arch cpu --check-id PWR051 -- ftn himeno.f90 -O3
Codee output
Date: 2025-04-16 Codee version: 2025.2 License type: Full
Compiler invocation: ftn himeno.f90 -O3

[1/1] himeno.f90 ... Done

CHECKS REPORT

himeno.f90:293:6 [PWR051] (level: L2): Consider applying multithreading parallelism to scalar reduction loop
Suggestion: Use 'rewrite' to automatically optimize the code
Documentation: https://github.com/codee-com/open-catalog/tree/main/Checks/PWR051
AutoFix (choose one option):
* Using OpenMP 'for' with built-in reduction (recommended):
codee rewrite --check-id pwr051 --variant omp-for --in-place himeno.f90:293:6 -- ftn himeno.f90 -O3
* Using OpenMP 'taskwait':
codee rewrite --check-id pwr051 --variant omp-taskwait --in-place himeno.f90:293:6 -- ftn himeno.f90 -O3
* Using OpenMP 'taskloop':
codee rewrite --check-id pwr051 --variant omp-taskloop --in-place himeno.f90:293:6 -- ftn himeno.f90 -O3

1 file, 7 functions, 5 loops, 214 LOCs successfully analyzed (1 checker) and 0 non-analyzed files in 385 ms

3. Autofix

Let's use Codee's autofix capabilities to automatically optimize the code with OpenMP. Copy-paste the suggested Codee invocation, and replace the --in-place argument with -o to create a new file with the modification:

Codee command
codee rewrite --check-id pwr051 --variant omp-for himeno.f90:293:6 -o himeno_codee.f90 -- ftn himeno.f90 -O3
Codee output
Date: 2025-04-16 Codee version: 2025.2 License type: Full
Compiler invocation: ftn himeno.f90 -O3

[1/1] himeno.f90 ... Done

Results for file 'himeno.f90':
Successfully applied AutoFix to the loop at 'himeno.f90:jacobi:293:6' [using multi-threading]:
[INFO] himeno.f90:293:6 Parallel scalar reduction pattern identified for variable 'gosa' with associative, commutative operator '+'
[INFO] himeno.f90:293:6 Parallel forall: variable 'wrk2'
[INFO] himeno.f90:293:6 Available parallelization strategies for variable 'gosa'
[INFO] himeno.f90:293:6 #1 OpenMP scalar reduction (* implemented)
[INFO] himeno.f90:293:6 #2 OpenMP atomic access
[INFO] himeno.f90:293:6 #3 OpenMP explicit privatization
[INFO] himeno.f90:293:6 Loop parallelized with multithreading using OpenMP directive 'for'
[INFO] himeno.f90:293:6 Parallel region defined by OpenMP directive 'parallel'

Successfully created himeno_codee.f90

Minimum software stack requirements: OpenMP version 3.0 with multithreading capabilities
diff -C 2 himeno.f90 himeno_codee.f90
*** himeno.f90  2024-11-06 06:58:49.054808000 -0800
--- himeno_codee.f90 2025-04-16 01:05:36.947038000 -0700
***************
*** 291,294 ****
--- 291,298 ----
do loop=1,nn
gosa= 0.0
+ ! Codee: Loop modified by Codee (2025-04-16 01:05:36)
+ ! Codee: Technique applied: multithreading with 'omp-for' pragmas
+ !$omp parallel default(none) shared(a, b, bnd, c, gosa, imax, jmax, kmax, p, wrk1, wrk2) private(i, j, k, s0, ss)
+ !$omp do private(i, j, s0, ss) reduction(+: gosa) schedule(auto)
do k=2,kmax-1
do j=2,jmax-1
***************
*** 312,315 ****
--- 316,320 ----
enddo
enddo
+ !$omp end parallel
!
p(2:imax-1,2:jmax-1,2:kmax-1)= &

4. Execution

Compile the original source code of Himeno (himeno.f90) and the optimized version (himeno_codee.f90) to compare their performance. For instance, using the craftn compiler:

Compiler commands
ftn himeno.f90 -o himeno -O3 && \
ftn himeno_codee.f90 -o himeno_codee -O3 -fopenmp

To run them lets request access to an interactive node:

salloc --nodes 1 --qos interactive --time 01:00:00 --constraint cpu

And run the original executable (himeno) and the optimized one (himeno_codee), choosing the L input dataset size.

./himeno
  mimax= 513  mjmax= 257  mkmax= 257
imax= 512 jmax= 256 kmax= 256
Time measurement accuracy : .66991E-05
Start rehearsal measurement process.
Measure the performance in 3 times.
MFLOPS: 14399.752 time(s): 0.23306804935889708, 8.573809755E-4
Now, start the actual measurement process.
The loop will be excuted in 772 times.
This will take about one minute.
Wait for a while.
Loop executed for 772 times
Gosa : 6.248097052E-4
MFLOPS: 14177.5322 time(s): 60.916247973525195
Score based on Pentium III 600MHz : 171.143555
./himeno_codee
  mimax=         513  mjmax=         257  mkmax=         257
imax= 512 jmax= 256 kmax= 256
Time measurement accuracy : .10000E-02
Start rehearsal measurement process.
Measure the performance in 3 times.
MFLOPS: 17036.1523 time(s): 0.19700000000000001 8.46543233E-04
Now, start the actual measurement process.
The loop will be excuted in 913 times.
This will take about one minute.
Wait for a while.
Loop executed for 913 times
Gosa : 6.00490777E-04
MFLOPS: 17406.5215 time(s): 58.678000000000004
Score based on Pentium III 600MHz : 210.122192

5. Results

The performance has increased from 14177 MFLOPS to 17406 MFLOPS, representing a 1.23x speedup.