Himeno (Fortran)
Learn how to use Codee to optimize Himeno, a fluid analysis simulation code, by parallelizing computations on CPU with OpenMP.
This guide was made using an Azure HBv4 machine and the AMD compilers. These steps have also been tested with GNU and Intel compilers, so you can follow allong by replacing the corresponding compilation flags.
Optimal performance is typically achieved using 48, 96, or 144 threads. To take a deeper look into de architecture of the machine used please visit the official Microsoft webpage, HBv4-series virtual machine overview
Prerequisites
Ensure you have:
- Access to an Azure machine and Codee installed on it.
- AMD flang compiler.
- The codee-demos repository on your machine.
To clone the necessary repository just execute the following on your terminal:
git clone https://github.com/codee-com/codee-demos.git
Getting started
First, navigate to the source code for Himeno:
cd codee-demos/Fortran/Himeno
Walkthrough
1. Explore the source code
For the optimization we will focus on this triple-nested do loop within himeno.f90
  do loop=1,nn
     gosa= 0.0
     do k=2,kmax-1
        do j=2,jmax-1
           do i=2,imax-1
              s0=a(I,J,K,1)*p(I+1,J,K) &
                   +a(I,J,K,2)*p(I,J+1,K) &
                   +a(I,J,K,3)*p(I,J,K+1) &
                   +b(I,J,K,1)*(p(I+1,J+1,K)-p(I+1,J-1,K) &
                               -p(I-1,J+1,K)+p(I-1,J-1,K)) &
                   +b(I,J,K,2)*(p(I,J+1,K+1)-p(I,J-1,K+1) &
                               -p(I,J+1,K-1)+p(I,J-1,K-1)) &
                   +b(I,J,K,3)*(p(I+1,J,K+1)-p(I-1,J,K+1) &
                               -p(I+1,J,K-1)+p(I-1,J,K-1)) &
                   +c(I,J,K,1)*p(I-1,J,K) &
                   +c(I,J,K,2)*p(I,J-1,K) &
                   +c(I,J,K,3)*p(I,J,K-1)+wrk1(I,J,K)
              ss=(s0*a(I,J,K,4)-p(I,J,K))*bnd(I,J,K)
              GOSA=GOSA+SS*SS
              wrk2(I,J,K)=p(I,J,K)+OMEGA *SS
           enddo
        enddo
     enddo
2. Run the checks report
It is recommended to run the screening report first to obtain a ranking of the checkers, which can help you decide which one to implement first.
To explore how Codee can help speed up this loop by parallelizing it,
use --target-arch to include CPU-related checks in the analysis:
codee checks --target-arch cpu -- flang himeno.f90 -O3
Date: 2025-04-16 Codee version: 2025.2 License type: Full
Compiler invocation: flang himeno.f90 -O3
[1/1] himeno.f90 ... Done
CHECKS REPORT
himeno.f90:295:12 [PWR054] (level: L1): Consider applying vectorization to scalar reduction loop
himeno.f90:80:3 [PWR068] (level: L2): Encapsulate procedures within modules to avoid the risks of calling implicit interfaces
himeno.f90:83:3 [PWR068] (level: L2): Encapsulate procedures within modules to avoid the risks of calling implicit interfaces
himeno.f90:84:3 [PWR068] (level: L2): Encapsulate procedures within modules to avoid the risks of calling implicit interfaces
himeno.f90:99:3 [PWR068] (level: L2): Encapsulate procedures within modules to avoid the risks of calling implicit interfaces
himeno.f90:154:3 [PWR068] (level: L2): Encapsulate procedures within modules to avoid the risks of calling implicit interfaces
himeno.f90:67:1 [PWR069] (level: L2): Use the keyword only to explicitly state what to import from a module
himeno.f90:136:1 [PWR069] (level: L2): Use the keyword only to explicitly state what to import from a module
himeno.f90:164:1 [PWR069] (level: L2): Use the keyword only to explicitly state what to import from a module
himeno.f90:223:1 [PWR069] (level: L2): Use the keyword only to explicitly state what to import from a module
himeno.f90:255:1 [PWR069] (level: L2): Use the keyword only to explicitly state what to import from a module
himeno.f90:275:1 [PWR069] (level: L2): Use the keyword only to explicitly state what to import from a module
himeno.f90:293:6 [PWR051] (level: L2): Consider applying multithreading parallelism to scalar reduction loop
himeno.f90:295:12 [RMK001] (level: L3): Loop nesting that might benefit from hybrid parallelization using multithreading and SIMD
himeno.f90:293:6 [PWR035] (level: L3): Avoid non-consecutive array access to improve performance
himeno.f90:294:9 [PWR035] (level: L3): Avoid non-consecutive array access to improve performance
himeno.f90:136:1 [PWR001] (level: L3): Declare global variables as function parameters
himeno.f90:164:1 [PWR001] (level: L3): Declare global variables as function parameters
himeno.f90:223:1 [PWR001] (level: L3): Declare global variables as function parameters
himeno.f90:255:1 [PWR001] (level: L3): Declare global variables as function parameters
himeno.f90:275:1 [PWR001] (level: L3): Declare global variables as function parameters
SUGGESTIONS
  Use --verbose to get more details, e.g:
        codee checks --verbose --target-arch cpu -- flang himeno.f90 -O3
  Use --check-id to focus on specific subsets of checkers, e.g.:
        codee checks --check-id PWR054 --target-arch cpu -- flang himeno.f90 -O3
1 file, 7 functions, 5 loops, 213 LOCs successfully analyzed (21 checkers) and 0 non-analyzed files in 220 ms
We can also run the detailed output of the checks report (option --verbose)
to obtain more information about each suggestion. This detailed output includes
links to the Open Catalog, along with the precise location in the source code.
All this additional information can be overwhelming when many checkers are
reported. To prevent this, use the --check-id flag to filter the output.
As an example, let's focus on the checker PWR051, related to parallelizing a loop with multithreading:
codee checks --verbose --target-arch cpu --check-id PWR051 -- flang himeno.f90 -O3
Date: 2025-04-16 Codee version: 2025.2 License type: Full
Compiler invocation: flang himeno.f90 -O3
[1/1] himeno.f90 ... Done
CHECKS REPORT
himeno.f90:293:6 [PWR051] (level: L2): Consider applying multithreading parallelism to scalar reduction loop
  Suggestion: Use 'rewrite' to automatically optimize the code
  Documentation: https://github.com/codee-com/open-catalog/tree/main/Checks/PWR051
  AutoFix (choose one option):
    * Using OpenMP 'for' with built-in reduction (recommended):
        codee rewrite --check-id pwr051 --variant omp-for --in-place himeno.f90:293:6 -- flang himeno.f90 -O3
    * Using OpenMP 'taskwait':
        codee rewrite --check-id pwr051 --variant omp-taskwait --in-place himeno.f90:293:6 -- flang himeno.f90 -O3
    * Using OpenMP 'taskloop':
        codee rewrite --check-id pwr051 --variant omp-taskloop --in-place himeno.f90:293:6 -- flang himeno.f90 -O3
1 file, 7 functions, 5 loops, 213 LOCs successfully analyzed (1 checker) and 0 non-analyzed files in 171 ms
3. Autofix
Let's use Codee's autofix capabilities to automatically optimize the code with
OpenMP. Copy-paste the suggested Codee invocation, and replace the --in-place
argument with -o to create a new file with the modification:
codee rewrite --check-id pwr051 --variant omp-for -o himeno_codee.f90 himeno.f90:293:6 -- flang himeno.f90 -O3
Date: 2025-04-16 Codee version: 2025.2 License type: Full
Compiler invocation: flang himeno.f90 -O3
[1/1] himeno.f90 ... Done
Results for file '/home/codee/codee-demos/Fortran/Himeno/himeno.f90':
  Successfully applied AutoFix to the loop at 'himeno.f90:jacobi:293:6' [using multi-threading]:
      [INFO] himeno.f90:293:6 Parallel scalar reduction pattern identified for variable 'gosa' with associative, commutative operator '+'
      [INFO] himeno.f90:293:6 Parallel forall: variable 'wrk2'
      [INFO] himeno.f90:293:6 Available parallelization strategies for variable 'gosa'
      [INFO] himeno.f90:293:6   #1 OpenMP scalar reduction (* implemented)
      [INFO] himeno.f90:293:6   #2 OpenMP atomic access
      [INFO] himeno.f90:293:6   #3 OpenMP explicit privatization
      [INFO] himeno.f90:293:6 Loop parallelized with multithreading using OpenMP directive 'for'
      [INFO] himeno.f90:293:6 Parallel region defined by OpenMP directive 'parallel'
Successfully created himeno_codee.f90
Minimum software stack requirements: OpenMP version 3.0 with multithreading capabilities
*** himeno.f90  2025-03-31 17:03:33.476802469 +0200
--- himeno_codee.f90    2025-04-15 11:18:37.632008756 +0200
***************
*** 291,294 ****
--- 291,298 ----
    do loop=1,nn
       gosa= 0.0
+      ! Codee: Loop modified by Codee (2025-04-15 11:18:37)
+      ! Codee: Technique applied: multithreading with 'omp-for' pragmas
+      !$omp parallel default(none) shared(a, b, bnd, c, gosa, imax, jmax, kmax, p, wrk1, wrk2) private(i, j, k, s0, ss)
+      !$omp do private(i, j, s0, ss) reduction(+: gosa) schedule(auto)
       do k=2,kmax-1
          do j=2,jmax-1
***************
*** 312,315 ****
--- 316,320 ----
          enddo
       enddo
+      !$omp end parallel
  !
       p(2:imax-1,2:jmax-1,2:kmax-1)= &
4. Execution
Compile the original source code of Himeno (himeno.f90) and the optimized
version (himeno_codee.f90) to compare their performance. For instance, using
the AMD flang compiler:
flang himeno.f90 -o himeno -O3 && \
    flang himeno_codee.f90 -o himeno_codee -O3 -fopenmp
And run the original executable (himeno) and the optimized one
(himeno_codee), choosing the L input dataset size and using 48 threads.
The election of the number of threads was made based on experimentation, you can see
more details about the architecture of the machine on
HBv4-series virtual machine overview
  mimax=          513  mjmax=          257  mkmax=          257
  imax=          512  jmax=          256  kmax=          256
  Time measurement accuracy : .10000E-05
  Start rehearsal measurement process.
  Measure the performance in 3 times.
   MFLOPS:    5369.632       time(s):   0.6250190000000000        4.8828125E-04
 Now, start the actual measurement process.
 The loop will be excuted in          287  times.
 This will take about one minute.
 Wait for a while.
  Loop executed for           287  times
  Gosa :   4.8828125E-04
  MFLOPS:    5397.830       time(s):    59.48113799999999     
  Score based on Pentium III 600MHz :    65.15971     
  mimax=          513  mjmax=          257  mkmax=          257
  imax=          512  jmax=          256  kmax=          256
  Time measurement accuracy : .10000E-05
  Start rehearsal measurement process.
  Measure the performance in 3 times.
   MFLOPS:    19493.87       time(s):   0.1721630000000000        8.5657468E-04
 Now, start the actual measurement process.
 The loop will be excuted in         1045  times.
 This will take about one minute.
 Wait for a while.
  Loop executed for          1045  times
  Gosa :   5.8882288E-04
  MFLOPS:    30169.47       time(s):    38.74941800000000     
  Score based on Pentium III 600MHz :    364.1896    
5. Results
Across 10 executions, the optimized version obtained 30169 MFLOPS, while the original ontained 5397 MFLOPS, representing a ~5.6x speedup.