Himeno optimization through GPU parallelism on NVIDIA/Cray Compilers

Goal

Learn how to use Codee to parallelize Himeno, a fluid analysis simulation code, on GPU.

info

This guide is part of the NERSC + Codee Training Series 2024. Code available for download at the previous link.

Getting started

First, navigate to the source code for Himeno:

cd codee-demos/Fortran/Himeno

Next, load the latest Codee version available on Perlmutter:

module load codee/2024.3.1

Walkthrough

1. Run the screening report

To explore the recommendations of the Open Catalog that are applicable to Himeno, let's run Codee's screening report; use --target-arch to include GPU offloading checks in the analysis:

Codee command

codee screening --target-arch gpu -- nvfortran himeno.f90 -Ofast

Codee output
Date: 2024-09-04 Codee version: 2024.3.1 License type: Full
Compiler invocation: nvfortran himeno.f90 -Ofast

[1/1] himeno.f90 ... Done

SCREENING REPORT

---Number of files---
Total | C C++ Fortran
----- | - --- -------
1     | 0 0   1

Lines of code Analysis time # checks Profiling
------------- ------------- -------- ---------
214           231 ms        22       n/a

Lines of code : total lines of code found in the target (computed the same way as the sloccount tool)
Analysis time : time required to analyze the target
# checks : total actionable items (opportunities, recommendations, defects and remarks) detected
Profiling : estimation of overall execution time required by this target

RANKING OF CHECKERS

Checker Priority AutoFix # Title
------- -------- ------- - ------------------------------------------------------------------------------------------------
PWR068  P27 (L1)         6 Encapsulate external procedures within modules to avoid the risks of calling implicit interfaces
PWR054  P12 (L1)  x      1 Consider applying vectorization to scalar reduction loop
PWR063  P12 (L1)         1 Avoid using legacy Fortran constructs
PWR056  P4 (L3)   x      1 Consider applying offloading parallelism to scalar reduction loop
PWR069  P3 (L3)          6 Use the keyword only to explicitly state what to import from a module
PWR001  P3 (L3)          5 Declare global variables as function parameters
PWR035  P2 (L3)          2 Avoid non-consecutive array access to improve performance

SUGGESTIONS

  Use 'roi' to get a return of investment estimation report:
        codee roi --target-arch gpu -- nvfortran himeno.f90 -Ofast

  Use 'checks' to find out details about the detected checks:
        codee checks --target-arch gpu -- nvfortran himeno.f90 -Ofast

1 file, 7 functions, 5 loops successfully analyzed (22 checkers) and 0 non-analyzed files in 232 ms

2. Run the checks report

In the "RANKING OF CHECKERS" of the screening report we can see one occurrence of the PWR056, which is related to offloading to GPU and it has an AutoFix available. Use the --check-id flag to get only results for the given offloading checker (PWR056), and the --verbose flag to see all the available AutoFix options.

Codee command
codee checks --target-arch gpu --verbose --check-id PWR056 -- nvfortran himeno.f90 -Ofast

Codee output
Date: 2024-09-04 Codee version: 2024.3.1 License type: Full
Compiler invocation: nvfortran himeno.f90 -Ofast

[1/1] himeno.f90 ... Done

CHECKS REPORT

himeno.f90:293:6 [PWR056] (level: L3): Consider applying offloading parallelism to scalar reduction loop
  Suggestion: Use 'rewrite' to automatically optimize the code
  Documentation: https://github.com/codee-com/open-catalog/tree/main/Checks/PWR056
  AutoFix (choose one option):
    * Using OpenMP (recommended):
        codee rewrite --offload omp-teams --in-place himeno.f90:293:6 -- nvfortran himeno.f90 -Ofast
    * Using OpenACC:
        codee rewrite --offload acc --in-place himeno.f90:293:6 -- nvfortran himeno.f90 -Ofast
    * Using OpenMP and OpenACC combined:
        codee rewrite --offload omp-teams,acc --in-place himeno.f90:293:6 -- nvfortran himeno.f90 -Ofast

1 file, 7 functions, 5 loops successfully analyzed (1 checker) and 0 non-analyzed files in 183 ms

3. Autofix

Codee offers offloading Autofixes with OpenMP and OpenACC. Let's pick the OpenMP option.

We can copy and paste the suggested Codee invocation to perform the offloading, and then we need to replace the --in-place flag with -o to create a new file with the modified code.

Compiler Driven Mode

In Compiler Driven Mode, Codee generate specific pragmas for the target compiler using a combination of clauses intended to lead to the best performance for the given compiler.

Add the --compiler-driven-mode flag to allow Codee to generate specific pragmas for the target compiler (nvfortran in this case):

Codee command
module load PrgEnv-nvidia && \
    codee rewrite --offload omp-teams -o himeno_nvfort_comp_driven.f90 \
        himeno.f90:293:6 --compiler-driven-mode -- nvfortran himeno.f90 -Ofast

Codee output
Date: 2024-09-04 Codee version: 2024.3.1 License type: Full
Compiler invocation: nvfortran himeno.f90 -Ofast

codee: warning: Target compiler not fully supported and optimization flags can not be detected. The loops' vectorization statuses will not be reported correctly.

[Fortran] target compiler 'nvfortran', 23.9.0
  Full version name: nvfortran 23.9-0 64-bit target on x86-64 Linux -tp znver3 

Results for file '/global/homes/u/user/codee-demos/Fortran/Himeno/himeno.f90':
  Successfully applied AutoFix to the loop at 'himeno.f90:jacobi:293:6' [using offloading]:
      [INFO] himeno.f90:293:6 Parallel scalar reduction pattern identified for variable 'gosa' with associative, commutative operator '+'
      [INFO] himeno.f90:293:6 Parallel forall: variable 'wrk2'
      [INFO] himeno.f90:293:6 Available parallelization strategies for variable 'gosa'
      [INFO] himeno.f90:293:6   #1 OpenMP scalar reduction (* implemented)
      [INFO] himeno.f90:293:6   #2 OpenMP atomic access
      [INFO] himeno.f90:293:6   #3 OpenMP explicit privatization
      [INFO] himeno.f90:293:6 Loop parallelized with teams using OpenMP directive 'target teams distribute parallel for'
  Fine-tuning suggestions for better performance [using offloading]:
      [TODO] Consider optimizing data transfers of arrays by adding the proper array ranges in data mapping clauses
          Documentation: https://github.com/codee-com/open-catalog/tree/main/Glossary/Offloading-data-transfers.md

Successfully created himeno_nvfort_comp_driven.f90

Minimum software stack requirements: OpenMP version 5.0 with offloading capabilities

Let's try to generate specific pragmas for the Cray Fortran Compiler, just for comparison:

Codee command
module load PrgEnv-cray && \
    codee rewrite --offload omp-teams -o himeno_cray_comp_driven.f90 \
        himeno.f90:293:6 --compiler-driven-mode -- ftn himeno.f90 -Ofast

Codee output
Date: 2024-09-04 Codee version: 2024.3.1 License type: Full
Compiler invocation: ftn himeno.f90 -Ofast

codee: warning: Target compiler not fully supported and optimization flags can not be detected. The loops' vectorization statuses will not be reported correctly.

[Fortran] target compiler 'ftn', 17.0.0
  Full version name: Cray Fortran : Version 17.0.0  Mon Sep 04, 2024  04:05:37

Results for file '/global/homes/u/user/codee-demos/Fortran/Himeno/himeno.f90':
  Successfully applied AutoFix to the loop at 'himeno.f90:jacobi:293:6' [using offloading]:
      [INFO] himeno.f90:293:6 Parallel scalar reduction pattern identified for variable 'gosa' with associative, commutative operator '+'
      <...>
      [INFO] himeno.f90:293:6 Loop parallelized with teams using OpenMP directive 'target teams distribute parallel for'
  Fine-tuning suggestions for better performance [using offloading]:
      [TODO] Consider optimizing data transfers of arrays by adding the proper array ranges in data mapping clauses
          Documentation: https://github.com/codee-com/open-catalog/tree/main/Glossary/Offloading-data-transfers.md

Successfully created himeno_cray_comp_driven.f90

Minimum software stack requirements: OpenMP version 4.0 with offloading capabilities

As we can see, the pragma for crayftn does not use the distribute clause, unlike the pragma for nvfortran.

diff himeno_cray_comp_driven.f90 himeno_nvfort_comp_driven.f90
297,298c297,298
<      !$omp target teams distribute simd shared(a, b, bnd, c, imax, jmax, kmax, p, wrk1) map(to: kmax, jmax, imax, p, a, bnd, b, c, &
< !$omp wrk1) private(i, j, s0, ss) reduction(+: gosa) map(tofrom: gosa) map(from: wrk2)
---
>      !$omp target teams loop shared(a, b, bnd, c, imax, jmax, kmax, p, wrk1) map(to: kmax, jmax, imax, p, a, bnd, b, c, wrk1) priva&
> !$omp te(i, j, s0, ss) reduction(+: gosa) map(tofrom: gosa) map(from: wrk2)

Compiler Agnostic Mode

Remove the --compiler-driven-mode, just to generate the generic offloading pragmas that Codee offers, so we compare them with the ones generated specifically for nvfortran in the previous step:

Codee command
codee rewrite --offload omp-teams -o himeno_nvfort_comp_agnostic.f90 himeno.f90:293:6 -- nvfortran himeno.f90 -Ofast

Codee output
Date: 2024-09-04 Codee version: 2024.3.1 License type: Full
Compiler invocation: nvfortran himeno.f90 -Ofast

Results for file '/global/homes/u/user/codee-demos/Fortran/Himeno/himeno.f90':
  Successfully applied AutoFix to the loop at 'himeno.f90:jacobi:293:6' [using offloading]:
      [INFO] himeno.f90:293:6 Parallel scalar reduction pattern identified for variable 'gosa' with associative, commutative operator '+'
      [INFO] himeno.f90:293:6 Parallel forall: variable 'wrk2'
      [INFO] himeno.f90:293:6 Available parallelization strategies for variable 'gosa'
      [INFO] himeno.f90:293:6   #1 OpenMP scalar reduction (* implemented)
      [INFO] himeno.f90:293:6   #2 OpenMP atomic access
      [INFO] himeno.f90:293:6   #3 OpenMP explicit privatization
      [INFO] himeno.f90:293:6 Loop parallelized with teams using OpenMP directive 'target teams distribute parallel for'
  Fine-tuning suggestions for better performance [using offloading]:
      [TODO] Consider optimizing data transfers of arrays by adding the proper array ranges in data mapping clauses
          Documentation: https://github.com/codee-com/open-catalog/tree/main/Glossary/Offloading-data-transfers.md

Successfully created himeno_nvfort_comp_agnostic.f90

Minimum software stack requirements: OpenMP version 4.0 with offloading capabilities

In this case, the pragma for nvfortran uses target teams loop instead of the more generic target teams distribute parallel do:

diff himeno_nvfort_comp_driven.f90 himeno_nvfort_comp_agnostic.f90
<...>
297,298c297,298
<      !$omp target teams loop shared(a, b, bnd, c, imax, jmax, kmax, p, wrk1) map(to: kmax, jmax, imax, p, a, bnd, b, c, wrk1) priva&
< !$omp te(i, j, s0, ss) reduction(+: gosa) map(tofrom: gosa) map(from: wrk2)
---
>      !$omp target teams distribute parallel do simd shared(a, b, bnd, c, imax, jmax, kmax, p, wrk1) map(to: kmax, jmax, imax, p, a,&
> !$omp  bnd, b, c, wrk1) private(i, j, s0, ss) reduction(+: gosa) map(tofrom: gosa) map(from: wrk2) schedule(static)

4. Execution

Finally, compile and run both the original and the optimized codes to assess the speed improvements. The following SBATCH script can be used as reference; create launch.sh and Himeno.sh, and add execution permissions to the latter:

chmod u+x Himeno.sh

launch.sh
#!/bin/bash

#SBATCH --account=ntrain6
#SBATCH --job-name=codee_himeno_gpu

#SBATCH --constraint=gpu
#SBATCH --qos=regular
#SBATCH --reservation=codee_day1
#SBATCH --time=0:05:00

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --gpus-per-task=1

export SLURM_CPU_BIND="cores"
srun Himeno.sh

Himeno.sh
#!/bin/bash

module load PrgEnv-nvidia
rm -f himeno himeno_nvfort_comp_driven

GRID_SIZE="XL"

nvfortran himeno.f90 -Ofast -o himeno
echo "$GRID_SIZE" | ./himeno

nvfortran himeno_nvfort_comp_driven.f90 -o himeno_nvfort_comp_driven -Ofast -mp -target=gpu -Minfo=mp
echo "$GRID_SIZE" | ./himeno_nvfort_comp_driven

Getting started​

Walkthrough​

1. Run the screening report​

2. Run the checks report​

3. Autofix​

Compiler Driven Mode​

Compiler Agnostic Mode​

4. Execution​