Himeno optimization through GPU parallelism on NVIDIA/Cray Compilers
Learn how to use Codee to parallelize Himeno, a fluid analysis simulation code, on GPU.
This guide is part of the NERSC + Codee Training Series 2024. Code available for download at the previous link.
Getting started
First, navigate to the source code for Himeno:
cd codee-demos/Fortran/Himeno
Next, load the latest Codee version available on Perlmutter:
module load codee/2024.3.1
Walkthrough
1. Run the screening report
To explore the recommendations of the Open
Catalog that are applicable to
Himeno, let's run Codee's screening report; use --target-arch
to include
GPU offloading checks in the analysis:
codee screening --target-arch gpu -- nvfortran himeno.f90 -Ofast
Date: 2024-09-04 Codee version: 2024.3.1 License type: Full
Compiler invocation: nvfortran himeno.f90 -Ofast
[1/1] himeno.f90 ... Done
SCREENING REPORT
---Number of files---
Total | C C++ Fortran
----- | - --- -------
1 | 0 0 1
Lines of code Analysis time # checks Profiling
------------- ------------- -------- ---------
214 231 ms 22 n/a
Lines of code : total lines of code found in the target (computed the same way as the sloccount tool)
Analysis time : time required to analyze the target
# checks : total actionable items (opportunities, recommendations, defects and remarks) detected
Profiling : estimation of overall execution time required by this target
RANKING OF CHECKERS
Checker Priority AutoFix # Title
------- -------- ------- - ------------------------------------------------------------------------------------------------
PWR068 P27 (L1) 6 Encapsulate external procedures within modules to avoid the risks of calling implicit interfaces
PWR054 P12 (L1) x 1 Consider applying vectorization to scalar reduction loop
PWR063 P12 (L1) 1 Avoid using legacy Fortran constructs
PWR056 P4 (L3) x 1 Consider applying offloading parallelism to scalar reduction loop
PWR069 P3 (L3) 6 Use the keyword only to explicitly state what to import from a module
PWR001 P3 (L3) 5 Declare global variables as function parameters
PWR035 P2 (L3) 2 Avoid non-consecutive array access to improve performance
SUGGESTIONS
Use 'roi' to get a return of investment estimation report:
codee roi --target-arch gpu -- nvfortran himeno.f90 -Ofast
Use 'checks' to find out details about the detected checks:
codee checks --target-arch gpu -- nvfortran himeno.f90 -Ofast
1 file, 7 functions, 5 loops successfully analyzed (22 checkers) and 0 non-analyzed files in 232 ms
2. Run the checks report
In the "RANKING OF CHECKERS" of the screening report we can see one occurrence
of the PWR056, which is related to offloading to GPU and it has an AutoFix
available. Use the --check-id
flag to get only results for the given
offloading checker (PWR056), and the --verbose
flag to see all the available
AutoFix options.
codee checks --target-arch gpu --verbose --check-id PWR056 -- nvfortran himeno.f90 -Ofast
Date: 2024-09-04 Codee version: 2024.3.1 License type: Full
Compiler invocation: nvfortran himeno.f90 -Ofast
[1/1] himeno.f90 ... Done
CHECKS REPORT
himeno.f90:293:6 [PWR056] (level: L3): Consider applying offloading parallelism to scalar reduction loop
Suggestion: Use 'rewrite' to automatically optimize the code
Documentation: https://github.com/codee-com/open-catalog/tree/main/Checks/PWR056
AutoFix (choose one option):
* Using OpenMP (recommended):
codee rewrite --offload omp-teams --in-place himeno.f90:293:6 -- nvfortran himeno.f90 -Ofast
* Using OpenACC:
codee rewrite --offload acc --in-place himeno.f90:293:6 -- nvfortran himeno.f90 -Ofast
* Using OpenMP and OpenACC combined:
codee rewrite --offload omp-teams,acc --in-place himeno.f90:293:6 -- nvfortran himeno.f90 -Ofast
1 file, 7 functions, 5 loops successfully analyzed (1 checker) and 0 non-analyzed files in 183 ms
3. Autofix
Codee offers offloading Autofixes with OpenMP and OpenACC. Let's pick the OpenMP option.
We can copy and paste the suggested Codee invocation to perform the offloading,
and then we need to replace the --in-place
flag with -o
to create a new
file with the modified code.
Compiler Driven Mode
In Compiler Driven Mode, Codee generate specific pragmas for the target compiler using a combination of clauses intended to lead to the best performance for the given compiler.
Add the --compiler-driven-mode
flag to allow Codee to generate specific
pragmas for the target compiler (nvfortran
in this case):
module load PrgEnv-nvidia && \
codee rewrite --offload omp-teams -o himeno_nvfort_comp_driven.f90 \
himeno.f90:293:6 --compiler-driven-mode -- nvfortran himeno.f90 -Ofast
Date: 2024-09-04 Codee version: 2024.3.1 License type: Full
Compiler invocation: nvfortran himeno.f90 -Ofast
codee: warning: Target compiler not fully supported and optimization flags can not be detected. The loops' vectorization statuses will not be reported correctly.
[Fortran] target compiler 'nvfortran', 23.9.0
Full version name: nvfortran 23.9-0 64-bit target on x86-64 Linux -tp znver3
Results for file '/global/homes/u/user/codee-demos/Fortran/Himeno/himeno.f90':
Successfully applied AutoFix to the loop at 'himeno.f90:jacobi:293:6' [using offloading]:
[INFO] himeno.f90:293:6 Parallel scalar reduction pattern identified for variable 'gosa' with associative, commutative operator '+'
[INFO] himeno.f90:293:6 Parallel forall: variable 'wrk2'
[INFO] himeno.f90:293:6 Available parallelization strategies for variable 'gosa'
[INFO] himeno.f90:293:6 #1 OpenMP scalar reduction (* implemented)
[INFO] himeno.f90:293:6 #2 OpenMP atomic access
[INFO] himeno.f90:293:6 #3 OpenMP explicit privatization
[INFO] himeno.f90:293:6 Loop parallelized with teams using OpenMP directive 'target teams distribute parallel for'
Fine-tuning suggestions for better performance [using offloading]:
[TODO] Consider optimizing data transfers of arrays by adding the proper array ranges in data mapping clauses
Documentation: https://github.com/codee-com/open-catalog/tree/main/Glossary/Offloading-data-transfers.md
Successfully created himeno_nvfort_comp_driven.f90
Minimum software stack requirements: OpenMP version 5.0 with offloading capabilities
Let's try to generate specific pragmas for the Cray Fortran Compiler, just for comparison:
module load PrgEnv-cray && \
codee rewrite --offload omp-teams -o himeno_cray_comp_driven.f90 \
himeno.f90:293:6 --compiler-driven-mode -- ftn himeno.f90 -Ofast
Date: 2024-09-04 Codee version: 2024.3.1 License type: Full
Compiler invocation: ftn himeno.f90 -Ofast
codee: warning: Target compiler not fully supported and optimization flags can not be detected. The loops' vectorization statuses will not be reported correctly.
[Fortran] target compiler 'ftn', 17.0.0
Full version name: Cray Fortran : Version 17.0.0 Mon Sep 04, 2024 04:05:37
Results for file '/global/homes/u/user/codee-demos/Fortran/Himeno/himeno.f90':
Successfully applied AutoFix to the loop at 'himeno.f90:jacobi:293:6' [using offloading]:
[INFO] himeno.f90:293:6 Parallel scalar reduction pattern identified for variable 'gosa' with associative, commutative operator '+'
<...>
[INFO] himeno.f90:293:6 Loop parallelized with teams using OpenMP directive 'target teams distribute parallel for'
Fine-tuning suggestions for better performance [using offloading]:
[TODO] Consider optimizing data transfers of arrays by adding the proper array ranges in data mapping clauses
Documentation: https://github.com/codee-com/open-catalog/tree/main/Glossary/Offloading-data-transfers.md
Successfully created himeno_cray_comp_driven.f90
Minimum software stack requirements: OpenMP version 4.0 with offloading capabilities
As we can see, the pragma for crayftn
does not use the distribute
clause,
unlike the pragma for nvfortran
.
297,298c297,298
< !$omp target teams distribute simd shared(a, b, bnd, c, imax, jmax, kmax, p, wrk1) map(to: kmax, jmax, imax, p, a, bnd, b, c, &
< !$omp wrk1) private(i, j, s0, ss) reduction(+: gosa) map(tofrom: gosa) map(from: wrk2)
---
> !$omp target teams loop shared(a, b, bnd, c, imax, jmax, kmax, p, wrk1) map(to: kmax, jmax, imax, p, a, bnd, b, c, wrk1) priva&
> !$omp te(i, j, s0, ss) reduction(+: gosa) map(tofrom: gosa) map(from: wrk2)
Compiler Agnostic Mode
Remove the --compiler-driven-mode
, just to generate the generic offloading
pragmas that Codee offers, so we compare them with the ones generated
specifically for nvfortran
in the previous step:
codee rewrite --offload omp-teams -o himeno_nvfort_comp_agnostic.f90 himeno.f90:293:6 -- nvfortran himeno.f90 -Ofast
Date: 2024-09-04 Codee version: 2024.3.1 License type: Full
Compiler invocation: nvfortran himeno.f90 -Ofast
Results for file '/global/homes/u/user/codee-demos/Fortran/Himeno/himeno.f90':
Successfully applied AutoFix to the loop at 'himeno.f90:jacobi:293:6' [using offloading]:
[INFO] himeno.f90:293:6 Parallel scalar reduction pattern identified for variable 'gosa' with associative, commutative operator '+'
[INFO] himeno.f90:293:6 Parallel forall: variable 'wrk2'
[INFO] himeno.f90:293:6 Available parallelization strategies for variable 'gosa'
[INFO] himeno.f90:293:6 #1 OpenMP scalar reduction (* implemented)
[INFO] himeno.f90:293:6 #2 OpenMP atomic access
[INFO] himeno.f90:293:6 #3 OpenMP explicit privatization
[INFO] himeno.f90:293:6 Loop parallelized with teams using OpenMP directive 'target teams distribute parallel for'
Fine-tuning suggestions for better performance [using offloading]:
[TODO] Consider optimizing data transfers of arrays by adding the proper array ranges in data mapping clauses
Documentation: https://github.com/codee-com/open-catalog/tree/main/Glossary/Offloading-data-transfers.md
Successfully created himeno_nvfort_comp_agnostic.f90
Minimum software stack requirements: OpenMP version 4.0 with offloading capabilities
In this case, the pragma for nvfortran
uses target teams loop
instead of
the more generic target teams distribute parallel do
:
<...>
297,298c297,298
< !$omp target teams loop shared(a, b, bnd, c, imax, jmax, kmax, p, wrk1) map(to: kmax, jmax, imax, p, a, bnd, b, c, wrk1) priva&
< !$omp te(i, j, s0, ss) reduction(+: gosa) map(tofrom: gosa) map(from: wrk2)
---
> !$omp target teams distribute parallel do simd shared(a, b, bnd, c, imax, jmax, kmax, p, wrk1) map(to: kmax, jmax, imax, p, a,&
> !$omp bnd, b, c, wrk1) private(i, j, s0, ss) reduction(+: gosa) map(tofrom: gosa) map(from: wrk2) schedule(static)
4. Execution
Finally, compile and run both the original and the optimized codes to assess
the speed improvements. The following SBATCH script can be used as reference;
create launch.sh
and Himeno.sh
, and add execution permissions to the
latter:
chmod u+x Himeno.sh
#!/bin/bash
#SBATCH --account=ntrain6
#SBATCH --job-name=codee_himeno_gpu
#SBATCH --constraint=gpu
#SBATCH --qos=regular
#SBATCH --reservation=codee_day1
#SBATCH --time=0:05:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --gpus-per-task=1
export SLURM_CPU_BIND="cores"
srun Himeno.sh
#!/bin/bash
module load PrgEnv-nvidia
rm -f himeno himeno_nvfort_comp_driven
GRID_SIZE="XL"
nvfortran himeno.f90 -Ofast -o himeno
echo "$GRID_SIZE" | ./himeno
nvfortran himeno_nvfort_comp_driven.f90 -o himeno_nvfort_comp_driven -Ofast -mp -target=gpu -Minfo=mp
echo "$GRID_SIZE" | ./himeno_nvfort_comp_driven