NUCCOR parallelization on CPU/GPU at Perlmutter
Walk you through the usage of Codee to optimize NUCCOR, by applying CPU multithreading as well as offloading computations to GPU. NUCCOR is a nuclear physics code that is used to calculate the properties of atomic nuclei and their reactions.
This guide is part of the NERSC + Codee Training Series 2024. Code available for download at the previous link.
Getting started
First, navigate to the source code for NUCCOR:
cd codee-demos/Fortran/NUCCOR
The NUCCOR code within this repository is just the
kernel of NUCCOR (mtc.F90
) whose module is used by a benchmark program
(mtc_main.F90
). The particularity of this code is that the MTC kernel is
located in two identical files, mtc.F90
and mtc_openmp.F90
, and both are
used from the benchmark program. The idea is to use Codee to optimize the
mtc_openmp.F90
file only, leaving the mtc.F90
without changes. This way the
benchmark can use the execution time of the original version (mtc.F90
) as a
singlethreaded baseline.
Load the latest Codee version available on Perlmutter:
module load codee/2024.3.1
Walkthrough
1. Generate the compile_commands.json
The project comes with a Makefile
, so we can leverage the tool bear
(version 3 or later) to generate the compile_commands.json
file required
by Codee:
/global/cfs/cdirs/m4232pub/tools/bin/bear -- make
It's as simple as prepending bear --
to the make
invocation. This command
will produce a compile_commands.json
file with all the compiler invocations
needed to build the source files.
2. Run the screening report
To explore the recommendations of the Open
Catalog that are applicable to
NUCCOR, run Codee's screening report. Use --target-arch
to include
CPU parallelization checkers in the analysis, and --compile-commands
to point
to the compilation database we just generated with bear
.
We are going to optimize mtc_openmp.F90
(clone of mtc.F90
), leaving
mtc.F90
unchanged on purpose, to serve as a baseline. Therefore, we will pass
mtc_openmp.F90
to Codee as argument, so it only reports checkers for it.
Otherwise Codee will report every checker it finds for all the files defined in
the compile_commands.json
file.
Note: The way bear
interacts with Perlmutter's filesystems causes the
compilation database to list project files under /global/u2
instead of
/global/home
. This will prevent Codee from locating the source files when
using file filters. To resolve this, we will use the realpath
command to
adjust the filters.
codee screening --target-arch cpu $(realpath mtc_openmp.F90) --compile-commands compile_commands.json
Note: the compilation database entries will be analyzed in the order necessary to meet module dependencies between Fortran source files.
Configuration file 'compile_commands.json' successfully parsed.
Date: 2024-09-06 Codee version: 2024.3.1 License type: Full
[1/1] mtc_openmp.F90 ... Done
SCREENING REPORT
---Number of files---
Total | C C++ Fortran
----- | - --- -------
1 | 0 0 1
Lines of code Analysis time # checks Profiling
------------- ------------- -------- ---------
96 52 ms 17 n/a
Lines of code : total lines of code found in the target (computed the same way as the sloccount tool)
Analysis time : time required to analyze the target
# checks : total actionable items (opportunities, recommendations, defects and remarks) detected
Profiling : estimation of overall execution time required by this target
RANKING OF CHECKERS
Checker Priority AutoFix # Title
------- -------- ------- - ----------------------------------------------------------------------------------------------------------------------
PWR003 P18 (L1) 3 Explicitly declare pure functions
PWR053 P12 (L1) x 1 Consider applying vectorization to forall loop
PWR054 P12 (L1) x 1 Consider applying vectorization to scalar reduction loop
PWR050 P6 (L2) x 1 Consider applying multithreading parallelism to forall loop
PWR051 P6 (L2) x 1 Consider applying multithreading parallelism to scalar reduction loop
PWR069 P3 (L3) 3 Use the keyword only to explicitly state what to import from a module
PWR035 P2 (L3) 6 Avoid non-consecutive array access to improve performance
RMK010 P0 (L3) 1 The vectorization cost model states the loop is not a SIMD opportunity due to strided memory accesses in the loop body
SUGGESTIONS
Use 'roi' to get a return of investment estimation report:
codee roi --target-arch cpu mtc_openmp.F90 --compile-commands compile_commands.json
Use 'checks' to find out details about the detected checks:
codee checks --target-arch cpu mtc_openmp.F90 --compile-commands compile_commands.json
1 file, 6 functions, 16 loops successfully analyzed (17 checkers) and 0 non-analyzed files in 94 ms
17 checkers reported, of which two are CPU multithreading oportunities (PWR050 and PWR051) with AutoFixes available.
3. Run the checks report
Let's list the 17 checkers reported, using the codee checks
report:
codee checks --target-arch cpu $(realpath mtc_openmp.F90) --compile-commands compile_commands.json
Note: the compilation database entries will be analyzed in the order necessary to meet module dependencies between Fortran source files.
Configuration file 'compile_commands.json' successfully parsed.
Date: 2024-09-06 Codee version: 2024.3.1 License type: Full
[1/1] mtc_openmp.F90 ... Done
CHECKS REPORT
mtc_openmp.F90:59:5 [PWR003] (level: L1): Explicitly declare pure functions
mtc_openmp.F90:77:5 [PWR003] (level: L1): Explicitly declare pure functions
mtc_openmp.F90:111:5 [PWR003] (level: L1): Explicitly declare pure functions
mtc_openmp.F90:69:21 [PWR053] (level: L1): Consider applying vectorization to forall loop
mtc_openmp.F90:47:25 [PWR054] (level: L1): Consider applying vectorization to scalar reduction loop
mtc_openmp.F90:86:9 [PWR050] (level: L2): Consider applying multithreading parallelism to forall loop
mtc_openmp.F90:47:25 [PWR051] (level: L2): Consider applying multithreading parallelism to scalar reduction loop
mtc_openmp.F90:30:5 [PWR069] (level: L3): Use the keyword only to explicitly state what to import from a module
mtc_openmp.F90:59:5 [PWR069] (level: L3): Use the keyword only to explicitly state what to import from a module
mtc_openmp.F90:77:5 [PWR069] (level: L3): Use the keyword only to explicitly state what to import from a module
mtc_openmp.F90:66:9 [PWR035] (level: L3): Avoid non-consecutive array access to improve performance
mtc_openmp.F90:86:9 [PWR035] (level: L3): Avoid non-consecutive array access to improve performance
mtc_openmp.F90:87:13 [PWR035] (level: L3): Avoid non-consecutive array access to improve performance
mtc_openmp.F90:88:17 [PWR035] (level: L3): Avoid non-consecutive array access to improve performance
mtc_openmp.F90:89:21 [PWR035] (level: L3): Avoid non-consecutive array access to improve performance
mtc_openmp.F90:91:25 [PWR035] (level: L3): Avoid non-consecutive array access to improve performance
mtc_openmp.F90:93:33 [RMK010] (level: L3): The vectorization cost model states the loop is not a SIMD opportunity due to strided memory accesses in the loop body
SUGGESTIONS
Use --verbose to get more details, e.g:
codee checks --verbose --target-arch cpu mtc_openmp.F90 --compile-commands compile_commands.json
Use --check-id to focus on specific subsets of checkers, e.g.:
codee checks --check-id PWR003 --target-arch cpu mtc_openmp.F90 --compile-commands compile_commands.json
1 file, 6 functions, 16 loops successfully analyzed (17 checkers) and 0 non-analyzed files in 118 ms
We can also run the detailed output of the checks report (option --verbose
)
to obtain more information about each checker. This detailed output includes
links to the Open Catalog, along with the precise location in the source code.
However, the additional information that the verbose mode brings can be
overwhelming when many checkers are reported. To prevent this, use the
--check-id
flag to filter the output.
Let's focus on the checker PWR050, which is related to parallelizing loops with multithreading:
codee checks --target-arch cpu $(realpath mtc_openmp.F90) --compile-commands compile_commands.json --check-id PWR050 --verbose
Note: the compilation database entries will be analyzed in the order necessary to meet module dependencies between Fortran source files.
Configuration file 'compile_commands.json' successfully parsed.
Date: 2024-09-06 Codee version: 2024.3.1 License type: Full
[1/1] mtc_openmp.F90 ... Done
CHECKS REPORT
/global/u2/u/user/codee-demos/Fortran/NUCCOR/mtc_openmp.F90:86:9 [PWR050] (level: L2): Consider applying multithreading parallelism to forall loop
Suggestion: Use 'rewrite' to automatically optimize the code
Documentation: https://github.com/codee-com/open-catalog/tree/main/Checks/PWR050
AutoFix (choose one option):
* Using OpenMP 'for' (recommended):
codee rewrite --multi omp-for --in-place /global/u2/u/user/codee-demos/Fortran/NUCCOR/mtc_openmp.F90:86:9 --compile-commands compile_commands.json
* Using OpenMP 'taskwait':
codee rewrite --multi omp-taskwait --in-place /global/u2/u/user/codee-demos/Fortran/NUCCOR/mtc_openmp.F90:86:9 --compile-commands compile_commands.json
* Using OpenMP 'taskloop':
codee rewrite --multi omp-taskloop --in-place /global/u2/u/user/codee-demos/Fortran/NUCCOR/mtc_openmp.F90:86:9 --compile-commands compile_commands.json
1 file, 6 functions, 16 loops successfully analyzed (1 checker) and 0 non-analyzed files in 82 ms
4. Autofix
Apply the multithreading AutoFixes, choosing the first (recommended) rewriting
option ("Using OpenMP for
"):
codee rewrite --multi omp-for --in-place $(realpath mtc_openmp.F90):86:9 --compile-commands compile_commands.json
Note: the compilation database entries will be analyzed in the order necessary to meet module dependencies between Fortran source files.
Configuration file 'compile_commands.json' successfully parsed.
Date: 2024-09-06 Codee version: 2024.3.1 License type: Full
Results for file '/global/u2/u/user/codee-demos/Fortran/NUCCOR/mtc_openmp.F90':
Successfully applied AutoFix to the loop at '/global/u2/u/user/codee-demos/Fortran/NUCCOR/mtc_openmp.F90:contract_simple:86:9' [using multi-threading]:
[INFO] /global/u2/u/user/codee-demos/Fortran/NUCCOR/mtc_openmp.F90:86:9 Parallel forall: variable 'dst'
[INFO] /global/u2/u/user/codee-demos/Fortran/NUCCOR/mtc_openmp.F90:86:9 Loop parallelized with multithreading using OpenMP directive 'for'
[INFO] /global/u2/u/user/codee-demos/Fortran/NUCCOR/mtc_openmp.F90:86:9 Parallel region defined by OpenMP directive 'parallel'
Successfully updated /global/u2/u/user/codee-demos/Fortran/NUCCOR/mtc_openmp.F90
Minimum software stack requirements: OpenMP version 3.0 with multithreading capabilities
Let's see what changes codee rewrite
has applied to the code:
diff --git a/Fortran/NUCCOR/mtc_openmp.F90 b/Fortran/NUCCOR/mtc_openmp.F90
index b7aba0c..c5e27e1 100644
--- a/Fortran/NUCCOR/mtc_openmp.F90
+++ b/Fortran/NUCCOR/mtc_openmp.F90
@@ -83,6 +88,10 @@ contains
integer :: nh, np, i, j, m, a, b, e, f
real(real64) :: temp
+ ! Codee: Loop modified by Codee (2024-09-06 04:05:29)
+ ! Codee: Technique applied: multithreading with 'omp-for' pragmas
+ !$omp parallel default(none) shared(dst, op, src) private(a, b, e, f, i, j, m, temp)
+ !$omp do private(a, b, e, f, i, m, temp) schedule(auto)
do j = 1, size(dst, 4)
do i = 1, size(dst, 3)
do b = 1, size(dst, 2)
@@ -100,6 +109,7 @@ contains
end do
end do
end do
+ !$omp end parallel
end subroutine contract_simple
subroutine cleanup(this)
5. Execution
Finally, compile and run both the original and the optimized codes to assess
the speed improvements. The following SBATCH script can be used as reference;
create launch.sh
and NUCCOR.sh
, and add execution permissions to the
latter:
chmod u+x Nuccor.sh
#!/bin/bash
#SBATCH --account=ntrain6
#SBATCH --job-name=codee_nuccor_cpu
#SBATCH --qos=regular
#SBATCH --reservation=codee_day2
#SBATCH --time=0:05:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
export SLURM_CPU_BIND="cores"
srun Nuccor.sh
#!/bin/bash
module load PrgEnv-gnu
make clean
make
./mtc.x 30 70 10 0.1 yes contract_simple
Results
nh: 30
np: 40
nab: 16
nc: 4
nij: 9
nk: 3
Allocated cmap: T
Allocated kmap: T
Memory usage for simple contract: 12.900 Gb
Time spent in simple contraction reference: 65.704842659994029
Time spent in contraction Openmp version: 2.7219030110281892
Test simple contraction: OK
The original version of MTC contract_simple ran on 65.7 seconds, while the multithreaded version took just 2.7 seconds, which represents an speedup of 24x.
6. Explore GPU offloading AutoFixes of Codee
You are free to explore other parallelization techniques of Codee. For example,
try generating pragmas for GPU offloading. Just replace the --target-arch cpu
with --target-arch gpu
and you will get checkers for offloading. Note that
you can also get both checkers with --target-arch cpu,gpu
.
Remember that you can use the Codee Compiler Driven Mode (enabled with the
--compiler-driven-mode
flag) to generate offloading pragmas optimized
specifically for your target compiler. Codee generates specific offloading
pragmas for the following compilers: crayftn
and nvfortran
, while it will
generate generic offloading pragmas for the rest (or whenever the
--compiler-driven-mode
flag is not used).