NUCCOR parallelization on CPU/GPU at Perlmutter

Goal

Walk you through the usage of Codee to optimize NUCCOR, by applying CPU multithreading as well as offloading computations to GPU. NUCCOR is a nuclear physics code that is used to calculate the properties of atomic nuclei and their reactions.

info

This guide is part of the NERSC + Codee Training Series 2024. Code available for download at the previous link.

Getting started

First, navigate to the source code for NUCCOR:

cd codee-demos/Fortran/NUCCOR

Important note

The NUCCOR code within this repository is just the kernel of NUCCOR (mtc.F90) whose module is used by a benchmark program (mtc_main.F90). The particularity of this code is that the MTC kernel is located in two identical files, mtc.F90 and mtc_openmp.F90, and both are used from the benchmark program. The idea is to use Codee to optimize the mtc_openmp.F90 file only, leaving the mtc.F90 without changes. This way the benchmark can use the execution time of the original version (mtc.F90) as a singlethreaded baseline.

Load the latest Codee version available on Perlmutter:

module load codee/2024.3.1

Walkthrough

1. Generate the `compile_commands.json`

The project comes with a Makefile, so we can leverage the tool bear (version 3 or later) to generate the compile_commands.json file required by Codee:

/global/cfs/cdirs/m4232pub/tools/bin/bear -- make

It's as simple as prepending bear -- to the make invocation. This command will produce a compile_commands.json file with all the compiler invocations needed to build the source files.

2. Run the screening report

To explore the recommendations of the Open Catalog that are applicable to NUCCOR, run Codee's screening report. Use --target-arch to include CPU parallelization checkers in the analysis, and --compile-commands to point to the compilation database we just generated with bear.

Reminder

We are going to optimize mtc_openmp.F90 (clone of mtc.F90), leaving mtc.F90 unchanged on purpose, to serve as a baseline. Therefore, we will pass mtc_openmp.F90 to Codee as argument, so it only reports checkers for it. Otherwise Codee will report every checker it finds for all the files defined in the compile_commands.json file.

Note: The way bear interacts with Perlmutter's filesystems causes the compilation database to list project files under /global/u2 instead of /global/home. This will prevent Codee from locating the source files when using file filters. To resolve this, we will use the realpath command to adjust the filters.

Codee command
codee screening --target-arch cpu $(realpath mtc_openmp.F90) --compile-commands compile_commands.json

Codee output
Note: the compilation database entries will be analyzed in the order necessary to meet module dependencies between Fortran source files.
Configuration file 'compile_commands.json' successfully parsed.
Date: 2024-09-06 Codee version: 2024.3.1 License type: Full

[1/1] mtc_openmp.F90 ... Done

SCREENING REPORT

---Number of files---
Total | C C++ Fortran
----- | - --- -------
1     | 0 0   1

Lines of code Analysis time # checks Profiling
------------- ------------- -------- ---------
96            52 ms         17       n/a

Lines of code : total lines of code found in the target (computed the same way as the sloccount tool)
Analysis time : time required to analyze the target
# checks : total actionable items (opportunities, recommendations, defects and remarks) detected
Profiling : estimation of overall execution time required by this target

RANKING OF CHECKERS

Checker Priority AutoFix # Title
------- -------- ------- - ----------------------------------------------------------------------------------------------------------------------
PWR003  P18 (L1)         3 Explicitly declare pure functions
PWR053  P12 (L1)  x      1 Consider applying vectorization to forall loop
PWR054  P12 (L1)  x      1 Consider applying vectorization to scalar reduction loop
PWR050  P6 (L2)   x      1 Consider applying multithreading parallelism to forall loop
PWR051  P6 (L2)   x      1 Consider applying multithreading parallelism to scalar reduction loop
PWR069  P3 (L3)          3 Use the keyword only to explicitly state what to import from a module
PWR035  P2 (L3)          6 Avoid non-consecutive array access to improve performance
RMK010  P0 (L3)          1 The vectorization cost model states the loop is not a SIMD opportunity due to strided memory accesses in the loop body

SUGGESTIONS

  Use 'roi' to get a return of investment estimation report:
        codee roi --target-arch cpu mtc_openmp.F90 --compile-commands compile_commands.json

  Use 'checks' to find out details about the detected checks:
        codee checks --target-arch cpu mtc_openmp.F90 --compile-commands compile_commands.json

1 file, 6 functions, 16 loops successfully analyzed (17 checkers) and 0 non-analyzed files in 94 ms

17 checkers reported, of which two are CPU multithreading oportunities (PWR050 and PWR051) with AutoFixes available.

3. Run the checks report

Let's list the 17 checkers reported, using the codee checks report:

Codee command
codee checks --target-arch cpu $(realpath mtc_openmp.F90) --compile-commands compile_commands.json

Codee output
Note: the compilation database entries will be analyzed in the order necessary to meet module dependencies between Fortran source files.
Configuration file 'compile_commands.json' successfully parsed.
Date: 2024-09-06 Codee version: 2024.3.1 License type: Full

[1/1] mtc_openmp.F90 ... Done

CHECKS REPORT

mtc_openmp.F90:59:5 [PWR003] (level: L1): Explicitly declare pure functions
mtc_openmp.F90:77:5 [PWR003] (level: L1): Explicitly declare pure functions
mtc_openmp.F90:111:5 [PWR003] (level: L1): Explicitly declare pure functions
mtc_openmp.F90:69:21 [PWR053] (level: L1): Consider applying vectorization to forall loop
mtc_openmp.F90:47:25 [PWR054] (level: L1): Consider applying vectorization to scalar reduction loop
mtc_openmp.F90:86:9 [PWR050] (level: L2): Consider applying multithreading parallelism to forall loop
mtc_openmp.F90:47:25 [PWR051] (level: L2): Consider applying multithreading parallelism to scalar reduction loop
mtc_openmp.F90:30:5 [PWR069] (level: L3): Use the keyword only to explicitly state what to import from a module
mtc_openmp.F90:59:5 [PWR069] (level: L3): Use the keyword only to explicitly state what to import from a module
mtc_openmp.F90:77:5 [PWR069] (level: L3): Use the keyword only to explicitly state what to import from a module
mtc_openmp.F90:66:9 [PWR035] (level: L3): Avoid non-consecutive array access to improve performance
mtc_openmp.F90:86:9 [PWR035] (level: L3): Avoid non-consecutive array access to improve performance
mtc_openmp.F90:87:13 [PWR035] (level: L3): Avoid non-consecutive array access to improve performance
mtc_openmp.F90:88:17 [PWR035] (level: L3): Avoid non-consecutive array access to improve performance
mtc_openmp.F90:89:21 [PWR035] (level: L3): Avoid non-consecutive array access to improve performance
mtc_openmp.F90:91:25 [PWR035] (level: L3): Avoid non-consecutive array access to improve performance
mtc_openmp.F90:93:33 [RMK010] (level: L3): The vectorization cost model states the loop is not a SIMD opportunity due to strided memory accesses in the loop body

SUGGESTIONS

  Use --verbose to get more details, e.g:
        codee checks --verbose --target-arch cpu mtc_openmp.F90 --compile-commands compile_commands.json

  Use --check-id to focus on specific subsets of checkers, e.g.:
        codee checks --check-id PWR003 --target-arch cpu mtc_openmp.F90 --compile-commands compile_commands.json

1 file, 6 functions, 16 loops successfully analyzed (17 checkers) and 0 non-analyzed files in 118 ms

We can also run the detailed output of the checks report (option --verbose) to obtain more information about each checker. This detailed output includes links to the Open Catalog, along with the precise location in the source code. However, the additional information that the verbose mode brings can be overwhelming when many checkers are reported. To prevent this, use the --check-id flag to filter the output.

Let's focus on the checker PWR050, which is related to parallelizing loops with multithreading:

Codee command
codee checks --target-arch cpu $(realpath mtc_openmp.F90) --compile-commands compile_commands.json --check-id PWR050 --verbose

Codee output
Note: the compilation database entries will be analyzed in the order necessary to meet module dependencies between Fortran source files.
Configuration file 'compile_commands.json' successfully parsed.
Date: 2024-09-06 Codee version: 2024.3.1 License type: Full

[1/1] mtc_openmp.F90 ... Done

CHECKS REPORT

/global/u2/u/user/codee-demos/Fortran/NUCCOR/mtc_openmp.F90:86:9 [PWR050] (level: L2): Consider applying multithreading parallelism to forall loop
  Suggestion: Use 'rewrite' to automatically optimize the code
  Documentation: https://github.com/codee-com/open-catalog/tree/main/Checks/PWR050
  AutoFix (choose one option):
      * Using OpenMP 'for' (recommended):
        codee rewrite --multi omp-for --in-place /global/u2/u/user/codee-demos/Fortran/NUCCOR/mtc_openmp.F90:86:9 --compile-commands compile_commands.json
      * Using OpenMP 'taskwait':
        codee rewrite --multi omp-taskwait --in-place /global/u2/u/user/codee-demos/Fortran/NUCCOR/mtc_openmp.F90:86:9 --compile-commands compile_commands.json
      * Using OpenMP 'taskloop':
        codee rewrite --multi omp-taskloop --in-place /global/u2/u/user/codee-demos/Fortran/NUCCOR/mtc_openmp.F90:86:9 --compile-commands compile_commands.json

1 file, 6 functions, 16 loops successfully analyzed (1 checker) and 0 non-analyzed files in 82 ms

4. Autofix

Apply the multithreading AutoFixes, choosing the first (recommended) rewriting option ("Using OpenMP for"):

Codee command
codee rewrite --multi omp-for --in-place $(realpath mtc_openmp.F90):86:9 --compile-commands compile_commands.json

Codee output
Note: the compilation database entries will be analyzed in the order necessary to meet module dependencies between Fortran source files.
Configuration file 'compile_commands.json' successfully parsed.
Date: 2024-09-06 Codee version: 2024.3.1 License type: Full

Results for file '/global/u2/u/user/codee-demos/Fortran/NUCCOR/mtc_openmp.F90':
  Successfully applied AutoFix to the loop at '/global/u2/u/user/codee-demos/Fortran/NUCCOR/mtc_openmp.F90:contract_simple:86:9' [using multi-threading]:
      [INFO] /global/u2/u/user/codee-demos/Fortran/NUCCOR/mtc_openmp.F90:86:9 Parallel forall: variable 'dst'
      [INFO] /global/u2/u/user/codee-demos/Fortran/NUCCOR/mtc_openmp.F90:86:9 Loop parallelized with multithreading using OpenMP directive 'for'
      [INFO] /global/u2/u/user/codee-demos/Fortran/NUCCOR/mtc_openmp.F90:86:9 Parallel region defined by OpenMP directive 'parallel'

Successfully updated /global/u2/u/user/codee-demos/Fortran/NUCCOR/mtc_openmp.F90

Minimum software stack requirements: OpenMP version 3.0 with multithreading capabilities

Let's see what changes codee rewrite has applied to the code:

git diff mtc_openmp.F90
diff --git a/Fortran/NUCCOR/mtc_openmp.F90 b/Fortran/NUCCOR/mtc_openmp.F90
index b7aba0c..c5e27e1 100644
--- a/Fortran/NUCCOR/mtc_openmp.F90
+++ b/Fortran/NUCCOR/mtc_openmp.F90
@@ -83,6 +88,10 @@ contains
         integer :: nh, np, i, j, m, a, b, e, f
         real(real64) :: temp
 
+        ! Codee: Loop modified by Codee (2024-09-06 04:05:29)
+        ! Codee: Technique applied: multithreading with 'omp-for' pragmas
+        !$omp parallel default(none) shared(dst, op, src) private(a, b, e, f, i, j, m, temp)
+        !$omp do private(a, b, e, f, i, m, temp) schedule(auto)
         do j = 1, size(dst, 4)
             do i = 1, size(dst, 3)
                 do b = 1, size(dst, 2)
@@ -100,6 +109,7 @@ contains
                 end do
             end do
         end do
+        !$omp end parallel
     end subroutine contract_simple
 
     subroutine cleanup(this)

5. Execution

Finally, compile and run both the original and the optimized codes to assess the speed improvements. The following SBATCH script can be used as reference; create launch.sh and NUCCOR.sh, and add execution permissions to the latter:

chmod u+x Nuccor.sh

launch.sh
#!/bin/bash

#SBATCH --account=ntrain6
#SBATCH --job-name=codee_nuccor_cpu

#SBATCH --qos=regular
#SBATCH --reservation=codee_day2
#SBATCH --time=0:05:00

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32

export SLURM_CPU_BIND="cores"
srun Nuccor.sh

Nuccor.sh
#!/bin/bash

module load PrgEnv-gnu
make clean
make

./mtc.x 30 70 10 0.1 yes contract_simple

Results

Benchmark output
 nh:           30
 np:           40
 nab:           16
 nc:            4
 nij:            9
 nk:            3
 Allocated cmap:  T
 Allocated kmap:  T
Memory usage for simple contract:     12.900 Gb
 Time spent in simple contraction reference:    65.704842659994029     
 Time spent in contraction Openmp version:    2.7219030110281892     
 Test simple contraction: OK

The original version of MTC contract_simple ran on 65.7 seconds, while the multithreaded version took just 2.7 seconds, which represents an speedup of 24x.

6. Explore GPU offloading AutoFixes of Codee

You are free to explore other parallelization techniques of Codee. For example, try generating pragmas for GPU offloading. Just replace the --target-arch cpu with --target-arch gpu and you will get checkers for offloading. Note that you can also get both checkers with --target-arch cpu,gpu.

Remember that you can use the Codee Compiler Driven Mode (enabled with the --compiler-driven-mode flag) to generate offloading pragmas optimized specifically for your target compiler. Codee generates specific offloading pragmas for the following compilers: crayftn and nvfortran, while it will generate generic offloading pragmas for the rest (or whenever the --compiler-driven-mode flag is not used).

Getting started​

Walkthrough​

1. Generate the compile_commands.json​

2. Run the screening report​

3. Run the checks report​

4. Autofix​

5. Execution​

Results​

6. Explore GPU offloading AutoFixes of Codee​