NUCCOR (Fortran)

Acknowledgment

We gratefully thank NERSC for providing access to the Perlmutter supercomputer, a HPE Cray EX system equipped with the Cray Compiler Environment (CCE).

Goal

Walk you through the usage of Codee to optimize NUCCOR, a nuclear physics code that is used to calculate the properties of atomic nuclei and their reactions, by parallelizing computations on CPU with OpenMP.

Prerequisites

Ensure you have installed:

Codee installed.
The Cray Compiler Environment
The codee-demos repository on your machine.

To clone the necessary repository just execute the following on your terminal:

git clone https://github.com/codee-com/codee-demos.git

Getting started

First, navigate to the source code for NUCCOR:

cd codee-demos/Fortran/NUCCOR

Important note

The NUCCOR code within this repository is just the kernel of NUCCOR (mtc.F90) whose module is used by a benchmark program (mtc_main.F90). The particularity of this code is that the MTC kernel is located in two identical files, mtc.F90 and mtc_openmp.F90, and both are used from the benchmark program. The idea is to use Codee to optimize the mtc_openmp.F90 file only, leaving the mtc.F90 without changes. This way the benchmark can use the execution time of the original version (mtc.F90) as a single-threaded baseline.

To clone the necessary repository just execute the following on your terminal:

git clone https://github.com/codee-com/codee-demos.git

Walkthrough

1. Explore the source code

For the optimization we will focus on this nested loop within mtc_openmp.F90 that performs a tensor contraction:

        do concurrent (j = 1: size(dst, 4))
            do i = 1, size(dst, 3)
                do b = 1, size(dst, 2)
                    do a = 1, size(dst, 1)
                        temp = 0.0d0
                        do m = 1, size(op, 1)
                            do f = 1, size(op, 3)
                                do e = 1, size(op, 4)
                                    temp = temp + op(m, a, e, f)*src(e, f, b, i, j, m)
                                end do
                            end do
                        end do
                        dst(a, b, i, j) = 0.5d0*temp
                    end do
                end do
            end do
        end do

2. Generate the `compile_commands.json`

The project comes with a Makefile, so we can leverage the tool bear (that comes bundled with codee) to generate the compile_commands.json file required by Codee. First we need to modify the Makefile so it uses the crayftn compiler:

sed -i 's/^FC = gfortran/FC = ftn/' Makefile

Now compile the code using bear to parse the compilation commands:

bear -- make

It's as simple as prepending bear -- to the make invocation. This command will produce a compile_commands.json file with all the compiler invocations needed to build the source files.

3. Run the checks report

Note

It is recommended to run the screening report first to obtain a ranking of the checkers, which can help you decide which one to implement first.

Note: The way bear interacts with Perlmutter's filesystems causes the compilation database to list project files under /global/u2 instead of /global/home. This will prevent Codee from locating the source files when using file filters. To resolve this, we will use the realpath command to adjust the filters.

To explore how Codee can help speed up this loop by parallelizing it, use --target-arch to include CPU-related checks in the analysis:

Codee command
codee checks --target-arch cpu $(realpath mtc_openmp.F90) -p compile_commands.json

Codee output
Configuration file 'compile_commands.json' successfully parsed.
Date: 2025-04-16 Codee version: 2025.2 License type: Full
Performing Fortran module dependency analysis... Done

[Dep] mtc_patch.f90 ... Done
[1/1] mtc_openmp.F90 ... Done

CHECKS REPORT

mtc_openmp.F90 [RMK015] (level: L1): Tune compiler optimization flags to increase the speed of the code
mtc_openmp.F90:69:21 [PWR053] (level: L1): Consider applying vectorization to forall loop
mtc_openmp.F90:47:25 [PWR054] (level: L1): Consider applying vectorization to scalar reduction loop
mtc_openmp.F90:59:5 [PWR003] (level: L2): Explicitly declare pure functions
mtc_openmp.F90:77:5 [PWR003] (level: L2): Explicitly declare pure functions
mtc_openmp.F90:111:5 [PWR003] (level: L2): Explicitly declare pure functions
mtc_openmp.F90:86:9 [PWR050] (level: L2): Consider applying multithreading parallelism to forall loop
mtc_openmp.F90:47:25 [PWR051] (level: L2): Consider applying multithreading parallelism to scalar reduction loop
mtc_openmp.F90:66:9 [PWR035] (level: L3): Avoid non-consecutive array access to improve performance
mtc_openmp.F90:86:9 [PWR035] (level: L3): Avoid non-consecutive array access to improve performance
mtc_openmp.F90:87:13 [PWR035] (level: L3): Avoid non-consecutive array access to improve performance
mtc_openmp.F90:88:17 [PWR035] (level: L3): Avoid non-consecutive array access to improve performance
mtc_openmp.F90:89:21 [PWR035] (level: L3): Avoid non-consecutive array access to improve performance
mtc_openmp.F90:91:25 [PWR035] (level: L3): Avoid non-consecutive array access to improve performance
mtc_openmp.F90:93:33 [RMK010] (level: L4): Strided memory accesses in the loop body may prevent vectorization

SUGGESTIONS

  Use --verbose to get more details, e.g:
        codee checks --verbose --target-arch cpu mtc_openmp.F90 -p compile_commands.json

  Use --check-id to focus on specific subsets of checkers, e.g.:
        codee checks --check-id RMK015 --target-arch cpu mtc_openmp.F90 -p compile_commands.json

1 file, 6 functions, 16 loops, 96 LOCs successfully analyzed (15 checkers) and 0 non-analyzed files in 591 ms

We can also run the detailed output of the checks report (option --verbose) to obtain more information about each checker. This detailed output includes links to the Open Catalog, along with the precise location in the source code. However, the additional information that the verbose mode brings can be overwhelming when many checkers are reported. To prevent this, use the --check-id flag to filter the output.

Let's focus on the checker PWR050, which is related to parallelizing loops with multithreading:

Codee command
codee checks --target-arch cpu $(realpath mtc_openmp.F90) -p compile_commands.json --check-id PWR050 --verbose

Codee output
Note: the compilation database entries will be analyzed in the order necessary to meet module dependencies between Fortran source files.
Configuration file 'compile_commands.json' successfully parsed.
Date: 2025-04-16 Codee version: 2025.2 License type: Full
Performing Fortran module dependency analysis... Done

[Dep] mtc_patch.f90 ... Done
[1/1] mtc_openmp.F90 ... Done

CHECKS REPORT

mtc_openmp.F90:86:9 [PWR050] (level: L2): Consider applying multithreading parallelism to forall loop
  Suggestion: Use 'rewrite' to automatically optimize the code
  Documentation: https://github.com/codee-com/open-catalog/tree/main/Checks/PWR050
  AutoFix (choose one option):
    * Using OpenMP 'for' (recommended):
        codee rewrite --check-id pwr050 --variant omp-for --in-place mtc_openmp.F90:86:9 -p compile_commands.json
    * Using Fortran native 'do concurrent':
        codee rewrite --check-id pwr050 --variant native --in-place mtc_openmp.F90:86:9 -p compile_commands.json
    * Using OpenMP 'taskwait':
        codee rewrite --check-id pwr050 --variant omp-taskwait --in-place mtc_openmp.F90:86:9 -p compile_commands.json
    * Using OpenMP 'taskloop':
        codee rewrite --check-id pwr050 --variant omp-taskloop --in-place mtc_openmp.F90:86:9 -p compile_commands.json

1 file, 6 functions, 16 loops, 96 LOCs successfully analyzed (1 checker) and 0 non-analyzed files in 499 ms

4. Autofix

Apply the multithreading AutoFixes, choosing the first (recommended) rewriting option ("Using OpenMP for"):

Codee command
codee rewrite --check-id pwr050 --variant omp-for --in-place $(realpath mtc_openmp.F90):86:9 -p compile_commands.json

Codee output
Note: the compilation database entries will be analyzed in the order necessary to meet module dependencies between Fortran source files.
Configuration file 'compile_commands.json' successfully parsed.
Date: 2025-04-16 Codee version: 2025.2 License type: Full
Performing Fortran module dependency analysis... Done

[Dep] mtc_patch.f90 ... Done
[1/1] mtc_openmp.F90 ... Done

Results for file '=mtc_openmp.F90':
  Successfully applied AutoFix to the loop at 'mtc_openmp.F90:contract_simple:86:9' [using multi-threading]:
      [INFO] mtc_openmp.F90:86:9 Parallel forall: variable 'dst'
      [INFO] mtc_openmp.F90:86:9 Loop parallelized with multithreading using OpenMP directive 'for'
      [INFO] mtc_openmp.F90:86:9 Parallel region defined by OpenMP directive 'parallel'

Successfully updated mtc_openmp.F90

Minimum software stack requirements: OpenMP version 3.0 with multithreading capabilities

Let's see what changes codee rewrite has applied to the code:

git diff mtc_openmp.F90
diff --git a/Fortran/NUCCOR/mtc_openmp.F90 b/Fortran/NUCCOR/mtc_openmp.F90
index b7aba0c..f3c5ef6 100644
--- a/Fortran/NUCCOR/mtc_openmp.F90
+++ b/Fortran/NUCCOR/mtc_openmp.F90
@@ -83,6 +83,10 @@ contains
         integer :: nh, np, i, j, m, a, b, e, f
         real(real64) :: temp
 
+        ! Codee: Loop modified by Codee (2025-04-16 01:53:27)
+        ! Codee: Technique applied: multithreading with 'omp-for' pragmas
+        !$omp parallel default(none) shared(dst, op, src) private(a, b, e, f, i, j, m, temp)
+        !$omp do private(a, b, e, f, i, m, temp) schedule(auto)
         do j = 1, size(dst, 4)
             do i = 1, size(dst, 3)
                 do b = 1, size(dst, 2)
@@ -100,6 +104,7 @@ contains
                 end do
             end do
         end do
+        !$omp end parallel
     end subroutine contract_simple
 
     subroutine cleanup(this)

5. Execution

Finally, compile again the code and execute it to compare the performance between the original version and the optimized by Codee.

make clean
make

To run the new version lets request access to an interactive node:

salloc --nodes 1 --qos interactive --time 01:00:00 --constraint cpu

Once inside the interactive node execute the code by running:

./mtc.x 30 70 10 0.1 yes contract_simple

Benchmark output
 nh:           30
 np:           40
 nab:           16
 nc:            4
 nij:            9
 nk:            3
 Allocated cmap:  T
 Allocated kmap:  T
Memory usage for simple contract:     12.900 Gb
 Time spent in simple contraction reference:    60.713642340968363     
 Time spent in contraction Openmp version:    3.1684048759052530     
 Test simple contraction: OK

6. Results

The original version of MTC contract_simple ran on 60.7 seconds, while the multithreaded version took just 3.2 seconds, representing a 19x speedup

Prerequisites​

Getting started​

Walkthrough​

1. Explore the source code​

2. Generate the compile_commands.json​

3. Run the checks report​

4. Autofix​

5. Execution​

6. Results​