COULOMB optimization through CPU parallelism

Goal

Walk you through the usage of Codee to optimize Coulomb, a code that computes electrical currents in 2D planes, by parallelizing computations on CPU.

info

This guide is part of the NERSC + Codee Training Series 2024. Code available for download at the previous link.

Getting started

First, navigate to the source code for COULOMB:

cd codee-demos/C/COULOMB

Next, load the latest Codee version available on Perlmutter:

module load codee/2024.3.1

Walkthrough

1. Explore the source code

The computation is handled by a triple-nested for loop within coulomb.c:

for(int i=0; i<rows; i++) {
    for(int j=0; j<cols; j++) {
        double mat_ij = 0;
        for(int k = 0; k < size; k+=4) {
            double dx = vec[k+0] - (scaleX * j + x0);
            double dy = vec[k+1] - (scaleY * i + y0);
            double dz = vec[k+2] - (z0);
            double charge = 1e-9 * vec[k+3];
            double dist = sqrt(dx*dx + dy*dy + dz*dz);
            mat_ij += charge / dist;
        }
        mat[j + i * cols] = mat_ij / (4 * PI * e0);
    }
}

2. Run the checks report

To explore how Codee can help speed up this loop by offloading it to a CPU, use --target-arch to include CPU-related checks in the analysis:

Codee command
codee checks --verbose --target-arch cpu coulomb.c:coulomb -- gcc coulomb.c -Ofast -lm

Codee output
Date: 2024-09-05 Codee version: 2024.3.1 License type: Full
Compiler invocation: gcc coulomb.c -Ofast -lm

[1/1] coulomb.c ... Done

CHECKS REPORT

<...>

coulomb.c:26:2 [PWR050] (level: L2): Consider applying multithreading parallelism to forall loop
  Suggestion: Use 'rewrite' to automatically optimize the code
  Documentation: https://github.com/codee-com/open-catalog/tree/main/Checks/PWR050
  AutoFix (choose one option):
      * Using OpenMP 'for' (recommended):
        codee rewrite --multi omp-for --in-place coulomb.c:26:2 -- gcc coulomb.c -Ofast -lm
      * Using OpenMP 'taskwait':
        codee rewrite --multi omp-taskwait --in-place coulomb.c:26:2 -- gcc coulomb.c -Ofast -lm
      * Using OpenMP 'taskloop':
        codee rewrite --multi omp-taskloop --in-place coulomb.c:26:2 -- gcc coulomb.c -Ofast -lm

<...>

SUGGESTIONS

  Use --check-id to focus on specific subsets of checkers, e.g.:
        codee checks --check-id PWR045 --verbose --target-arch cpu coulomb.c:coulomb -- gcc coulomb.c -Ofast -lm

1 file, 1 function, 3 loops successfully analyzed (6 checkers) and 0 non-analyzed files in 54 ms

Codee suggests various options to optimize the loop using OpenMP parallelization.

3. Autofix

OpenMP

Let's use Codee's autofix capabilities to automatically optimize the code, following the recomended option.

We can copy and paste the suggested Codee invocation to perform the parallelization; just replace the --in-place argument with -o to create a new file with the modified code:

Codee command
codee rewrite --multi omp-for -o coulomb_omp.c coulomb.c:26:2 -- gcc coulomb.c -Ofast -lm

Codee output
Date: 2024-09-05 Codee version: 2024.3.1 License type: Full
Compiler invocation: gcc coulomb.c -Ofast -lm

Results for file '/global/homes/u/user/codee-demos/C/COULOMB/coulomb.c':
  Successfully applied AutoFix to the loop at 'coulomb.c:coulomb:26:2' [using multi-threading]:
      [INFO] coulomb.c:26:2 Parallel forall: variable 'mat'
      [INFO] coulomb.c:26:2 Loop parallelized with multithreading using OpenMP directive 'for'
      [INFO] coulomb.c:26:2 Parallel region defined by OpenMP directive 'parallel'
      [INFO] coulomb.c:26:2 Make sure there is no aliasing among variables: vec, mat

Successfully created coulomb_omp.c

Minimum software stack requirements: OpenMP version 3.0 with multithreading capabilities

diff -C 2 coulomb.c coulomb_omp.c
        double scaleY = (y1 - y0) / rows;
  
+       // Codee: Loop modified by Codee (2024-09-05 04:27:26)
+       // Codee: Technique applied: multithreading with 'omp-for' pragmas
+       #pragma omp parallel default(none) shared(PI, cols, e0, mat, rows, scaleX, scaleY, size, vec, x0, y0, z0)
+       {
+       #pragma omp for schedule(auto)
        for(int i=0; i<rows; i++) {
                for(int j=0; j<cols; j++) {
***************
*** 38,41 ****
--- 43,47 ----
                }
        }
+       } // end parallel
  }

4. Execution

Finally, compile and run both the original and the optimized code to assess the speed improvements. The following SLURM scripts can be used as reference; create launch.sh and COULOMB.sh, and add execution permissions to the latter:

chmod u+x COULOMB.sh

launch.sh
#!/bin/bash

#SBATCH --account=ntrain6
#SBATCH --job-name=codee_c_coulomb

#SBATCH --constraint=cpu
#SBATCH --qos=regular
#SBATCH --reservation=codee_day1
#SBATCH --time=0:05:00

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --gpus-per-task=1

export SLURM_CPU_BIND="cores"
srun COULOMB.sh

COULOMB.sh
#!/bin/bash

rm -f coulomb coulomb_omp 

gcc Vector.c Matrix2D.c coulomb.c -Ofast -lm -o coulomb
./coulomb 600

gcc Vector.c Matrix2D.c coulomb_omp.c -Ofast -fopenmp -lm -o coulomb_omp
./coulomb_omp 600

The optimized version ran on 1.90 seconds, while the original took 27.68 seconds, which represents an speedup of 14.57x.

Getting started​

Walkthrough​

1. Explore the source code​

2. Run the checks report​

3. Autofix​

OpenMP​

4. Execution​