ATMUX parallelization on CPU at Perlmutter

Goal

Walk you through the usage of Codee to optimize a sparse matrix-vector multiplication code by parallelizing computations on CPU.

info

This guide is part of the NERSC + Codee Training Series 2024. Code available for download at the previous link.

Getting started

First, navigate to the source code for ATMUX:

cd codee-demos/C/ATMUX

Next, load the latest Codee version available on Perlmutter:

module load codee/2024.3.1

Walkthrough

1. Explore the source code

The computation is handled by a double-nested for loop within atmux.c:

// y = A^T x
for (int i = 0; i < n; i++) {
    for (int k = row_ptr[i]; k < row_ptr[i + 1]; k++) {
        y[col_ind[k]] = y[col_ind[k]] + x[i] * val[k];
    }
}

2. Run the checks report

To explore how Codee can help speed up this loop by parallelizing its execution on CPU, use --target-arch to include multithreading-related checks in the analysis:

Codee command
codee checks --verbose --target-arch cpu atmux.c:atmux -- gcc atmux.c -lm -Ofast -I lib/

Codee output
Date: 2024-09-09 Codee version: 2024.3.1 License type: Full
Compiler invocation: gcc atmux.c -lm -Ofast -I lib/

[1/1] atmux.c ... Done

CHECKS REPORT

<...>

atmux.c:22:5 [PWR052] (level: L2): Consider applying multithreading parallelism to sparse reduction loop
  Suggestion: Use 'rewrite' to automatically optimize the code
  Documentation: https://github.com/codee-com/open-catalog/tree/main/Checks/PWR052
  AutoFix (choose one option):
      * Using OpenMP 'for' with atomic protection (recommended):
        codee rewrite --multi omp-for --in-place atmux.c:22:5 -- gcc atmux.c -lm -Ofast -I lib/
      * Using OpenMP 'for' with explicit privatization:
        codee rewrite --multi omp-for --in-place --explicit-privatization y atmux.c:22:5 -- gcc atmux.c -lm -Ofast -I lib/
      * Using OpenMP 'taskwait':
        codee rewrite --multi omp-taskwait --in-place atmux.c:22:5 -- gcc atmux.c -lm -Ofast -I lib/
      * Using OpenMP 'taskloop':
        codee rewrite --multi omp-taskloop --in-place atmux.c:22:5 -- gcc atmux.c -lm -Ofast -I lib/

<...>

1 file, 1 function, 3 loops successfully analyzed (8 checkers) and 0 non-analyzed files in 47 ms

Codee suggests various options to optimize the loop by automatically generating OpenMP directives.

3. Autofix

Let's use Codee's autofix capabilities to automatically optimize the code. We will create two new files: one using the atomic protection strategy, and another using the explicit privatization, to compare their performance.

Atomic protection

We can copy-paste the suggested Codee invocation to generate the OpenMP parallelization based on the atomic protection; replace the --in-place argument with -o to create a new file with the modification:

Codee command
codee rewrite --multi omp-for -o atmux_atomic.c atmux.c:22:5 -- gcc atmux.c -lm -Ofast -I lib/

Codee output
Date: 2024-09-09 Codee version: 2024.3.1 License type: Full
Compiler invocation: gcc atmux.c -lm -Ofast -I lib/

Results for file '/global/homes/u/user/codee-demos/C/ATMUX/atmux.c':
  Successfully applied AutoFix to the loop at 'atmux.c:atmux:22:5' [using multi-threading]:
      [INFO] atmux.c:22:5 Parallel sparse reduction pattern identified for variable 'y' with associative, commutative operator '+'
      [INFO] atmux.c:22:5 Available parallelization strategies for variable 'y'
      [INFO] atmux.c:22:5   #1 OpenMP atomic access (* implemented)
      [INFO] atmux.c:22:5   #2 OpenMP explicit privatization
      [INFO] atmux.c:22:5 Loop parallelized with multithreading using OpenMP directive 'for'
      [INFO] atmux.c:22:5 Parallel region defined by OpenMP directive 'parallel'
      [INFO] atmux.c:22:5 Make sure there is no aliasing among variables: row_ptr, col_ind
      [INFO] atmux.c:22:5 Make sure there is no aliasing among variables: x, val, y

Successfully created atmux_atomic.c

Minimum software stack requirements: OpenMP version 3.0 with multithreading capabilities

Note the presence of the #pragma omp atomic update clause in the updated code:

diff -C 2 atmux.c atmux_atomic.c
      // y = A^T x
+     // Codee: Loop modified by Codee (2024-09-09 04:31:03)
+     // Codee: Technique applied: multithreading with 'omp-for' pragmas
+     #pragma omp parallel default(none) shared(col_ind, n, row_ptr, val, x, y)
+     {
+     #pragma omp for schedule(auto)
      for (int i = 0; i < n; i++) {
          for (int k = row_ptr[i]; k < row_ptr[i + 1]; k++) {
+             #pragma omp atomic update
              y[col_ind[k]] = y[col_ind[k]] + x[i] * val[k];
          }
      }
+     } // end parallel

Explicit privatization

Just like before, copy-paste the suggested Codee invocation, replacing the --in-place argument with -o:

Codee command
codee rewrite --multi omp-for -o atmux_explicit.c --explicit-privatization y atmux.c:22:5 -- gcc atmux.c -lm -Ofast -I lib/

Codee output
Date: 2024-09-09 Codee version: 2024.3.1 License type: Full
Compiler invocation: gcc atmux.c -lm -Ofast -I lib/

Results for file '/global/homes/u/user/codee-demos/C/ATMUX/atmux.c':
  Successfully applied AutoFix to the loop at 'atmux.c:atmux:22:5' [using multi-threading]:
      [INFO] atmux.c:22:5 Parallel sparse reduction pattern identified for variable 'y' with associative, commutative operator '+'
      [INFO] atmux.c:22:5 Available parallelization strategies for variable 'y'
      [INFO] atmux.c:22:5   #1 OpenMP explicit privatization (* implemented)
      [INFO] atmux.c:22:5   #2 OpenMP atomic access
      [INFO] atmux.c:22:5 Loop parallelized with multithreading using OpenMP directive 'for'
      [INFO] atmux.c:22:5 Please enter access range for variable 'y'
      [INFO] atmux.c:22:5 Critical section guarantees correctness of the reduction operation
      [INFO] atmux.c:22:5 Parallel region defined by OpenMP directive 'parallel'
      [INFO] atmux.c:22:5 Make sure there is no aliasing among variables: row_ptr, col_ind
      [INFO] atmux.c:22:5 Make sure there is no aliasing among variables: x, val, y

Successfully created atmux_explicit.c

Minimum software stack requirements: OpenMP version 3.0 with multithreading capabilities

In this scenario, Codee generates a code template that the programmer must fill with additional information:

diff -C 1 atmux.c atmux_explicit.c
      // y = A^T x
+     // Codee: Loop modified by Codee (2024-09-09 04:31:53)
+     // Codee: Technique applied: multithreading with 'omp-for' pragmas
+     #pragma omp parallel default(none) shared(col_ind, n, row_ptr, val, x, y)
+     {
+     unsigned int y_length = /* y start */ + /* y length */;
+     double *y_private = (double *) malloc(sizeof(double) * y_length);
+     for (int i = /* y start */; i < y_length; ++i) {
+       y_private[i] = 0;
+     }
+     #pragma omp for schedule(auto)
      for (int i = 0; i < n; i++) {
          for (int k = row_ptr[i]; k < row_ptr[i + 1]; k++) {
!             y_private[col_ind[k]] = y_private[col_ind[k]] + x[i] * val[k];
          }
      }
+     #pragma omp critical
+     for(int i = /* y start */; i < y_length; ++i) {
+       y[i] += y_private[i];
+     }
+     free(y_private);
+     } // end parallel
  }

Replace /* y start */ and /* y length */ with the appropriate values, 0 and n respectively, to complete the code.

4. Execution

Finally, compile and run both the original and the optimized codes to assess the speed improvements. The following SLURM scripts can be used as reference; create launch.sh and ATMUX.sh, and add execution permissions to the latter:

chmod u+x ATMUX.sh

launch.sh
#!/bin/bash

#SBATCH --account=ntrain6
#SBATCH --job-name=codee_c_atmux_cpu

#SBATCH --constraint=cpu
#SBATCH --qos=shared
#SBATCH --reservation=codee_day2
#SBATCH --time=0:05:00

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32

export SLURM_CPU_BIND="cores"
export OMP_NUM_THREADS=32
srun ATMUX.sh

ATMUX.sh
#!/bin/bash

module load PrgEnv-gnu
rm -f atmux amtux_atomic atmux_explicit

gcc atmux.c lib/Matrix2D.c lib/Vector.c lib/CRSMatrix.c -I lib/ -lm -Ofast -o atmux
./atmux 17000

gcc atmux_atomic.c lib/Matrix2D.c lib/Vector.c lib/CRSMatrix.c -I lib/ -lm -Ofast -fopenmp -o atmux_atomic
./atmux_atomic 17000

gcc atmux_explicit.c lib/Matrix2D.c lib/Vector.c lib/CRSMatrix.c -I lib/ -lm -Ofast  -fopenmp -o atmux_explicit
./atmux_explicit 17000

The atomic version ran on 0.79 seconds, while the original took 0.85 seconds, which represents an speedup of 1.08x.

The explicit version ran on 0.28 seconds, while the original took 0.85 seconds, which represents an speedup of 2.07x.

Getting started​

Walkthrough​

1. Explore the source code​

2. Run the checks report​

3. Autofix​

Atomic protection​

Explicit privatization​

4. Execution​