ATMUX parallelization on CPU at Perlmutter
Walk you through the usage of Codee to optimize a sparse matrix-vector multiplication code by parallelizing computations on CPU.
This guide is part of the NERSC + Codee Training Series 2024. Code available for download at the previous link.
Getting started
First, navigate to the source code for ATMUX:
cd codee-demos/C/ATMUX
Next, load the latest Codee version available on Perlmutter:
module load codee/2024.3.1
Walkthrough
1. Explore the source code
The computation is handled by a double-nested for
loop within atmux.c
:
// y = A^T x
for (int i = 0; i < n; i++) {
for (int k = row_ptr[i]; k < row_ptr[i + 1]; k++) {
y[col_ind[k]] = y[col_ind[k]] + x[i] * val[k];
}
}
2. Run the checks report
To explore how Codee can help speed up this loop by parallelizing its execution
on CPU, use --target-arch
to include multithreading-related checks in the
analysis:
codee checks --verbose --target-arch cpu atmux.c:atmux -- gcc atmux.c -lm -Ofast -I lib/
Date: 2024-09-09 Codee version: 2024.3.1 License type: Full
Compiler invocation: gcc atmux.c -lm -Ofast -I lib/
[1/1] atmux.c ... Done
CHECKS REPORT
<...>
atmux.c:22:5 [PWR052] (level: L2): Consider applying multithreading parallelism to sparse reduction loop
Suggestion: Use 'rewrite' to automatically optimize the code
Documentation: https://github.com/codee-com/open-catalog/tree/main/Checks/PWR052
AutoFix (choose one option):
* Using OpenMP 'for' with atomic protection (recommended):
codee rewrite --multi omp-for --in-place atmux.c:22:5 -- gcc atmux.c -lm -Ofast -I lib/
* Using OpenMP 'for' with explicit privatization:
codee rewrite --multi omp-for --in-place --explicit-privatization y atmux.c:22:5 -- gcc atmux.c -lm -Ofast -I lib/
* Using OpenMP 'taskwait':
codee rewrite --multi omp-taskwait --in-place atmux.c:22:5 -- gcc atmux.c -lm -Ofast -I lib/
* Using OpenMP 'taskloop':
codee rewrite --multi omp-taskloop --in-place atmux.c:22:5 -- gcc atmux.c -lm -Ofast -I lib/
<...>
1 file, 1 function, 3 loops successfully analyzed (8 checkers) and 0 non-analyzed files in 47 ms
Codee suggests various options to optimize the loop by automatically generating OpenMP directives.
3. Autofix
Let's use Codee's autofix capabilities to automatically optimize the code. We will create two new files: one using the atomic protection strategy, and another using the explicit privatization, to compare their performance.
Atomic protection
We can copy-paste the suggested Codee invocation to generate the OpenMP
parallelization based on the atomic protection; replace the --in-place
argument with -o
to create a new file with the modification:
codee rewrite --multi omp-for -o atmux_atomic.c atmux.c:22:5 -- gcc atmux.c -lm -Ofast -I lib/
Date: 2024-09-09 Codee version: 2024.3.1 License type: Full
Compiler invocation: gcc atmux.c -lm -Ofast -I lib/
Results for file '/global/homes/u/user/codee-demos/C/ATMUX/atmux.c':
Successfully applied AutoFix to the loop at 'atmux.c:atmux:22:5' [using multi-threading]:
[INFO] atmux.c:22:5 Parallel sparse reduction pattern identified for variable 'y' with associative, commutative operator '+'
[INFO] atmux.c:22:5 Available parallelization strategies for variable 'y'
[INFO] atmux.c:22:5 #1 OpenMP atomic access (* implemented)
[INFO] atmux.c:22:5 #2 OpenMP explicit privatization
[INFO] atmux.c:22:5 Loop parallelized with multithreading using OpenMP directive 'for'
[INFO] atmux.c:22:5 Parallel region defined by OpenMP directive 'parallel'
[INFO] atmux.c:22:5 Make sure there is no aliasing among variables: row_ptr, col_ind
[INFO] atmux.c:22:5 Make sure there is no aliasing among variables: x, val, y
Successfully created atmux_atomic.c
Minimum software stack requirements: OpenMP version 3.0 with multithreading capabilities
Note the presence of the #pragma omp atomic update
clause in the updated
code:
// y = A^T x
+ // Codee: Loop modified by Codee (2024-09-09 04:31:03)
+ // Codee: Technique applied: multithreading with 'omp-for' pragmas
+ #pragma omp parallel default(none) shared(col_ind, n, row_ptr, val, x, y)
+ {
+ #pragma omp for schedule(auto)
for (int i = 0; i < n; i++) {
for (int k = row_ptr[i]; k < row_ptr[i + 1]; k++) {
+ #pragma omp atomic update
y[col_ind[k]] = y[col_ind[k]] + x[i] * val[k];
}
}
+ } // end parallel
Explicit privatization
Just like before, copy-paste the suggested Codee invocation, replacing the
--in-place
argument with -o
:
codee rewrite --multi omp-for -o atmux_explicit.c --explicit-privatization y atmux.c:22:5 -- gcc atmux.c -lm -Ofast -I lib/
Date: 2024-09-09 Codee version: 2024.3.1 License type: Full
Compiler invocation: gcc atmux.c -lm -Ofast -I lib/
Results for file '/global/homes/u/user/codee-demos/C/ATMUX/atmux.c':
Successfully applied AutoFix to the loop at 'atmux.c:atmux:22:5' [using multi-threading]:
[INFO] atmux.c:22:5 Parallel sparse reduction pattern identified for variable 'y' with associative, commutative operator '+'
[INFO] atmux.c:22:5 Available parallelization strategies for variable 'y'
[INFO] atmux.c:22:5 #1 OpenMP explicit privatization (* implemented)
[INFO] atmux.c:22:5 #2 OpenMP atomic access
[INFO] atmux.c:22:5 Loop parallelized with multithreading using OpenMP directive 'for'
[INFO] atmux.c:22:5 Please enter access range for variable 'y'
[INFO] atmux.c:22:5 Critical section guarantees correctness of the reduction operation
[INFO] atmux.c:22:5 Parallel region defined by OpenMP directive 'parallel'
[INFO] atmux.c:22:5 Make sure there is no aliasing among variables: row_ptr, col_ind
[INFO] atmux.c:22:5 Make sure there is no aliasing among variables: x, val, y
Successfully created atmux_explicit.c
Minimum software stack requirements: OpenMP version 3.0 with multithreading capabilities
In this scenario, Codee generates a code template that the programmer must fill with additional information:
// y = A^T x
+ // Codee: Loop modified by Codee (2024-09-09 04:31:53)
+ // Codee: Technique applied: multithreading with 'omp-for' pragmas
+ #pragma omp parallel default(none) shared(col_ind, n, row_ptr, val, x, y)
+ {
+ unsigned int y_length = /* y start */ + /* y length */;
+ double *y_private = (double *) malloc(sizeof(double) * y_length);
+ for (int i = /* y start */; i < y_length; ++i) {
+ y_private[i] = 0;
+ }
+ #pragma omp for schedule(auto)
for (int i = 0; i < n; i++) {
for (int k = row_ptr[i]; k < row_ptr[i + 1]; k++) {
! y_private[col_ind[k]] = y_private[col_ind[k]] + x[i] * val[k];
}
}
+ #pragma omp critical
+ for(int i = /* y start */; i < y_length; ++i) {
+ y[i] += y_private[i];
+ }
+ free(y_private);
+ } // end parallel
}
Replace /* y start */
and /* y length */
with the appropriate values, 0
and n
respectively, to complete the code.
4. Execution
Finally, compile and run both the original and the optimized codes to assess
the speed improvements. The following SLURM scripts can be used as reference;
create launch.sh
and ATMUX.sh
, and add execution permissions to the latter:
chmod u+x ATMUX.sh
#!/bin/bash
#SBATCH --account=ntrain6
#SBATCH --job-name=codee_c_atmux_cpu
#SBATCH --constraint=cpu
#SBATCH --qos=shared
#SBATCH --reservation=codee_day2
#SBATCH --time=0:05:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
export SLURM_CPU_BIND="cores"
export OMP_NUM_THREADS=32
srun ATMUX.sh
#!/bin/bash
module load PrgEnv-gnu
rm -f atmux amtux_atomic atmux_explicit
gcc atmux.c lib/Matrix2D.c lib/Vector.c lib/CRSMatrix.c -I lib/ -lm -Ofast -o atmux
./atmux 17000
gcc atmux_atomic.c lib/Matrix2D.c lib/Vector.c lib/CRSMatrix.c -I lib/ -lm -Ofast -fopenmp -o atmux_atomic
./atmux_atomic 17000
gcc atmux_explicit.c lib/Matrix2D.c lib/Vector.c lib/CRSMatrix.c -I lib/ -lm -Ofast -fopenmp -o atmux_explicit
./atmux_explicit 17000
The atomic version ran on 0.79 seconds, while the original took 0.85 seconds, which represents an speedup of 1.08x.
The explicit version ran on 0.28 seconds, while the original took 0.85 seconds, which represents an speedup of 2.07x.