ATMUX (C)

Goal

Walk you through the usage of Codee to optimize ATMUX, a sparse matrix-vector multiplication code, by parallelizing computations on CPU with OpenMP.

info

This guide was made using an Azure HBv4 machine and the AMD compilers. These steps have also been tested with GNU and Intel compilers, so you can follow allong by replacing the corresponding compilation flags.

Optimal performance is typically achieved using 48, 96, or 144 threads. To take a deeper look into de architecture of the machine used please visit the official Microsoft webpage, HBv4-series virtual machine overview

Prerequisites

Ensure you have:

Access to an Azure machine and Codee installed on it.
AMD clang compiler.
The codee-demos repository on your machine.

To clone the necessary repository just execute the following on your terminal:

git clone https://github.com/codee-com/codee-demos.git

Getting started

First, navigate to the source code for ATMUX:

cd codee-demos/C/ATMUX

Walkthrough

1. Explore the source code

The computation is handled by a double-nested for loop within atmux.c:

// y = A^T x
for (int i = 0; i < n; i++) {
    for (int k = row_ptr[i]; k < row_ptr[i + 1]; k++) {
        y[col_ind[k]] = y[col_ind[k]] + x[i] * val[k];
    }
}

2. Run the checks report

Note

It is recommended to run the screening report first to obtain a ranking of the checkers, which can help you decide which one to implement first.

To explore how Codee can help speed up this loop by parallelizing it, use --target-arch to include CPU-related checks in the analysis:

Codee command
codee checks --verbose --target-arch cpu atmux.c:atmux -- clang atmux.c -lm -O3 -Ilib/

Codee output
Date: 2025-04-16 Codee version: 2025.2 License type: Full
Compiler invocation: clang atmux.c -lm -O3 -Ilib/

[1/1] atmux.c ... Done

CHECKS REPORT

<...>

atmux.c:22:5 [PWR052] (level: L2): Consider applying multithreading parallelism to sparse reduction loop
  Suggestion: Use 'rewrite' to automatically optimize the code
  Documentation: https://github.com/codee-com/open-catalog/tree/main/Checks/PWR052
  AutoFix (choose one option):
    * Using OpenMP 'for' with atomic protection (recommended):
        codee rewrite --check-id pwr052 --variant omp-for --in-place atmux.c:22:5 -- clang atmux.c -lm -O3 -Ilib/
    * Using OpenMP 'for' with explicit privatization:
        codee rewrite --check-id pwr052 --variant omp-for --in-place --explicit-privatization y atmux.c:22:5 -- clang atmux.c -lm -O3 -Ilib/
    * Using OpenMP 'taskwait':
        codee rewrite --check-id pwr052 --variant omp-taskwait --in-place atmux.c:22:5 -- clang atmux.c -lm -O3 -Ilib/
    * Using OpenMP 'taskloop':
        codee rewrite --check-id pwr052 --variant omp-taskloop --in-place atmux.c:22:5 -- clang atmux.c -lm -O3 -Ilib/

<...>
SUGGESTIONS

  Use --check-id to focus on specific subsets of checkers, e.g.:
        codee checks --check-id PWR060 --verbose --target-arch cpu atmux.c:atmux -- clang atmux.c -lm -O3 -Ilib/

1 file, 1 function, 3 loops, 72 LOCs successfully analyzed (8 checkers) and 0 non-analyzed files in 881 ms

Codee suggests various options to optimize the loop by automatically generating OpenMP directives.

3. Autofix

Let's use Codee's autofix capabilities to automatically optimize the code. We will create two new files: one using the atomic protection strategy, and another using the explicit privatization, to compare their performance.

Atomic protection

We can copy-paste the suggested Codee invocation to generate the OpenMP parallelization based on the atomic protection; replace the --in-place argument with -o to create a new file with the modification:

Codee command
codee rewrite --check-id pwr052 --variant omp-for -o atmux_atomic.c atmux.c:22:5 -- clang atmux.c -lm -O3 -Ilib/

Codee output
Date: 2025-04-16 Codee version: 2025.2 License type: Full
Compiler invocation: clang atmux.c -lm -O3 -Ilib/

[1/1] atmux.c ... Done
4.9at 'atmux.c:atmux:22:5' [using multi-threading]:
      [INFO] atmux.c:22:5 Parallel sparse reduction pattern identified for variable 'y' with associative, commutative operator '+'
      [INFO] atmux.c:22:5 Available parallelization strategies for variable 'y'
      [INFO] atmux.c:22:5   #1 OpenMP atomic access (* implemented)
      [INFO] atmux.c:22:5   #2 OpenMP explicit privatization
      [INFO] atmux.c:22:5 Loop parallelized with multithreading using OpenMP directive 'for'
      [INFO] atmux.c:22:5 Parallel region defined by OpenMP directive 'parallel'
      [INFO] atmux.c:22:5 Make sure there is no aliasing among variables: row_ptr, col_ind
      [INFO] atmux.c:22:5 Make sure there is no aliasing among variables: x, val, y

Successfully created atmux_atomic.c

Minimum software stack requirements: OpenMP version 3.0 with multithreading capabilities

Note the presence of the #pragma omp atomic update clause in the updated code:

diff -C 2 atmux.c atmux_atomic.c
      // y = A^T x
+     // Codee: Loop modified by Codee (2025-03-25 08:57:05)
+     // Codee: Technique applied: multithreading with 'omp-for' pragmas
+     #pragma omp parallel default(none) shared(col_ind, n, row_ptr, val, x, y)
+     {
+     #pragma omp for schedule(auto)
      for (int i = 0; i < n; i++) {
          for (int k = row_ptr[i]; k < row_ptr[i + 1]; k++) {
+             #pragma omp atomic update
              y[col_ind[k]] = y[col_ind[k]] + x[i] * val[k];
          }
      }
+     } // end parallel

Explicit privatization

Just like before, copy-paste the suggested Codee invocation, replacing the --in-place argument with -o:

Codee command
codee rewrite --check-id pwr052 --variant omp-for -o atmux_explicit.c --explicit-privatization y atmux.c:22:5 -- clang atmux.c -lm -O3 -Ilib/

Codee output
Date: 2025-04-16 Codee version: 2025.2 License type: Full
Compiler invocation: clang atmux.c -lm -O3 -Ilib/

[1/1] atmux.c ... Done

Results for file '/home/codee/codee-demos/C/ATMUX/atmux.c':
  Successfully applied AutoFix to the loop at 'atmux.c:atmux:22:5' [using multi-threading]:
      [INFO] atmux.c:22:5 Parallel sparse reduction pattern identified for variable 'y' with associative, commutative operator '+'
      [INFO] atmux.c:22:5 Available parallelization strategies for variable 'y'
      [INFO] atmux.c:22:5   #1 OpenMP explicit privatization (* implemented)
      [INFO] atmux.c:22:5   #2 OpenMP atomic access
      [INFO] atmux.c:22:5 Loop parallelized with multithreading using OpenMP directive 'for'
      [INFO] atmux.c:22:5 Please enter access range for variable 'y'
      [INFO] atmux.c:22:5 Critical section guarantees correctness of the reduction operation
      [INFO] atmux.c:22:5 Parallel region defined by OpenMP directive 'parallel'
      [INFO] atmux.c:22:5 Make sure there is no aliasing among variables: row_ptr, col_ind
      [INFO] atmux.c:22:5 Make sure there is no aliasing among variables: x, val, y

Successfully created atmux_explicit.c

Minimum software stack requirements: OpenMP version 3.0 with multithreading capabilities

In this scenario, Codee generates a code template that the programmer must fill with additional information:

diff -C 1 atmux.c atmux_explicit.c
      // y = A^T x
+     // Codee: Loop modified by Codee (2025-03-25 08:59:17)
+     // Codee: Technique applied: multithreading with 'omp-for' pragmas
+     #pragma omp parallel default(none) shared(col_ind, n, row_ptr, val, x, y)
+     {
+     unsigned int y_length = /* y start */ + /* y length */;
+     double *y_private = (double *) malloc(sizeof(double) * y_length);
+     for (int i = /* y start */; i < y_length; ++i) {
+       y_private[i] = 0;
+     }
+     #pragma omp for schedule(auto)
      for (int i = 0; i < n; i++) {
          for (int k = row_ptr[i]; k < row_ptr[i + 1]; k++) {
!             y_private[col_ind[k]] = y_private[col_ind[k]] + x[i] * val[k];
          }
      }
+     #pragma omp critical
+     for(int i = /* y start */; i < y_length; ++i) {
+       y[i] += y_private[i];
+     }
+     free(y_private);
+     } // end parallel

Replace /* y start */ and /* y length */ with the appropriate values, 0 and n respectively, to complete the code. This can be easily done using the following sed command:

sed -i 's|/\* y start \*/|0|g; s|/\* y length \*/|n|g' atmux_explicit.c

4. Execution

Compile the original source code of Atmux (atmux.c), the Codee atomic version (atmux_atomic.c) and the Codee explicit (atmux_explicit.c) to compare their performance. For instance, using the AMD clang compiler:

Compiler commands
clang atmux.c lib/Matrix2D.c lib/Vector.c lib/CRSMatrix.c -I lib/ -lm -O3 -o atmux && \
clang atmux_atomic.c lib/Matrix2D.c lib/Vector.c lib/CRSMatrix.c -I lib/ -lm -O3 -fopenmp -o atmux_atomic && \
clang atmux_explicit.c lib/Matrix2D.c lib/Vector.c lib/CRSMatrix.c -I lib/ -lm -O3 -fopenmp -o atmux_explicit

And run the original executable (atmux), the Codee atomix (atmux_atomic) ant the Codee explicit (atmux_explicit) versions, choosing a problem size of 17000 using 48 threads. The election of the number of threads was made based on experimentation, you can see more details about the architecture of the machine on HBv4-series virtual machine overview

./atmux 17000
- Input parameters
size    = 17000
- Executing test...
time (s)= 0.635771
size    = 17000
sparsity= 0.66
chksum  = 244110871193
iters   = 10

./atmux_atomic 17000
- Input parameters
size    = 17000
- Executing test...
time (s)= 0.600767
size    = 17000
sparsity= 0.66
chksum  = 244110871193
iters   = 10

./atmux_explicit 17000
- Input parameters
size    = 17000
- Executing test...
time (s)= 0.121478
size    = 17000
sparsity= 0.66
chksum  = 244110871193
iters   = 10

5. Results

Across 10 executions, the atomic version ran on 0.062 ± 0.5 seconds, while the original took 0.09 ± 0.08 seconds, representing a ~1.47x speedup.

Across 10 executions, the explicit version ran on 0.06 ± 0.07 seconds, while the original took 0.09 ± 0.08 seconds, representing a ~1.5x speedup.

Prerequisites​

Getting started​

Walkthrough​

1. Explore the source code​

2. Run the checks report​

3. Autofix​

Atomic protection​

Explicit privatization​

4. Execution​

5. Results​