MATMUL offloading to GPU at Perlmutter

Goal

Walk you through the usage of Codee to optimize a matrix multiplication code by offloading computations to GPU.

info

This guide is part of the NERSC + Codee Training Series 2024. Code available for download at the previous link.

Getting started

First, navigate to the source code for MATMUL:

cd codee-demos/C/MATMUL

Next, load the latest Codee version available on Perlmutter:

module load codee/2024.3.1

Walkthrough

1. Explore the source code

The computation is handled by a triple-nested for loop within main.c:

// Accumulation
for (size_t i = 0; i < m; i++) {
    for (size_t j = 0; j < n; j++) {
        for (size_t k = 0; k < p; k++) {
            C[i][j] += A[i][k] * B[k][j];
        }
    }
}

2. Run the checks report

To explore how Codee can help speed up this loop by offloading it to a GPU, use --target-arch to include GPU-related checks in the analysis:

Codee command
codee checks --verbose --target-arch gpu main.c:matmul -- gcc main.c -I include -Ofast

Codee output
Date: 2024-09-09 Codee version: 2024.3.1 License type: Full
Compiler invocation: gcc main.c -I include -Ofast

[1/1] main.c ... Done

CHECKS REPORT

main.c:16:9 [PWR039] (level: L1): Consider loop interchange to improve the locality of reference and enable vectorization
  Loops to interchange:
    16:         for (size_t j = 0; j < n; j++) {
    17:             for (size_t k = 0; k < p; k++) {
  Suggestion: Interchange inner and outer loops in the loop nest to improve performance
  Documentation: https://github.com/codee-com/open-catalog/tree/main/Checks/PWR039
  AutoFix:
    codee rewrite --memory loop-interchange --in-place main.c:16:9 -- gcc main.c -I include -Ofast

<...>

main.c:15:5 [PWR055] (level: L3): Consider applying offloading parallelism to forall loop
  Suggestion: Use 'rewrite' to automatically optimize the code
  Documentation: https://github.com/codee-com/open-catalog/tree/main/Checks/PWR055
  AutoFix (choose one option):
    * Using OpenMP (recommended):
        codee rewrite --offload omp-teams --in-place main.c:15:5 -- gcc main.c -I include -Ofast
    * Using OpenACC:
        codee rewrite --offload acc --in-place main.c:15:5 -- gcc main.c -I include -Ofast
    * Using OpenMP and OpenACC combined:
        codee rewrite --offload omp-teams,acc --in-place main.c:15:5 -- gcc main.c -I include -Ofast

<...>

1 file, 1 function, 5 loops successfully analyzed (7 checkers) and 0 non-analyzed files in 50 ms

Codee suggests various options to optimize the loop, including automatic code generation for loop interchange, helping improve memory accesses, and offloading using OpenMP or OpenACC directives.

3. Autofix

Let's use Codee's autofix capabilities to automatically optimize the code. We will apply the loop interchange optimization first, and then create two new variants: one using OpenMP and another using OpenACC, to compare their performance.

Loop interchange

We can copy-paste the suggested Codee invocation to perform the loop interchange; replace the --in-place argument with -o to create a new file with the modification:

Codee command
codee rewrite --memory loop-interchange -o main_li.c main.c:16:9 -- gcc main.c -I include -Ofast

Codee output
Date: 2024-09-09 Codee version: 2024.3.1 License type: Full
Compiler invocation: gcc main.c -I include -Ofast

Results for file '/global/homes/u/user/codee-demos/C/MATMUL/main.c':
  Successfully applied AutoFix to the loop at 'main.c:16:9' [using loop interchange]:
      [INFO] Loops interchanged:
        - main.c:16:9
        - main.c:17:13

Successfully created main_li.c

Let's confirm that the loop interchange has been correctly applied:

diff -C 2 main.c main_li.c
      // Accumulation
      for (size_t i = 0; i < m; i++) {
!         for (size_t j = 0; j < n; j++) {
!             for (size_t k = 0; k < p; k++) {
                  C[i][j] += A[i][k] * B[k][j];
              }
--- 14,21 ----
      // Accumulation
      for (size_t i = 0; i < m; i++) {
!         // Codee: Loop modified by Codee (2024-09-09 04:59:29)
!         // Codee: Technique applied: loop interchange
!         for (size_t k = 0; k < p; k++) {
!             for (size_t j = 0; j < n; j++) {
                  C[i][j] += A[i][k] * B[k][j];
              }

OpenMP

For convenience, we can run the checks report again on main_li.c to make Codee generate updated autofix commands:

Codee command
codee checks --verbose --target-arch gpu main_li.c:matmul -- gcc main_li.c -I include -Ofast

Codee output
Date: 2024-09-09 Codee version: 2024.3.1 License type: Full
Compiler invocation: gcc main_li.c -I include -Ofast

[1/1] main_li.c ... Done

CHECKS REPORT

<...>

main_li.c:15:5 [PWR055] (level: L3): Consider applying offloading parallelism to forall loop
  Suggestion: Use 'rewrite' to automatically optimize the code
  Documentation: https://github.com/codee-com/open-catalog/tree/main/Checks/PWR055
  AutoFix (choose one option):
    * Using OpenMP (recommended):
        codee rewrite --offload omp-teams --in-place main_li.c:15:5 -- gcc main_li.c -I include -Ofast
    * Using OpenACC:
        codee rewrite --offload acc --in-place main_li.c:15:5 -- gcc main_li.c -I include -Ofast
    * Using OpenMP and OpenACC combined:
        codee rewrite --offload omp-teams,acc --in-place main_li.c:15:5 -- gcc main_li.c -I include -Ofast

<...>

1 file, 1 function, 5 loops successfully analyzed (7 checkers) and 0 non-analyzed files in 44 ms

Let's start with the OpenMP offloading; replace the --in-place argument with -o to create a new file with the modification:

Codee command
codee rewrite --offload omp-teams -o main_li_omp.c main_li.c:15:5 -- gcc main_li.c -I include -Ofast

Codee output
Date: 2024-09-09 Codee version: 2024.3.1 License type: Full
Compiler invocation: gcc main_li.c -I include -Ofast

Results for file '/global/homes/u/user/codee-demos/C/MATMUL/main_li.c':
  Successfully applied AutoFix to the loop at 'main_li.c:matmul:15:5' [using offloading]:
      [INFO] main_li.c:15:5 Parallel forall: variable 'C'
      [INFO] main_li.c:15:5 Loop parallelized with teams using OpenMP directive 'target teams distribute parallel for'

Successfully created main_li_omp.c

Minimum software stack requirements: OpenMP version 4.5 with offloading capabilities

By default, the OpenMP generated code offloads the computation using target teams, manages data transfers with map, and splits the workload using schedule(static):

diff -C 5 main_li.c main_li_omp.c
              C[i][j] = 0;
          }
      }
  
      // Accumulation
+     // Codee: Loop modified by Codee (2024-09-09 05:00:27)
+     // Codee: Technique applied: offloading with 'omp-teams' pragmas
+     // Codee: Inserted offloading preamble: begin
+     #pragma omp target enter data map(to: A[0:m])
+     for(int i0 = 0; i0 < m; ++i0) {
+       #pragma omp target enter data map(to: A[i0][0:p])
+     }
+     #pragma omp target enter data map(to: B[0:p])
+     for(int i0 = 0; i0 < p; ++i0) {
+       #pragma omp target enter data map(to: B[i0][0:n])
+     }
+     #pragma omp target enter data map(to: C[0:m])
+     for(int i0 = 0; i0 < m; ++i0) {
+       #pragma omp target enter data map(to: C[i0][0:n])
+     }
+     // Codee: Inserted offloading preamble: end
+     // Codee: Offloaded loop: begin
+     #pragma omp target teams distribute parallel for simd shared(A, B, m, n, p) map(to: m, n, p) schedule(static)
      for (size_t i = 0; i < m; i++) {
          // Codee: Loop modified by Codee (2024-09-09 04:59:29)
          // Codee: Technique applied: loop interchange
          for (size_t k = 0; k < p; k++) {
              for (size_t j = 0; j < n; j++) {
                  C[i][j] += A[i][k] * B[k][j];
              }
          }
      }
+     // Codee: Offloaded loop: end
+     // Codee: Inserted offloading postamble: begin
+     for(int i0 = 0; i0 < m; ++i0) {
+       #pragma omp target exit data map(from: C[i0][0:n])
+     }
+     #pragma omp target exit data map(from: C[0:m])
+     // Codee: Inserted offloading postamble: end
  }
  
  int main(int argc, char *argv[]) {
      int param_iters = 1;

OpenACC

Just like before, copy-paste the suggested Codee invocation, replacing the --in-place argument with -o:

Codee command
codee rewrite --offload acc -o main_li_acc.c main_li.c:15:5 -- gcc main_li.c -I include -Ofast

Codee output
Date: 2024-09-09 Codee version: 2024.3.1 License type: Full
Compiler invocation: gcc main_li.c -I include -Ofast

Results for file '/global/homes/u/user/codee-demos/C/MATMUL/main_li.c':
  Successfully applied AutoFix to the loop at 'main_li.c:matmul:15:5' [using offloading]:
      [INFO] main_li.c:15:5 Parallel forall: variable 'C'
      [INFO] main_li.c:15:5 Parallel region defined by OpenACC directive 'parallel'
      [INFO] main_li.c:15:5 Loop parallelized with OpenACC directive 'loop'
      [INFO] main_li.c:15:5 Data region for host-device data transfers defined by OpenACC directive 'data'
      [INFO] main_li.c:15:5 Make sure there is no aliasing among variables: A, B, C

Successfully created main_li_acc.c

Minimum software stack requirements: OpenACC version 2.0 with offloading capabilities

By default, the OpenACC generated code offloads the computation using parallel and manages data transfers with data copy:

diff -C 4 main_li.c main_li_acc.c
      // Accumulation
+     // Codee: Loop modified by Codee (2024-09-09 05:01:40)
+     // Codee: Technique applied: offloading with 'acc' pragmas
+     #pragma acc data copyin(A[0:m][0:p], B[0:p][0:n], m, n, p) copy(C[0:m][0:n])
+     #pragma acc parallel
+     #pragma acc loop
      for (size_t i = 0; i < m; i++) {
          // Codee: Loop modified by Codee (2024-09-09 04:59:29)
          // Codee: Technique applied: loop interchange
          for (size_t k = 0; k < p; k++) {

4. Execution

Finally, compile and run both the original and the optimized codes to assess the speed improvements. The following SLURM scripts can be used as reference; create launch.sh and MATMUL.sh, and add execution permissions to the latter:

chmod u+x MATMUL.sh

launch.sh
#!/bin/bash

#SBATCH --account=ntrain6
#SBATCH --job-name=codee_c_matmul

#SBATCH --constraint=gpu
#SBATCH --qos=shared
#SBATCH --reservation=codee_day1
#SBATCH --time=0:05:00

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --gpus-per-task=1

export SLURM_CPU_BIND="cores"
srun MATMUL.sh

MATMUL.sh
#!/bin/bash

module load PrgEnv-nvidia
rm -f matmul matmul_omp matmul_acc

nvc clock.c matrix.c main.c -I include/ -Ofast -o matmul
./matmul 3000

nvc clock.c matrix.c main_li_omp.c -I include/ -Ofast -mp -target=gpu -Minfo=mp -o matmul_li_omp
./matmul_li_omp 3000

nvc clock.c matrix.c main_li_acc.c -I include/ -Ofast -acc -target=gpu -Minfo=acc -o matmul_li_acc
./matmul_li_acc 3000

The OpenMP version ran on 4.77 seconds, while the original took 38.99 seconds, which represents an speedup of 8.18x.

The OpenACC version ran on 6.17 seconds, while the original took 38.99 seconds, which represents an speedup of 6.32x.

Getting started​

Walkthrough​

1. Explore the source code​

2. Run the checks report​

3. Autofix​

Loop interchange​

OpenMP​

OpenACC​

4. Execution​