PI offloading to GPU at Perlmutter

Goal

Walk you through the usage of Codee to optimize a code that estimates the value of pi by offloading computations to GPU.

info

This guide is part of the NERSC + Codee Training Series 2024. Code available for download at the previous link.

Getting started

First, navigate to the source code for PI:

cd codee-demos/C/PI

Next, load the latest Codee version available on Perlmutter:

module load codee/2024.3.1

Walkthrough

1. Explore the source code

The computation is handled by a single for loop within pi.c:

double sum = 0.0;
for (unsigned long i = 0; i < N; i++) {
    double x = (i + 0.5) / N;
    sum += sqrt(1 - x * x);
}

2. Run the checks report

To explore how Codee can help speed up this loop by offloading it to a GPU, use --target-arch to include GPU-related checks in the analysis:

Codee command
codee checks --verbose --target-arch gpu pi.c:main -- gcc pi.c -lm -Ofast

Codee output
Date: 2024-09-05 Codee version: 2024.3.1 License type: Full
Compiler invocation: gcc pi.c -lm -Ofast

[1/1] pi.c ... Done

CHECKS REPORT

<...>

pi.c:31:5 [PWR056] (level: L3): Consider applying offloading parallelism to scalar reduction loop
  Suggestion: Use 'rewrite' to automatically optimize the code
  Documentation: https://github.com/codee-com/open-catalog/tree/main/Checks/PWR056
  AutoFix (choose one option):
    * Using OpenMP (recommended):
        codee rewrite --offload omp-teams --in-place pi.c:31:5 -- gcc pi.c -lm -Ofast
    * Using OpenACC:
        codee rewrite --offload acc --in-place pi.c:31:5 -- gcc pi.c -lm -Ofast
    * Using OpenMP and OpenACC combined:
        codee rewrite --offload omp-teams,acc --in-place pi.c:31:5 -- gcc pi.c -lm -Ofast

<...>

1 file, 1 function, 1 loop successfully analyzed (3 checkers) and 0 non-analyzed files in 40 ms

Codee suggests various options to optimize the loop, including automatic code generation for offloading using OpenMP or OpenACC directives.

3. Autofix

Let's use Codee's autofix capabilities to automatically optimize the code. We will create two new files: one using OpenMP and another using OpenACC, to compare their performance.

OpenMP

We can copy-paste the suggested Codee invocation to generate the OpenMP offloading; replace the --in-place argument with -o to create a new file with the modification:

Codee command
codee rewrite --offload omp-teams -o pi_omp.c pi.c:31:5 -- gcc pi.c -lm -Ofast

Codee output
Date: 2024-09-05 Codee version: 2024.3.1 License type: Full
Compiler invocation: gcc pi.c -lm -Ofast

Results for file '/global/homes/u/user/codee-demos/C/PI/pi.c':
  Successfully applied AutoFix to the loop at 'pi.c:main:31:5' [using offloading]:
      [INFO] pi.c:31:5 Parallel scalar reduction pattern identified for variable 'sum' with associative, commutative operator '+'
      [INFO] pi.c:31:5 Available parallelization strategies for variable 'sum'
      [INFO] pi.c:31:5   #1 OpenMP scalar reduction (* implemented)
      [INFO] pi.c:31:5   #2 OpenMP atomic access
      [INFO] pi.c:31:5   #3 OpenMP explicit privatization
      [INFO] pi.c:31:5 Loop parallelized with teams using OpenMP directive 'target teams distribute parallel for'

Successfully created pi_omp.c

Minimum software stack requirements: OpenMP version 4.0 with offloading capabilities

By default, the OpenMP generated code offloads the computation using target teams, manages data transfers with map, and splits the workload using schedule(static):

diff -C 2 pi.c pi_omp.c
      double sum = 0.0;
+     // Codee: Loop modified by Codee (2024-09-05 04:42:51)
+     // Codee: Technique applied: offloading with 'omp-teams' pragmas
+     // Codee: Offloaded loop: begin
+     #pragma omp target teams distribute parallel for simd shared(N) map(to: N) reduction(+: sum) map(tofrom: sum) schedule(static)
      for (unsigned long i = 0; i < N; i++) {
          double x = (i + 0.5) / N;
          sum += sqrt(1 - x * x);
      }
+     // Codee: Offloaded loop: end
  
      out_result = 4.0 / N * sum;

OpenACC

Just like before, copy-paste the suggested Codee invocation, replacing the --in-place argument with -o:

Codee command
codee rewrite --offload acc -o pi_acc.c pi.c:31:5 -- gcc pi.c -lm -Ofast

Codee output
Date: 2024-09-05 Codee version: 2024.3.1 License type: Full
Compiler invocation: gcc pi.c -lm -Ofast

Results for file '/global/homes/u/user/codee-demos/C/PI/pi.c':
  Successfully applied AutoFix to the loop at 'pi.c:main:31:5' [using offloading]:
      [INFO] pi.c:31:5 Parallel scalar reduction pattern identified for variable 'sum' with associative, commutative operator '+'
      [INFO] pi.c:31:5 Available parallelization strategies for variable 'sum'
      [INFO] pi.c:31:5   #1 OpenACC scalar reduction (* implemented)
      [INFO] pi.c:31:5   #2 OpenACC atomic access
      [INFO] pi.c:31:5 Parallel region defined by OpenACC directive 'parallel'
      [INFO] pi.c:31:5 Loop parallelized with OpenACC directive 'loop'
      [INFO] pi.c:31:5 Data region for host-device data transfers defined by OpenACC directive 'data'

Successfully created pi_acc.c

Minimum software stack requirements: OpenACC version 2.0 with offloading capabilities

By default, the OpenACC generated code offloads the computation using parallel and manages data transfers with data copy:

diff -C 2 pi.c pi_acc.c
      double sum = 0.0;
+     // Codee: Loop modified by Codee (2024-09-05 03:24:31)
+     // Codee: Technique applied: offloading with 'acc' pragmas
+     #pragma acc data copyin(N) copy(sum)
+     #pragma acc parallel
+     #pragma acc loop reduction(+: sum)
      for (unsigned long i = 0; i < N; i++) {
          double x = (i + 0.5) / N;

4. Execution

Finally, compile and run both the original and the optimized codes to assess the speed improvements. The following SLURM scripts can be used as reference; create launch.sh and PI.sh, and add execution permissions to the latter:

chmod u+x PI.sh

launch.sh
#!/bin/bash

#SBATCH --account=ntrain6
#SBATCH --job-name=codee_c_pi

#SBATCH --constraint=gpu
#SBATCH --qos=shared
#SBATCH --reservation=codee_day1
#SBATCH --time=0:05:00

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --gpus-per-task=1

export SLURM_CPU_BIND="cores"
srun PI.sh

PI.sh
#!/bin/bash

module load PrgEnv-nvidia
rm -f pi pi_omp pi_acc

nvc pi.c -lm -Ofast -o pi
./pi 900000000

nvc pi_omp.c -lm -Ofast -mp -target=gpu -Minfo=mp -o pi_omp
./pi_omp 900000000

nvc pi_acc.c -lm -Ofast -acc -target=gpu -Minfo=acc -o pi_acc
./pi_acc 900000000

The OpenMP version ran on 0.43 seconds, while the original took 0.59 seconds, which represents an speedup of 1.37x.

The OpenACC version ran on 0.34 seconds, while the original took 0.59 seconds, which represents an speedup of 1.74x.

Getting started​

Walkthrough​

1. Explore the source code​

2. Run the checks report​

3. Autofix​

OpenMP​

OpenACC​

4. Execution​