PI offloading to GPU at Perlmutter
Walk you through the usage of Codee to optimize a code that estimates the value of pi by offloading computations to GPU.
This guide is part of the NERSC + Codee Training Series 2024. Code available for download at the previous link.
Getting started
First, navigate to the source code for PI:
cd codee-demos/C/PI
Next, load the latest Codee version available on Perlmutter:
module load codee/2024.3.1
Walkthrough
1. Explore the source code
The computation is handled by a single for
loop within pi.c
:
double sum = 0.0;
for (unsigned long i = 0; i < N; i++) {
double x = (i + 0.5) / N;
sum += sqrt(1 - x * x);
}
2. Run the checks report
To explore how Codee can help speed up this loop by offloading it to a GPU,
use --target-arch
to include GPU-related checks in the analysis:
codee checks --verbose --target-arch gpu pi.c:main -- gcc pi.c -lm -Ofast
Date: 2024-09-05 Codee version: 2024.3.1 License type: Full
Compiler invocation: gcc pi.c -lm -Ofast
[1/1] pi.c ... Done
CHECKS REPORT
<...>
pi.c:31:5 [PWR056] (level: L3): Consider applying offloading parallelism to scalar reduction loop
Suggestion: Use 'rewrite' to automatically optimize the code
Documentation: https://github.com/codee-com/open-catalog/tree/main/Checks/PWR056
AutoFix (choose one option):
* Using OpenMP (recommended):
codee rewrite --offload omp-teams --in-place pi.c:31:5 -- gcc pi.c -lm -Ofast
* Using OpenACC:
codee rewrite --offload acc --in-place pi.c:31:5 -- gcc pi.c -lm -Ofast
* Using OpenMP and OpenACC combined:
codee rewrite --offload omp-teams,acc --in-place pi.c:31:5 -- gcc pi.c -lm -Ofast
<...>
1 file, 1 function, 1 loop successfully analyzed (3 checkers) and 0 non-analyzed files in 40 ms
Codee suggests various options to optimize the loop, including automatic code generation for offloading using OpenMP or OpenACC directives.
3. Autofix
Let's use Codee's autofix capabilities to automatically optimize the code. We will create two new files: one using OpenMP and another using OpenACC, to compare their performance.
OpenMP
We can copy-paste the suggested Codee invocation to generate the OpenMP
offloading; replace the --in-place
argument with -o
to create a new file
with the modification:
codee rewrite --offload omp-teams -o pi_omp.c pi.c:31:5 -- gcc pi.c -lm -Ofast
Date: 2024-09-05 Codee version: 2024.3.1 License type: Full
Compiler invocation: gcc pi.c -lm -Ofast
Results for file '/global/homes/u/user/codee-demos/C/PI/pi.c':
Successfully applied AutoFix to the loop at 'pi.c:main:31:5' [using offloading]:
[INFO] pi.c:31:5 Parallel scalar reduction pattern identified for variable 'sum' with associative, commutative operator '+'
[INFO] pi.c:31:5 Available parallelization strategies for variable 'sum'
[INFO] pi.c:31:5 #1 OpenMP scalar reduction (* implemented)
[INFO] pi.c:31:5 #2 OpenMP atomic access
[INFO] pi.c:31:5 #3 OpenMP explicit privatization
[INFO] pi.c:31:5 Loop parallelized with teams using OpenMP directive 'target teams distribute parallel for'
Successfully created pi_omp.c
Minimum software stack requirements: OpenMP version 4.0 with offloading capabilities
By default, the OpenMP generated code offloads the computation using target teams
, manages data transfers with map
, and splits the workload using
schedule(static)
:
double sum = 0.0;
+ // Codee: Loop modified by Codee (2024-09-05 04:42:51)
+ // Codee: Technique applied: offloading with 'omp-teams' pragmas
+ // Codee: Offloaded loop: begin
+ #pragma omp target teams distribute parallel for simd shared(N) map(to: N) reduction(+: sum) map(tofrom: sum) schedule(static)
for (unsigned long i = 0; i < N; i++) {
double x = (i + 0.5) / N;
sum += sqrt(1 - x * x);
}
+ // Codee: Offloaded loop: end
out_result = 4.0 / N * sum;
OpenACC
Just like before, copy-paste the suggested Codee invocation, replacing the
--in-place
argument with -o
:
codee rewrite --offload acc -o pi_acc.c pi.c:31:5 -- gcc pi.c -lm -Ofast
Date: 2024-09-05 Codee version: 2024.3.1 License type: Full
Compiler invocation: gcc pi.c -lm -Ofast
Results for file '/global/homes/u/user/codee-demos/C/PI/pi.c':
Successfully applied AutoFix to the loop at 'pi.c:main:31:5' [using offloading]:
[INFO] pi.c:31:5 Parallel scalar reduction pattern identified for variable 'sum' with associative, commutative operator '+'
[INFO] pi.c:31:5 Available parallelization strategies for variable 'sum'
[INFO] pi.c:31:5 #1 OpenACC scalar reduction (* implemented)
[INFO] pi.c:31:5 #2 OpenACC atomic access
[INFO] pi.c:31:5 Parallel region defined by OpenACC directive 'parallel'
[INFO] pi.c:31:5 Loop parallelized with OpenACC directive 'loop'
[INFO] pi.c:31:5 Data region for host-device data transfers defined by OpenACC directive 'data'
Successfully created pi_acc.c
Minimum software stack requirements: OpenACC version 2.0 with offloading capabilities
By default, the OpenACC generated code offloads the computation using
parallel
and manages data transfers with data copy
:
double sum = 0.0;
+ // Codee: Loop modified by Codee (2024-09-05 03:24:31)
+ // Codee: Technique applied: offloading with 'acc' pragmas
+ #pragma acc data copyin(N) copy(sum)
+ #pragma acc parallel
+ #pragma acc loop reduction(+: sum)
for (unsigned long i = 0; i < N; i++) {
double x = (i + 0.5) / N;
4. Execution
Finally, compile and run both the original and the optimized codes to assess
the speed improvements. The following SLURM scripts can be used as reference;
create launch.sh
and PI.sh
, and add execution permissions to the latter:
chmod u+x PI.sh
#!/bin/bash
#SBATCH --account=ntrain6
#SBATCH --job-name=codee_c_pi
#SBATCH --constraint=gpu
#SBATCH --qos=shared
#SBATCH --reservation=codee_day1
#SBATCH --time=0:05:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --gpus-per-task=1
export SLURM_CPU_BIND="cores"
srun PI.sh
#!/bin/bash
module load PrgEnv-nvidia
rm -f pi pi_omp pi_acc
nvc pi.c -lm -Ofast -o pi
./pi 900000000
nvc pi_omp.c -lm -Ofast -mp -target=gpu -Minfo=mp -o pi_omp
./pi_omp 900000000
nvc pi_acc.c -lm -Ofast -acc -target=gpu -Minfo=acc -o pi_acc
./pi_acc 900000000
The OpenMP version ran on 0.43 seconds, while the original took 0.59 seconds, which represents an speedup of 1.37x.
The OpenACC version ran on 0.34 seconds, while the original took 0.59 seconds, which represents an speedup of 1.74x.