LULESHmk offloading to GPU at Perlmutter

Goal

Walk you through the usage of Codee to optimize a Lagrangian hydrodynamics simulation code by offloading computations to GPU.

info

This guide is part of the NERSC + Codee Training Series 2024. Code available for download at the previous link.

Getting started

First, navigate to the source code for LULESHmk:

cd codee-demos/C/LULESHmk/src

Next, load the latest Codee version available on Perlmutter:

module load codee/2024.3.1

Walkthrough

1. Explore the source code

The main computation is handled by the function CalcFBHourglassForceForElems():

void CalcFBHourglassForceForElems(Index_t numElem, Index_t *domain_m_nodelist, Real_t *domain_m_fx, Real_t *domain_m_fy,
                                  Real_t *domain_m_fz) {
    /*************************************************
     *
     *     FUNCTION: Calculates the Flanagan-Belytschko anti-hourglass
     *               force.
     *
     *************************************************/
    Real_t gamma[4][8];

    gamma[0][0] = (1.);
    gamma[0][1] = (1.);
    gamma[0][2] = (-1.);
    gamma[0][3] = (-1.);
    gamma[0][4] = (-1.);
    gamma[0][5] = (-1.);
    gamma[0][6] = (1.);
    gamma[0][7] = (1.);
    gamma[1][0] = (1.);
    gamma[1][1] = (-1.);
    gamma[1][2] = (-1.);
    gamma[1][3] = (1.);
    gamma[1][4] = (-1.);
    gamma[1][5] = (1.);
    gamma[1][6] = (1.);
    gamma[1][7] = (-1.);
    gamma[2][0] = (1.);
    gamma[2][1] = (-1.);
    gamma[2][2] = (1.);
    gamma[2][3] = (-1.);
    gamma[2][4] = (1.);
    gamma[2][5] = (-1.);
    gamma[2][6] = (1.);
    gamma[2][7] = (-1.);
    gamma[3][0] = (-1.);
    gamma[3][1] = (1.);
    gamma[3][2] = (-1.);
    gamma[3][3] = (1.);
    gamma[3][4] = (1.);
    gamma[3][5] = (-1.);
    gamma[3][6] = (1.);
    gamma[3][7] = (-1.);

    /*************************************************/
    /*    compute the hourglass modes */

    for (Index_t i2 = 0; i2 < numElem; ++i2) {
        Real_t hgfx[8], hgfy[8], hgfz[8];

        CalcElemFBHourglassForce(i2, gamma, hgfx, hgfy, hgfz);

        // With the threaded version, we write into local arrays per elem
        // so we don't have to worry about race conditions
        Index_t n0si2 = domain_m_nodelist[(8) * i2 + 0];
        Index_t n1si2 = domain_m_nodelist[(8) * i2 + 1];
        Index_t n2si2 = domain_m_nodelist[(8) * i2 + 2];
        Index_t n3si2 = domain_m_nodelist[(8) * i2 + 3];
        Index_t n4si2 = domain_m_nodelist[(8) * i2 + 4];
        Index_t n5si2 = domain_m_nodelist[(8) * i2 + 5];
        Index_t n6si2 = domain_m_nodelist[(8) * i2 + 6];
        Index_t n7si2 = domain_m_nodelist[(8) * i2 + 7];

        domain_m_fx[n0si2] += hgfx[0];
        domain_m_fy[n0si2] += hgfy[0];
        domain_m_fz[n0si2] += hgfz[0];

        domain_m_fx[n1si2] += hgfx[1];
        domain_m_fy[n1si2] += hgfy[1];
        domain_m_fz[n1si2] += hgfz[1];

        domain_m_fx[n2si2] += hgfx[2];
        domain_m_fy[n2si2] += hgfy[2];
        domain_m_fz[n2si2] += hgfz[2];

        domain_m_fx[n3si2] += hgfx[3];
        domain_m_fy[n3si2] += hgfy[3];
        domain_m_fz[n3si2] += hgfz[3];

        domain_m_fx[n4si2] += hgfx[4];
        domain_m_fy[n4si2] += hgfy[4];
        domain_m_fz[n4si2] += hgfz[4];

        domain_m_fx[n5si2] += hgfx[5];
        domain_m_fy[n5si2] += hgfy[5];
        domain_m_fz[n5si2] += hgfz[5];

        domain_m_fx[n6si2] += hgfx[6];
        domain_m_fy[n6si2] += hgfy[6];
        domain_m_fz[n6si2] += hgfz[6];

        domain_m_fx[n7si2] += hgfx[7];
        domain_m_fy[n7si2] += hgfy[7];
        domain_m_fz[n7si2] += hgfz[7];
    }
}

2. Run the checks report

To explore how Codee can help speed up this function by offloading part of it to a GPU, use --target-arch to include GPU-related checks in the analysis:

Codee command
codee checks --verbose --target-arch gpu luleshmk.c:CalcFBHourglassForceForElems -- gcc luleshmk.c -lm -Ofast

Codee output
Date: 2024-09-09 Codee version: 2024.3.1 License type: Full
Compiler invocation: gcc luleshmk.c -lm -Ofast

[1/1] luleshmk.c ... Done

CHECKS REPORT

<...>

luleshmk.c:190:5 [PWR057] (level: L3): Consider applying offloading parallelism to sparse reduction loop
  Suggestion: Use 'rewrite' to automatically optimize the code
  Documentation: https://github.com/codee-com/open-catalog/tree/main/Checks/PWR057
  AutoFix (choose one option):
    * Using OpenMP (recommended):
        codee rewrite --offload omp-teams --in-place luleshmk.c:190:5 -- gcc luleshmk.c -lm -Ofast
    * Using OpenACC:
        codee rewrite --offload acc --in-place luleshmk.c:190:5 -- gcc luleshmk.c -lm -Ofast
    * Using OpenMP and OpenACC combined:
        codee rewrite --offload omp-teams,acc --in-place luleshmk.c:190:5 -- gcc luleshmk.c -lm -Ofast

SUGGESTIONS

  Use --check-id to focus on specific subsets of checkers, e.g.:
        codee checks --check-id PWR021 --verbose --target-arch gpu luleshmk.c:CalcFBHourglassForceForElems -- gcc luleshmk.c -lm -Ofast

1 file, 1 function, 1 loop successfully analyzed (3 checkers) and 0 non-analyzed files in 77 ms

Codee suggests various options to optimize the loop within the function, including automatic code generation for offloading using OpenMP or OpenACC directives.

3. Autofix

Let's use Codee's autofix capabilities to automatically optimize the code. We will create a new file using OpenACC to assess its performance.

OpenACC

We can copy-paste the suggested Codee invocation to generate the OpenACC offloading; replace the --in-place argument with -o to create a new file with the modification:

Codee command
codee rewrite --offload acc -o luleshmk_acc.c luleshmk.c:190:5 -- gcc luleshmk.c -lm -Ofast

Codee output
Date: 2024-09-09 Codee version: 2024.3.1 License type: Full
Compiler invocation: gcc luleshmk.c -lm -Ofast

Results for file '/global/homes/u/user/codee-demos/C/LULESHmk/src/luleshmk.c':
  Successfully applied AutoFix to the loop at 'luleshmk.c:CalcFBHourglassForceForElems:190:5' [using offloading]:
      [INFO] luleshmk.c:190:5 Parallel sparse reduction pattern identified for variable 'domain_m_fz' with associative, commutative operator '+'
      [INFO] luleshmk.c:190:5 Parallel sparse reduction pattern identified for variable 'domain_m_fy' with associative, commutative operator '+'
      [INFO] luleshmk.c:190:5 Parallel sparse reduction pattern identified for variable 'domain_m_fx' with associative, commutative operator '+'
      [INFO] luleshmk.c:190:5 Available parallelization strategies for variable 'domain_m_fz'
      [INFO] luleshmk.c:190:5   #1 OpenACC atomic access (* implemented)
      [INFO] luleshmk.c:190:5 Available parallelization strategies for variable 'domain_m_fy'
      [INFO] luleshmk.c:190:5   #1 OpenACC atomic access (* implemented)
      [INFO] luleshmk.c:190:5 Available parallelization strategies for variable 'domain_m_fx'
      [INFO] luleshmk.c:190:5   #1 OpenACC atomic access (* implemented)
      [INFO] luleshmk.c:190:5 Parallel region defined by OpenACC directive 'parallel'
      [INFO] luleshmk.c:190:5 Loop parallelized with OpenACC directive 'loop'
      [INFO] luleshmk.c:190:5 Data region for host-device data transfers defined by OpenACC directive 'data'
      [INFO] luleshmk.c:190:5 Make sure there is no aliasing among variables: domain_m_fz, domain_m_fy, domain_m_fx
  Fine-tuning suggestions for better performance [using offloading]:
      [TODO] Consider optimizing data transfers of arrays by adding the proper array ranges in data mapping clauses
          Documentation: https://github.com/codee-com/open-catalog/tree/main/Glossary/Offloading-data-transfers.md

Successfully created luleshmk_acc.c

Minimum software stack requirements: OpenACC version 2.0 with offloading capabilities

By default, the OpenACC generated code offloads the computation using parallel and manages data transfers with data copy; several atomic clauses are incorporated to ensure the correctness of the sparse reductions:

diff -C 1 luleshmk.c luleshmk_acc.c
***************
*** 189,190 ****
--- 189,196 ----
  
+     // Codee: Loop modified by Codee (2024-09-09 04:47:01)
+     // Codee: Technique applied: offloading with 'acc' pragmas
+     // TODO (Codee): Consider optimizing data transfers of arrays by adding the proper array ranges in data mapping clauses
+     #pragma acc data copyin(domain_m_nodelist[:], gamma[0:4][0:8], numElem) copy(domain_m_fx[:], domain_m_fy[:], domain_m_fz[:])
+     #pragma acc parallel
+     #pragma acc loop
      for (Index_t i2 = 0; i2 < numElem; ++i2) {
***************
*** 205,236 ****
--- 211,266 ----
  
+         #pragma acc atomic update
          domain_m_fx[n0si2] += hgfx[0];
+         #pragma acc atomic update
          domain_m_fy[n0si2] += hgfy[0];
+         #pragma acc atomic update
          domain_m_fz[n0si2] += hgfz[0];

<...>

As Codee suggested with this message:

TODO (Codee): Consider optimizing data transfers of arrays by adding the proper array ranges in data mapping clauses

To make the code compilable, it will be necessary to modify the first pragma to include the proper array ranges. Update the line as follows:

#pragma acc data copyin(domain_m_nodelist[0:MAX_NODELIST], gamma[0:4][0:8], numElem) copy(domain_m_fx[0:NUM_NODES], domain_m_fy[0:NUM_NODES], domain_m_fz[0:NUM_NODES])

4. Execution

Finally, compile and run both the original and the optimized code to assess the speed improvements. The following SLURM scripts can be used as reference; create launch.sh and LULESHMK.sh, and add execution permissions to the latter:

chmod u+x LULESHMK.sh

launch.sh
#!/bin/bash

#SBATCH --account=ntrain6
#SBATCH --job-name=codee_c_luleshmk

#SBATCH --constraint=gpu
#SBATCH --qos=shared
#SBATCH --reservation=codee_day2
#SBATCH --time=0:05:00

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --gpus-per-task=1

export SLURM_CPU_BIND="cores"
srun LULESHMK.sh

LULESHMK.sh
#!/bin/bash

module load PrgEnv-nvidia
rm -f luleshmk luleshmk_acc

nvc luleshmk.c -o luleshmk
./luleshmk

nvc luleshmk_acc.c -acc -target=gpu -Minfo=acc -o luleshmk_acc
./luleshmk_acc

The optimized version ran on 3.44 seconds, while the original took 18.44 seconds, which represents an speedup of 5.8x.

Nevertheless, if we use the -Ofast flag the compiler is able to optimize the serial version so much that the code runs on 0.68 seconds, thus diminishing the speedup gained when offloading to GPU.

Getting started​

Walkthrough​

1. Explore the source code​

2. Run the checks report​

3. Autofix​

OpenACC​

4. Execution​