Skip to main content

LULESHmk (C)

Goal

Walk you through the usage of Codee to optimize LULESHMK, a Lagrangian hydrodynamics simulation code, by parallelizing computations on CPU with OpenMP.

info

This guide was made using an Azure HBv4 machine and the AMD compilers. These steps have also been tested with GNU and Intel compilers, so you can follow allong by replacing the corresponding compilation flags.

Optimal performance is typically achieved using 48, 96, or 144 threads. To take a deeper look into de architecture of the machine used please visit the official Microsoft webpage, HBv4-series virtual machine overview

Prerequisites

Ensure you have:

  • Access to an Azure machine and Codee installed on it.
  • AMD clang compiler.
  • The codee-demos repository on your machine.

To clone the necessary repository just execute the following on your terminal:

git clone https://github.com/codee-com/codee-demos.git

Getting started

First, navigate to the source code for LULESHmk:

cd codee-demos/C/LULESHmk/src

Walkthrough

1. Explore the source code

The main computation is handled by the function CalcFBHourglassForceForElems():

void CalcFBHourglassForceForElems(Index_t numElem, Index_t *domain_m_nodelist, Real_t *domain_m_fx, Real_t *domain_m_fy,
Real_t *domain_m_fz) {
/*************************************************
*
* FUNCTION: Calculates the Flanagan-Belytschko anti-hourglass
* force.
*
*************************************************/
Real_t gamma[4][8];

gamma[0][0] = (1.);
gamma[0][1] = (1.);
<...>

/*************************************************/
/* compute the hourglass modes */

for (Index_t i2 = 0; i2 < numElem; ++i2) {
Real_t hgfx[8], hgfy[8], hgfz[8];

CalcElemFBHourglassForce(i2, gamma, hgfx, hgfy, hgfz);

// With the threaded version, we write into local arrays per elem
// so we don't have to worry about race conditions
Index_t n0si2 = domain_m_nodelist[(8) * i2 + 0];
Index_t n1si2 = domain_m_nodelist[(8) * i2 + 1];
Index_t n2si2 = domain_m_nodelist[(8) * i2 + 2];
<...>

domain_m_fx[n0si2] += hgfx[0];
domain_m_fy[n0si2] += hgfy[0];
domain_m_fz[n0si2] += hgfz[0];

<...>
}
}

2. Run the checks report

Note

It is recommended to run the screening report first to obtain a ranking of the checkers, which can help you decide which one to implement first.

To explore how Codee can help speed up this loop by parallelizing it, use --target-arch to include CPU-related checks in the analysis:

Codee command
codee checks --verbose --target-arch cpu luleshmk.c:CalcFBHourglassForceForElems -- clang luleshmk.c -lm -O3
Codee output
Date: 2025-04-16 Codee version: 2025.2 License type: Full
Compiler invocation: clang luleshmk.c -lm -O3

[1/1] luleshmk.c ... Done

CHECKS REPORT

<...>

luleshmk.c:190:5 [PWR052] (level: L2): Consider applying multithreading parallelism to sparse reduction loop
Suggestion: Use 'rewrite' to automatically optimize the code
Documentation: https://github.com/codee-com/open-catalog/tree/main/Checks/PWR052
AutoFix (choose one option):
* Using OpenMP 'for' with atomic protection (recommended):
codee rewrite --check-id pwr052 --variant omp-for --in-place luleshmk.c:190:5 -- clang luleshmk.c -lm -O3
* Using OpenMP 'for' with explicit privatization:
codee rewrite --check-id pwr052 --variant omp-for --in-place --explicit-privatization domain_m_fz,domain_m_fy,domain_m_fx luleshmk.c:190:5 -- clang luleshmk.c -lm -O3
* Using OpenMP 'taskwait':
codee rewrite --check-id pwr052 --variant omp-taskwait --in-place luleshmk.c:190:5 -- clang luleshmk.c -lm -O3
* Using OpenMP 'taskloop':
codee rewrite --check-id pwr052 --variant omp-taskloop --in-place luleshmk.c:190:5 -- clang luleshmk.c -lm -O3

<...>

SUGGESTIONS

Use --check-id to focus on specific subsets of checkers, e.g.:
codee checks --check-id PWR021 --verbose --target-arch cpu luleshmk.c:CalcFBHourglassForceForElems -- clang luleshmk.c -lm -O3

1 file, 1 function, 1 loop, 381 LOCs successfully analyzed (3 checkers) and 0 non-analyzed files in 87 ms

Codee suggests various options to optimize the loop by automatically generating OpenMP directives.

3. Autofix

Let's use Codee's autofix capabilities to automatically optimize the code. We will study the optimization using the atomic protection strategy.

We can copy-paste the suggested Codee invocation to generate the OpenMP pragmas; replace the --in-place argument with -o to create a new file with the modification:

Codee command
codee rewrite --check-id pwr052 --variant omp-for -o luleshmk_codee.c luleshmk.c:190:5 -- clang luleshmk.c -lm -O3
Codee output
Date: 2025-04-16 Codee version: 2025.2 License type: Full
Compiler invocation: clang luleshmk.c -lm -O3

[1/1] luleshmk.c ... Done

Results for file '/home/codee/codee-demos/C/LULESHmk/src/luleshmk.c':
Successfully applied AutoFix to the loop at 'luleshmk.c:CalcFBHourglassForceForElems:190:5' [using multi-threading]:
[INFO] luleshmk.c:190:5 Parallel sparse reduction pattern identified for variable 'domain_m_fz' with associative, commutative operator '+'
[INFO] luleshmk.c:190:5 Parallel sparse reduction pattern identified for variable 'domain_m_fy' with associative, commutative operator '+'
[INFO] luleshmk.c:190:5 Parallel sparse reduction pattern identified for variable 'domain_m_fx' with associative, commutative operator '+'
[INFO] luleshmk.c:190:5 Available parallelization strategies for variable 'domain_m_fz'
[INFO] luleshmk.c:190:5 #1 OpenMP atomic access (* implemented)
[INFO] luleshmk.c:190:5 #2 OpenMP explicit privatization
[INFO] luleshmk.c:190:5 Available parallelization strategies for variable 'domain_m_fy'
[INFO] luleshmk.c:190:5 #1 OpenMP atomic access (* implemented)
[INFO] luleshmk.c:190:5 #2 OpenMP explicit privatization
[INFO] luleshmk.c:190:5 Available parallelization strategies for variable 'domain_m_fx'
[INFO] luleshmk.c:190:5 #1 OpenMP atomic access (* implemented)
[INFO] luleshmk.c:190:5 #2 OpenMP explicit privatization
[INFO] luleshmk.c:190:5 Loop parallelized with multithreading using OpenMP directive 'for'
[INFO] luleshmk.c:190:5 Parallel region defined by OpenMP directive 'parallel'
[INFO] luleshmk.c:190:5 Make sure there is no aliasing among variables: domain_m_fz, domain_m_fy, domain_m_fx

Successfully created luleshmk_atomic.c

Minimum software stack requirements: OpenMP version 3.0 with multithreading capabilities

Note the presence of the #pragma omp atomic update clause in the updated code:

diff -C 2 luleshmk.c luleshmk_codee.c
          Index_t n7si2 = domain_m_nodelist[(8) * i2 + 7];

+ #pragma omp atomic update
domain_m_fx[n0si2] += hgfx[0];
+ #pragma omp atomic update
domain_m_fy[n0si2] += hgfy[0];
+ #pragma omp atomic update
domain_m_fz[n0si2] += hgfz[0];
<...>

4. Execution

Compile the original source code of Luleshmk (luleshmk.c) and the optimized version (luleshmk_codee.c) to compare their performance. For instance, using the AMD clang compiler:

Compiler commands
clang luleshmk.c -lm -O3 -o luleshmk && \
clang luleshmk_codee.c -lm -O3 -fopenmp -o luleshmk_codee

And run the original executable (luleshmk) and the optimized one (luleshmk_codee) using 48 threads. The election of the number of threads was made based on experimentation, you can see more details about the architecture of the machine on HBv4-series virtual machine overview

./luleshmk
- Configuring the test...
- Executing the test...
- Verifying the test...
Run completed:
Problem size = 30
MPI tasks = 1
Iteration count = 932
Final Origin Energy = 9.330000e+02
Testing Plane 0 of Energy Array on rank 0:
MaxAbsDiff = 8.178369e+06
TotalAbsDiff = 1.267647e+09
MaxRelDiff = 9.665601e-01


Elapsed time = 2.86 (s)
Grind time (us/z/c) = 0.11364839 (per dom) (0.11364839 overall)
FOM = 8799.069 (z/s)


numNodes = 27000
numElems = 30000
checksum_f = 3.28901e+11
checksum_e = 4.37594e+12
./luleshmk_codee
- Configuring the test...
- Executing the test...
- Verifying the test...
Run completed:
Problem size = 30
MPI tasks = 1
Iteration count = 932
Final Origin Energy = 9.330000e+02
Testing Plane 0 of Energy Array on rank 0:
MaxAbsDiff = 8.178369e+06
TotalAbsDiff = 1.267647e+09
MaxRelDiff = 9.665601e-01


Elapsed time = 1.59 (s)
Grind time (us/z/c) = 0.063334412 (per dom) (0.063334412 overall)
FOM = 15789.205 (z/s)


numNodes = 27000
numElems = 30000
checksum_f = 3.28901e+11
checksum_e = 4.37594e+12

5. Results

Across 10 executions, the optimized version ran on 1.06 ± 0.03 seconds, while the original took 24.56 ± 0.01 seconds, representing a ~1.34x speedup.