COULOMB (C)
Walk you through the usage of Codee to optimize Coulomb, a code that computes electrical currents in 2D planes, by parallelizing computations on CPU with OpenMP.
This guide was made using an Azure HBv4 machine and the AMD compilers. These steps have also been tested with GNU and Intel compilers, so you can follow allong by replacing the corresponding compilation flags.
Optimal performance is typically achieved using 48, 96, or 144 threads. To take a deeper look into de architecture of the machine used please visit the official Microsoft webpage, HBv4-series virtual machine overview
Prerequisites
Ensure you have:
- Access to an Azure machine and Codee installed on it.
- AMD clang compiler.
- The codee-demos repository on your machine.
To clone the necessary repository just execute the following on your terminal:
git clone https://github.com/codee-com/codee-demos.git
Getting started
First, navigate to the source code for COULOMB:
cd codee-demos/C/COULOMB
Walkthrough
1. Explore the source code
The computation is handled by a triple-nested for
loop within coulomb.c
:
for(int i=0; i<rows; i++) {
for(int j=0; j<cols; j++) {
double mat_ij = 0;
for(int k = 0; k < size; k+=4) {
double dx = vec[k+0] - (scaleX * j + x0);
double dy = vec[k+1] - (scaleY * i + y0);
double dz = vec[k+2] - (z0);
double charge = 1e-9 * vec[k+3];
double dist = sqrt(dx*dx + dy*dy + dz*dz);
mat_ij += charge / dist;
}
mat[j + i * cols] = mat_ij / (4 * PI * e0);
}
}
2. Run the checks report
It is recommended to run the screening report first to obtain a ranking of the checkers, which can help you decide which one to implement first.
To explore how Codee can help speed up this loop by parallelizing it,
use --target-arch
to include CPU-related checks in the analysis:
codee checks --verbose --target-arch cpu coulomb.c:coulomb -- clang coulomb.c -O3 -lm
Date: 2025-04-16 Codee version: 2025.2 License type: Full
Compiler invocation: clang coulomb.c -O3 -lm
[1/1] coulomb.c ... Done
CHECKS REPORT
<...>
coulomb.c:26:2 [PWR050] (level: L2): Consider applying multithreading parallelism to forall loop
Suggestion: Use 'rewrite' to automatically optimize the code
Documentation: https://github.com/codee-com/open-catalog/tree/main/Checks/PWR050
AutoFix (choose one option):
* Using OpenMP 'for' (recommended):
codee rewrite --check-id pwr050 --variant omp-for --in-place coulomb.c:26:2 -- clang coulomb.c -O3 -lm
* Using OpenMP 'taskwait':
codee rewrite --check-id pwr050 --variant omp-taskwait --in-place coulomb.c:26:2 -- clang coulomb.c -O3 -lm
* Using OpenMP 'taskloop':
codee rewrite --check-id pwr050 --variant omp-taskloop --in-place coulomb.c:26:2 -- clang coulomb.c -O3 -lm
<...>
SUGGESTIONS
Use --check-id to focus on specific subsets of checkers, e.g.:
codee checks --check-id PWR045 --verbose --target-arch cpu coulomb.c:coulomb -- clang coulomb.c -O3 -lm
1 file, 1 function, 3 loops, 102 LOCs successfully analyzed (6 checkers) and 0 non-analyzed files in 82 ms
Codee suggests various options to optimize the loop using OpenMP parallelization.
3. Autofix
Let's use Codee's autofix capabilities to automatically optimize the code, following the recomended option.
We can copy and paste the suggested Codee invocation to perform the
parallelization; just replace the --in-place
argument with -o
to create a
new file with the modified code:
codee rewrite --check-id pwr050 --variant omp-for -o coulomb_codee.c coulomb.c:26:2 -- clang coulomb.c -O3 -lm
Date: 2025-04-16 Codee version: 2025.2 License type: Full
Compiler invocation: clang coulomb.c -O3 -lm
[1/1] coulomb.c ... Done
Results for file '/home/codee/codee-demos/C/COULOMB/coulomb.c':
Successfully applied AutoFix to the loop at 'coulomb.c:coulomb:26:2' [using multi-threading]:
[INFO] coulomb.c:26:2 Parallel forall: variable 'mat'
[INFO] coulomb.c:26:2 Loop parallelized with multithreading using OpenMP directive 'for'
[INFO] coulomb.c:26:2 Parallel region defined by OpenMP directive 'parallel'
[INFO] coulomb.c:26:2 Make sure there is no aliasing among variables: vec, mat
Successfully created coulomb_codee.c
Minimum software stack requirements: OpenMP version 3.0 with multithreading capabilities
double scaleY = (y1 - y0) / rows;
+ // Codee: Loop modified by Codee (2025-03-25 09:37:31)
+ // Codee: Technique applied: multithreading with 'omp-for' pragmas
+ #pragma omp parallel default(none) shared(PI, cols, e0, mat, rows, scaleX, scaleY, size, vec, x0, y0, z0)
+ {
+ #pragma omp for schedule(auto)
for(int i=0; i<rows; i++) {
for(int j=0; j<cols; j++) {
***************
*** 38,41 ****
--- 43,47 ----
}
}
+ } // end parallel
4. Execution
Compile the original source code of Coulomb (coulomb.c
) and the optimized
version (coulomb_codee.c
) to compare their performance. For instance, using
the AMD clang
compiler:
clang Vector.c Matrix2D.c coulomb.c -O3 -lm -o coulomb && \
clang Vector.c Matrix2D.c coulomb_codee.c -O3 -fopenmp -lm -o coulomb_codee
And run the original executable (coulomb
) and the optimized one
(coulomb_codee
), choosing a problem size of 600
using 128 threads.
The election of the number of threads was made based on experimentation, you can see
more details about the architecture of the machine on
HBv4-series virtual machine overview
- Executing test...
time (s)= 48.512988
size = 600
chksum = 69448968446
- Executing test...
time (s)= 0.501747
size = 600
chksum = 69448968446
5. Results
Across 10 executions, the optimized version ran on 0.39 ± 0.01 seconds, while the original took 24.56 ± 0.01 seconds, representing a ~63x speedup.