PI (C)
Walk you through the usage of Codee to optimize PI, a code that estimates the value of pi, by parallelizing computations on CPU with OpenMP.
This guide was made using an Azure HBv4 machine and the AMD compilers. These steps have also been tested with GNU and Intel compilers, so you can follow allong by replacing the corresponding compilation flags.
To take a deeper look into de architecture of the machine used please visit the official Microsoft webpage, HBv4-series virtual machine overview
Prerequisites
Ensure you have:
- Access to an Azure machine and Codee installed on it.
- AMD clang compiler.
- The codee-demos repository on your machine.
To clone the necessary repository just execute the following on your terminal:
git clone https://github.com/codee-com/codee-demos.git
Getting started
First, navigate to the source code for PI:
cd codee-demos/C/PI
Walkthrough
1. Explore the source code
The computation is handled by a single for
loop within pi.c
:
double sum = 0.0;
for (unsigned long i = 0; i < N; i++) {
double x = (i + 0.5) / N;
sum += sqrt(1 - x * x);
}
2. Run the checks report
It is recommended to run the screening report first to obtain a ranking of the checkers, which can help you decide which one to implement first.
To explore how Codee can help speed up this loop by parallelizing it,
use --target-arch
to include CPU-related checks in the analysis:
codee checks --verbose --target-arch cpu pi.c:main -- clang pi.c -lm -O3
Date: 2025-04-16 Codee version: 2025.2 License type: Full
Compiler invocation: clang pi.c -lm -O3
[1/1] pi.c ... Done
CHECKS REPORT
<...>
pi.c:31:5 [PWR051] (level: L2): Consider applying multithreading parallelism to scalar reduction loop
Suggestion: Use 'rewrite' to automatically optimize the code
Documentation: https://github.com/codee-com/open-catalog/tree/main/Checks/PWR051
AutoFix (choose one option):
* Using OpenMP 'for' with built-in reduction (recommended):
codee rewrite --check-id pwr051 --variant omp-for --in-place pi.c:31:5 -- clang pi.c -lm -O3
* Using OpenMP 'for' with explicit privatization:
codee rewrite --check-id pwr051 --variant omp-for --in-place --explicit-privatization sum pi.c:31:5 -- clang pi.c -lm -O3
* Using OpenMP 'taskwait':
codee rewrite --check-id pwr051 --variant omp-taskwait --in-place pi.c:31:5 -- clang pi.c -lm -O3
* Using OpenMP 'taskloop':
codee rewrite --check-id pwr051 --variant omp-taskloop --in-place pi.c:31:5 -- clang pi.c -lm -O3
SUGGESTIONS
Use --check-id to focus on specific subsets of checkers, e.g.:
codee checks --check-id PWR054 --verbose --target-arch cpu pi.c:main -- clang pi.c -lm -O3
1 file, 1 function, 1 loop, 44 LOCs successfully analyzed (3 checkers) and 0 non-analyzed files in 26 ms
Codee suggests various options to optimize the loop, including automatic code generation for parallelizing the code.
3. Autofix
Let's use Codee's autofix capabilities to automatically optimize the code with OpenMP.
We can copy-paste the suggested Codee invocation to generate the OpenMP
offloading; replace the --in-place
argument with -o
to create a new file
with the modification:
codee rewrite --check-id pwr051 --variant omp-for -o pi_codee.c pi.c:31:5 -- clang pi.c -lm -O3
Date: 2025-04-16 Codee version: 2025.2 License type: Full
Compiler invocation: clang pi.c -lm -O3
[1/1] pi.c ... Done
Results for file '/home/codee/codee-demos/C/PI/pi.c':
Successfully applied AutoFix to the loop at 'pi.c:main:31:5' [using multi-threading]:
[INFO] pi.c:31:5 Parallel scalar reduction pattern identified for variable 'sum' with associative, commutative operator '+'
[INFO] pi.c:31:5 Available parallelization strategies for variable 'sum'
[INFO] pi.c:31:5 #1 OpenMP scalar reduction (* implemented)
[INFO] pi.c:31:5 #2 OpenMP atomic access
[INFO] pi.c:31:5 #3 OpenMP explicit privatization
[INFO] pi.c:31:5 Loop parallelized with multithreading using OpenMP directive 'for'
[INFO] pi.c:31:5 Parallel region defined by OpenMP directive 'parallel'
Successfully created pi_codee.c
Minimum software stack requirements: OpenMP version 3.0 with multithreading capabilities
By default, the OpenMP generated code parallelizes the loop using parallel
,
manages the reduction of the sum
variable, and dynamically schedules the iterations
using schedule(auto)
:
double sum = 0.0;
+ // Codee: Loop modified by Codee (2025-03-26 08:54:55)
+ // Codee: Technique applied: multithreading with 'omp-for' pragmas
+ #pragma omp parallel default(none) shared(N, sum)
+ {
+ #pragma omp for reduction(+: sum) schedule(auto)
for (unsigned long i = 0; i < N; i++) {
double x = (i + 0.5) / N;
sum += sqrt(1 - x * x);
}
+ } // end parallel
out_result = 4.0 / N * sum;
4. Execution
Compile the original source code of PI (pi.c
) and the optimized
version (pi_codee.c
) to compare their performance. For instance, using
the AMD clang
compiler:
clang pi.c -lm -O3 -o pi && \
clang pi_codee.c -lm -fopenmp -O3 -o pi_codee
And run the original executable (pi
) and the optimized one (pi_codee
),
choosing a problem size of 900000000
using 48 threads.
The election of the number of threads was made based on experimentation, you can see
more details about the architecture of the machine on
HBv4-series virtual machine overview
- Input parameters
steps = 900000000
- Executing test...
time (s)= 3.323360
result = 3.14159265
error = 2.7e-15
- Input parameters
steps = 900000000
- Executing test...
time (s)= 0.095345
result = 3.14159265
error = 7.7e-14
5. Results
Across 10 executions, the optimized version ran on 0.04 ± 0.0 seconds, while the original took 24.56 ± 0.01 seconds, representing a ~27.5x speedup.