MATMUL (C)

Goal

Walk you through the usage of Codee to optimize MATMUL, a matrix multiplication code, by parallelizing computations on CPU with OpenMP.

info

This guide was made using an Azure HBv4 machine and the AMD compilers. These steps have also been tested with GNU and Intel compilers, so you can follow allong by replacing the corresponding compilation flags.

To take a deeper look into de architecture of the machine used please visit the official Microsoft webpage, HBv4-series virtual machine overview

Prerequisites

Ensure you have:

Access to an Azure machine and Codee installed on it.
AMD clang compiler.
The codee-demos repository on your machine.

To clone the necessary repository just execute the following on your terminal:

git clone https://github.com/codee-com/codee-demos.git

Getting started

First, navigate to the source code for MATMUL:

cd codee-demos/C/MATMUL

Walkthrough

1. Explore the source code

The computation is handled by a triple-nested for loop within main.c:

// Accumulation
for (size_t i = 0; i < m; i++) {
    for (size_t j = 0; j < n; j++) {
        for (size_t k = 0; k < p; k++) {
            C[i][j] += A[i][k] * B[k][j];
        }
    }
}

2. Run the checks report

Note

It is recommended to run the screening report first to obtain a ranking of the checkers, which can help you decide which one to implement first.

To explore how Codee can help speed up this loop by parallelizing its execution on CPU, use --target-arch to include multithreading-related checks in the analysis:

Codee command
codee checks --verbose --target-arch cpu main.c:matmul -- clang main.c -I include -O3

Codee output
Date: 2025-04-16 Codee version: 2025.2 License type: Full
Compiler invocation: clang main.c -I include -O3

[1/1] main.c ... Done

CHECKS REPORT

main.c:16:9 [PWR039] (level: L1): Consider loop interchange to improve the locality of reference and enable vectorization
  Loops to interchange:
    16:         for (size_t j = 0; j < n; j++) {
    17:             for (size_t k = 0; k < p; k++) {
  Suggestion: Interchange inner and outer loops in the loop nest to improve performance
  Documentation: https://github.com/codee-com/open-catalog/tree/main/Checks/PWR039
  AutoFix:
    codee rewrite --check-id pwr039 --in-place main.c:16:9 -- clang main.c -I include -O3

<...>

main.c:15:5 [PWR050] (level: L2): Consider applying multithreading parallelism to forall loop
  Suggestion: Use 'rewrite' to automatically optimize the code
  Documentation: https://github.com/codee-com/open-catalog/tree/main/Checks/PWR050
  AutoFix (choose one option):
    * Using OpenMP 'for' (recommended):
        codee rewrite --check-id pwr050 --variant omp-for --in-place main.c:15:5 -- clang main.c -I include -O3
    * Using OpenMP 'taskwait':
        codee rewrite --check-id pwr050 --variant omp-taskwait --in-place main.c:15:5 -- clang main.c -I include -O3
    * Using OpenMP 'taskloop':
        codee rewrite --check-id pwr050 --variant omp-taskloop --in-place main.c:15:5 -- clang main.c -I include -O3

<...>
SUGGESTIONS

  Use --check-id to focus on specific subsets of checkers, e.g.:
        codee checks --check-id PWR039 --verbose --target-arch cpu main.c:matmul -- clang main.c -I include -O3

1 file, 1 function, 5 loops, 55 LOCs successfully analyzed (7 checkers) and 0 non-analyzed files in 1322 ms

Codee suggests various options to optimize the loop, including automatic code generation for loop interchange, helping improve memory accesses, and parallelism using OpenMP.

3. Autofix

Let's use Codee's autofix capabilities to automatically optimize the code. We will apply the loop interchange optimization first, and then applying OpenMP pragmas.

Loop interchange

We can copy-paste the suggested Codee invocation to perform the loop interchange; replace the --in-place argument with -o to create a new file with the modification:

Codee command
codee rewrite --check-id pwr039 -o main_li.c main.c:16:9 -- clang main.c -I include -O3

Codee output
Date: 2025-04-16 Codee version: 2025.2 License type: Full
Compiler invocation: clang main.c -I include -O3

[1/1] main.c ... Done

Results for file '/home/codee/codee-demos/C/MATMUL/main.c':
  Successfully applied AutoFix to the loop at 'main.c:16:9' [using loop interchange]:
      [INFO] Loops interchanged:
        - main.c:16:9
        - main.c:17:13

Successfully created main_li.c

Let's confirm that the loop interchange has been correctly applied:

diff -C 2 main.c main_li.c
      // Accumulation
      for (size_t i = 0; i < m; i++) {
!         for (size_t j = 0; j < n; j++) {
!             for (size_t k = 0; k < p; k++) {
                  C[i][j] += A[i][k] * B[k][j];
              }
--- 14,21 ----
      // Accumulation
      for (size_t i = 0; i < m; i++) {
!         // Codee: Loop modified by Codee (2025-03-25 16:57:35)
!         // Codee: Technique applied: loop interchange
!         for (size_t k = 0; k < p; k++) {
!             for (size_t j = 0; j < n; j++) {
                  C[i][j] += A[i][k] * B[k][j];
              }

OpenMP

For convenience, we can run the checks report again on main_li.c to make Codee generate updated autofix commands:

Codee command
codee checks --verbose --target-arch cpu main_li.c:matmul -- clang main_li.c -I include -O3

Codee output
Date: 2025-04-16 Codee version: 2025.2 License type: Full
Compiler invocation: clang main_li.c -I include -O3

[1/1] main_li.c ... Done

CHECKS REPORT

<...>

main_li.c:15:5 [PWR050] (level: L2): Consider applying multithreading parallelism to forall loop
  Suggestion: Use 'rewrite' to automatically optimize the code
  Documentation: https://github.com/codee-com/open-catalog/tree/main/Checks/PWR050
  AutoFix (choose one option):
    * Using OpenMP 'for' (recommended):
        codee rewrite --check-id pwr050 --variant omp-for --in-place main_li.c:15:5 -- clang main_li.c -I include -O3
    * Using OpenMP 'taskwait':
        codee rewrite --check-id pwr050 --variant omp-taskwait --in-place main_li.c:15:5 -- clang main_li.c -I include -O3
    * Using OpenMP 'taskloop':
        codee rewrite --check-id pwr050 --variant omp-taskloop --in-place main_li.c:15:5 -- clang main_li.c -I include -O3

<...>

SUGGESTIONS

  Use --check-id to focus on specific subsets of checkers, e.g.:
        codee checks --check-id PWR053 --verbose --target-arch cpu main_li.c:matmul -- clang main_li.c -I include -O3

1 file, 1 function, 5 loops, 55 LOCs successfully analyzed (8 checkers) and 0 non-analyzed files in 28 ms

Let's start with the OpenMP offloading; replace the --in-place argument with -o to create a new file with the modification:

Codee command
codee rewrite --check-id pwr050 --variant omp-for -o main_codee.c main_li.c:15:5 -- clang main_li.c -I include -O3

Codee output
Date: 2025-04-16 Codee version: 2025.2 License type: Full
Compiler invocation: clang main_li.c -I include -O3

[1/1] main_li.c ... Done

Results for file '/home/codee/codee-demos/C/MATMUL/main_li.c':
  Successfully applied AutoFix to the loop at 'main_li.c:matmul:15:5' [using multi-threading]:
      [INFO] main_li.c:15:5 Parallel forall: variable 'C'
      [INFO] main_li.c:15:5 Loop parallelized with multithreading using OpenMP directive 'for'
      [INFO] main_li.c:15:5 Parallel region defined by OpenMP directive 'parallel'
      [INFO] main_li.c:15:5 Make sure there is no aliasing among variables: A, B, C

Successfully created main_codee.c

Minimum software stack requirements: OpenMP version 3.0 with multithreading capabilities

By default, the OpenMP generated code executes the loop in parallel using parallel, distributes the iterations across threads with for, and dynamically schedules the work with schedule(auto):

diff -C 5 main_li.c main_codee.c
      // Accumulation
+     // Codee: Loop modified by Codee (2025-03-25 17:01:24)
+     // Codee: Technique applied: multithreading with 'omp-for' pragmas
+     #pragma omp parallel default(none) shared(A, B, C, m, n, p)
+     {
+     #pragma omp for schedule(auto)
      for (size_t i = 0; i < m; i++) {
          // Codee: Loop modified by Codee (2025-03-25 16:57:35)
          // Codee: Technique applied: loop interchange
          for (size_t k = 0; k < p; k++) {
              for (size_t j = 0; j < n; j++) {
                  C[i][j] += A[i][k] * B[k][j];
              }
          }
      }
+     } // end parallel
  }
  
  int main(int argc, char *argv[]) {
      int param_iters = 1;
  

4. Execution

Compile the original source code of Matmul (main.c) and the optimized version (main_codee.c) to compare their performance. For instance, using the AMD clang compiler:

Compiler commands
clang clock.c matrix.c main.c -I include/ -O3 -o matmul && \
    clang clock.c matrix.c main_codee.c -I include/ -O3 -o matmul_codee

And run the original executable (matmul) and the optimized one (matmul_codee), choosing a problem size of 3000 using 96 threads.

The election of the number of threads was made based on experimentation, you can see more details about the architecture of the machine on HBv4-series virtual machine overview

./matmul 3000
- Input parameters
n       = 3000
- Executing test...
time (s)= 25.745224
size    = 3000
chksum  = 546933812237

./matmul_codee 3000
- Input parameters
n       = 3000
- Executing test...
time (s)= 5.191395
size    = 3000
chksum  = 546933812237

5. Results

Across 10 executions, the optimized version ran on 0.26 ± 0.01 seconds, while the original took 5.37 ± 0.02 seconds, representing a ~20.62x speedup.

Prerequisites​

Getting started​

Walkthrough​

1. Explore the source code​

2. Run the checks report​

3. Autofix​

Loop interchange​

OpenMP​

4. Execution​

5. Results​