MbedTLS optimization through vectorization

Goal

Walk you through the usage of Codee to optimize MbedTLS, an open source library of cryptographic algorithms.

info

This guide is part of the NERSC + Codee Training Series 2024. Code available for download at the previous link.

Getting started

First, navigate to the source code for MbedTLS:

cd codee-demos/C/MbedTLS

Next, load the latest Codee version available on Perlmutter:

module load codee/2024.3.1

Walkthrough

1. Generate the `compile_commands.json`

This project uses CMake, which has native support for exporting compilation databases. Add the -DCMAKE_EXPORT_COMPILE_COMMANDS=ON flag to the CMake invocation:

CMake invocation
cmake -DENABLE_TESTING=ON -DCMAKE_C_COMPILER=gcc \
    -DUSE_SHARED_MBEDTLS_LIBRARY=ON -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DMBEDTLS_FATAL_WARNINGS=OFF \
    -DCMAKE_C_FLAGS=-fopenmp-simd -B build -G "Unix Makefiles" && \
    cmake --build build -j

2. Run the global screening report

To explore the recommendations of the Open Catalog that are applicable to MbedTLS, let's run Codee's screening report; use --compile-commands to point to the compilation database:

Codee command

codee screening --compile-commands build/compile_commands.json

Codee output
Configuration file 'build/compile_commands.json' successfully parsed.
Date: 2024-09-06 Codee version: 2024.3.1 License type: Full

[  1/263] /global/homes/u/user/codee-demos/C/MbedTLS/tests/src/asn1_helpers.c ... Done
[  2/263] /global/homes/u/user/codee-demos/C/MbedTLS/tests/src/certs.c ... Done
<...>
[263/263] /global/homes/u/user/codee-demos/C/MbedTLS/build/tests/test_suite_oid.c ... Done

SCREENING REPORT

----Number of files----
Total | C   C++ Fortran
----- | --- --- -------
263   | 263 0   0

Lines of code Analysis time # checks Profiling
------------- ------------- -------- ---------
245177        42.55 s       1369     n/a

Lines of code : total lines of code found in the target (computed the same way as the sloccount tool)
Analysis time : time required to analyze the target
# checks : total actionable items (opportunities, recommendations, defects and remarks) detected
Profiling : estimation of overall execution time required by this target

RANKING OF CHECKERS

Checker Priority AutoFix #   Title
------- -------- ------- --- ----------------------------------------------------------------------------------------------------------------------------
RMK015  P27 (L1)         263 Tune compiler optimization flags to increase the speed of the code
PWR003  P18 (L1)         119 Explicitly declare pure functions
PWR053  P12 (L1)  x      72  Consider applying vectorization to forall loop
PWR054  P12 (L1)  x      4   Consider applying vectorization to scalar reduction loop
PWR024  P8 (L2)          3   Loop can be rewritten in OpenMP canonical form
PWR023  P6 (L2)          3   Add 'restrict' for pointer function parameters to hint the compiler that vectorization is safe
PWR018  P6 (L2)          1   Call to recursive function within a loop inhibits vectorization
PWR010  P4 (L3)          91  Avoid column-major array access in C/C++
PWR034  P4 (L3)          12  Avoid strided array access to improve performance
PWR001  P3 (L3)          458 Declare global variables as function parameters
PWR002  P3 (L3)          31  Declare scalar variables in the smallest possible scope
PWR029  P3 (L3)          1   Remove integer increment preventing performance optimization
PWR012  P2 (L3)          108 Pass only required fields from derived type as parameters
PWR035  P2 (L3)          95  Avoid non-consecutive array access to improve performance
PWR016  P2 (L3)          63  Use separate arrays instead of an Array-of-Structs
PWR028  P2 (L3)          17  Remove pointer increment preventing performance optimization
PWR036  P2 (L3)          14  Avoid indirect array access to improve performance
PWR049  P2 (L3)          6   Move iterator-dependent condition outside of the loop
RMK010  P0 (L3)          6   The vectorization cost model states the loop is not a SIMD opportunity due to strided memory accesses in the loop body
RMK014  P0 (L3)          2   The vectorization cost model states the loop is not a SIMD opportunity due to unpredictable memory accesses in the loop body

SUGGESTIONS

  Use 'roi' to get a return of investment estimation report:
        codee roi --compile-commands build/compile_commands.json

  Focus the analysis on a specific file before proceeding with the Codee auto mode or the guided mode:
        codee screening specific/file.c --compile-commands build/compile_commands.json

263 files, 5460 functions, 12976 loops successfully analyzed (1369 checkers) and 0 non-analyzed files in 47.80 s

All the source files were successfully analyzed and 1369 checkers were reported. The different types of checkers reported can be seen in the RANKING section of the output.

3. Run the screening report for specific files

When using Codee for performance optimization, it is important to have the code hotspots identified to target Codee's reports. Let's run the screening report again, restricting the analysis to one of those hotspots:

Codee command

codee screening --compile-commands build/compile_commands.json library/aes.c

Codee output
Configuration file 'build/compile_commands.json' successfully parsed.
Date: 2024-09-06 Codee version: 2024.3.1 License type: Full

[1/1] library/aes.c (2 entries) ... Done

SCREENING REPORT

---Number of files---
Total | C C++ Fortran
----- | - --- -------
1     | 1 0   0

Lines of code Analysis time # checks Profiling
------------- ------------- -------- ---------
1657          401 ms        36       n/a

Lines of code : total lines of code found in the target (computed the same way as the sloccount tool)
Analysis time : time required to analyze the target
# checks : total actionable items (opportunities, recommendations, defects and remarks) detected
Profiling : estimation of overall execution time required by this target

RANKING OF CHECKERS

Checker Priority AutoFix # Title
------- -------- ------- - ----------------------------------------------------------------------------------------------------------------------------
RMK015  P27 (L1)         1 Tune compiler optimization flags to increase the speed of the code
PWR053  P12 (L1)  x      7 Consider applying vectorization to forall loop
PWR024  P8 (L2)          1 Loop can be rewritten in OpenMP canonical form
PWR001  P3 (L3)          6 Declare global variables as function parameters
PWR002  P3 (L3)          5 Declare scalar variables in the smallest possible scope
PWR036  P2 (L3)          9 Avoid indirect array access to improve performance
PWR028  P2 (L3)          5 Remove pointer increment preventing performance optimization
RMK010  P0 (L3)          1 The vectorization cost model states the loop is not a SIMD opportunity due to strided memory accesses in the loop body
RMK014  P0 (L3)          1 The vectorization cost model states the loop is not a SIMD opportunity due to unpredictable memory accesses in the loop body

SUGGESTIONS

  Use 'roi' to get a return of investment estimation report:
        codee roi --compile-commands build/compile_commands.json library/aes.c

  Use 'checks' to find out details about the detected checks:
        codee checks --compile-commands build/compile_commands.json library/aes.c

1 file, 21 functions, 88 loops successfully analyzed (36 checkers) and 0 non-analyzed files in 625 ms

Note how the ranking of checkers shown at the bottom lists the different types of checkers reported, and is ordered by priority. The checkers at the top are the most important ones. In this case, the screening report indicates that the PWR053 checker was reported 7 times and it has high priority.

4. Run the checks report

Now we need to see the entire list of occurrences of the PWR053, each one pointing at specific lines of code. We can use Codee's checks report report to obtain such list, including the --check-id PWR053 flag to filter by results of PWR053.

Codee command

codee checks --compile-commands build/compile_commands.json library/aes.c --check-id PWR053

Codee output
Configuration file 'build/compile_commands.json' successfully parsed.
Date: 2024-09-03 Codee version: 2024.3 License type: Full

[1/1] library/aes.c (2 entries) ... Done

CHECKS REPORT

/global/homes/u/user/codee-demos/C/MbedTLS/library/aes.c:1048:13 [PWR053] (level: L1): Consider applying vectorization to forall loop
/global/homes/u/user/codee-demos/C/MbedTLS/library/aes.c:1062:13 [PWR053] (level: L1): Consider applying vectorization to forall loop
/global/homes/u/user/codee-demos/C/MbedTLS/library/aes.c:1162:9 [PWR053] (level: L1): Consider applying vectorization to forall loop
/global/homes/u/user/codee-demos/C/MbedTLS/library/aes.c:1169:9 [PWR053] (level: L1): Consider applying vectorization to forall loop
/global/homes/u/user/codee-demos/C/MbedTLS/library/aes.c:1194:9 [PWR053] (level: L1): Consider applying vectorization to forall loop
/global/homes/u/user/codee-demos/C/MbedTLS/library/aes.c:1202:9 [PWR053] (level: L1): Consider applying vectorization to forall loop
/global/homes/u/user/codee-demos/C/MbedTLS/library/aes.c:1211:9 [PWR053] (level: L1): Consider applying vectorization to forall loop

SUGGESTIONS

  Use --verbose to get more details, e.g:
        codee checks --verbose --compile-commands build/compile_commands.json library/aes.c --check-id PWR053

1 file, 21 functions, 88 loops successfully analyzed (7 checkers) and 0 non-analyzed files in 193 ms

5. Run the checks report in verbose mode

Re-run the checks report with --verbose to get more details for each checker, including the different autofix options:

Codee command

codee checks --compile-commands build/compile_commands.json library/aes.c --check-id PWR053 --verbose

Codee output
Configuration file 'build/compile_commands.json' successfully parsed.
Date: 2024-09-03 Codee version: 2024.3 License type: Full

[1/1] library/aes.c (2 entries) ... Done

CHECKS REPORT

/global/homes/u/user/codee-demos/C/MbedTLS/library/aes.c:1048:13 [PWR053] (level: L1): Consider applying vectorization to forall loop
  Suggestion: Use 'rewrite' to automatically optimize the code
  Documentation: https://github.com/codee-com/open-catalog/tree/main/Checks/PWR053
  AutoFix (choose one option):
    * Using OpenMP pragmas (recommended):
        codee rewrite --vector omp --in-place library/aes.c:1048:13 --compile-commands build/compile_commands.json
    * Using Clang compiler pragmas:
        codee rewrite --vector clang --in-place library/aes.c:1048:13 --compile-commands build/compile_commands.json
    * Using GCC pragmas:
        codee rewrite --vector gcc --in-place library/aes.c:1048:13 --compile-commands build/compile_commands.json
    * Using ICC pragmas:
        codee rewrite --vector icc --in-place library/aes.c:1048:13 --compile-commands build/compile_commands.json
    * Using combined pragmas, for example (for GCC and Clang pragmas):
        codee rewrite --vector gcc,clang --in-place library/aes.c:1048:13 --compile-commands build/compile_commands.json

<...>

1 file, 21 functions, 88 loops successfully analyzed (7 checkers) and 0 non-analyzed files in 182 ms

5. Autofix

Use Codee's autofix capabilities to automatically optimize the code. The recommended rewriting option is to apply vectorization with OpenMP pragmas.

Aditionally, we will remove the loop filter. This way, the autofix will be applied to all the loops reported as vectorizable within aes.c, saving us from having to run codee rewrite for each loop individually.

Codee command
codee rewrite --vector omp --in-place library/aes.c --compile-commands build/compile_commands.json

Codee output
Configuration file 'build/compile_commands.json' successfully parsed.
Date: 2024-09-03 Codee version: 2024.3 License type: Full

Results for file '/global/homes/u/user/codee-demos/C/MbedTLS/library/aes.c':
  Successfully applied AutoFix to the loop at '/global/homes/u/user/codee-demos/C/MbedTLS/library/aes.c:aes_gen_tables:424:5' [using SIMD]:
  <...>
  Successfully applied AutoFix to the loop at '/global/homes/u/user/codee-demos/C/MbedTLS/library/aes.c:mbedtls_aes_setkey_enc:568:5' [using SIMD]:
  <...>
  Successfully applied AutoFix to the loop at '/global/homes/u/user/codee-demos/C/MbedTLS/library/aes.c:mbedtls_aes_crypt_cbc:1048:13' [using SIMD]:
  <...>
  Successfully applied AutoFix to the loop at '/global/homes/u/user/codee-demos/C/MbedTLS/library/aes.c:mbedtls_aes_crypt_cbc:1062:13' [using SIMD]:
  <...>
  Successfully applied AutoFix to the loop at '/global/homes/u/user/codee-demos/C/MbedTLS/library/aes.c:mbedtls_aes_crypt_xts:1162:9' [using SIMD]:
  <...>
  Successfully applied AutoFix to the loop at '/global/homes/u/user/codee-demos/C/MbedTLS/library/aes.c:mbedtls_aes_crypt_xts:1169:9' [using SIMD]:
  <...>
  Successfully applied AutoFix to the loop at '/global/homes/u/user/codee-demos/C/MbedTLS/library/aes.c:mbedtls_aes_crypt_xts:1194:9' [using SIMD]:
  <...>
  Could not apply AutoFix to the loop at '/global/homes/u/user/codee-demos/C/MbedTLS/library/aes.c:mbedtls_aes_crypt_xts:1202:9' [using SIMD]:
      [WARNING] /global/homes/u/user/codee-demos/C/MbedTLS/library/aes.c:1202:9 Loop is not OpenMP compliant
  <...>
  Successfully applied AutoFix to the loop at '/global/homes/u/user/codee-demos/C/MbedTLS/library/aes.c:mbedtls_aes_crypt_xts:1211:9' [using SIMD]:
<...>
Successfully updated /global/homes/u/user/codee-demos/C/MbedTLS/library/aes.c

Minimum software stack requirements: OpenMP version 4.0 with simd capabilities

Review the source code changes, for instance, using control version systems:

git diff .
diff --git a/library/aes.c b/library/aes.c
index 4afc3c48a..9b51789e9 100644
--- a/library/aes.c
+++ b/library/aes.c
@@ -421,6 +421,9 @@ static void aes_gen_tables( void )
     /*
      * generate the forward and reverse tables
      */
+    // Codee: Loop modified by Codee (2024-09-03 00:49:13)
+    // Codee: Technique applied: vectorization with 'omp' pragmas
+    #pragma omp simd private(x, y, z)
     for( i = 0; i < 256; i++ )
     {
         x = FSb[i];
@@ -565,6 +568,9 @@ int mbedtls_aes_setkey_enc( mbedtls_aes_context *ctx, const unsigned char *key,
         return( mbedtls_aesni_setkey_enc( (unsigned char *) ctx->rk, key, keybits ) );
 #endif
 
+    // Codee: Loop modified by Codee (2024-09-03 00:49:13)
+    // Codee: Technique applied: vectorization with 'omp' pragmas
+    #pragma omp simd
     for( i = 0; i < ( keybits >> 5 ); i++ )
     {
         RK[i] = MBEDTLS_GET_UINT32_LE( key, i << 2 );
@@ -1045,6 +1051,9 @@ int mbedtls_aes_crypt_cbc( mbedtls_aes_context *ctx,
             if( ret != 0 )
                 goto exit;
 
+            // Codee: Loop modified by Codee (2024-09-03 00:49:13)
+            // Codee: Technique applied: vectorization with 'omp' pragmas
+            #pragma omp simd
             for( i = 0; i < 16; i++ )
                 output[i] = (unsigned char)( output[i] ^ iv[i] );
 
@@ -1059,6 +1068,9 @@ int mbedtls_aes_crypt_cbc( mbedtls_aes_context *ctx,
     {
         while( length > 0 )
         {
+            // Codee: Loop modified by Codee (2024-09-03 00:49:13)
+            // Codee: Technique applied: vectorization with 'omp' pragmas
+            #pragma omp simd
             for( i = 0; i < 16; i++ )
                 output[i] = (unsigned char)( input[i] ^ iv[i] );
 
@@ -1159,6 +1171,9 @@ int mbedtls_aes_crypt_xts( mbedtls_aes_xts_context *ctx,
             mbedtls_gf128mul_x_ble( tweak, tweak );
         }
 
+        // Codee: Loop modified by Codee (2024-09-03 00:49:13)
+        // Codee: Technique applied: vectorization with 'omp' pragmas
+        #pragma omp simd
         for( i = 0; i < 16; i++ )
             tmp[i] = input[i] ^ tweak[i];
 
@@ -1166,6 +1181,9 @@ int mbedtls_aes_crypt_xts( mbedtls_aes_xts_context *ctx,
         if( ret != 0 )
             return( ret );
 
+        // Codee: Loop modified by Codee (2024-09-03 00:49:13)
@@ -1059,6 +1068,9 @@ int mbedtls_aes_crypt_cbc( mbedtls_aes_context *ctx,
     {
         while( length > 0 )
         {
+            // Codee: Loop modified by Codee (2024-09-03 00:49:13)
+            // Codee: Technique applied: vectorization with 'omp' pragmas
+            #pragma omp simd
             for( i = 0; i < 16; i++ )
                 output[i] = (unsigned char)( input[i] ^ iv[i] );
 
@@ -1159,6 +1171,9 @@ int mbedtls_aes_crypt_xts( mbedtls_aes_xts_context *ctx,
             mbedtls_gf128mul_x_ble( tweak, tweak );
         }
 
+        // Codee: Loop modified by Codee (2024-09-03 00:49:13)
+        // Codee: Technique applied: vectorization with 'omp' pragmas
+        #pragma omp simd
         for( i = 0; i < 16; i++ )
             tmp[i] = input[i] ^ tweak[i];
 
@@ -1166,6 +1181,9 @@ int mbedtls_aes_crypt_xts( mbedtls_aes_xts_context *ctx,
         if( ret != 0 )
             return( ret );
 
+        // Codee: Loop modified by Codee (2024-09-03 00:49:13)
+        // Codee: Technique applied: vectorization with 'omp' pragmas
+        #pragma omp simd
         for( i = 0; i < 16; i++ )
             output[i] = tmp[i] ^ tweak[i];
 
@@ -1191,6 +1209,9 @@ int mbedtls_aes_crypt_xts( mbedtls_aes_xts_context *ctx,
          * byte of cyphertext we won't steal. At the same time, copy the
          * remainder of the input for this final round (since the loop bounds
          * are the same). */
+        // Codee: Loop modified by Codee (2024-09-03 00:49:13)
+        // Codee: Technique applied: vectorization with 'omp' pragmas
+        #pragma omp simd lastprivate(i)
         for( i = 0; i < leftover; i++ )
         {
             output[i] = prev_output[i];
@@ -1208,6 +1229,9 @@ int mbedtls_aes_crypt_xts( mbedtls_aes_xts_context *ctx,
 
         /* Write the result back to the previous block, overriding the previous
          * output we copied. */
+        // Codee: Loop modified by Codee (2024-09-03 00:49:13)
+        // Codee: Technique applied: vectorization with 'omp' pragmas
+        #pragma omp simd
         for( i = 0; i < 16; i++ )
             prev_output[i] = tmp[i] ^ t[i];
     }

6. Execution

Compile the code with the optimizations applied by Codee:

cmake --build build -j

And run the benchmark AES_XTS, corresponding to aes.c:

MbedTLS optimized benchmark invocation

./build/programs/test/benchmark aes_xts

Optimized benchmark output

  AES-XTS-128              :     874306 KiB/s,          4 cycles/byte
  AES-XTS-256              :     715399 KiB/s,          4 cycles/byte

Lastly, revert the changes applied by Codee so that the code returns to the original version:

git command
git restore .

Re-compile again, so the binaries return to their original version as well:

cmake --build build -j

And run the benchmark AES_XTS, to compare the results with the ones obtained earlier with the code optimized by Codee:

MbedTLS original benchmark invocation

./build/programs/test/benchmark aes_xts

Original benchmark output

  AES-XTS-128              :     715509 KiB/s,          4 cycles/byte
  AES-XTS-256              :     653419 KiB/s,          5 cycles/byte

Note how Codee's optimization managed to obtain an speedup of 20% between the original version and the one optimized by Codee.

Getting started​

Walkthrough​

1. Generate the compile_commands.json​

2. Run the global screening report​

3. Run the screening report for specific files​

4. Run the checks report​

5. Run the checks report in verbose mode​

5. Autofix​

6. Execution​

Getting started

Walkthrough

1. Generate the `compile_commands.json`

2. Run the global screening report

3. Run the screening report for specific files

4. Run the checks report

5. Run the checks report in verbose mode

5. Autofix

6. Execution