Enabled differential benchmarking, modularized tests

aws · Jul 26, 2024 · f0d9c4c · f0d9c4c
1 parent fff7b15
commit f0d9c4c
Show file tree

Hide file tree

Showing 2 changed files with 180 additions and 73 deletions.
diff --git a/tests/regression/README.md b/tests/regression/README.md
@@ -1,10 +1,15 @@
 # Regression Testing for s2n-tls
 
-This folder contains regression tests and benchmarking tools for the `s2n-tls` library. The tests focus on various aspects of TLS connections.
+This folder contains regression tests and benchmarking tools for the `s2n-tls` library. The tests focus on various aspects of TLS connections. 
+
 
 ## Testing Philosophy
 
 Currently, s2n-tls implements a wall clock benchmarking tool which measures end-to-end handshake performance to compare s2n-tls with rustls and OpenSSL. In the past, s2n-tls has tried benchmarking to detect regressions through criterion in Rust, but the subprocess and spin-up time contributed to performance measurement which made the results inaccurate and difficult to use in CI. The project has a slightly different focus, learning from these existing tools. Performance assertion in s2n-tls focuses on a benchmarking tool that can detail performance by API path and do so with enough repeatability and accuracy to detect regressions between two versions of s2n-tls so that performance analysis can occur at PR time. This means that the scope of each harness is limited and mutually exclusive of other harnesses since we are intersted in measuring the performance of the important paths a TLS connection typically follows. 
+
+### Why CPU instructions
+The performance benchmarking framework utilizes CPU Instruction count across API paths to make the regression assertion. This technique reduces noise, ensuring that small performance differences are caught through this measure. While a difference in performance count may not result in a direct runtime difference, it is useful when comparing a PR to mainline and to dig into the specific sources of performance impact within the code. 
+
 ## Contents
 
 1. **lib.rs**
@@ -28,68 +33,83 @@ To run the harnesses with Valgrind and store the annotated results, run:
 ENABLE_VALGRIND = true cargo test
 ```
 
-This will recursively call all tests with valgrind enabled so the performance output is generated and stored
+This will recursively call all tests with valgrind enabled so the performance output is generated and stored in target/perf_outputs. If you are looking for the scalar performance output of a PR, this will provide inisght into which portions of the code account for what share of the CPU instruction count.
+
+## Running the Harnesses between versions (differential performance)
+Run the scalar performance for all harnesses on the current branch version of s2n-tls
+```
+ENABLE_VALGRIND=true cargo test
+```
+`git checkout` or `git switch` to mainline/version being compared to. Make sure you have stashed or committed any changes.
+```
+ENABLE_VALGRIND=true TEST_SUFFIX=prev cargo test
+DIFF_MODE=true cargo test
+```
+Restore your original version by switching back to it. This will assert on the performance difference of the current version minus the previous. If the regression exceeds the const `MAX_DIFF`, the test fails. Performance output profiles are stored in `target/perf_outputs`:
+- `curr` for the current version
+- `prev` for the version being compared to
+- `diff` for the difference between the two
+
 ## Running the tests w/o Valgrind
 
 ```
 cargo test
 ```
 
-This will run the tests without valgrind to test if the process completes as expected
-## Sample Output for Valgrind test
+This will run the tests without valgrind to test if the hanresses complete as expected
+## Sample Output for Valgrind test (differential)
 
-Running the test will run all harnesses and fail if any number of harnesses exceed the performance threshold. For example, a regression test faliure could look like:
+Running the differential test will run all harnesses and fail if any number of harnesses exceed the performance threshold. For example, a regression test faliure could look like:
 ```
----- tests::test_set_security_policy_and_build stdout ----
-Running command: valgrind --tool=cachegrind --cachegrind-out-file=cachegrind_test_set_security_policy_and_build.out /home/ubuntu/proj/s2n/tests/regression/target/debug/deps/regression-7c7d86aeafe3b426 test_set_security_policy_and_build
-Running command: cg_annotate cachegrind_test_set_security_policy_and_build.out > perf_outputs/test_set_security_policy_and_build.annotated.txt
-thread 'tests::test_set_security_policy_and_build' panicked at src/lib.rs:174:9:
-Instruction count difference in test_set_security_policy_and_build exceeds the threshold, regression of 13975865 instructions
-note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
+running 2 tests
+test tests::test_set_config ... FAILED
+test tests::test_rsa_handshake ... ok
 
----- tests::test_rsa_handshake stdout ----
-Running command: valgrind --tool=cachegrind --cachegrind-out-file=cachegrind_test_rsa_handshake.out /home/ubuntu/proj/s2n/tests/regression/target/debug/deps/regression-7c7d86aeafe3b426 test_rsa_handshake
-Running command: cg_annotate cachegrind_test_rsa_handshake.out > perf_outputs/test_rsa_handshake.annotated.txt
-thread 'tests::test_rsa_handshake' panicked at src/lib.rs:174:9:
-Instruction count difference in test_rsa_handshake exceeds the threshold, regression of 51176459 instructions
+failures:
+
+---- tests::test_set_config stdout ----
+Instruction difference for test_set_config: 245746
+thread 'tests::test_set_config' panicked at src/lib.rs:189:9:
+Instruction count difference in test_set_config exceeds the threshold, regression of 245746 instructions
+note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
 
 
 failures:
-    tests::test_rsa_handshake
-    tests::test_set_security_policy_and_build
+    tests::test_set_config
+
+test result: FAILED. 1 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.20s
 ```
 
-It also produces annotated cachegrind files stored in the `perf_ouput` directory which detail the instruction counts, how many instructions a particular file/function account for, and the contribution of individual lines of code to the overall instruction count. For example, these are the first few lines of the output generated for 'test_rsa_handshake.annotated.txt':
+It also produces annotated cachegrind diff files stored in the `perf_ouput` directory which detail the instruction counts, how many instructions a particular file/function account for, and the contribution of individual lines of code to the overall instruction count difference. For example, these are the first few lines of the output generated for 'test_set_config.txt' from the above faliure:
 
 ```
 --------------------------------------------------------------------------------
 -- Summary
 --------------------------------------------------------------------------------
-Ir_________________ 
+Ir______________ 
 
-79,270,744 (100.0%)  PROGRAM TOTALS
+245,746 (100.0%)  PROGRAM TOTALS
 
 --------------------------------------------------------------------------------
 -- File:function summary
 --------------------------------------------------------------------------------
-  Ir_______________________  file:function
-
-< 71,798,872 (90.6%, 90.6%)  /home/ubuntu/.cargo/registry/src/index.crates.io-6f17d22bba15001f/aws-lc-sys-0.19.0/aws-lc/generated-src/linux-x86_64/crypto/fipsmodule/x86_64-mont5.S:
-  54,908,926 (69.3%)           aws_lc_0_19_0_bn_sqr8x_internal
-  15,699,024 (19.8%)           mul4x_internal
-   1,114,840  (1.4%)           __bn_post4x_internal
-
-<  1,551,316  (2.0%, 92.5%)  /home/ubuntu/.cargo/registry/src/index.crates.io-6f17d22bba15001f/aws-lc-sys-0.19.0/aws-lc/generated-src/linux-x86_64/crypto/fipsmodule/p256-x86_64-asm.S:
-     676,336  (0.9%)           __ecp_nistz256_mul_montq
-     475,750  (0.6%)           __ecp_nistz256_sqr_montq
-      95,732  (0.1%)           aws_lc_0_19_0_ecp_nistz256_point_double
-
-<    833,553  (1.1%, 93.6%)  /home/ubuntu/.cargo/registry/src/index.crates.io-6f17d22bba15001f/aws-lc-sys-0.19.0/aws-lc/generated-src/linux-x86_64/crypto/fipsmodule/sha256-x86_64.S:
-     830,671  (1.0%)           sha256_block_data_order_avx
-
-<    557,697  (0.7%, 94.3%)  /home/ubuntu/.cargo/registry/src/index.crates.io-6f17d22bba15001f/aws-lc-sys-0.19.0/aws-lc/generated-src/linux-x86_64/crypto/fipsmodule/x86_64-mont.S:
-     493,032  (0.6%)           bn_mul4x_mont
-
+  Ir______________________  file:function
+
+<  243,774  (99.2%, 99.2%)  /home/ubuntu/.cargo/registry/src/index.crates.io-6f17d22bba15001f/aws-lc-sys-0.19.0/aws-lc/crypto/fipsmodule/../internal.h:
+    62,034  (25.2%)           constant_time_select_w
+    47,264  (19.2%)           value_barrier_w
+    47,264  (19.2%)           constant_time_select_int
+    25,279  (10.3%)           constant_time_lt_w
+    20,692   (8.4%)           constant_time_msb_w
+    20,566   (8.4%)           constant_time_is_zero_w
+    16,346   (6.7%)           constant_time_eq_w
+     2,720   (1.1%)           CRYPTO_addc_u64
+       608   (0.2%)           OPENSSL_memcpy
+      -504  (-0.2%)           CRYPTO_subc_u64
+       480   (0.2%)           CRYPTO_bswap4
+       424   (0.2%)           OPENSSL_memset
+       315   (0.1%)           CRYPTO_load_u32_be
+       270   (0.1%)           CRYPTO_store_u32_be
 ```
 
 ### Understanding the Annotated Output

diff --git a/tests/regression/src/lib.rs b/tests/regression/src/lib.rs
@@ -40,9 +40,9 @@ mod tests {
         process::Command,
     };
 
-    /// Configurable threshold for regression testing. 
+    /// Configurable threshold for regression testing.
     /// Tests will fail if the instruction count difference is greater than the value of this constant.
-    const MAX_DIFF: u64 = 1_000_000; 
+    const MAX_DIFF: i64 = 1_000;
 
     struct InstrumentationControl;
 
@@ -55,20 +55,37 @@ mod tests {
             cg::cachegrind::start_instrumentation();
         }
     }
+
     /// Environment variable to determine whether to run under valgrind or solely test functionality.
     fn is_running_under_valgrind() -> bool {
         env::var("ENABLE_VALGRIND").is_ok()
     }
 
+    /// Function to get the test suffix from environment variables. If the var is not set, it defaults to curr.
+    fn get_test_suffix() -> String {
+        env::var("TEST_SUFFIX").unwrap_or_else(|_| "curr".to_string())
+    }
+
+    /// Function to determine if diff mode is enabled.
+    fn is_diff_mode() -> bool {
+        env::var("DIFF_MODE").is_ok()
+    }
+
     fn valgrind_test<F>(test_name: &str, test_body: F) -> Result<(), s2n_tls::error::Error>
     where
         F: FnOnce(&InstrumentationControl) -> Result<(), s2n_tls::error::Error>,
     {
+        let suffix = get_test_suffix();
         if !is_running_under_valgrind() {
-            let ctrl = InstrumentationControl;
-            test_body(&ctrl)
+            if is_diff_mode() {
+                run_diff_test(test_name); 
+                Ok(())
+            } else {
+                let ctrl = InstrumentationControl;
+                test_body(&ctrl)
+            }
         } else {
-            run_valgrind_test(test_name);
+            run_valgrind_test(test_name, &suffix);
             Ok(())
         }
     }
@@ -87,7 +104,7 @@ mod tests {
         .unwrap();
     }
 
-    /// Test which creates a TestPair from config using `rsa_4096_sha512`. Only measures a pair handshake. 
+    /// Test which creates a TestPair from config using `rsa_4096_sha512`. Only measures a pair handshake.
     #[test]
     fn test_rsa_handshake() {
         valgrind_test("test_rsa_handshake", |ctrl| {
@@ -105,52 +122,123 @@ mod tests {
         })
         .unwrap();
     }
+
     /// Function to run specified test using valgrind
-    fn run_valgrind_test(test_name: &str) {
+    fn run_valgrind_test(test_name: &str, suffix: &str) {
         let exe_path = std::env::args().next().unwrap();
+        let output_file = create_output_file_path(test_name, suffix);
+        let command = build_valgrind_command(&exe_path, test_name, &output_file);
+
+        println!("Running command: {:?}", command);
+        execute_command(command);
+
+        let annotate_output = run_cg_annotate(&output_file);
+        save_annotate_output(&annotate_output, suffix, test_name);
+
+        let count = find_instruction_count(&annotate_output)
+            .expect("Failed to get instruction count from file");
+
+        println!("Instruction count for {}: {}", test_name, count);
+    }
+    /// Creates the path for the unannotated output file.
+    fn create_output_file_path(test_name: &str, suffix: &str) -> String {
         create_dir_all(Path::new("target/cg_artifacts")).unwrap();
-        let output_file = format!("target/cg_artifacts/cachegrind_{}.out", test_name);
-        let output_command = format!("--cachegrind-out-file={}", &output_file);
+        format!("target/cg_artifacts/cachegrind_{}_{}.out", test_name, suffix)
+    }
+    /// Builds the valgrind command.
+    fn build_valgrind_command(exe_path: &str, test_name: &str, output_file: &str) -> Command {
+        let output_command = format!("--cachegrind-out-file={}", output_file);
         let mut command = Command::new("valgrind");
         command
-            .args(["--tool=cachegrind", &output_command, &exe_path, test_name])
-            // Ensures that the recursive call is made to the actual harness code block rather than back to this function
-            .env_remove("ENABLE_VALGRIND"); 
-
-        println!("Running command: {:?}", command);
+            .args(["--tool=cachegrind", &output_command, exe_path, test_name])
+            .env_remove("ENABLE_VALGRIND");
+        command
+    }
+    /// Executes the given command.
+    fn execute_command(mut command: Command) {
         let status = command.status().expect("Failed to execute valgrind");
-
         if !status.success() {
             panic!("Valgrind failed");
         }
-
+    }
+    /// Runs the cg_annotate command on the output file.
+    fn run_cg_annotate(output_file: &str) -> std::process::Output {
         let annotate_output = Command::new("cg_annotate")
-            .arg(&output_file)
+            .arg(output_file)
             .output()
             .expect("Failed to run cg_annotate");
-
         if !annotate_output.status.success() {
             panic!("cg_annotate failed");
         }
-        create_dir_all(Path::new("target/perf_outputs")).unwrap();
-        let annotate_file = format!("target/perf_outputs/{}.annotated.txt", test_name);
+        annotate_output
+    }
+    /// Saves the annotated output to prev, curr, or diff accordingly
+    fn save_annotate_output(output: &std::process::Output, suffix: &str, test_name: &str) {
+        let directory = format!("target/perf_outputs/{}", suffix);
+        create_dir_all(Path::new(&directory)).unwrap();
+        let annotate_file = format!("target/perf_outputs/{}/{}_{}.annotated.txt", suffix, test_name, suffix);
         let mut file = File::create(&annotate_file).expect("Failed to create annotation file");
-        file.write_all(&annotate_output.stdout)
+        file.write_all(&output.stdout)
             .expect("Failed to write annotation file");
+    }
+    /// Function to run the diff test using valgrind, only called when diff mode is set
+    fn run_diff_test(test_name: &str) {
+        let (prev_file, curr_file) = get_diff_files(test_name);
+        ensure_diff_files_exist(&prev_file, &curr_file);
+
+        let diff_output = run_cg_annotate_diff(&prev_file, &curr_file);
+        save_diff_output(&diff_output, test_name);
+
+        let diff = find_instruction_count(&diff_output)
+            .expect("Failed to parse cg_annotate --diff output");
 
-        let count = find_instruction_count(&annotate_file)
-            .expect("Failed to get instruction count from file");
-        // This is temporary code to showcase the future diff functionality, here the code regresses by 10% each time so this test will almost always fail
-        let new_count = count + count / 10;
-        let diff = new_count - count;
-        assert!(diff <= self::MAX_DIFF, "Instruction count difference in {} exceeds the threshold, regression of {} instructions", test_name, diff);
+        assert_diff_within_threshold(diff, test_name);
+    }
+    /// Retrieves the file paths for the diff test.
+    fn get_diff_files(test_name: &str) -> (String, String) {
+        (
+            format!("target/cg_artifacts/cachegrind_{}_prev.out", test_name),
+            format!("target/cg_artifacts/cachegrind_{}_curr.out", test_name),
+        )
+    }
+    /// Ensures that the required performance files exist to use diff functionality
+    fn ensure_diff_files_exist(prev_file: &str, curr_file: &str) {
+        if !Path::new(prev_file).exists() || !Path::new(curr_file).exists() {
+            panic!("Required cachegrind files not found: {} or {}", prev_file, curr_file);
+        }
+    }
+    /// Runs the cg_annotate diff command to parse already generated performance files and compare them
+    fn run_cg_annotate_diff(prev_file: &str, curr_file: &str) -> std::process::Output {
+        let diff_output = Command::new("cg_annotate")
+            .args(["--diff", prev_file, curr_file])
+            .output()
+            .expect("Failed to run cg_annotate --diff");
+        if !diff_output.status.success() {
+            panic!("cg_annotate --diff failed");
+        }
+        diff_output
+    }
+    /// Saves the output of the cg_annotate diff command to a file.
+    fn save_diff_output(output: &std::process::Output, test_name: &str) {
+        create_dir_all(Path::new("target/perf_outputs/diff")).unwrap();
+        let diff_file = format!("target/perf_outputs/diff/{}_diff.annotated.txt", test_name);
+        let mut file = File::create(&diff_file).expect("failed to create diff annotation file");
+        file.write_all(&output.stdout)
+            .expect("Failed to write diff annotation file");
+    }
+    /// Asserts that the instruction count difference is within the threshold.
+    fn assert_diff_within_threshold(diff: i64, test_name: &str) {
+        assert!(
+            diff <= MAX_DIFF,
+            "Instruction count difference in {} exceeds the threshold, regression of {} instructions",
+            test_name,
+            diff,
+        );
     }
 
-    /// Parses the annotated file for the overall instruction count total
-    fn find_instruction_count(file_path: &str) -> Result<u64, io::Error> {
-        let path = Path::new(file_path);
-        let file = File::open(path)?;
-        let reader = io::BufReader::new(file);
+    /// Parses the annotated file for the overall instruction count total.
+    fn find_instruction_count(output: &std::process::Output) -> Result<i64, io::Error> {
+        let reader = io::BufReader::new(&output.stdout[..]);
         // Example of the line being parsed:
         // "79,278,369 (100.0%)  PROGRAM TOTALS"
         for line in reader.lines() {
@@ -159,12 +247,11 @@ mod tests {
                 if let Some(instructions) = line.split_whitespace().next() {
                     return instructions
                         .replace(',', "")
-                        .parse::<u64>()
+                        .parse::<i64>()
                         .map_err(|e| io::Error::new(io::ErrorKind::InvalidData, e));
                 }
             }
         }
-
         panic!("Failed to find instruction count in annotated file");
     }
 }