Modify regression threshold to configurable percentage (#4698)

aws · Aug 14, 2024 · fc9ad97 · fc9ad97
1 parent 7d73fd6
commit fc9ad97
Show file tree

Hide file tree

Showing 2 changed files with 36 additions and 18 deletions.
diff --git a/tests/regression/README.md b/tests/regression/README.md
@@ -5,7 +5,7 @@ This folder contains regression tests and benchmarking tools for the `s2n-tls` l
 
 ## Testing Philosophy
 
-Currently, s2n-tls implements a wall clock benchmarking tool which measures end-to-end handshake performance to compare s2n-tls with rustls and OpenSSL. In the past, s2n-tls has tried benchmarking to detect regressions through criterion in Rust, but the subprocess and spin-up time contributed to performance measurement which made the results inaccurate and difficult to use in CI. The project has a slightly different focus, learning from these existing tools. Performance assertion in s2n-tls focuses on a benchmarking tool that can detail performance by API path and do so with enough repeatability and accuracy to detect regressions between two versions of s2n-tls so that performance analysis can occur at PR time. This means that the scope of each harness is limited and mutually exclusive of other harnesses since we are intersted in measuring the performance of the important paths a TLS connection typically follows. 
+Currently, s2n-tls implements a wall clock benchmarking tool which measures end-to-end handshake performance to compare s2n-tls with rustls and OpenSSL. In the past, s2n-tls has tried benchmarking to detect regressions through criterion in Rust, but the subprocess and spin-up time contributed to performance measurement which made the results inaccurate and difficult to use in CI. The project has a slightly different focus, learning from these existing tools. Performance assertion in s2n-tls focuses on a benchmarking tool that can detail performance by API path and do so with enough repeatability and accuracy to detect regressions between two versions of s2n-tls so that performance analysis can occur at PR time. This means that the scope of each harness is limited and mutually exclusive of other harnesses since we are interested in measuring the performance of the important paths a TLS connection typically follows. 
 
 ### Why CPU instructions
 The performance benchmarking framework utilizes CPU Instruction count across API paths to make the regression assertion. This technique reduces noise, ensuring that small performance differences are caught through this measure. While a difference in performance count may not result in a direct runtime difference, it is useful when comparing a PR to mainline and to dig into the specific sources of performance impact within the code. 
@@ -33,7 +33,7 @@ To run the harnesses with Valgrind and store the annotated results, run:
 PERF_MODE=valgrind cargo test
 ```
 
-This will recursively call all tests with valgrind enabled so the performance output is generated and stored in target/perf_outputs. If you are looking for the scalar performance output of a PR, this will provide inisght into which portions of the code account for what share of the CPU instruction count.
+This will recursively call all tests with valgrind enabled so the performance output is generated and stored in target/perf_outputs. If you are looking for the scalar performance output of a PR, this will provide insight into which portions of the code account for what share of the CPU instruction count.
 
 ## Running the Harnesses between versions (differential performance)
 Run the scalar performance for all harnesses on the current branch version of s2n-tls
@@ -59,16 +59,16 @@ This will assert on the performance difference of the current version minus the
 cargo test
 ```
 
-This will run the tests without valgrind to test if the hanresses complete as expected
+This will run the tests without valgrind to test if the harnesses complete as expected
 
 ## Output Files
 - `target/$commit_id/test_name.raw`: Contains the raw cachegrind profile. On its own, the file is pretty much unreadable but is useful for the cg_annotate --diff functionality or to visualize the profile via tools like [KCachegrind](https://kcachegrind.github.io/html/Home.html).
-- `target/$commit_id/test_name.annotated`: The scalar annotated profile associated with that particular commit id. This file contains detailed infromation on the contribution of functions, files, and lines of code to the overall scalar performance count.
+- `target/$commit_id/test_name.annotated`: The scalar annotated profile associated with that particular commit id. This file contains detailed information on the contribution of functions, files, and lines of code to the overall scalar performance count.
 - `target/diff/test_name.diff`: The annotated performance difference between two commits. This file contains the overall performance difference and also details the instruction counts, how many instructions a particular file/function account for, and the contribution of individual lines of code to the overall instruction count difference.
 
 ## Sample Output for Valgrind test (differential)
 
-Running the differential test will run all harnesses and fail if any number of harnesses exceed the performance threshold. For example, a regression test faliure could look like:
+Running the differential test will run all harnesses and fail if any number of harnesses exceed the performance threshold. For example, a regression test failure could look like:
 ```
 running 2 tests
 test tests::test_set_config ... FAILED

diff --git a/tests/regression/src/lib.rs b/tests/regression/src/lib.rs
@@ -55,8 +55,6 @@ mod tests {
         process::{Command, Output},
     };
 
-    const MAX_DIFF: i64 = 1_000;
-
     struct InstrumentationControl;
 
     impl InstrumentationControl {
@@ -86,7 +84,11 @@ mod tests {
         }
     }
 
-    fn valgrind_test<F>(test_name: &str, test_body: F) -> Result<(), s2n_tls::error::Error>
+    fn valgrind_test<F>(
+        test_name: &str,
+        max_diff: f64,
+        test_body: F,
+    ) -> Result<(), s2n_tls::error::Error>
     where
         F: FnOnce(&InstrumentationControl) -> Result<(), s2n_tls::error::Error>,
     {
@@ -97,7 +99,7 @@ mod tests {
             }
             RegressionTestMode::Diff => {
                 let (prev_profile, curr_profile) = RawProfile::query(test_name);
-                DiffProfile::new(&prev_profile, &curr_profile).assert_performance();
+                DiffProfile::new(&prev_profile, &curr_profile).assert_performance(max_diff);
             }
             RegressionTestMode::Default => {
                 let ctrl = InstrumentationControl;
@@ -144,6 +146,11 @@ mod tests {
             format!("target/{}/{}.raw", self.commit_hash, self.test_name)
         }
 
+        // Returns the annotated profile associated with a raw profile
+        fn associated_annotated_profile(&self) -> AnnotatedProfile{
+            AnnotatedProfile::new(self)
+        }
+
         /// Return the raw profiles for `test_name` in "git" order. `tuple.0` is older than `tuple.1`
         ///
         /// This method will panic if there are not two profiles.
@@ -206,15 +213,23 @@ mod tests {
         fn path(&self) -> String {
             format!("target/{}/{}.annotated", self.commit_hash, self.test_name)
         }
+
+        fn instruction_count(&self) -> i64 {
+            let output = &std::fs::read_to_string(self.path()).unwrap();
+            find_instruction_count(output).unwrap()
+        }
     }
 
     struct DiffProfile {
         test_name: String,
+        prev_profile_count: i64,
     }
     impl DiffProfile {
         fn new(prev_profile: &RawProfile, curr_profile: &RawProfile) -> Self {
             let diff_profile = Self {
                 test_name: curr_profile.test_name.clone(),
+                // reads the annotated profile for previous instruction count
+                prev_profile_count: prev_profile.associated_annotated_profile().instruction_count()
             };
 
             // diff the raw profile
@@ -225,6 +240,7 @@ mod tests {
             assert_command_success(diff_output.clone());
 
             // write the diff to disk
+            create_dir_all(format!("target/diff")).unwrap();
             let diff_content = String::from_utf8(diff_output.stdout)
                 .expect("Invalid UTF-8 in cg_annotate --diff output");
             write(diff_profile.path(), diff_content).expect("Failed to write to file");
@@ -236,21 +252,23 @@ mod tests {
             format!("target/diff/{}.diff", self.test_name)
         }
 
-        fn assert_performance(&self) {
+        fn assert_performance(&self, max_diff: f64) {
             let diff_content = std::fs::read_to_string(self.path()).unwrap();
-
             let diff = find_instruction_count(&diff_content)
                 .expect("Failed to parse cg_annotate --diff output");
+            // percentage difference is the overall difference divided by the previous instruction count
+            let diff_percentage = diff as f64 / self.prev_profile_count as f64;
             assert!(
-                diff <= MAX_DIFF,
-                "Instruction count difference exceeds the threshold, regression of {} instructions. 
-                Check the annotated output logs in target/diff/{}.diff for debug information",
-                diff, self.test_name
+                diff_percentage <= max_diff,
+                "Instruction count difference exceeds the threshold, regression of {diff_percentage}% ({diff} instructions). 
+                Check the annotated output logs in target/diff/{}.diff for debug information", self.test_name
             );
         }
+
     }
 
-    // Pulls the instruction count as an integer from the annotated output file.
+    /// Pulls the instruction count as an integer from the annotated output file. 
+    /// Accepts output from Annotated and Diff profile formats.
     fn find_instruction_count(output: &str) -> Result<i64, io::Error> {
         let reader = io::BufReader::new(output.as_bytes());
         // Example of the line being parsed:
@@ -303,7 +321,7 @@ mod tests {
     /// Test to create new config, set security policy, host_callback information, load/trust certs, and build config.
     #[test]
     fn test_set_config() {
-        valgrind_test("test_set_config", |ctrl| {
+        valgrind_test("test_set_config", 0.01, |ctrl| {
             ctrl.stop_instrumentation();
             ctrl.start_instrumentation();
             let keypair_rsa = CertKeyPair::default();
@@ -317,7 +335,7 @@ mod tests {
     /// Test which creates a TestPair from config using `rsa_4096_sha512`. Only measures a pair handshake.
     #[test]
     fn test_rsa_handshake() {
-        valgrind_test("test_rsa_handshake", |ctrl| {
+        valgrind_test("test_rsa_handshake", 0.01, |ctrl| {
             ctrl.stop_instrumentation();
             let keypair_rsa = CertKeyPair::default();
             let config = set_config(&security::DEFAULT_TLS13, keypair_rsa)?;