Skip to content

Commit c405770

Browse files
committed
adds upstream testing development document
Signed-off-by: James Kunstle <[email protected]>
1 parent e1ef2ee commit c405770

File tree

1 file changed

+102
-0
lines changed

1 file changed

+102
-0
lines changed

docs/upstream-testing-plan.md

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
# Testing in 'instructlab/instructlab'
2+
3+
We would like for this library to be a lightweight, efficient, and hackable toolkit for training "small" LLMs.
4+
Currently, this library is used heavily by `github.com/instructlab/instructlab` and the `ilab` tool, and is tested via that project's end-to-end tests.
5+
To improve development velocity, project trustworthiness, and library utility apart from `ilab`, we need to bolster our project testing.
6+
7+
## Testing a training library
8+
9+
Training LLMs requires a lot of time and compute. We simply cannot "test" our library by executing an entire instruct-tuning run for each feature configuration we would like to verify.
10+
This presents a challenge: we have to test our code as well as possible without frequently generating a complete final artifact (a "trained" model checkpoint).
11+
12+
To compensate for this challenge, we propose the following three-tier testing methodology:
13+
14+
1. Unit tests: verification that subsystems work correctly
15+
2. Smoke tests: verification that feature-bounded systems work correctly
16+
3. Benchmark tests: verification that outputs behave correctly
17+
18+
### Responsibilities of: Unit Tests
19+
20+
This is one of the most common patterns in software testing and requires little introduction.
21+
For our purposes, Unit Tests will be tests that do not require accelerated hardware (e.g. GPUs).
22+
For us, the objective of a Unit Test is to check that an isolateable sub-system functions correctly.
23+
24+
This might include data management utilities (distributed data sampler, batch collator, etc), model setup tools, loop instrumentation tools, etc.
25+
Unit Tests should give very fast confirmation that the library is behaving correctly, especially when bumping dependency versions.
26+
27+
Unit Tests should be able to run on any development machine and should always run for incoming upstream PRs.
28+
29+
### Responsibilities of: Smoke Tests
30+
31+
The origin of the term "smoke test" comes from plumbing. Plumbers would pipe smoke, instead of water, through a completed pipe system to check that there were no leaks.
32+
In computer software, smoke testing provides fast verification to decide whether further testing is necessary- if something immediately breaks, we quickly know that there's a
33+
problem, and shouldn't spend resources on more in-depth testing.
34+
35+
This is effectively what we'd like to do with this class of tests: verify that everything _seems_ to be in order before actually building a model to verify this.
36+
37+
The challenge, however, is that "basic" testing in this context probably still requires accelerated hardware. We can't check that checkpointing a LoRA model while using CPU Offloading
38+
doesn't break without spinning up that exact testing scenario.
39+
40+
"Smoke tests" in this repo, therefore, will be defined as tests that demonstrate the functionality, though not the correctness, of features that we want to ship. This is
41+
very helpful to us because _most important failures don't happen during training itself- they happen during setup, checkpointing, and tear-down_.
42+
43+
Pictorially, we can represent smoke testing as abbreviating a training run ('0' are setup/tear-down/checkpointing steps, '.' are training steps)"
44+
45+
00000.................................................00000.....................................................00000
46+
47+
to the following:
48+
49+
00000.00000.00000
50+
51+
Smoke tests would block PRs from being merged and would run automatically once all unit and lint tests pass.
52+
This puts the onus on the Unit Tests (and the test writer) to verify code correctness as much as they possibly can without using hardware. If something requires many invocations of
53+
smoke tests to debug, it probably wasn't sufficiently debugged during development, or is insufficiently unit tested.
54+
55+
Smoke tests will inevitably require a high-spec development machine to run locally. It shouldn't be acceptable for smoke tests to run for >60 minutes- we should aim for them to run in <30 minutes.
56+
57+
#### Smoke tests coverage
58+
59+
This library attempts to provide training support for multiple variants of accelerators, numbers of accelerators, two distributed backends, and many features. Therefore, the completeness matrix
60+
is roughly explained by the following pseudocode:
61+
62+
```python
63+
64+
for variant in ["nvidia", "amd", "intel", "cpu", "mps"]: # parallel at runner level
65+
for framework in ["fsdp", "deepspeed", None]: # None is for CPU and MPS
66+
for n_card in [1, 2, 4, 8, None]: # 1, 2, 4 can run in parallel, then 8. None is for CPU and MPS
67+
for feat_perm in all_permutations(["lora", "cpu_offload", None, ...]):
68+
69+
opts = [variant, n_card, framework, *feat_perm]
70+
71+
if check_opts_compatible(*opts):
72+
run_accelerated_tests(*opts)
73+
74+
```
75+
76+
When we have to add a new accelerator variant or feature, we can plug them into their respective tier in the smoke testing hierarchy.
77+
78+
### Responsibilities of: Benchmark Tests
79+
80+
Intermittently, we'll want to verify that our "presumeably correct" training library _actually produces_ a high-performance model.
81+
Predicting model performance from statically analyzing the logits or the parameters isn't viable, so it'll have to be benchmarked.
82+
83+
Benchmark testing might look like the following:
84+
85+
1. Tune a few models on well-understood dataset using one or two permutations of features, unique to benchmark run.
86+
2. Benchmark model on battery of evaluation metrics that give us the strongest signal re: model performance.
87+
88+
Implementing this testing schema can be another project for a later time.
89+
90+
## Concerns and Notes
91+
92+
### Running smoke tests is expensive and shouldn't be done for every contribution
93+
94+
Repository maintainers may want to implement a manual smoke-testing policy based on what a PR actually changes, rather than have the test be automatic. If an incoming PR changes the library interface but doesn't directly affect the training code, a unit test would probably be the most appropriate way to check whether all behavior still squares.
95+
96+
### How often should we be running benchmark tests?
97+
98+
There are many options for this depending on the observed risk that new features or dependency bumps seem to bring to the project. Likely, a comprehensive benchmark test should be done as a baseline and then the test should only be repeated when new model architectures are supported, major versions of ilab SDG are released, etc.
99+
100+
## Conclusion
101+
102+
Testing ML models has never been easy, and it's unlikely that the final implementation of these ideas will closely mirror this document. However, we hope that the _spirit_ of this document- fast feedback, efficient use of resources, unit testing prioritization- will be effective.

0 commit comments

Comments
 (0)