-
Notifications
You must be signed in to change notification settings - Fork 67
adds upstream testing development document #404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,114 @@ | ||||||
# Testing in 'instructlab/instructlab' | ||||||
|
||||||
We would like for this library to be a lightweight, efficient, and hackable toolkit for training LLMs with SFT and RLHF. | ||||||
|
||||||
Currently, this library is used heavily by the `ilab` tool from `github.com/instructlab/instructlab`. It is tested via that project's end-to-end tests. | ||||||
To improve development velocity, project trustworthiness, and library utility apart from `ilab`, we need to improve our project testing in-repo. | ||||||
|
||||||
## Testing a training library | ||||||
|
||||||
Training LLMs requires a lot of time and compute. We simply cannot "test" our library by executing an entire instruct-tuning run for each feature configuration we would like to verify. | ||||||
|
||||||
This presents a challenge: we have to test our code as well as possible without frequently generating a complete final artifact (a "trained" model checkpoint). | ||||||
|
||||||
To compensate for this challenge, we propose the following three-tier testing methodology: | ||||||
|
||||||
0. Linting: dev-time verification that code meets style and static analysis requirements | ||||||
1. Unit tests: verification that indivisible units work correctly | ||||||
2. Subsystem tests: verification that collections of units work correctly together | ||||||
3. Smoke tests: verification that feature-bounded systems work correctly | ||||||
4. Benchmark tests: verification that outputs behave correctly | ||||||
|
||||||
### Responsibilities of: Unit Tests | ||||||
|
||||||
This is one of the most common patterns in software testing and requires little introduction. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd include automated coverage targets / reports here. I know coverage targets are a bit controversial (Goodhart's law), but assuming we understand the risks, coverage reports and enforcement should help us not to take sight off the ball. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's a good idea. We can make an issue to investigate this in the future |
||||||
For our purposes, Unit Tests will be tests that do not require accelerated hardware (e.g. GPUs). | ||||||
|
||||||
These tests will verify that the smallest units of meaningful code work correctly. | ||||||
They should be able to run on any development machine and should always run for incoming upstream PRs. | ||||||
|
||||||
### Responsibilities of: Subsystem Tests | ||||||
|
||||||
This might include data management utilities (distributed data sampler, batch collator, etc), model setup tools, loop instrumentation tools, etc. | ||||||
Subsystem Tests should give very fast confirmation that the library is behaving correctly, especially when bumping dependency versions. | ||||||
|
||||||
Subsystem Tests should be able to run on any development machine and should always run for incoming upstream PRs. | ||||||
|
||||||
### Responsibilities of: Smoke Tests | ||||||
|
||||||
The origin of the term "smoke test" comes from plumbing. Plumbers would pipe smoke, instead of water, through a completed pipe system to check that there were no leaks. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. today I learned |
||||||
In computer software, smoke testing provides fast verification to decide whether further testing is necessary- if something immediately breaks, we quickly know that there's a | ||||||
problem, and shouldn't spend resources on more in-depth testing. | ||||||
|
||||||
This is effectively what we'd like to do with this class of tests: verify that everything _seems_ to be in order before actually building a model to verify this. | ||||||
|
||||||
The challenge, however, is that "basic" testing in this context probably still requires accelerated hardware. We can't check that checkpointing a LoRA model while using CPU Offloading | ||||||
doesn't break without spinning up that exact testing scenario. | ||||||
|
||||||
"Smoke tests" in this repo, therefore, will be defined as tests that demonstrate the functionality, but don't prove the final model correctness, for features that we want to ship. This is | ||||||
very helpful to us because _most important failures don't happen during training itself- they happen during setup, checkpointing, and tear-down_. | ||||||
|
||||||
Pictorially, we can represent smoke testing as abbreviating a training run ('0' are setup/tear-down/checkpointing steps, '.' are training steps)" | ||||||
|
||||||
00000.................................................00000.....................................................00000 | ||||||
|
||||||
to the following: | ||||||
|
||||||
00000.00000.00000 | ||||||
|
||||||
Smoke tests would block PRs from being merged and would run automatically once all unit and lint tests pass. | ||||||
This puts the onus on the Unit Tests (and the test writer) to verify code correctness as much as they possibly can without using hardware. If something requires many invocations of | ||||||
smoke tests to debug, it probably wasn't sufficiently debugged during development, or is insufficiently unit tested. | ||||||
|
||||||
Smoke tests will inevitably require a high-spec development machine to run locally. It shouldn't be acceptable for smoke tests to run for >60 minutes- we should aim for them to run in <30 minutes. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd like to keep the upper-bound to 30min. That's the hard cutoff- faster is always better. |
||||||
|
||||||
#### Smoke tests coverage | ||||||
|
||||||
This library attempts to provide training support for multiple variants of accelerators, numbers of accelerators, two distributed backends, and many features. Therefore, the completeness matrix | ||||||
is roughly explained by the following pseudocode: | ||||||
|
||||||
```python | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd recommend to explicitly state that tests are written in python using pytest framework. I'd rather not perpetuate the "functional tests" bash shell pattern found elsewhere, in this repo, if possible. |
||||||
|
||||||
for variant in ["nvidia", "amd", "intel"]: # parallel at runner level | ||||||
for framework in ["fsdp", "deepspeed"]: | ||||||
for n_card in [1, 4]: | ||||||
for feat_perm in all_permutations(["lora", "cpu_offload", None, ...]): | ||||||
|
||||||
opts = [variant, n_card, framework, *feat_perm] | ||||||
|
||||||
if check_opts_compatible(*opts): | ||||||
run_accelerated_tests(*opts) | ||||||
|
||||||
``` | ||||||
|
||||||
When we have to add a new accelerator variant or feature, we can plug them into their respective tier in the smoke testing hierarchy. | ||||||
|
||||||
### Responsibilities of: Benchmark Tests | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Benchmark tests are going to be far more computationally intensive than to test a training job, and are also unlikely to provide meaningful signals that a given model is learning correctly. We should not rely on these for testing training. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah likely true. I can remove this / deprioritize it. |
||||||
|
||||||
Intermittently, we'll want to verify that our "presumeably correct" training library _actually produces_ a high-performance model. | ||||||
Predicting model performance from statically analyzing the logits or the parameters isn't viable, so it'll have to be benchmarked. | ||||||
|
||||||
Benchmark testing might look like the following: | ||||||
|
||||||
1. Tune a few models on well-understood dataset using one or two permutations of features, unique to benchmark run. | ||||||
2. Benchmark model on battery of evaluation metrics that give us the strongest signal re: model performance. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One question here is whether instructlab/training library should depend on instructlab/eval library for its own benchmarking, or whether a more generic library (lm-eval) should be used instead. Does eval library bring anything unique for testing purposes? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't have a clear answer on this upfront. I feel like we'd want to dogfood our own lib but be flexible |
||||||
|
||||||
Implementing this testing schema can be another project for a later time. | ||||||
|
||||||
## Concerns and Notes | ||||||
|
||||||
### Running smoke tests is expensive and shouldn't be done for every contribution | ||||||
|
||||||
Repository maintainers may want to implement a manual smoke-testing policy based on what a PR actually changes, rather than have the test be automatic. If an incoming PR changes the library interface but doesn't directly affect the training code, a unit test would probably be the most appropriate way to check whether all behavior still squares. | ||||||
|
||||||
### How often should we be running benchmark tests? | ||||||
|
||||||
There are many options for this depending on the observed risk that new features or dependency bumps seem to bring to the project. Likely, a comprehensive benchmark test should be done as a baseline and then the test should only be repeated when new model architectures are supported, major versions of ilab SDG are released, etc. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we might want to do this more often, like on Medium-Large library changes. Or upon a weekly release to ensure we didn't merge anything in the last week that broke stuff |
||||||
|
||||||
### How should these tests be implemented? Bash scripts? Pytest? | ||||||
|
||||||
Pytest. Pytest + Tox. Writing tests in Bash is very hard to maintain. | ||||||
|
||||||
## Conclusion | ||||||
|
||||||
Testing ML models has never been easy, and it's unlikely that the final implementation of these ideas will closely mirror this document. However, we hope that the _spirit_ of this document- fast feedback, efficient use of resources, unit testing prioritization- will be effective. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is meant by RLHF in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Reinforcement Learning from Human Feedback"