Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Config objects #30

Merged
merged 27 commits into from
Mar 4, 2024
Merged
Changes from 1 commit
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
460ff9b
Export AnomalyDetector
ejnnr Feb 28, 2024
dbae3bf
Make tasks more flexible
ejnnr Feb 29, 2024
f16b9ca
Iterating on tasks
ejnnr Feb 29, 2024
9073a85
Mostly fix tests
ejnnr Feb 29, 2024
54c34a6
[WIP] Remove configs
ejnnr Mar 1, 2024
51e6a25
Remove unused DatasetConfigs
ejnnr Mar 1, 2024
48f8292
Rename task file
ejnnr Mar 1, 2024
79b51ec
WIP on removing ScriptConfig and TrainConfig
ejnnr Mar 2, 2024
bdd56fb
Remove backdoor loading/storing logic
ejnnr Mar 2, 2024
62e618a
Remove TrainConfig
ejnnr Mar 2, 2024
94c54ed
Adjust abstractions
ejnnr Mar 2, 2024
4c7e0c2
Remove loggers
ejnnr Mar 3, 2024
6809a7e
Fix bugs and tests
ejnnr Mar 3, 2024
6f0e472
Move save_path and max_batch_size arguments
ejnnr Mar 3, 2024
ae98812
Remove another unused file
ejnnr Mar 3, 2024
31a7993
Remove more unused code
ejnnr Mar 3, 2024
f0dacc5
Minor improvements and remove TODOs
ejnnr Mar 3, 2024
0267bd1
Fix demo notebook
ejnnr Mar 3, 2024
975289e
Add WaNet warning
ejnnr Mar 3, 2024
1b82635
Update gitignore
ejnnr Mar 3, 2024
35220aa
Update documentation somewhat
ejnnr Mar 3, 2024
f9ab02b
Remove simple_parsing dependency
ejnnr Mar 3, 2024
80463e2
Merge remote-tracking branch 'origin/main' into no-configs
ejnnr Mar 4, 2024
d61c676
Adjust tampering/LM code to no-config style
ejnnr Mar 4, 2024
565f456
Add convenience method to clone WanetBackdoor instance
VRehnberg Mar 4, 2024
2c1b38c
Minor changes to WaNet cloning
ejnnr Mar 4, 2024
f7e9300
Merge pull request #34 from VRehnberg/wanet-partial-clone-method
ejnnr Mar 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Update documentation somewhat
ejnnr committed Mar 3, 2024
commit 35220aabc9c65952ef20999d193f00947d1a6819
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -31,13 +31,13 @@ installing `cupbearer`, in particular if you want to control CUDA version etc.

## Running experiments
We provide scripts in `cupbearer.scripts` for more easily running experiments.
See [demo.ipynb](demo.ipynb) for a quick example of how to use them---this is likely
See [the demo notebook](notebooks/simple_demo.ipynb) for a quick example of how to use them---this is likely
also the best way to get an overview of how the components of `cupbearer` fit together.

These "scripts" are Python functions and designed to be used from within Python,
e.g. in a Jupyter notebook or via [submitit](https://github.com/facebookincubator/submitit/tree/main)
if on Slurm. But of course you could also write a simple Python wrapper and then use
them from the CLI. Their configuration interface is designed to be very general,
them from the CLI. The scripts are designed to be pretty general,
which sometimes comes at the cost of being a bit verbose---we recommend writing helper
functions for your specific use case on top of the general script interface.
Of course you can also use the components of `cupbearer` directly without going through
139 changes: 0 additions & 139 deletions docs/adding_a_script.md

This file was deleted.

87 changes: 0 additions & 87 deletions docs/adding_a_task.md
Original file line number Diff line number Diff line change
@@ -1,88 +1 @@
# Adding a new task

The only component that a task absolutely needs is an implementation of the
`TaskConfigBase` abstract class:
```python
class TaskConfigBase(BaseConfig, ABC):
@abstractmethod
def build_reference_data(self) -> Dataset:
pass

@abstractmethod
def build_model(self) -> Model:
pass

def build_params(self):
return None

@abstractmethod
def build_anomalous_data(self) -> Dataset:
pass
```
If your config has any parameters, you should use a dataclass to set them. E.g.
```python
@dataclass
class MyTaskConfig(TaskConfigBase):
my_required_param: str
my_optional_param: int = 42

...
```
This will automagically let you override these parameters from the command line
(and any parameters without default values will be required).

`build_reference_data` and `build_anomalous_data` both need to return `pytorch` `Dataset`s.
`build_model` needs to return a `models.Model`, which is a special type of `flax.linen.Module`.
`build_params` can return a parameter dict for the returned `Model` (if `None`, the model
will be randomly initialized, which is usually not what you want).

In practice, the datasets and the model will have to come from somewhere, so you'll
often implement a few things in addition to the task config class. There are predefined
interfaces for datasets and models, and if possible I suggest using those (either
using their existing implementations, or adding your own). For example, consider
the adversarial example task:
```python
@dataclass
class AdversarialExampleTask(TaskConfigBase):
run_path: Path

def __post_init__(self):
self._reference_data = TrainDataFromRun(path=self.run_path)
self._anomalous_data = AdversarialExampleConfig(run_path=self.run_path)
self._model = StoredModel(path=self.run_path)

def build_anomalous_data(self) -> Dataset:
return self._anomalous_data.build_dataset()

def build_model(self) -> Model:
return self._model.build_model()

def build_params(self) -> Model:
return self._model.build_params()

def build_reference_data(self) -> Dataset:
return self._reference_data.build_dataset()
```
This task only has one parameter, the path to the training run of a base model.
It then uses the training data of that run as reference data, and an adversarial
version of it as anomalous data. The model is just the trained base model, loaded
from disk.

You can also add new scripts in the `scripts` directory, to generate the datasets
and/or train the model. For example, the adversarial examples task has an
associated script `make_adversarial_examples.py`. (To get the model, we can simply
use the existing `train_classifier.py` script.)

There's no formal connection between scripts and the rest of the library---you can
leave it up to users to run the necessary preparatory scripts before using your new
task. But if feasible, you may want to automate this. For example, the `AdversarialExampleDataset`
automatically runs `make_adversarial_examples.py` if the necessary files are not found.

Finally, you need to register your task to make it accessible from the command line
in the existing scripts. Simply add the task config class to the `TASKS` dict in `tasks/__init__.py`
(with an arbitrary name as the key).

Then you should be able to run commands like
```bash
python -m cupbearer.scripts.train_detector --task my_task --detector my_detector --task.my_required_param foo
```
43 changes: 0 additions & 43 deletions docs/configuration.md

This file was deleted.

70 changes: 13 additions & 57 deletions docs/high_level_structure.md
Original file line number Diff line number Diff line change
@@ -3,32 +3,13 @@ In this document, we'll go over all the subpackages of `cupbearer` to see what r
they play and how to extend them. For more details of extending `cupbearer`, see
the other documentation files on specific subpackages.

## Configuration
Different parts of `cupbearer` interface with each other through many configuration
dataclasses. Each dataset, model, task, detector, script, etc. should expose all its
hyperparameters and configuration options through such a dataclass. That way,
all options will automatically be configurable from the command line.

Many of the configuration dataclass ABCs have one or several `build()` methods that
create the actual object of interest based on the configuration. For example,
the `DetectorConfig` ABC has an abstract `build()` method that must return an
`AnomalyDetector` instance.

See [configuration.md](configuration.md) for more details on the configuration
dataclasses and what to keep in mind when writing your own.

## Helper subpackages
### `cupbearer.data`
The `data` package contains implementations of basic datasets, transforms,
and specialized datasets (e.g. datasets consisting only of adversarial examples).
The key interface is the `DatasetConfig` class. It has a `build()` method that
needs to return a pytorch `Dataset` instance.

In principle, you don't need to use the `DatasetConfig` interface (or anything
from the `data` package) to implement new tasks or detectors. Tasks and detectors
just pass `Dataset` instances between each other. But unless you have a good reason
to avoid the `DatasetConfig` interface, it's best to use it since it already works
with the scripts and you get some features such as configuring transforms for free.
Using this subpackage is optional, you can define tasks directly using standard
pytorch `Dataset`s.

### `cupbearer.models`
Unlike the `data` package, you have to use the `models` package at the moment.
@@ -37,53 +18,32 @@ to the model's activations. Using the implementations from the `models` package
ensures a consistent way to get activations from models. As long as you don't want
to add new model architectures, most of the details of this package won't matter.

For now, only linear computational graphs are supported, i.e. each model needs to
be a fixed sequence of computational steps performed one after the other
(like a `Sequential` module in many deep learning frameworks). A `Computation`
is just a type alias for such as sequence of steps. The `Model` class takes such a
`Computation` and is itself a `flax.linen.Module` that implements the computation.
The main thing it does on top of `flax.linen.Sequential` is that it can also return
all the activations of the model. It also has a function for plotting the architecture
of the model.

Similar to the `DataConfig` interface, there's a `ModelConfig` with a `build()`
method that returns a `Model` instance.
In the future, we'll likely deprecate the `HookedModel` interface and just support
standard `torch.nn.Module`s via pytorch hooks.

### `cupbearer.utils`
The `utils` package contains many miscallaneous helper functions. You probably won't
interact with these too much, but here are a few that it may be good to know about:
- `utils.trainer` contains a `Trainer` class that's a very simple version of pytorch
lightning for flax. You certainly don't need to use this in any scripts you add,
but it may save you some boilerplate. NOTE: we might deprecate this in the future
and replace it with something like `elegy`.
- `utils.utils.save` and `utils.utils.load` can save and store pytrees. They use the
`orbax` checkpointer under the hood, but add some hacky support for saving/loading
types.

We'll cover a few more functions from the `utils` package when we talk about scripts.
The `utils` package contains some miscallaneous helper functions. Most of these are
mainly for internal usage, but see the example notebooks for helpful ones.

## Tasks
The `tasks` package contains the `TaskConfigBase` ABC, which is the interface any
task needs to implement, as well as all the existing tasks. To add a new task:
1. Create a new module or subpackage in `tasks`, where you implement a new class
that inherits `TaskConfigBase`.
2. Add your new class to the `TASKS` dictionary in `tasks/__init__.py`.
The `tasks` package contains the `Task` class, which is the interface any
task needs to implement, as well as all the existing tasks. To add a new task,
you can either inherit `Task` or simply write a function that returns a `Task` instance.

Often, you'll also need to implement a new type of dataset or model.
Often, you'll also need to implement a new type of dataset or model for your task.
That code probably belongs in the `data` and `model` packages,
though sometimes it's a judgement call.

See [adding_a_task.md](adding_a_task.md) for more details.

## Detectors
The `detectors` package is similar to `tasks`, but for anomaly detectors. In addition
to the `DetectorConfig` interface, it also contains an `AnomalyDetector` ABC, which
any detection method needs to subclass for its actual implementation.
The `detectors` package is similar to `tasks`, but for anomaly detectors. The key
interface is `AnomalyDetector`.

See [adding_a_detector.md](adding_a_detector.md) for more details.

## Scripts
The `scripts` package contains command line scripts and their configurations.
The `scripts` package contains Python functions for running common workflows.
Two scripts are meant to be used by all detectors/tasks:
- `train_detector` trains a detector on a task and saves the trained detector to disk.
- `eval_detector` evaluates a stored (or otherwise specified) detector and evaluates
@@ -92,7 +52,3 @@ Two scripts are meant to be used by all detectors/tasks:
All other scripts are helper scripts for specific tasks or detectors. For example,
most tasks will need a script to train the model to be analyzed, and perhaps to prepare
the dataset.

There's a lot more to be said about scripts, see the [README](../README.md) for a brief
overview of *running* scripts, and [adding_a_script.md](adding_a_script.md) for details
on writing new scripts.