🔗 Multimodal CodeGen for Web Data Extraction

`gzip` Predicts Data-dependent Scaling Laws

This is the official code for gzip Predicts Data-dependent Scaling Laws (under review at NeurIPS 2024).

We find that:

scaling laws are sensitive to differences in data complexity
gzip, a compression algorithm, is an effective predictor of how data complexity impacts scaling properties

Our data-dependent scaling law's compute-optimal frontier increases in dataset size preference (over parameter count preference) as training data becomes more complex (harder to compress).

Code Overview

data_gen.py: create PCFGs with specified syntactic properties and sample text datasets from them
data_utils.py: gzip-compressibility measurement, tokenization & HuggingFace tooling, dataloaders, etc.
training.py: run a single training run given model and dataset, returning loss at each train step
main.py: run a set of training runs across datasets & model sizes (hackily GPU-parallelized with threading)
fsdp_training.py: for running bigger jobs with cleaner data loading & FSDP training

Upon request via email, we can also provide:

JSONL records of all training runs (this is large and can't fit on GitHub)
the Jupyter Notebook used to fit scaling laws from training runs and generate all visuals

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

`gzip` Predicts Data-dependent Scaling Laws

Code Overview

Files

README.md

Latest commit

History

README.md

File metadata and controls

gzip Predicts Data-dependent Scaling Laws

Code Overview

`gzip` Predicts Data-dependent Scaling Laws