Skip to content

Latest commit

 

History

History
37 lines (31 loc) · 1.97 KB

README.md

File metadata and controls

37 lines (31 loc) · 1.97 KB

Comparison of parameter-data scaling contours for datasets of 2 different gzip-compressibilities

🐦 Twitter   •   📄 Arxiv   •   🤗 Datasets

🔗 Multimodal CodeGen for Web Data Extraction

gzip Predicts Data-dependent Scaling Laws

This is the official code for gzip Predicts Data-dependent Scaling Laws (under review at NeurIPS 2024).

We find that:

  1. scaling laws are sensitive to differences in data complexity
  2. gzip, a compression algorithm, is an effective predictor of how data complexity impacts scaling properties

Our data-dependent scaling law's compute-optimal frontier increases in dataset size preference (over parameter count preference) as training data becomes more complex (harder to compress).

Code Overview

  • data_gen.py: create PCFGs with specified syntactic properties and sample text datasets from them
  • data_utils.py: gzip-compressibility measurement, tokenization & HuggingFace tooling, dataloaders, etc.
  • training.py: run a single training run given model and dataset, returning loss at each train step
  • main.py: run a set of training runs across datasets & model sizes (hackily GPU-parallelized with threading)
  • fsdp_training.py: for running bigger jobs with cleaner data loading & FSDP training

Upon request via email, we can also provide:

  • JSONL records of all training runs (this is large and can't fit on GitHub)
  • the Jupyter Notebook used to fit scaling laws from training runs and generate all visuals