Modular dataset configuration #104

jlamypoirier · 2025-01-06T21:06:56Z

✨ Description

A set of composable and dynamic dataset configuration classes, that allow defining arbitrary dataset definition schemes.

Dynamic configuration classes: experimental implementation in GPTDatasetConfig. if it works well we could use elsewhere, ex. for defining plugins.. It defines a class registry that is populated in __init_subclass__, and works as long as the subclass is imported.

The data config now has a dataset entry, which can be any dataset config. That dataset config may contain further nested dataset configs, etc.
Datasets can be sampled or samplable. Both support build_and_sample(config), but samplable (subclass of sampled) also supports build(). Indexed datasets are a special case of samplable datasets which also support indexing (get and len).

The types for now are:

memmap: A typical Megatron dataset, indexed
concatenated: The concatenation of multiple indexed datasets if it were one. Currently unused.
slice: A contiguous slice of an indexed dataset, for subsampling or train/valid/test split.
blended: Blend sampled datasets according to the given probabilities.
dummy: Always returns the same sample. Only available as sampled.
legacy: Same as before this PR, for backward compatibility only. This is the only way to do dataset from json files, which we aim to replace with a concatenated one anyway.

Datasets are defined are data.datasets.[Phase]
Dataset classes may (and typically do) include nested dataset definitions.

Misc:

Remove split dataset machinery.

Breaking change: sample dataset source has been dropped since it's not that relevant. Otherwise configs are backward-compatible (for now).

For future work:

Concatenate dataset from all files in directory.
Remove capitalization on phase names in config
Make phase names more flexible? (plain str instead of enum)
shuffle wrapper for indexed datasets?
Move dataset definition next to phase definition in train config?
More features.
Simplify things?

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

tscholak · 2025-01-20T14:41:20Z

Hi @jlamypoirier,

Thanks for your reply! I wanted to address your points in detail, as I think there's a larger discussion here about accessibility and maintainability for the team as a whole.

The choice between __init_subclass__ is mostly personal preference here.

I don't think this is just about personal preference, far from it. Implementation choices like this have significant implications for accessibility and maintainability, especially for contributors who are unfamiliar with unconventional patterns like __init_subclass__ in this context. While __init_subclass__ works, it's not a pattern most Python developers encounter often, let alone with kwargs in class definitions.

In contrast, the metaclass approach, while also somewhat advanced, avoids the non-standard kwargs in class definitions. It aligns more closely with existing Python patterns for tagged unions, especially in libraries like Pydantic, which are familiar to many developers.

Do you have a specific reason for using a metaclass?

The primary reason isn't about metaclasses themselves but about making the codebase more approachable. The metaclass solution avoids the unconventional use of kwargs in class definitions, which I found confusing even with my experience. If it took me time to understand, it will likely be even harder for others.

With the metaclass approach (or something else that leads to sum-type patterns similar to Pydantic), contributors only need to understand:

A base class representing the sum type.
Subclasses with a ClassVar acting as a type tag.

This is simpler, more explicit, and easier to discover.

I personally prefer __init_subclass__ because it's way easier to deal with.

This might be true for you, but we should consider what's easiest for the team and future contributors. When patterns like this rely on nuanced knowledge of Python's internals (__init_subclass__, dynamic kwargs, etc.), it raises the barrier for others to understand or extend the code.

Using a familiar and explicit approach (like class-level tags) ensures that even those unfamiliar with advanced Python features can contribute. Ultimately, our goal should be to lower the barrier to entry, not increase it.

Concerning the kwarg for the class, it's not necessary, but I chose to have it for clarity and to deal with edge cases.

Respectfully, I think this design achieves the opposite of clarity. Using kwargs in a class definition is unconventional in Python, and most contributors won't immediately understand its purpose. It's a pattern that will make people hesitant to touch the codebase for fear of breaking something they don't fully grasp. This is a problem that Fast-LLM has today, and we want to make the library more approachable and not less.

Regarding edge cases: if the primary issue is avoiding registration of abstract or generic classes, there are simpler ways to achieve this without introducing kwargs. For instance:

A dedicated abstract flag could be added to the base class.
Explicit registration mechanisms could allow full control without relying on hidden behaviours introduced through __init_subclass__ or meta classes.

The main idea is we could have a complicated hierarchy of classes, and we don't want to add all of them to the registry (e.g., abstract/generic ones).

If we're talking about a "complicated hierarchy of classes" for a sum type, I'd suggest re-evaluating whether that level of complexity is necessary. Configuration data is fundamentally declarative. Complex inheritance hierarchies are often overkill and introduce unnecessary coupling between data and behaviour.

If we truly need this complexity, then yes, we can design around abstract classes not being registered. However, this should be done in a way that is transparent and easy to follow. The current kwargs-based solution obscures the intent, making it harder for others to understand and extend.

With a plain ClassVar, it's harder to tell whether we want to add the class to the registry because the value could have been inherited from the base class.

This is true, but it's not an unsolvable problem. For example:

Use explicit registration logic instead of relying on implicit behavior.
Introduce a flag like _abstract that prevents a class from being added to the registry.
Alternatively, require subclasses to explicitly override the tag field, which can serve as both a type discriminator and a registration signal.

This would remove ambiguity while keeping the design clean and approachable.

There are obviously solutions without the class kwargs, but I thought it was the cleanest way.

I understand why you might see this as clean from your perspective, but for others (myself included), the use of class kwargs is off-putting and unconventional. It introduces a barrier to understanding and contributes to a sense that this part of the codebase is overly specialized.

Also, I get the distinct feeling that this PR is laying the foundation for a pattern that will likely be extended to other parts of the codebase. Making it as intuitive and Pythonic as possible is critical. A design inspired by Pydantic's tagged unions (i.e., using explicit class-level tags and clear registration mechanisms) would achieve the same goals while being more familiar and accessible.

Ultimately, I'd like us to ensure this codebase is not only functional, flexible, and performance, but also accessible to contributors, both current and future. Patterns that require deep knowledge of Python internals or rely on unconventional mechanisms (like kwargs in class definitions) create barriers to collaboration. My goal here isn't to criticize but to ensure we also prioritize readability, maintainability, and accessibility for the team as a whole.

jlamypoirier · 2025-01-20T18:30:57Z

Hi Torsten! Thanks for your comments, but I must respectfully disagree on some aspects.

The metaclass vs __init_subclass__ and kwarg vs Classvar are independent questions, let's treat them as such.

__init_subclass__ is the new(ish) simpler and better way to register subclasses since python 3.6 https://peps.python.org/pep-0487/, I think this alone should settle it as per our contribution guidelines.

I don't really get either of your arguments for a metaclass, in fact I see them as arguments for the opposite. Metaclasses are a complicated and obscure concept most developers should never have to deal with. Yes some experienced developers may have worked with them in the past, but they will have no problem dealing with the simpler version. In my opinion most developers won't know either, and __init_subclass__ is way more accessible. Therefore, in terms of readability, maintainability, and accessibility, __init_subclass__ is clearly better.

With the metaclass approach (or something else that leads to sum-type patterns similar to Pydantic), contributors only need to understand:

A base class representing the sum type.

Subclasses with a ClassVar acting as a type tag.

This is simpler, more explicit, and easier to discover.

There is absolutely no difference with __init_subclass__. The kwarg vs Classvar thing is a separate concern.

I don't see much value trying to imitate pydantic:

There is no difference from the user point of view, other than __init_subclass__ avoiding potential metaclass interactions. One would have to dig deep to notice the difference in implementation.
It's not really a design choice. Pydantic's development started before python 3.6, so they had to use metaclass. They might also have other reasons for using metaclasses that don't apply here.

jlamypoirier · 2025-01-20T18:31:05Z

As for the separate kwarg vs Classvar matter, this is really a design choice, we can discuss it discuss it in our next meeting. I introduced the kwarg to try it out, because I think it may offer a good alternative to some of the messier parts of Fast-LLM, including the massive use of class variables. Yes I'm aware of the alternative ways to do it, the abstract flag is already there actually (though I'd prefer a better solution).

Either way, I think it's mostly a matter of documentation. We either have to tell users to add a class kwarg or a classvar, and in either case it magically triggers a hidden part of the code, so it's equally opaque to users. So the difference is really about the subjective matter of which one looks the scariest.

tscholak · 2025-01-20T22:29:10Z

Hi again,

I think we may still be talking past each other, so I'd like to clarify my position.

I really don't care whether we use __init_subclass__, a metaclass, or something else for the internal mechanism. That's an implementation detail, and it can live in a part of the codebase nobody ever touches again. The point I'm raising is about the interface contributors use to define and tag new configuration classes. This is what I care about:

No contributor should need to understand the internal mechanics to define config classes. Whether it's __init_subclass__ or not behind the scenes, it should be abstracted away. Nobody should need to dig into this unless they are maintaining that very core mechanism. Pydantic is a good example: nobody looks at its metaclass implementation, yet the interface is clear and consistent.
The tagging mechanism must be clear, conventional, and Pythonic. Using kwargs in class definitions isn't something any Python developers will expect. I've never seen this being used before, and I suspect others haven't either. Contributors need to understand how to define new configuration classes and tag them in a way that feels natural. Class-level attributes are well-known and widely used for tagging in Python. They are also familiar to developers who’ve worked with libraries like Pydantic.

As for the separate kwarg vs Classvar matter, this is really a design choice, we can discuss it in our next meeting.

I completely disagree that this is a subjective design choice. It's not about preference. It's about whether contributors can easily understand and use the tagging mechanism.

Either way, I think it's mostly a matter of documentation. We either have to tell users to add a class kwarg or a classvar, and in either case, it magically triggers a hidden part of the code, so it’s equally opaque to users.

I have to disagree here, too. This isn't just a matter of documentation. Ideally, contributors can look at a sum type in the codebase and immediately understand how to define their own.

A class-level attribute like kind: ClassVar[str] = "memmap" is intuitive and self-explanatory for most Python developers.
A kwarg in a class definition (e.g., type_="memmap") is not something anyone expects. It’s unconventional and requires explanation, which adds cognitive overhead.

When you say “it's equally opaque,” I think you're discounting the value of familiarity. Familiar patterns reduce friction, while unconventional ones add it. By choosing the familiar approach, we make the tagging mechanism approachable for everyone.

So the difference is really about the subjective matter of which one looks the scariest.

Yes, and I am here to tell you that the kwarg approach looks far scarier to most Python developers. Few people will have seen or used this pattern before, whereas tagging with class-level attributes is widely recognized and understood.

In summary, I strongly believe that the kwarg approach introduces unnecessary barriers to understanding and collaboration and should be replaced with class-level attributes before this PR is merged.

jlamypoirier · 2025-01-20T23:03:13Z

I still have a slight preference for kwargs, but I'll switch it back to a class attribute to avoid an empty debate about what a typical python developer may or may not find understandable.

No contributor should need to understand the internal mechanics to define config classes. Whether it's __init_subclass__ or not behind the scenes, it should be abstracted away. Nobody should need to dig into this unless they are maintaining that very core mechanism. Pydantic is a good example: nobody looks at its metaclass implementation, yet the interface is clear and consistent.

This won't be solved here either way, the only way to make the mechanics clear is by documenting it.

The tagging mechanism must be clear, conventional, and Pythonic. Using kwargs in class definitions isn't something any Python developers will expect. I've never seen this being used before, and I suspect others haven't either. Contributors need to understand how to define new configuration classes and tag them in a way that feels natural. Class-level attributes are well-known and widely used for tagging in Python. They are also familiar to developers who’ve worked with libraries like Pydantic.

Again highly subjective, but I won't argue further.

jlamypoirier · 2025-01-20T23:18:50Z

Alternatively, require subclasses to explicitly override the tag field, which can serve as both a type discriminator and a registration signal.

How would I do that? Other than implicitly by complaining on duplicates?

tscholak

Thanks @jlamypoirier, Lgtm

jlamypoirier added 3 commits January 6, 2025 16:06

Modular dataset configuration

147e33b

fixes

c41a2c5

fix

e013ba2

jlamypoirier marked this pull request as ready for review January 7, 2025 20:25

jlamypoirier requested a review from tscholak January 7, 2025 20:25

tscholak and others added 8 commits January 9, 2025 12:17

Merge branch 'main' into modular_dataset

6b45944

Merge branch 'main' into modular_dataset

952a03d

Generalize indexed

82285ae

fix

7011ca3

Modularize fim, decouple data from dataset, basic tests, misc

9574715

Make tests pass

5532b97

Remove split datasets

5d5e0ab

Make tests pass

baacc4e

jlamypoirier mentioned this pull request Jan 13, 2025

[meta] Fast-LLM Improvements Tracker 🌟 #100

Open

jlamypoirier added 2 commits January 13, 2025 14:31

misc

09640d8

misc

a73acf6

jlamypoirier mentioned this pull request Jan 15, 2025

Typing improvements #114

Merged

8 tasks

Fix merge

bb1b87f

jlamypoirier force-pushed the modular_dataset branch from cf74d2d to bb1b87f Compare January 15, 2025 19:38

jlamypoirier added 3 commits January 15, 2025 15:57

Type hints

148b448

misc

13e4f43

Dataset tweaks

0219006

jlamypoirier mentioned this pull request Jan 16, 2025

Dataset tweaks #118

Merged

8 tasks

Merge branch 'dataset_tweaks' into modular_dataset

b9b516f

jlamypoirier changed the base branch from main to dataset_tweaks January 16, 2025 22:08

jlamypoirier added 5 commits January 16, 2025 17:09

fix

8a33cef

misc

62fbe01

misc

1934828

misc

c0be45c

Merge branch 'dataset_tweaks' into modular_dataset

6358d08

jlamypoirier added 6 commits January 20, 2025 15:33

Merge branch 'dataset_tests' into modular_dataset

bae839b

Fix merge, update tests

41ad25f

Fix

d7c1f38

stuff

28bf7de

Merge branch 'dataset_tests' into modular_dataset

401f56e

fix

17e3aea

Drop class kwarg

ab2f468

tscholak approved these changes Jan 21, 2025

View reviewed changes

jlamypoirier added 3 commits January 21, 2025 20:19

fixes

27587a4

fixes

ca1f944

fixes

3c17819

Base automatically changed from dataset_tests to main January 22, 2025 03:02

jlamypoirier added 10 commits January 21, 2025 22:03

Merge branch 'dataset_tests' into modular_dataset

bd2fcec

Merge branch 'main' into modular_dataset

5405d42

Fix merge

0d8bf14

Fix merge

a0aae75

Fix merge

9dbbcf9

Fix merge

6dea63e

fixes

8245041

fixes

77b1324

fixes

54e5fa5

Match legacy

755c355

jlamypoirier merged commit efb1afb into main Jan 22, 2025
4 checks passed

jlamypoirier deleted the modular_dataset branch January 22, 2025 05:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modular dataset configuration #104

Modular dataset configuration #104

jlamypoirier commented Jan 6, 2025 •

edited

Loading

tscholak commented Jan 20, 2025

jlamypoirier commented Jan 20, 2025

jlamypoirier commented Jan 20, 2025 •

edited

Loading

tscholak commented Jan 20, 2025 •

edited by jlamypoirier

Loading

jlamypoirier commented Jan 20, 2025

jlamypoirier commented Jan 20, 2025

tscholak left a comment

Modular dataset configuration #104

Modular dataset configuration #104

Conversation

jlamypoirier commented Jan 6, 2025 • edited Loading

✨ Description

🔍 Type of change

tscholak commented Jan 20, 2025

jlamypoirier commented Jan 20, 2025

jlamypoirier commented Jan 20, 2025 • edited Loading

tscholak commented Jan 20, 2025 • edited by jlamypoirier Loading

jlamypoirier commented Jan 20, 2025

jlamypoirier commented Jan 20, 2025

tscholak left a comment

Choose a reason for hiding this comment

jlamypoirier commented Jan 6, 2025 •

edited

Loading

jlamypoirier commented Jan 20, 2025 •

edited

Loading

tscholak commented Jan 20, 2025 •

edited by jlamypoirier

Loading