Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modular dataset configuration #104

Merged
merged 56 commits into from
Jan 22, 2025
Merged

Modular dataset configuration #104

merged 56 commits into from
Jan 22, 2025

Conversation

jlamypoirier
Copy link
Collaborator

@jlamypoirier jlamypoirier commented Jan 6, 2025

✨ Description

A set of composable and dynamic dataset configuration classes, that allow defining arbitrary dataset definition schemes.

Dynamic configuration classes: experimental implementation in GPTDatasetConfig. if it works well we could use elsewhere, ex. for defining plugins.. It defines a class registry that is populated in __init_subclass__, and works as long as the subclass is imported.

The data config now has a dataset entry, which can be any dataset config. That dataset config may contain further nested dataset configs, etc.
Datasets can be sampled or samplable. Both support build_and_sample(config), but samplable (subclass of sampled) also supports build(). Indexed datasets are a special case of samplable datasets which also support indexing (get and len).

The types for now are:

  • memmap: A typical Megatron dataset, indexed
  • concatenated: The concatenation of multiple indexed datasets if it were one. Currently unused.
  • slice: A contiguous slice of an indexed dataset, for subsampling or train/valid/test split.
  • blended: Blend sampled datasets according to the given probabilities.
  • dummy: Always returns the same sample. Only available as sampled.
  • legacy: Same as before this PR, for backward compatibility only. This is the only way to do dataset from json files, which we aim to replace with a concatenated one anyway.

Datasets are defined are data.datasets.[Phase]
Dataset classes may (and typically do) include nested dataset definitions.

Misc:

  • Remove split dataset machinery.

Breaking change: sample dataset source has been dropped since it's not that relevant. Otherwise configs are backward-compatible (for now).

For future work:

  • Concatenate dataset from all files in directory.
  • Remove capitalization on phase names in config
  • Make phase names more flexible? (plain str instead of enum)
  • shuffle wrapper for indexed datasets?
  • Move dataset definition next to phase definition in train config?
  • More features.
  • Simplify things?

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

@jlamypoirier jlamypoirier marked this pull request as ready for review January 7, 2025 20:25
@jlamypoirier jlamypoirier requested a review from tscholak January 7, 2025 20:25
@jlamypoirier jlamypoirier mentioned this pull request Jan 15, 2025
8 tasks
@jlamypoirier jlamypoirier mentioned this pull request Jan 16, 2025
8 tasks
@jlamypoirier jlamypoirier changed the base branch from main to dataset_tweaks January 16, 2025 22:08
@tscholak
Copy link
Collaborator

Hi @jlamypoirier,

Thanks for your reply! I wanted to address your points in detail, as I think there's a larger discussion here about accessibility and maintainability for the team as a whole.

The choice between __init_subclass__ is mostly personal preference here.

I don't think this is just about personal preference, far from it. Implementation choices like this have significant implications for accessibility and maintainability, especially for contributors who are unfamiliar with unconventional patterns like __init_subclass__ in this context. While __init_subclass__ works, it's not a pattern most Python developers encounter often, let alone with kwargs in class definitions.

In contrast, the metaclass approach, while also somewhat advanced, avoids the non-standard kwargs in class definitions. It aligns more closely with existing Python patterns for tagged unions, especially in libraries like Pydantic, which are familiar to many developers.

Do you have a specific reason for using a metaclass?

The primary reason isn't about metaclasses themselves but about making the codebase more approachable. The metaclass solution avoids the unconventional use of kwargs in class definitions, which I found confusing even with my experience. If it took me time to understand, it will likely be even harder for others.

With the metaclass approach (or something else that leads to sum-type patterns similar to Pydantic), contributors only need to understand:

  • A base class representing the sum type.
  • Subclasses with a ClassVar acting as a type tag.

This is simpler, more explicit, and easier to discover.

I personally prefer __init_subclass__ because it's way easier to deal with.

This might be true for you, but we should consider what's easiest for the team and future contributors. When patterns like this rely on nuanced knowledge of Python's internals (__init_subclass__, dynamic kwargs, etc.), it raises the barrier for others to understand or extend the code.

Using a familiar and explicit approach (like class-level tags) ensures that even those unfamiliar with advanced Python features can contribute. Ultimately, our goal should be to lower the barrier to entry, not increase it.

Concerning the kwarg for the class, it's not necessary, but I chose to have it for clarity and to deal with edge cases.

Respectfully, I think this design achieves the opposite of clarity. Using kwargs in a class definition is unconventional in Python, and most contributors won't immediately understand its purpose. It's a pattern that will make people hesitant to touch the codebase for fear of breaking something they don't fully grasp. This is a problem that Fast-LLM has today, and we want to make the library more approachable and not less.

Regarding edge cases: if the primary issue is avoiding registration of abstract or generic classes, there are simpler ways to achieve this without introducing kwargs. For instance:

  • A dedicated abstract flag could be added to the base class.
  • Explicit registration mechanisms could allow full control without relying on hidden behaviours introduced through __init_subclass__ or meta classes.

The main idea is we could have a complicated hierarchy of classes, and we don't want to add all of them to the registry (e.g., abstract/generic ones).

If we're talking about a "complicated hierarchy of classes" for a sum type, I'd suggest re-evaluating whether that level of complexity is necessary. Configuration data is fundamentally declarative. Complex inheritance hierarchies are often overkill and introduce unnecessary coupling between data and behaviour.

If we truly need this complexity, then yes, we can design around abstract classes not being registered. However, this should be done in a way that is transparent and easy to follow. The current kwargs-based solution obscures the intent, making it harder for others to understand and extend.

With a plain ClassVar, it's harder to tell whether we want to add the class to the registry because the value could have been inherited from the base class.

This is true, but it's not an unsolvable problem. For example:

  1. Use explicit registration logic instead of relying on implicit behavior.
  2. Introduce a flag like _abstract that prevents a class from being added to the registry.
  3. Alternatively, require subclasses to explicitly override the tag field, which can serve as both a type discriminator and a registration signal.

This would remove ambiguity while keeping the design clean and approachable.

There are obviously solutions without the class kwargs, but I thought it was the cleanest way.

I understand why you might see this as clean from your perspective, but for others (myself included), the use of class kwargs is off-putting and unconventional. It introduces a barrier to understanding and contributes to a sense that this part of the codebase is overly specialized.

Also, I get the distinct feeling that this PR is laying the foundation for a pattern that will likely be extended to other parts of the codebase. Making it as intuitive and Pythonic as possible is critical. A design inspired by Pydantic's tagged unions (i.e., using explicit class-level tags and clear registration mechanisms) would achieve the same goals while being more familiar and accessible.

Ultimately, I'd like us to ensure this codebase is not only functional, flexible, and performance, but also accessible to contributors, both current and future. Patterns that require deep knowledge of Python internals or rely on unconventional mechanisms (like kwargs in class definitions) create barriers to collaboration. My goal here isn't to criticize but to ensure we also prioritize readability, maintainability, and accessibility for the team as a whole.

@jlamypoirier
Copy link
Collaborator Author

Hi Torsten! Thanks for your comments, but I must respectfully disagree on some aspects.

The metaclass vs __init_subclass__ and kwarg vs Classvar are independent questions, let's treat them as such.

__init_subclass__ is the new(ish) simpler and better way to register subclasses since python 3.6 https://peps.python.org/pep-0487/, I think this alone should settle it as per our contribution guidelines.

I don't really get either of your arguments for a metaclass, in fact I see them as arguments for the opposite. Metaclasses are a complicated and obscure concept most developers should never have to deal with. Yes some experienced developers may have worked with them in the past, but they will have no problem dealing with the simpler version. In my opinion most developers won't know either, and __init_subclass__ is way more accessible. Therefore, in terms of readability, maintainability, and accessibility, __init_subclass__ is clearly better.

With the metaclass approach (or something else that leads to sum-type patterns similar to Pydantic), contributors only need to understand:

  • A base class representing the sum type.
  • Subclasses with a ClassVar acting as a type tag.

This is simpler, more explicit, and easier to discover.

There is absolutely no difference with __init_subclass__. The kwarg vs Classvar thing is a separate concern.

I don't see much value trying to imitate pydantic:

  • There is no difference from the user point of view, other than __init_subclass__ avoiding potential metaclass interactions. One would have to dig deep to notice the difference in implementation.
  • It's not really a design choice. Pydantic's development started before python 3.6, so they had to use metaclass. They might also have other reasons for using metaclasses that don't apply here.

@jlamypoirier
Copy link
Collaborator Author

jlamypoirier commented Jan 20, 2025

As for the separate kwarg vs Classvar matter, this is really a design choice, we can discuss it discuss it in our next meeting. I introduced the kwarg to try it out, because I think it may offer a good alternative to some of the messier parts of Fast-LLM, including the massive use of class variables. Yes I'm aware of the alternative ways to do it, the abstract flag is already there actually (though I'd prefer a better solution).

Either way, I think it's mostly a matter of documentation. We either have to tell users to add a class kwarg or a classvar, and in either case it magically triggers a hidden part of the code, so it's equally opaque to users. So the difference is really about the subjective matter of which one looks the scariest.

@tscholak
Copy link
Collaborator

tscholak commented Jan 20, 2025

Hi again,

I think we may still be talking past each other, so I'd like to clarify my position.

I really don't care whether we use __init_subclass__, a metaclass, or something else for the internal mechanism. That's an implementation detail, and it can live in a part of the codebase nobody ever touches again. The point I'm raising is about the interface contributors use to define and tag new configuration classes. This is what I care about:

  1. No contributor should need to understand the internal mechanics to define config classes. Whether it's __init_subclass__ or not behind the scenes, it should be abstracted away. Nobody should need to dig into this unless they are maintaining that very core mechanism. Pydantic is a good example: nobody looks at its metaclass implementation, yet the interface is clear and consistent.

  2. The tagging mechanism must be clear, conventional, and Pythonic. Using kwargs in class definitions isn't something any Python developers will expect. I've never seen this being used before, and I suspect others haven't either. Contributors need to understand how to define new configuration classes and tag them in a way that feels natural. Class-level attributes are well-known and widely used for tagging in Python. They are also familiar to developers who’ve worked with libraries like Pydantic.

As for the separate kwarg vs Classvar matter, this is really a design choice, we can discuss it in our next meeting.

I completely disagree that this is a subjective design choice. It's not about preference. It's about whether contributors can easily understand and use the tagging mechanism.

Either way, I think it's mostly a matter of documentation. We either have to tell users to add a class kwarg or a classvar, and in either case, it magically triggers a hidden part of the code, so it’s equally opaque to users.

I have to disagree here, too. This isn't just a matter of documentation. Ideally, contributors can look at a sum type in the codebase and immediately understand how to define their own.

  • A class-level attribute like kind: ClassVar[str] = "memmap" is intuitive and self-explanatory for most Python developers.
  • A kwarg in a class definition (e.g., type_="memmap") is not something anyone expects. It’s unconventional and requires explanation, which adds cognitive overhead.

When you say “it's equally opaque,” I think you're discounting the value of familiarity. Familiar patterns reduce friction, while unconventional ones add it. By choosing the familiar approach, we make the tagging mechanism approachable for everyone.

So the difference is really about the subjective matter of which one looks the scariest.

Yes, and I am here to tell you that the kwarg approach looks far scarier to most Python developers. Few people will have seen or used this pattern before, whereas tagging with class-level attributes is widely recognized and understood.

In summary, I strongly believe that the kwarg approach introduces unnecessary barriers to understanding and collaboration and should be replaced with class-level attributes before this PR is merged.

@jlamypoirier
Copy link
Collaborator Author

I still have a slight preference for kwargs, but I'll switch it back to a class attribute to avoid an empty debate about what a typical python developer may or may not find understandable.

  1. No contributor should need to understand the internal mechanics to define config classes. Whether it's __init_subclass__ or not behind the scenes, it should be abstracted away. Nobody should need to dig into this unless they are maintaining that very core mechanism. Pydantic is a good example: nobody looks at its metaclass implementation, yet the interface is clear and consistent.

This won't be solved here either way, the only way to make the mechanics clear is by documenting it.

  1. The tagging mechanism must be clear, conventional, and Pythonic. Using kwargs in class definitions isn't something any Python developers will expect. I've never seen this being used before, and I suspect others haven't either. Contributors need to understand how to define new configuration classes and tag them in a way that feels natural. Class-level attributes are well-known and widely used for tagging in Python. They are also familiar to developers who’ve worked with libraries like Pydantic.

Again highly subjective, but I won't argue further.

@jlamypoirier
Copy link
Collaborator Author

Alternatively, require subclasses to explicitly override the tag field, which can serve as both a type discriminator and a registration signal.

How would I do that? Other than implicitly by complaining on duplicates?

Copy link
Collaborator

@tscholak tscholak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jlamypoirier, Lgtm

Base automatically changed from dataset_tests to main January 22, 2025 03:02
@jlamypoirier jlamypoirier merged commit efb1afb into main Jan 22, 2025
4 checks passed
@jlamypoirier jlamypoirier deleted the modular_dataset branch January 22, 2025 05:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants