-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modular dataset configuration #104
Conversation
cf74d2d
to
bb1b87f
Compare
Hi @jlamypoirier, Thanks for your reply! I wanted to address your points in detail, as I think there's a larger discussion here about accessibility and maintainability for the team as a whole.
I don't think this is just about personal preference, far from it. Implementation choices like this have significant implications for accessibility and maintainability, especially for contributors who are unfamiliar with unconventional patterns like In contrast, the metaclass approach, while also somewhat advanced, avoids the non-standard kwargs in class definitions. It aligns more closely with existing Python patterns for tagged unions, especially in libraries like Pydantic, which are familiar to many developers.
The primary reason isn't about metaclasses themselves but about making the codebase more approachable. The metaclass solution avoids the unconventional use of kwargs in class definitions, which I found confusing even with my experience. If it took me time to understand, it will likely be even harder for others. With the metaclass approach (or something else that leads to sum-type patterns similar to Pydantic), contributors only need to understand:
This is simpler, more explicit, and easier to discover.
This might be true for you, but we should consider what's easiest for the team and future contributors. When patterns like this rely on nuanced knowledge of Python's internals ( Using a familiar and explicit approach (like class-level tags) ensures that even those unfamiliar with advanced Python features can contribute. Ultimately, our goal should be to lower the barrier to entry, not increase it.
Respectfully, I think this design achieves the opposite of clarity. Using kwargs in a class definition is unconventional in Python, and most contributors won't immediately understand its purpose. It's a pattern that will make people hesitant to touch the codebase for fear of breaking something they don't fully grasp. This is a problem that Fast-LLM has today, and we want to make the library more approachable and not less. Regarding edge cases: if the primary issue is avoiding registration of abstract or generic classes, there are simpler ways to achieve this without introducing kwargs. For instance:
If we're talking about a "complicated hierarchy of classes" for a sum type, I'd suggest re-evaluating whether that level of complexity is necessary. Configuration data is fundamentally declarative. Complex inheritance hierarchies are often overkill and introduce unnecessary coupling between data and behaviour. If we truly need this complexity, then yes, we can design around abstract classes not being registered. However, this should be done in a way that is transparent and easy to follow. The current kwargs-based solution obscures the intent, making it harder for others to understand and extend.
This is true, but it's not an unsolvable problem. For example:
This would remove ambiguity while keeping the design clean and approachable.
I understand why you might see this as clean from your perspective, but for others (myself included), the use of class kwargs is off-putting and unconventional. It introduces a barrier to understanding and contributes to a sense that this part of the codebase is overly specialized. Also, I get the distinct feeling that this PR is laying the foundation for a pattern that will likely be extended to other parts of the codebase. Making it as intuitive and Pythonic as possible is critical. A design inspired by Pydantic's tagged unions (i.e., using explicit class-level tags and clear registration mechanisms) would achieve the same goals while being more familiar and accessible. Ultimately, I'd like us to ensure this codebase is not only functional, flexible, and performance, but also accessible to contributors, both current and future. Patterns that require deep knowledge of Python internals or rely on unconventional mechanisms (like kwargs in class definitions) create barriers to collaboration. My goal here isn't to criticize but to ensure we also prioritize readability, maintainability, and accessibility for the team as a whole. |
Hi Torsten! Thanks for your comments, but I must respectfully disagree on some aspects. The
I don't really get either of your arguments for a metaclass, in fact I see them as arguments for the opposite. Metaclasses are a complicated and obscure concept most developers should never have to deal with. Yes some experienced developers may have worked with them in the past, but they will have no problem dealing with the simpler version. In my opinion most developers won't know either, and
There is absolutely no difference with I don't see much value trying to imitate pydantic:
|
As for the separate Either way, I think it's mostly a matter of documentation. We either have to tell users to add a class kwarg or a classvar, and in either case it magically triggers a hidden part of the code, so it's equally opaque to users. So the difference is really about the subjective matter of which one looks the scariest. |
Hi again, I think we may still be talking past each other, so I'd like to clarify my position. I really don't care whether we use
I completely disagree that this is a subjective design choice. It's not about preference. It's about whether contributors can easily understand and use the tagging mechanism.
I have to disagree here, too. This isn't just a matter of documentation. Ideally, contributors can look at a sum type in the codebase and immediately understand how to define their own.
When you say “it's equally opaque,” I think you're discounting the value of familiarity. Familiar patterns reduce friction, while unconventional ones add it. By choosing the familiar approach, we make the tagging mechanism approachable for everyone.
Yes, and I am here to tell you that the kwarg approach looks far scarier to most Python developers. Few people will have seen or used this pattern before, whereas tagging with class-level attributes is widely recognized and understood. In summary, I strongly believe that the kwarg approach introduces unnecessary barriers to understanding and collaboration and should be replaced with class-level attributes before this PR is merged. |
I still have a slight preference for kwargs, but I'll switch it back to a class attribute to avoid an empty debate about what a typical python developer may or may not find understandable.
This won't be solved here either way, the only way to make the mechanics clear is by documenting it.
Again highly subjective, but I won't argue further. |
How would I do that? Other than implicitly by complaining on duplicates? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jlamypoirier, Lgtm
✨ Description
A set of composable and dynamic dataset configuration classes, that allow defining arbitrary dataset definition schemes.
Dynamic configuration classes: experimental implementation in
GPTDatasetConfig
. if it works well we could use elsewhere, ex. for defining plugins.. It defines a class registry that is populated in__init_subclass__
, and works as long as the subclass is imported.The data config now has a
dataset
entry, which can be any dataset config. That dataset config may contain further nested dataset configs, etc.Datasets can be sampled or samplable. Both support
build_and_sample(config)
, but samplable (subclass of sampled) also supportsbuild()
. Indexed datasets are a special case of samplable datasets which also support indexing (get and len).The types for now are:
memmap
: A typical Megatron dataset, indexedconcatenated
: The concatenation of multiple indexed datasets if it were one. Currently unused.slice
: A contiguous slice of an indexed dataset, for subsampling or train/valid/test split.blended
: Blend sampled datasets according to the given probabilities.dummy
: Always returns the same sample. Only available as sampled.legacy
: Same as before this PR, for backward compatibility only. This is the only way to do dataset from json files, which we aim to replace with a concatenated one anyway.Datasets are defined are
data.datasets.[Phase]
Dataset classes may (and typically do) include nested dataset definitions.
Misc:
Breaking change:
sample
dataset source has been dropped since it's not that relevant. Otherwise configs are backward-compatible (for now).For future work:
shuffle
wrapper for indexed datasets?🔍 Type of change
Select all that apply: