-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataInputType] Decouple downstream func from data_args #1120
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
I'm still a little confused as to why this is necessary? It comes at the cost of adding another data class, I don't really understand the benefit.
As implemented now (and I'm guessing in the future), we always parse data_args before downloading datasets. Is there a case where we would not want to depend on parsing data_args first? |
Were parsing the input arg to a more compatible format to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Please rename
load_dataset_args
toload_dataset_kwargs
- Please do not create a new data class
LoadDatasetArguments
, instead putload_dataset_kwargs
directly onDatasetArguments
- Please remove
raw_kwargs
, as it's now been replaced byload_dataset_kwargs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually want to get to this
from datasets import load_dataset
dataset = load_dataset(
name=data_args.dataset, # if str
**data_args.load_dataset_kwargs
)
exactly |
For 2. The base is populated with tons of args, would not know which arg is used where. Better to keep it organized by having a class. |
SUMMARY:
Decouple
data_args
fromget_raw_dataset
.get_raw_dataset
currently depends on hardcoded attributes ofdata_args
- ex. trust_remote_code, streaming - which is only used forload_dataset
Currently, 'data_args` governs
CHANGES:
LoadDatasetArguments
dataclass and make the current code compatible.load_dataset
for 1. hf_stub, 2. hf_cache_path 3. local dataset path.Notes:
Precursor PR for Dataset refactor. The attributes of DatasetArguments attributes will be organized into their own dataclass.
TEST PLAN:
Pass existing tests