Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataInputType] Decouple downstream func from data_args #1120

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

horheynm
Copy link
Collaborator

@horheynm horheynm commented Feb 3, 2025

SUMMARY:
Decouple data_args from get_raw_dataset.

get_raw_dataset currently depends on hardcoded attributes of data_args - ex. trust_remote_code, streaming - which is only used for load_dataset

Currently, 'data_args` governs

  1. raw dataset
  2. processed dataset
  3. data loader

CHANGES:

  • Decouple data_args to to 1, 2, 3 above. For 1. Introduce LoadDatasetArguments dataclass and make the current code compatible.
  • Add tests to for load_dataset for 1. hf_stub, 2. hf_cache_path 3. local dataset path.

Notes:
Precursor PR for Dataset refactor. The attributes of DatasetArguments attributes will be organized into their own dataclass.

TEST PLAN:
Pass existing tests

Copy link

github-actions bot commented Feb 3, 2025

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@horheynm horheynm added the ready When a PR is ready for review label Feb 3, 2025
@kylesayrs
Copy link
Collaborator

I'm still a little confused as to why this is necessary? It comes at the cost of adding another data class, I don't really understand the benefit.

get_raw_dataset is downstream and should not depend on parsing to data_args first

As implemented now (and I'm guessing in the future), we always parse data_args before downloading datasets. Is there a case where we would not want to depend on parsing data_args first?

@horheynm
Copy link
Collaborator Author

horheynm commented Feb 4, 2025

I'm still a little confused as to why this is necessary? It comes at the cost of adding another data class, I don't really understand the benefit.

get_raw_dataset is downstream and should not depend on parsing to data_args first

As implemented now (and I'm guessing in the future), we always parse data_args before downloading datasets. Is there a case where we would not want to depend on parsing data_args first?

Were parsing the input arg to a more compatible format to load_dataset.
Currently the func depends on data_args, where some are hardcoded to be included vs the new support which will be used in load_dataset_args

Copy link
Collaborator

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Please rename load_dataset_args to load_dataset_kwargs
  2. Please do not create a new data class LoadDatasetArguments, instead put load_dataset_kwargs directly on DatasetArguments
  3. Please remove raw_kwargs, as it's now been replaced by load_dataset_kwargs

Copy link
Collaborator

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually want to get to this

from datasets import load_dataset

dataset = load_dataset(
    name=data_args.dataset,  # if str
    **data_args.load_dataset_kwargs
)

@horheynm
Copy link
Collaborator Author

horheynm commented Feb 6, 2025

Eventually want to get to this

from datasets import load_dataset

dataset = load_dataset(
    name=data_args.dataset,  # if str
    **data_args.load_dataset_kwargs
)

exactly

@horheynm
Copy link
Collaborator Author

horheynm commented Feb 6, 2025

  1. Please rename load_dataset_args to load_dataset_kwargs
  2. Please do not create a new data class LoadDatasetArguments, instead put load_dataset_kwargs directly on DatasetArguments
  3. Please remove raw_kwargs, as it's now been replaced by load_dataset_kwargs

For 2. The base is populated with tons of args, would not know which arg is used where. Better to keep it organized by having a class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready When a PR is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants