Skip to content

Commit

Permalink
feat: add push_to_huggingface method to the ArgillaTrainer (#3976)
Browse files Browse the repository at this point in the history
# Description

This PR adds a new method `push_to_huggingface` to the `ArgillaTrainer`
to simplify uploading the trained models to the [huggingface model
hub](https://huggingface.co/models).

This option is implemented for the following models:
- [x] `transformers`
- [x] `peft`
- [x] `setfit`
- [x] `spacy`
- [x] `spacy-transformers`
- [x] `trl`
- [ ] `sentence_transformers`
This framework doesn't work as of currently. The following message is
written at the corresponding test:
*This framework is not implemented yet. Cross-Encoder models don't
implement the functionality*
*for pushing a model to huggingface, and SentenceTransformer models have
the functionality*
*but is outdated and doesn't work with the current versions of
'huggingface-hub'.*
*The present test is let here for the future, when we either implement
the functionality*
       *in 'argilla', or 'sentence-transformers'.*
- [ ] `openai`
     *Doesn't apply*

Closes #3633 

**Type of change**

- [x] New feature (non-breaking change which adds functionality)
- [x] Refactor (change restructuring the codebase without changing
functionality)
- [x] Improvement (change adding some improvement to an existing
functionality)

**How Has This Been Tested**

(Please describe the tests that you ran to verify your changes. And
ideally, reference `tests`)

- [x] `tests/integration/client/feedback/training/test_openai.py`
- [x]
`tests/integration/client/feedback/training/test_sentence_transformers.py`
- [x] `tests/integration/client/feedback/training/test_trainer.py`
- [x] `tests/integration/client/feedback/training/test_trl.py`

**Checklist**

- [x] I added relevant documentation
- [x] I followed the style guidelines of this project
- [x] I did a self-review of my code
- [ ] I made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [x] I have added tests that prove my fix is effective or that my
feature works
- [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK)
(see text above)
- [x] I have added relevant notes to the `CHANGELOG.md` file (See
https://keepachangelog.com/)

---------

Co-authored-by: David Berenstein <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
3 people authored Oct 31, 2023
1 parent dab25a9 commit 3ba2f9f
Show file tree
Hide file tree
Showing 18 changed files with 634 additions and 12 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,10 @@ These are the section headers that we use:

## [Unreleased]()

### Added

- Added functionality to push your models to huggingface hub with `ArgillaTrainer.push_to_huggingface` ([#3976](https://github.com/argilla-io/argilla/pull/3976)).

### Fixed

- Fix svg images out of screen with too large images ([#4047](https://github.com/argilla-io/argilla/pull/4047))
Expand Down
77 changes: 75 additions & 2 deletions docs/_source/practical_guides/fine_tune.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,15 +82,22 @@ A `TrainingTask` is used to define how the data should be processed and formatte
| for_direct_preference_optimization | `prompt-chosen-rejected` | `Union[Tuple[str, str, str], Iterator[Tuple[str, str, str]]]` ||
| for_chat_completion | `chat-turn-role-content` | `Union[Tuple[str, str, str, str], Iterator[Tuple[str, str, str, str]]]`||

#### Model card generation
#### Huggingface hub Integration

This section presents some integrations with the huggingface 🤗[model hub](https://huggingface.co/docs/hub/models-the-hub), the easiest way to share your argilla models, as well as possibility to generate an automated model card.

:::{note}
Take a look at the following [sample model](https://huggingface.co/plaguss/test_model) in the 🤗huggingface hub with the autogenerated model card, and check [https://huggingface.co/models?other=argilla](https://huggingface.co/models?other=argilla) for shared Argilla models to come.
:::

##### Model card generation

The `ArgillaTrainer` automatically generates a [model card](https://huggingface.co/docs/hub/model-cards) when saving the model. After calling `trainer.train(output_dir="my_model")`, you should see the model card under the same output dir you passed through the train method: `./my_model/README.md`. Most of the fields in the card are automatically generated when possible, but the following fields can be (optionally) updated via the `framework_kwargs` variable of the `ArgillaTrainer` like so:

```python
model_card_kwargs = {
"language": ["en", "es"],
"license": "Apache-2.0",
"model_id": "all-MiniLM-L6-v2",
"dataset_name": "argilla/emotion",
"tags": ["nlp", "few-shot-learning", "argilla", "setfit"],
"model_summary": "Small summary of what the model does",
Expand All @@ -117,6 +124,72 @@ Even though its generated internally, you can get the card by calling the `gener
argilla_model_card = trainer.generate_model_card("my_model")
```

##### Upload your models to Huggingface Hub


If you don't have huggingface hub installed yet, you can do it with the following command:

```console
pip install huggingface_hub
```

:::{note}

If your framework chosen is `spacy` or `spacy-transformers` you should also install the following dependency:

```console
pip install spacy-huggingface-hub
```
:::

And then select the environment, depending on whether you are working with a script or from a jupyter notebook:

::::{tab-set}

:::{tab-item} Console

Run the following command from a console window and insert your 🤗huggingface hub token:

```console
huggingface-cli login
```

:::

:::{tab-item} Notebook

Run the following command from a notebook cell and insert your 🤗huggingface hub token:


```python
from huggingface_hub import notebook_login

notebook_login()
```

:::

::::

Internally, the token will be used when calling the `push_to_huggingface` model.

Be sure to take a look at the huggingface hub
[requirements](https://huggingface.co/docs/hub/repositories-getting-started#requirements) in case you need more help publishing your models.

After your model is trained, you just need to call `push_to_huggingface` and wait for your model to be pushed to the hub (by default, a model card will be generated, put the argument to `False` if you don't want it):

```python
# spaCy based models:
repo_id = output_dir

# Every other framework:
repo_id = "organization/model-name" # for example: argilla/newest-model

trainer.push_to_huggingface(repo_id, generate_card=True)
```

Due to the spaCy behavior when pushing models, the repo_id is automatically generated internally, you need to pass the path to where the model was saved (the same `output_dir` variable you may pass to the `train` method), and it will work out just the same.

### Tasks

#### Text Classification
Expand Down
1 change: 1 addition & 0 deletions environment_dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ dependencies:
- spacy==3.5.3
- https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0.tar.gz
- spacy-transformers>=1.2.5
- spacy-huggingface-hub >= 0.0.10
- transformers[torch]>=4.30.0 # <- required for DPO with TRL
- evaluate
- seqeval
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@ integrations = [
"snorkel >= 0.9.7",
"spacy == 3.5.3",
"spacy-transformers >= 1.2.5",
"spacy-huggingface-hub >= 0.0.10",
"transformers[torch] >= 4.30.0",
"evaluate",
"seqeval",
Expand Down
53 changes: 48 additions & 5 deletions src/argilla/client/feedback/training/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@
from pathlib import Path
from typing import TYPE_CHECKING, Any, Dict, List, Optional, Union

from huggingface_hub import HfApi

from argilla.client.feedback.schemas.records import FeedbackRecord
from argilla.client.feedback.training.schemas import TrainingTaskForTextClassification, TrainingTaskTypes
from argilla.client.models import Framework, TextClassificationRecord
Expand Down Expand Up @@ -269,11 +271,11 @@ def save(self, output_dir: str, generate_card: bool = True) -> None:
if generate_card:
self.generate_model_card(output_dir)

def generate_model_card(self, output_dir: str) -> "ArgillaModelCard":
def generate_model_card(self, output_dir: Optional[str] = None) -> "ArgillaModelCard":
"""Generate and return a model card based on the model card data.
Args:
output_dir: Folder where the model card will be written.
output_dir: If given, folder where the model card will be written.
Returns:
model_card: The model card.
Expand All @@ -288,11 +290,46 @@ def generate_model_card(self, output_dir: str) -> "ArgillaModelCard":
template_path=ArgillaModelCard.default_template_path,
)

model_card_path = Path(output_dir) / "README.md"
model_card.save(model_card_path)
self._logger.info(f"Model card generated at: {model_card_path}")
if output_dir:
model_card_path = Path(output_dir) / "README.md"
model_card.save(model_card_path)
self._logger.info(f"Model card generated at: {model_card_path}")

return model_card

def push_to_huggingface(self, repo_id: str, generate_card: Optional[bool] = True, **kwargs) -> None:
"""Push your model to [huggingface's model hub](https://huggingface.co/models).
Args:
repo_id:
The name of the repository you want to push your model and tokenizer to.
It should contain your organization name when pushing to a given organization.
generate_card:
Whether to generate (and push) a model card for your model. Defaults to True.
"""
if not kwargs.get("token"):
# Try obtaining the token with huggingface_hub utils as a last resort, or let it fail.
from huggingface_hub import HfFolder

if token := HfFolder.get_token():
kwargs["token"] = token

# One last check for the tests. We use a different env var name
# that the one gathered with HfFolder.get_token
if token := kwargs.get("token", os.environ.get("HF_HUB_ACCESS_TOKEN", None)):
kwargs["token"] = token

url = self._trainer.push_to_huggingface(repo_id, **kwargs)

if generate_card:
model_card = self.generate_model_card()
# For spacy based models, overwrite the repo_id with the url variable returned
# from its trainer.
if getattr(self._trainer, "language", None):
repo_id = url

model_card.push_to_hub(repo_id, repo_type="model", token=kwargs["token"])


class ArgillaTrainerSkeleton(ABC):
def __init__(
Expand Down Expand Up @@ -360,3 +397,9 @@ def get_model_card_data(self, card_data_kwargs: Dict[str, Any]) -> "FrameworkCar
"""
Generates a `FrameworkCardData` instance to generate a model card from.
"""

@abstractmethod
def push_to_huggingface(self, repo_id: str, **kwargs) -> Optional[str]:
"""
Uploads the model to [Huggingface Hub](https://huggingface.co/docs/hub/models-the-hub).
"""
3 changes: 3 additions & 0 deletions src/argilla/client/feedback/training/frameworks/openai.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,3 +67,6 @@ def get_model_card_data(self, **card_data_kwargs) -> "OpenAIModelCardData":
task=self._task,
**card_data_kwargs,
)

def push_to_huggingface(self, repo_id: str, **kwargs) -> None:
raise NotImplementedError("This method is not implemented for `ArgillaOpenAITrainer`.")
22 changes: 22 additions & 0 deletions src/argilla/client/feedback/training/frameworks/peft.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@

from argilla.client.feedback.training.frameworks.transformers import ArgillaTransformersTrainer
from argilla.training.peft import ArgillaPeftTrainer as ArgillaPeftTrainerV1
from argilla.utils.dependency import requires_dependencies

if TYPE_CHECKING:
from argilla.client.feedback.integrations.huggingface.model_card import PeftModelCardData
Expand Down Expand Up @@ -43,3 +44,24 @@ def get_model_card_data(self, **card_data_kwargs) -> "PeftModelCardData":
update_config_kwargs=self.lora_kwargs,
**card_data_kwargs,
)

@requires_dependencies("huggingface_hub")
def push_to_huggingface(self, repo_id: str, **kwargs) -> None:
"""Uploads the model to [huggingface's model hub](https://huggingface.co/models).
The full list of parameters can be seen at:
[huggingface_hub](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.ModelHubMixin.push_to_hub).
Args:
repo_id:
The name of the repository you want to push your model and tokenizer to.
It should contain your organization name when pushing to a given organization.
"""
if not self._transformers_model:
raise ValueError(
"The model must be initialized prior to this point. You can either call `train` or `init_model`."
)
model_url = self._transformers_model.push_to_hub(repo_id, **kwargs)
self._logger.info(f"Model pushed to: {model_url}")
tokenizer_url = self._transformers_tokenizer.push_to_hub(repo_id, **kwargs)
self._logger.info(f"Tokenizer pushed to: {tokenizer_url}")
Original file line number Diff line number Diff line change
Expand Up @@ -367,3 +367,20 @@ def get_model_card_data(self, **card_data_kwargs) -> "SentenceTransformerCardDat
trainer_cls=self._trainer_cls,
**card_data_kwargs,
)

def push_to_huggingface(self, repo_id: str, **kwargs) -> None:
"""Uploads the model to [huggingface's model hub](https://huggingface.co/models).
The full list of parameters can be seen at:
[sentence-transformer api docs](https://www.sbert.net/docs/package_reference/SentenceTransformer.html#sentence_transformers.SentenceTransformer.save_to_hub).
Args:
repo_id:
The name of the repository you want to push your model and tokenizer to.
It should contain your organization name when pushing to a given organization.
Raises:
NotImplementedError:
For `CrossEncoder` models, that currently aren't implemented underneath.
"""
raise NotImplementedError("This method is not implemented for `ArgillaSentenceTransformersTrainer`.")
22 changes: 21 additions & 1 deletion src/argilla/client/feedback/training/frameworks/setfit.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
from argilla.client.feedback.training.frameworks.transformers import ArgillaTransformersTrainer
from argilla.client.models import TextClassificationRecord
from argilla.training.setfit import ArgillaSetFitTrainer as ArgillaSetFitTrainerV1
from argilla.utils.dependency import require_dependencies
from argilla.utils.dependency import require_dependencies, requires_dependencies

if TYPE_CHECKING:
from argilla.client.feedback.integrations.huggingface.model_card import SetFitModelCardData
Expand Down Expand Up @@ -66,3 +66,23 @@ def get_model_card_data(self, **card_data_kwargs) -> "SetFitModelCardData":
update_config_kwargs={**self.setfit_model_kwargs, **self.setfit_trainer_kwargs},
**card_data_kwargs,
)

@requires_dependencies("huggingface_hub")
def push_to_huggingface(self, repo_id: str, **kwargs) -> None:
"""Uploads the model to [huggingface's model hub](https://huggingface.co/models).
The full list of parameters can be seen at:
[huggingface_hub](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.ModelHubMixin.push_to_hub).
Args:
repo_id:
The name of the repository you want to push your model and tokenizer to.
It should contain your organization name when pushing to a given organization.
Raises:
NotImplementedError: If the model doesn't exist, meaning it hasn't been instantiated yet.
"""
if not self.__trainer:
raise ValueError("The `trainer` must be initialized prior to this point. You should call `train`.")
url = self.__trainer.push_to_hub(repo_id, **kwargs)
self._logger.info(f"Model pushed to: {url}")
Loading

0 comments on commit 3ba2f9f

Please sign in to comment.