feat: add push_to_huggingface method to the ArgillaTrainer (#3976)

# Description This PR adds a new method `push_to_huggingface` to the `ArgillaTrainer` to simplify uploading the trained models to the [huggingface model hub](https://huggingface.co/models). This option is implemented for the following models: - [x] `transformers` - [x] `peft` - [x] `setfit` - [x] `spacy` - [x] `spacy-transformers` - [x] `trl` - [ ] `sentence_transformers` This framework doesn't work as of currently. The following message is written at the corresponding test: *This framework is not implemented yet. Cross-Encoder models don't implement the functionality* *for pushing a model to huggingface, and SentenceTransformer models have the functionality* *but is outdated and doesn't work with the current versions of 'huggingface-hub'.* *The present test is let here for the future, when we either implement the functionality* *in 'argilla', or 'sentence-transformers'.* - [ ] `openai` *Doesn't apply* Closes #3633 **Type of change** - [x] New feature (non-breaking change which adds functionality) - [x] Refactor (change restructuring the codebase without changing functionality) - [x] Improvement (change adding some improvement to an existing functionality) **How Has This Been Tested** (Please describe the tests that you ran to verify your changes. And ideally, reference `tests`) - [x] `tests/integration/client/feedback/training/test_openai.py` - [x] `tests/integration/client/feedback/training/test_sentence_transformers.py` - [x] `tests/integration/client/feedback/training/test_trainer.py` - [x] `tests/integration/client/feedback/training/test_trl.py` **Checklist** - [x] I added relevant documentation - [x] I followed the style guidelines of this project - [x] I did a self-review of my code - [ ] I made corresponding changes to the documentation - [x] My changes generate no new warnings - [x] I have added tests that prove my fix is effective or that my feature works - [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK) (see text above) - [x] I have added relevant notes to the `CHANGELOG.md` file (See https://keepachangelog.com/) --------- Co-authored-by: David Berenstein <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
argilla-io · Oct 31, 2023 · 3ba2f9f · 3ba2f9f
1 parent dab25a9
commit 3ba2f9f
Show file tree

Hide file tree

Showing 18 changed files with 634 additions and 12 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -16,6 +16,10 @@ These are the section headers that we use:
 
 ## [Unreleased]()
 
+### Added
+
+- Added functionality to push your models to huggingface hub with `ArgillaTrainer.push_to_huggingface` ([#3976](https://github.com/argilla-io/argilla/pull/3976)).
+
 ### Fixed
 
 - Fix svg images out of screen with too large images ([#4047](https://github.com/argilla-io/argilla/pull/4047))

diff --git a/docs/_source/practical_guides/fine_tune.md b/docs/_source/practical_guides/fine_tune.md
@@ -82,15 +82,22 @@ A `TrainingTask` is used to define how the data should be processed and formatte
 | for_direct_preference_optimization | `prompt-chosen-rejected`     | `Union[Tuple[str, str, str], Iterator[Tuple[str, str, str]]]`          | ✗      |
 | for_chat_completion                | `chat-turn-role-content`     | `Union[Tuple[str, str, str, str], Iterator[Tuple[str, str, str, str]]]`| ✗      |
 
-#### Model card generation
+#### Huggingface hub Integration
+
+This section presents some integrations with the huggingface 🤗[model hub](https://huggingface.co/docs/hub/models-the-hub), the easiest way to share your argilla models, as well as possibility to generate an automated model card.
+
+:::{note}
+Take a look at the following [sample model](https://huggingface.co/plaguss/test_model) in the 🤗huggingface hub with the autogenerated model card, and check [https://huggingface.co/models?other=argilla](https://huggingface.co/models?other=argilla) for shared Argilla models to come.
+:::
+
+##### Model card generation
 
 The `ArgillaTrainer` automatically generates a [model card](https://huggingface.co/docs/hub/model-cards) when saving the model. After calling `trainer.train(output_dir="my_model")`, you should see the model card under the same output dir you passed through the train method: `./my_model/README.md`. Most of the fields in the card are automatically generated when possible, but the following fields can be (optionally) updated via the `framework_kwargs` variable of the `ArgillaTrainer` like so:
 
 ```python
 model_card_kwargs = {
     "language": ["en", "es"],
     "license": "Apache-2.0",
-    "model_id": "all-MiniLM-L6-v2",
     "dataset_name": "argilla/emotion",
     "tags": ["nlp", "few-shot-learning", "argilla", "setfit"],
     "model_summary": "Small summary of what the model does",
@@ -117,6 +124,72 @@ Even though its generated internally, you can get the card by calling the `gener
 argilla_model_card = trainer.generate_model_card("my_model")
 ```
 
+##### Upload your models to Huggingface Hub
+
+
+If you don't have huggingface hub installed yet, you can do it with the following command:
+
+```console
+pip install huggingface_hub
+```
+
+:::{note}
+
+If your framework chosen is `spacy` or `spacy-transformers` you should also install the following dependency:
+
+```console
+pip install spacy-huggingface-hub
+```
+:::
+
+And then select the environment, depending on whether you are working with a script or from a jupyter notebook:
+
+::::{tab-set}
+
+:::{tab-item} Console
+
+Run the following command from a console window and insert your 🤗huggingface hub token:
+
+```console
+huggingface-cli login
+```
+
+:::
+
+:::{tab-item} Notebook
+
+Run the following command from a notebook cell and insert your 🤗huggingface hub token:
+
+
+```python
+from huggingface_hub import notebook_login
+
+notebook_login()
+```
+
+:::
+
+::::
+
+Internally, the token will be used when calling the `push_to_huggingface` model.
+
+Be sure to take a look at the huggingface hub
+[requirements](https://huggingface.co/docs/hub/repositories-getting-started#requirements) in case you need more help publishing your models.
+
+After your model is trained, you just need to call `push_to_huggingface` and wait for your model to be pushed to the hub (by default, a model card will be generated, put the argument to `False` if you don't want it):
+
+```python
+# spaCy based models:
+repo_id = output_dir
+
+# Every other framework:
+repo_id = "organization/model-name"  # for example: argilla/newest-model
+
+trainer.push_to_huggingface(repo_id, generate_card=True)
+```
+
+Due to the spaCy behavior when pushing models, the repo_id is automatically generated internally, you need to pass the path to where the model was saved (the same `output_dir` variable you may pass to the `train` method), and it will work out just the same.
+
 ### Tasks
 
 #### Text Classification

diff --git a/environment_dev.yml b/environment_dev.yml
@@ -47,6 +47,7 @@ dependencies:
       - spacy==3.5.3
       - https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0.tar.gz
       - spacy-transformers>=1.2.5
+      - spacy-huggingface-hub >= 0.0.10
       - transformers[torch]>=4.30.0 # <- required for DPO with TRL
       - evaluate
       - seqeval

diff --git a/pyproject.toml b/pyproject.toml
@@ -100,6 +100,7 @@ integrations = [
     "snorkel >= 0.9.7",
     "spacy == 3.5.3",
     "spacy-transformers >= 1.2.5",
+    "spacy-huggingface-hub >= 0.0.10",
     "transformers[torch] >= 4.30.0",
     "evaluate",
     "seqeval",

diff --git a/src/argilla/client/feedback/training/base.py b/src/argilla/client/feedback/training/base.py
@@ -19,6 +19,8 @@
 from pathlib import Path
 from typing import TYPE_CHECKING, Any, Dict, List, Optional, Union
 
+from huggingface_hub import HfApi
+
 from argilla.client.feedback.schemas.records import FeedbackRecord
 from argilla.client.feedback.training.schemas import TrainingTaskForTextClassification, TrainingTaskTypes
 from argilla.client.models import Framework, TextClassificationRecord
@@ -269,11 +271,11 @@ def save(self, output_dir: str, generate_card: bool = True) -> None:
         if generate_card:
             self.generate_model_card(output_dir)
 
-    def generate_model_card(self, output_dir: str) -> "ArgillaModelCard":
+    def generate_model_card(self, output_dir: Optional[str] = None) -> "ArgillaModelCard":
         """Generate and return a model card based on the model card data.
 
         Args:
-            output_dir: Folder where the model card will be written.
+            output_dir: If given, folder where the model card will be written.
 
         Returns:
             model_card: The model card.
@@ -288,11 +290,46 @@ def generate_model_card(self, output_dir: str) -> "ArgillaModelCard":
             template_path=ArgillaModelCard.default_template_path,
         )
 
-        model_card_path = Path(output_dir) / "README.md"
-        model_card.save(model_card_path)
-        self._logger.info(f"Model card generated at: {model_card_path}")
+        if output_dir:
+            model_card_path = Path(output_dir) / "README.md"
+            model_card.save(model_card_path)
+            self._logger.info(f"Model card generated at: {model_card_path}")
+
         return model_card
 
+    def push_to_huggingface(self, repo_id: str, generate_card: Optional[bool] = True, **kwargs) -> None:
+        """Push your model to [huggingface's model hub](https://huggingface.co/models).
+
+        Args:
+            repo_id:
+                The name of the repository you want to push your model and tokenizer to.
+                It should contain your organization name when pushing to a given organization.
+            generate_card:
+                Whether to generate (and push) a model card for your model. Defaults to True.
+        """
+        if not kwargs.get("token"):
+            # Try obtaining the token with huggingface_hub utils as a last resort, or let it fail.
+            from huggingface_hub import HfFolder
+
+            if token := HfFolder.get_token():
+                kwargs["token"] = token
+
+            # One last check for the tests. We use a different env var name
+            # that the one gathered with HfFolder.get_token
+            if token := kwargs.get("token", os.environ.get("HF_HUB_ACCESS_TOKEN", None)):
+                kwargs["token"] = token
+
+        url = self._trainer.push_to_huggingface(repo_id, **kwargs)
+
+        if generate_card:
+            model_card = self.generate_model_card()
+            # For spacy based models, overwrite the repo_id with the url variable returned
+            # from its trainer.
+            if getattr(self._trainer, "language", None):
+                repo_id = url
+
+            model_card.push_to_hub(repo_id, repo_type="model", token=kwargs["token"])
+
 
 class ArgillaTrainerSkeleton(ABC):
     def __init__(
@@ -360,3 +397,9 @@ def get_model_card_data(self, card_data_kwargs: Dict[str, Any]) -> "FrameworkCar
         """
         Generates a `FrameworkCardData` instance to generate a model card from.
         """
+
+    @abstractmethod
+    def push_to_huggingface(self, repo_id: str, **kwargs) -> Optional[str]:
+        """
+        Uploads the model to [Huggingface Hub](https://huggingface.co/docs/hub/models-the-hub).
+        """
diff --git a/src/argilla/client/feedback/training/frameworks/openai.py b/src/argilla/client/feedback/training/frameworks/openai.py
@@ -67,3 +67,6 @@ def get_model_card_data(self, **card_data_kwargs) -> "OpenAIModelCardData":
             task=self._task,
             **card_data_kwargs,
         )
+
+    def push_to_huggingface(self, repo_id: str, **kwargs) -> None:
+        raise NotImplementedError("This method is not implemented for `ArgillaOpenAITrainer`.")
diff --git a/src/argilla/client/feedback/training/frameworks/peft.py b/src/argilla/client/feedback/training/frameworks/peft.py
@@ -16,6 +16,7 @@
 
 from argilla.client.feedback.training.frameworks.transformers import ArgillaTransformersTrainer
 from argilla.training.peft import ArgillaPeftTrainer as ArgillaPeftTrainerV1
+from argilla.utils.dependency import requires_dependencies
 
 if TYPE_CHECKING:
     from argilla.client.feedback.integrations.huggingface.model_card import PeftModelCardData
@@ -43,3 +44,24 @@ def get_model_card_data(self, **card_data_kwargs) -> "PeftModelCardData":
             update_config_kwargs=self.lora_kwargs,
             **card_data_kwargs,
         )
+
+    @requires_dependencies("huggingface_hub")
+    def push_to_huggingface(self, repo_id: str, **kwargs) -> None:
+        """Uploads the model to [huggingface's model hub](https://huggingface.co/models).
+
+        The full list of parameters can be seen at:
+        [huggingface_hub](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.ModelHubMixin.push_to_hub).
+
+        Args:
+            repo_id:
+                The name of the repository you want to push your model and tokenizer to.
+                It should contain your organization name when pushing to a given organization.
+        """
+        if not self._transformers_model:
+            raise ValueError(
+                "The model must be initialized prior to this point. You can either call `train` or `init_model`."
+            )
+        model_url = self._transformers_model.push_to_hub(repo_id, **kwargs)
+        self._logger.info(f"Model pushed to: {model_url}")
+        tokenizer_url = self._transformers_tokenizer.push_to_hub(repo_id, **kwargs)
+        self._logger.info(f"Tokenizer pushed to: {tokenizer_url}")
diff --git a/src/argilla/client/feedback/training/frameworks/sentence_transformers.py b/src/argilla/client/feedback/training/frameworks/sentence_transformers.py
@@ -367,3 +367,20 @@ def get_model_card_data(self, **card_data_kwargs) -> "SentenceTransformerCardDat
             trainer_cls=self._trainer_cls,
             **card_data_kwargs,
         )
+
+    def push_to_huggingface(self, repo_id: str, **kwargs) -> None:
+        """Uploads the model to [huggingface's model hub](https://huggingface.co/models).
+
+        The full list of parameters can be seen at:
+        [sentence-transformer api docs](https://www.sbert.net/docs/package_reference/SentenceTransformer.html#sentence_transformers.SentenceTransformer.save_to_hub).
+
+        Args:
+            repo_id:
+                The name of the repository you want to push your model and tokenizer to.
+                It should contain your organization name when pushing to a given organization.
+
+        Raises:
+            NotImplementedError:
+                For `CrossEncoder` models, that currently aren't implemented underneath.
+        """
+        raise NotImplementedError("This method is not implemented for `ArgillaSentenceTransformersTrainer`.")
diff --git a/src/argilla/client/feedback/training/frameworks/setfit.py b/src/argilla/client/feedback/training/frameworks/setfit.py
@@ -18,7 +18,7 @@
 from argilla.client.feedback.training.frameworks.transformers import ArgillaTransformersTrainer
 from argilla.client.models import TextClassificationRecord
 from argilla.training.setfit import ArgillaSetFitTrainer as ArgillaSetFitTrainerV1
-from argilla.utils.dependency import require_dependencies
+from argilla.utils.dependency import require_dependencies, requires_dependencies
 
 if TYPE_CHECKING:
     from argilla.client.feedback.integrations.huggingface.model_card import SetFitModelCardData
@@ -66,3 +66,23 @@ def get_model_card_data(self, **card_data_kwargs) -> "SetFitModelCardData":
             update_config_kwargs={**self.setfit_model_kwargs, **self.setfit_trainer_kwargs},
             **card_data_kwargs,
         )
+
+    @requires_dependencies("huggingface_hub")
+    def push_to_huggingface(self, repo_id: str, **kwargs) -> None:
+        """Uploads the model to [huggingface's model hub](https://huggingface.co/models).
+
+        The full list of parameters can be seen at:
+        [huggingface_hub](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.ModelHubMixin.push_to_hub).
+
+        Args:
+            repo_id:
+                The name of the repository you want to push your model and tokenizer to.
+                It should contain your organization name when pushing to a given organization.
+
+        Raises:
+            NotImplementedError: If the model doesn't exist, meaning it hasn't been instantiated yet.
+        """
+        if not self.__trainer:
+            raise ValueError("The `trainer` must be initialized prior to this point. You should call `train`.")
+        url = self.__trainer.push_to_hub(repo_id, **kwargs)
+        self._logger.info(f"Model pushed to: {url}")