Add READMEs to examples/ and nemo_curator/scripts directories (NVIDIA#332)

sarahyurick · web-flow · commit d1f52f6427c7 · 2024-12-03T09:27:41.000-08:00
* save progress

Signed-off-by: Sarah Yurick &lt;sarahyurick@gmail.com&gt;

* add remaining docs

Signed-off-by: Sarah Yurick &lt;sarahyurick@gmail.com&gt;

* add titles and table

Signed-off-by: Sarah Yurick &lt;sarahyurick@gmail.com&gt;

* remove trailing whitespace

Signed-off-by: Sarah Yurick &lt;sarahyurick@gmail.com&gt;

* add --help instructions

Signed-off-by: Sarah Yurick &lt;sarahyurick@gmail.com&gt;

---------

Signed-off-by: Sarah Yurick &lt;sarahyurick@gmail.com&gt;
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -37,7 +37,7 @@ There should be at least one example per module in the curator.
 They should be incredibly lightweight and rely on the core `nemo_curator` modules for their functionality.
 Most should be designed for a user to get up and running on their local machines, but distributed examples are welcomed if it makes sense.
 Python scripts should be the primary way to showcase your module.
-Though, SLURM scripts or other cluster scripts should be included if there are special steps needed to run the module.
+Though, Slurm scripts or other cluster scripts should be included if there are special steps needed to run the module.
 
 The documentation should complement each example by going through the motivation behind why a user would use each module.
 It should include both an explanation of the module, and how it's used in its corresponding example.
diff --git a/docs/user-guide/cpuvsgpu.rst b/docs/user-guide/cpuvsgpu.rst
@@ -35,7 +35,7 @@ All of the ``examples/`` use it to set up a Dask cluster.
   It is possible to run entirely CPU-based workflows on a GPU cluster, though the process count (and therefore the number of parallel tasks) will be limited by the number of GPUs on your machine.
 
 * ``scheduler_address`` and ``scheduler_file`` are used for connecting to an existing Dask cluster.
-  Supplying one of these is essential if you are running a Dask cluster on SLURM or Kubernetes.
+  Supplying one of these is essential if you are running a Dask cluster on Slurm or Kubernetes.
   All other arguments are ignored if either of these are passed, as the cluster configuration will be done when you create the schduler and works on your cluster.
 
 * The remaining arguments can be modified `here <https://github.com/NVIDIA/NeMo-Curator/blob/main/nemo_curator/utils/distributed_utils.py>`_.
@@ -82,15 +82,15 @@ Even if you start a GPU dask cluster, you can't operate on datasets that use a `
 The ``DocuemntDataset`` must either have been originally read in with a ``cudf`` backend, or it must be transferred during the script.
 
 -----------------------------------------
-Dask with SLURM
+Dask with Slurm
 -----------------------------------------
 
-We provide an example SLURM script pipeline in ``examples/slurm``.
+We provide an example Slurm script pipeline in ``examples/slurm``.
 This pipeline has a script ``start-slurm.sh`` that provides configuration options similar to what ``get_client`` provides.
-Every SLURM cluster is different, so make sure you understand how your SLURM cluster works so the scripts can be easily adapted.
-``start-slurm.sh`` calls ``containter-entrypoint.sh`` which sets up a Dask scheduler and workers across the cluster.
+Every Slurm cluster is different, so make sure you understand how your Slurm cluster works so the scripts can be easily adapted.
+``start-slurm.sh`` calls ``containter-entrypoint.sh``, which sets up a Dask scheduler and workers across the cluster.
 
-Our Python examples are designed to work such that they can be run locally on their own, or easily substituted into the ``start-slurm.sh`` to run on multiple nodes.
+Our Python examples are designed to work such that they can be run locally on their own, or easily substituted into the ``start-slurm.sh`` script to run on multiple nodes.
 You can adapt your scripts easily too by simply following the pattern of adding ``get_client`` with ``add_distributed_args``.
 
 -----------------------------------------
diff --git a/docs/user-guide/kubernetescurator.rst b/docs/user-guide/kubernetescurator.rst
@@ -139,7 +139,7 @@ use ``kubectl cp``, but ``exec`` has fewer surprises regarding compressed files:
 Create a Dask Cluster
 ---------------------
 
-Use the ``create_dask_cluster.py`` to create a CPU or GPU dask cluster.
+Use the ``create_dask_cluster.py`` to create a CPU or GPU Dask cluster.
 
 .. note::
     If you are creating another Dask cluster with the same ``--name <name>``, first delete it via::
@@ -289,7 +289,7 @@ container, we will need to build a custom image with your code installed:
         # Fill in <private-registry>/<username>/<password>
         kubectl create secret docker-registry my-private-registry --docker-server=<private-registry> --docker-username=<username> --docker-password=<password>
 
-    And with this new secret, you create your new dask cluster:
+    And with this new secret, you create your new Dask cluster:
 
     .. code-block:: bash
 
@@ -360,7 +360,7 @@ At this point you can tail the logs and look for ``Finished!`` in ``/nemo-worksp
 
 Deleting Cluster
 ----------------
-After you have finished using the created dask cluster, you can delete it to release the resources:
+After you have finished using the created Dask cluster, you can delete it to release the resources:
 
 .. code-block:: bash
 
diff --git a/docs/user-guide/personalidentifiableinformationidentificationandremoval.rst b/docs/user-guide/personalidentifiableinformationidentificationandremoval.rst
@@ -26,7 +26,7 @@ The tool utilizes `Dask <https://dask.org>`_ to parallelize tasks and hence it c
 used to scale up to terabytes of data easily. Although Dask can be deployed on various
 distributed compute environments such as HPC clusters, Kubernetes and other cloud
 offerings such as AWS EKS, Google cloud etc, the current implementation only supports
-Dask on HPC clusters that use SLURM as the resource manager.
+Dask on HPC clusters that use Slurm as the resource manager.
 
 -----------------------------------------
 Usage
@@ -92,7 +92,7 @@ The PII redaction module can also be invoked via ``script/find_pii_and_deidentif
 
 ``python nemo_curator/scripts/find_pii_and_deidentify.py``
 
-To launch the script from within a SLURM environment, the script ``examples/slurm/start-slurm.sh`` can be modified and used.
+To launch the script from within a Slurm environment, the script ``examples/slurm/start-slurm.sh`` can be modified and used.
 
 
 ############################
diff --git a/examples/README.md b/examples/README.md
@@ -0,0 +1,25 @@
+# NeMo Curator Python API examples
+
+This directory contains multiple Python scripts with examples of how to use various NeMo Curator classes and functions.
+The goal of these examples is to give the user an overview of many of the ways your text data can be curated.
+These include:
+
+| Python Script                         | Description                                                                                                   |
+|---------------------------------------|---------------------------------------------------------------------------------------------------------------|
+| blend_and_shuffle.py                  | Combine multiple datasets into one with different amounts of each dataset, then randomly permute the dataset. |
+| classifier_filtering.py               | Train a fastText classifier, then use it to filter high and low quality data.                                 |
+| download_arxiv.py                     | Download Arxiv tar files and extract them.                                                                    |
+| download_common_crawl.py              | Download Common Crawl WARC snapshots and extract them.                                                        |
+| download_wikipedia.py                 | Download the latest Wikipedia dumps and extract them.                                                         |
+| exact_deduplication.py                | Use the `ExactDuplicates` class to perform exact deduplication on text data.                                  |
+| find_pii_and_deidentify.py            | Use the `PiiModifier` and `Modify` classes to remove personally identifiable information from text data.      |
+| fuzzy_deduplication.py                | Use the `FuzzyDuplicatesConfig` and `FuzzyDuplicates` classes to perform fuzzy deduplication on text data.    |
+| identify_languages_and_fix_unicode.py | Use `FastTextLangId` to filter data by language, then fix the unicode in it.                                  |
+| raw_download_common_crawl.py          | Download the raw compressed WARC files from Common Crawl without extracting them.                             |
+| semdedup_example.py                   | Use the `SemDedup` class to perform semantic deduplication on text data.                                      |
+| task_decontamination.py               | Remove segments of downstream evaluation tasks from a dataset.                                                |
+| translation_example.py                | Create and use an `IndicTranslation` model for language translation.                                          |
+
+Before running any of these scripts, we strongly recommend displaying `python <script name>.py --help` to ensure that any needed or relevant arguments are specified.
+
+The `classifiers`, `k8s`, `nemo_run`, and `slurm` subdirectories contain even more examples of NeMo Curator's capabilities.
diff --git a/examples/classifiers/README.md b/examples/classifiers/README.md
@@ -0,0 +1,21 @@
+## Text Classification
+
+The Python scripts in this directory demonstrate how to run classification on your text data with each of these 4 classifiers:
+
+- Domain Classifier
+- Quality Classifier
+- AEGIS Safety Models
+- FineWeb Educational Content Classifier
+
+For more information about these classifiers, please see NeMo Curator's [Distributed Data Classification documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/distributeddataclassification.html).
+
+Each of these scripts provide simple examples of what your own Python scripts might look like.
+
+At a high level, you will:
+
+1. Create a Dask client by using the `get_client` function
+2. Use `DocumentDataset.read_json` (or `DocumentDataset.read_parquet`) to read your data
+3. Initialize and call the classifier on your data
+4. Write your results to the desired output type with `to_json` or `to_parquet`
+
+Before running any of these scripts, we strongly recommend displaying `python <script name>.py --help` to ensure that any needed or relevant arguments are specified.
diff --git a/examples/k8s/README.md b/examples/k8s/README.md
@@ -0,0 +1,5 @@
+## Kubernetes
+
+The `create_dask_cluster.py` can be used to create a CPU or GPU Dask cluster.
+
+See [Running NeMo Curator on Kubernetes](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/kubernetescurator.html) for more information.
diff --git a/examples/nemo_run/README.md b/examples/nemo_run/README.md
@@ -0,0 +1,5 @@
+## NeMo-Run
+
+The `launch_slurm.py` script shows an example of how to run a Slurm job via Python APIs.
+
+See the [Dask with Slurm](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/cpuvsgpu.html?highlight=slurm#dask-with-slurm) and [NeMo-Run Quickstart](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/quickstart.html?highlight=slurm#execute-on-a-slurm-cluster) pages for more information.
diff --git a/examples/nemo_run/launch_slurm.py b/examples/nemo_run/launch_slurm.py
@@ -21,7 +21,7 @@
 @run.factory
 def nemo_curator_slurm_executor() -> SlurmExecutor:
     """
-    Configure the following function with the details of your SLURM cluster
+    Configure the following function with the details of your Slurm cluster
     """
     return SlurmExecutor(
         job_name_prefix="nemo-curator",
@@ -35,7 +35,7 @@ def nemo_curator_slurm_executor() -> SlurmExecutor:
 
 
 def main():
-    # Path to NeMo-Curator/examples/slurm/container_entrypoint.sh on the SLURM cluster
+    # Path to NeMo-Curator/examples/slurm/container_entrypoint.sh on the Slurm cluster
     container_entrypoint = "/cluster/path/slurm/container_entrypoint.sh"
     # The NeMo Curator command to run
     # This command can be susbstituted with any NeMo Curator command
diff --git a/examples/slurm/README.md b/examples/slurm/README.md
@@ -0,0 +1,9 @@
+# Dask with Slurm
+
+This directory provides an example Slurm script pipeline.
+This pipeline has a script `start-slurm.sh` that provides configuration options similar to what `get_client` provides.
+Every Slurm cluster is different, so make sure you understand how your Slurm cluster works so the scripts can be easily adapted.
+`start-slurm.sh` calls `containter-entrypoint.sh`, which sets up a Dask scheduler and workers across the cluster.
+
+Our Python examples are designed to work such that they can be run locally on their own, or easily substituted into the `start-slurm.sh` script to run on multiple nodes.
+You can adapt your scripts easily too by simply following the pattern of adding `get_client` with `add_distributed_args`.
diff --git a/examples/slurm/start-slurm.sh b/examples/slurm/start-slurm.sh
@@ -23,7 +23,7 @@
 # Begin easy customization
 # =================================================================
 
-# Base directory for all SLURM job logs and files
+# Base directory for all Slurm job logs and files
 # Does not affect directories referenced in your script
 export BASE_JOB_DIR=`pwd`/nemo-curator-jobs
 export JOB_DIR=$BASE_JOB_DIR/$SLURM_JOB_ID
diff --git a/nemo_curator/modules/dataset_ops.py b/nemo_curator/modules/dataset_ops.py
@@ -117,7 +117,7 @@ def blend_datasets(
     target_size: int, datasets: List[DocumentDataset], sampling_weights: List[float]
 ) -> DocumentDataset:
     """
-    Combined multiple datasets into one with different amounts of each dataset
+    Combines multiple datasets into one with different amounts of each dataset.
     Args:
         target_size: The number of documents the resulting dataset should have.
             The actual size of the dataset may be slightly larger if the normalized weights do not allow
diff --git a/nemo_curator/nemo_run/slurm.py b/nemo_curator/nemo_run/slurm.py
@@ -23,7 +23,7 @@
 @dataclass
 class SlurmJobConfig:
     """
-    Configuration for running a NeMo Curator script on a SLURM cluster using
+    Configuration for running a NeMo Curator script on a Slurm cluster using
     NeMo Run
 
     Args:
@@ -74,13 +74,13 @@ def to_script(self, add_scheduler_file: bool = True, add_device: bool = True):
             add_scheduler_file: Automatically appends a '--scheduler-file' argument to the
                 script_command where the value is job_dir/logs/scheduler.json. All
                 scripts included in NeMo Curator accept and require this argument to scale
-                properly on SLURM clusters.
+                properly on Slurm clusters.
             add_device: Automatically appends a '--device' argument to the script_command
                 where the value is the member variable of device. All scripts included in
                 NeMo Curator accept and require this argument.
         Returns:
             A NeMo Run Script that will intialize a Dask cluster, and run the specified command.
-            It is designed to be executed on a SLURM cluster
+            It is designed to be executed on a Slurm cluster
         """
         env_vars = self._build_env_vars()
 
diff --git a/nemo_curator/scripts/README.md b/nemo_curator/scripts/README.md
@@ -0,0 +1,29 @@
+# NeMo Curator CLI Scripts
+
+The following Python scripts are designed to be executed from the command line (terminal) only.
+
+Here, we list all of the Python scripts and their terminal commands:
+
+| Python Command                           | CLI Command                    |
+|------------------------------------------|--------------------------------|
+| python add_id.py                         | add_id                         |
+| python blend_datasets.py                 | blend_datasets                 |
+| python download_and_extract.py           | download_and_extract           |
+| python filter_documents.py               | filter_documents               |
+| python find_exact_duplicates.py          | gpu_exact_dups                 |
+| python find_matching_ngrams.py           | find_matching_ngrams           |
+| python find_pii_and_deidentify.py        | deidentify                     |
+| python get_common_crawl_urls.py          | get_common_crawl_urls          |
+| python get_wikipedia_urls.py             | get_wikipedia_urls             |
+| python make_data_shards.py               | make_data_shards               |
+| python prepare_fasttext_training_data.py | prepare_fasttext_training_data |
+| python prepare_task_data.py              | prepare_task_data              |
+| python remove_matching_ngrams.py         | remove_matching_ngrams         |
+| python separate_by_metadata.py           | separate_by_metadata           |
+| python text_cleaning.py                  | text_cleaning                  |
+| python train_fasttext.py                 | train_fasttext                 |
+| python verify_classification_results.py  | verify_classification_results  |
+
+For more information about the arguments needed for each script, you can use `add_id --help`, etc.
+
+More scripts can be found in the `classifiers`, `fuzzy_deduplication`, and `semdedup` subdirectories.
diff --git a/nemo_curator/scripts/classifiers/README.md b/nemo_curator/scripts/classifiers/README.md
@@ -0,0 +1,92 @@
+## Text Classification
+
+The Python scripts in this directory demonstrate how to run classification on your text data with each of these 4 classifiers:
+
+- Domain Classifier
+- Quality Classifier
+- AEGIS Safety Models
+- FineWeb Educational Content Classifier
+
+For more information about these classifiers, please see NeMo Curator's [Distributed Data Classification documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/distributeddataclassification.html).
+
+### Usage
+
+#### Domain classifier inference
+
+```bash
+# same as `python domain_classifier_inference.py`
+domain_classifier_inference \
+    --input-data-dir /path/to/data/directory \
+    --output-data-dir /path/to/output/directory \
+    --input-file-type "jsonl" \
+    --input-file-extension "jsonl" \
+    --output-file-type "jsonl" \
+    --input-text-field "text" \
+    --batch-size 64 \
+    --autocast \
+    --max-chars 2000 \
+    --device "gpu"
+```
+
+Additional arguments may be added for customizing a Dask cluster and client. Run `domain_classifier_inference --help` for more information.
+
+#### Quality classifier inference
+
+```bash
+# same as `python quality_classifier_inference.py`
+quality_classifier_inference \
+    --input-data-dir /path/to/data/directory \
+    --output-data-dir /path/to/output/directory \
+    --input-file-type "jsonl" \
+    --input-file-extension "jsonl" \
+    --output-file-type "jsonl" \
+    --input-text-field "text" \
+    --batch-size 64 \
+    --autocast \
+    --max-chars 2000 \
+    --device "gpu"
+```
+
+Additional arguments may be added for customizing a Dask cluster and client. Run `quality_classifier_inference --help` for more information.
+
+#### AEGIS classifier inference
+
+```bash
+# same as `python aegis_classifier_inference.py`
+aegis_classifier_inference \
+    --input-data-dir /path/to/data/directory \
+    --output-data-dir /path/to/output/directory \
+    --input-file-type "jsonl" \
+    --input-file-extension "jsonl" \
+    --output-file-type "jsonl" \
+    --input-text-field "text" \
+    --batch-size 64 \
+    --max-chars 6000 \
+    --device "gpu" \
+    --aegis-variant "nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0" \
+    --token "hf_1234"
+```
+
+- `--aegis-variant` can be `nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0`, `nvidia/Aegis-AI-Content-Safety-LlamaGuard-Permissive-1.0`, or a path to your own PEFT of LlamaGuard 2.
+- `--token` is your HuggingFace token, which is used when downloading the base Llama Guard model.
+
+Additional arguments may be added for customizing a Dask cluster and client. Run `aegis_classifier_inference --help` for more information.
+
+#### FineWeb-Edu classifier inference
+
+```bash
+# same as `python fineweb_edu_classifier_inference.py`
+fineweb_edu_classifier_inference \
+    --input-data-dir /path/to/data/directory \
+    --output-data-dir /path/to/output/directory \
+    --input-file-type "jsonl" \
+    --input-file-extension "jsonl" \
+    --output-file-type "jsonl" \
+    --input-text-field "text" \
+    --batch-size 64 \
+    --autocast \
+    --max-chars 2000 \
+    --device "gpu"
+```
+
+Additional arguments may be added for customizing a Dask cluster and client. Run `fineweb_edu_classifier_inference --help` for more information.
diff --git a/tutorials/pretraining-data-curation/README.md b/tutorials/pretraining-data-curation/README.md
diff --git a/tutorials/pretraining-data-curation/red-pajama-v2-curation-tutorial.ipynb b/tutorials/pretraining-data-curation/red-pajama-v2-curation-tutorial.ipynb
diff --git a/tutorials/pretraining-data-curation/start-distributed-notebook.sh b/tutorials/pretraining-data-curation/start-distributed-notebook.sh