Skip to content

Commit d1f52f6

Browse files
authored
Add READMEs to examples/ and nemo_curator/scripts directories (NVIDIA#332)
* save progress Signed-off-by: Sarah Yurick <[email protected]> * add remaining docs Signed-off-by: Sarah Yurick <[email protected]> * add titles and table Signed-off-by: Sarah Yurick <[email protected]> * remove trailing whitespace Signed-off-by: Sarah Yurick <[email protected]> * add --help instructions Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]>
1 parent bc724ec commit d1f52f6

18 files changed

+208
-22
lines changed

CONTRIBUTING.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ There should be at least one example per module in the curator.
3737
They should be incredibly lightweight and rely on the core `nemo_curator` modules for their functionality.
3838
Most should be designed for a user to get up and running on their local machines, but distributed examples are welcomed if it makes sense.
3939
Python scripts should be the primary way to showcase your module.
40-
Though, SLURM scripts or other cluster scripts should be included if there are special steps needed to run the module.
40+
Though, Slurm scripts or other cluster scripts should be included if there are special steps needed to run the module.
4141

4242
The documentation should complement each example by going through the motivation behind why a user would use each module.
4343
It should include both an explanation of the module, and how it's used in its corresponding example.

docs/user-guide/cpuvsgpu.rst

+6-6
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ All of the ``examples/`` use it to set up a Dask cluster.
3535
It is possible to run entirely CPU-based workflows on a GPU cluster, though the process count (and therefore the number of parallel tasks) will be limited by the number of GPUs on your machine.
3636

3737
* ``scheduler_address`` and ``scheduler_file`` are used for connecting to an existing Dask cluster.
38-
Supplying one of these is essential if you are running a Dask cluster on SLURM or Kubernetes.
38+
Supplying one of these is essential if you are running a Dask cluster on Slurm or Kubernetes.
3939
All other arguments are ignored if either of these are passed, as the cluster configuration will be done when you create the schduler and works on your cluster.
4040

4141
* The remaining arguments can be modified `here <https://github.com/NVIDIA/NeMo-Curator/blob/main/nemo_curator/utils/distributed_utils.py>`_.
@@ -82,15 +82,15 @@ Even if you start a GPU dask cluster, you can't operate on datasets that use a `
8282
The ``DocuemntDataset`` must either have been originally read in with a ``cudf`` backend, or it must be transferred during the script.
8383

8484
-----------------------------------------
85-
Dask with SLURM
85+
Dask with Slurm
8686
-----------------------------------------
8787

88-
We provide an example SLURM script pipeline in ``examples/slurm``.
88+
We provide an example Slurm script pipeline in ``examples/slurm``.
8989
This pipeline has a script ``start-slurm.sh`` that provides configuration options similar to what ``get_client`` provides.
90-
Every SLURM cluster is different, so make sure you understand how your SLURM cluster works so the scripts can be easily adapted.
91-
``start-slurm.sh`` calls ``containter-entrypoint.sh`` which sets up a Dask scheduler and workers across the cluster.
90+
Every Slurm cluster is different, so make sure you understand how your Slurm cluster works so the scripts can be easily adapted.
91+
``start-slurm.sh`` calls ``containter-entrypoint.sh``, which sets up a Dask scheduler and workers across the cluster.
9292

93-
Our Python examples are designed to work such that they can be run locally on their own, or easily substituted into the ``start-slurm.sh`` to run on multiple nodes.
93+
Our Python examples are designed to work such that they can be run locally on their own, or easily substituted into the ``start-slurm.sh`` script to run on multiple nodes.
9494
You can adapt your scripts easily too by simply following the pattern of adding ``get_client`` with ``add_distributed_args``.
9595

9696
-----------------------------------------

docs/user-guide/kubernetescurator.rst

+3-3
Original file line numberDiff line numberDiff line change
@@ -139,7 +139,7 @@ use ``kubectl cp``, but ``exec`` has fewer surprises regarding compressed files:
139139
Create a Dask Cluster
140140
---------------------
141141
142-
Use the ``create_dask_cluster.py`` to create a CPU or GPU dask cluster.
142+
Use the ``create_dask_cluster.py`` to create a CPU or GPU Dask cluster.
143143
144144
.. note::
145145
If you are creating another Dask cluster with the same ``--name <name>``, first delete it via::
@@ -289,7 +289,7 @@ container, we will need to build a custom image with your code installed:
289289
# Fill in <private-registry>/<username>/<password>
290290
kubectl create secret docker-registry my-private-registry --docker-server=<private-registry> --docker-username=<username> --docker-password=<password>
291291
292-
And with this new secret, you create your new dask cluster:
292+
And with this new secret, you create your new Dask cluster:
293293
294294
.. code-block:: bash
295295
@@ -360,7 +360,7 @@ At this point you can tail the logs and look for ``Finished!`` in ``/nemo-worksp
360360
361361
Deleting Cluster
362362
----------------
363-
After you have finished using the created dask cluster, you can delete it to release the resources:
363+
After you have finished using the created Dask cluster, you can delete it to release the resources:
364364
365365
.. code-block:: bash
366366

docs/user-guide/personalidentifiableinformationidentificationandremoval.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ The tool utilizes `Dask <https://dask.org>`_ to parallelize tasks and hence it c
2626
used to scale up to terabytes of data easily. Although Dask can be deployed on various
2727
distributed compute environments such as HPC clusters, Kubernetes and other cloud
2828
offerings such as AWS EKS, Google cloud etc, the current implementation only supports
29-
Dask on HPC clusters that use SLURM as the resource manager.
29+
Dask on HPC clusters that use Slurm as the resource manager.
3030

3131
-----------------------------------------
3232
Usage
@@ -92,7 +92,7 @@ The PII redaction module can also be invoked via ``script/find_pii_and_deidentif
9292

9393
``python nemo_curator/scripts/find_pii_and_deidentify.py``
9494

95-
To launch the script from within a SLURM environment, the script ``examples/slurm/start-slurm.sh`` can be modified and used.
95+
To launch the script from within a Slurm environment, the script ``examples/slurm/start-slurm.sh`` can be modified and used.
9696

9797

9898
############################

examples/README.md

+25
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# NeMo Curator Python API examples
2+
3+
This directory contains multiple Python scripts with examples of how to use various NeMo Curator classes and functions.
4+
The goal of these examples is to give the user an overview of many of the ways your text data can be curated.
5+
These include:
6+
7+
| Python Script | Description |
8+
|---------------------------------------|---------------------------------------------------------------------------------------------------------------|
9+
| blend_and_shuffle.py | Combine multiple datasets into one with different amounts of each dataset, then randomly permute the dataset. |
10+
| classifier_filtering.py | Train a fastText classifier, then use it to filter high and low quality data. |
11+
| download_arxiv.py | Download Arxiv tar files and extract them. |
12+
| download_common_crawl.py | Download Common Crawl WARC snapshots and extract them. |
13+
| download_wikipedia.py | Download the latest Wikipedia dumps and extract them. |
14+
| exact_deduplication.py | Use the `ExactDuplicates` class to perform exact deduplication on text data. |
15+
| find_pii_and_deidentify.py | Use the `PiiModifier` and `Modify` classes to remove personally identifiable information from text data. |
16+
| fuzzy_deduplication.py | Use the `FuzzyDuplicatesConfig` and `FuzzyDuplicates` classes to perform fuzzy deduplication on text data. |
17+
| identify_languages_and_fix_unicode.py | Use `FastTextLangId` to filter data by language, then fix the unicode in it. |
18+
| raw_download_common_crawl.py | Download the raw compressed WARC files from Common Crawl without extracting them. |
19+
| semdedup_example.py | Use the `SemDedup` class to perform semantic deduplication on text data. |
20+
| task_decontamination.py | Remove segments of downstream evaluation tasks from a dataset. |
21+
| translation_example.py | Create and use an `IndicTranslation` model for language translation. |
22+
23+
Before running any of these scripts, we strongly recommend displaying `python <script name>.py --help` to ensure that any needed or relevant arguments are specified.
24+
25+
The `classifiers`, `k8s`, `nemo_run`, and `slurm` subdirectories contain even more examples of NeMo Curator's capabilities.

examples/classifiers/README.md

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
## Text Classification
2+
3+
The Python scripts in this directory demonstrate how to run classification on your text data with each of these 4 classifiers:
4+
5+
- Domain Classifier
6+
- Quality Classifier
7+
- AEGIS Safety Models
8+
- FineWeb Educational Content Classifier
9+
10+
For more information about these classifiers, please see NeMo Curator's [Distributed Data Classification documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/distributeddataclassification.html).
11+
12+
Each of these scripts provide simple examples of what your own Python scripts might look like.
13+
14+
At a high level, you will:
15+
16+
1. Create a Dask client by using the `get_client` function
17+
2. Use `DocumentDataset.read_json` (or `DocumentDataset.read_parquet`) to read your data
18+
3. Initialize and call the classifier on your data
19+
4. Write your results to the desired output type with `to_json` or `to_parquet`
20+
21+
Before running any of these scripts, we strongly recommend displaying `python <script name>.py --help` to ensure that any needed or relevant arguments are specified.

examples/k8s/README.md

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
## Kubernetes
2+
3+
The `create_dask_cluster.py` can be used to create a CPU or GPU Dask cluster.
4+
5+
See [Running NeMo Curator on Kubernetes](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/kubernetescurator.html) for more information.

examples/nemo_run/README.md

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
## NeMo-Run
2+
3+
The `launch_slurm.py` script shows an example of how to run a Slurm job via Python APIs.
4+
5+
See the [Dask with Slurm](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/cpuvsgpu.html?highlight=slurm#dask-with-slurm) and [NeMo-Run Quickstart](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/quickstart.html?highlight=slurm#execute-on-a-slurm-cluster) pages for more information.

examples/nemo_run/launch_slurm.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121
@run.factory
2222
def nemo_curator_slurm_executor() -> SlurmExecutor:
2323
"""
24-
Configure the following function with the details of your SLURM cluster
24+
Configure the following function with the details of your Slurm cluster
2525
"""
2626
return SlurmExecutor(
2727
job_name_prefix="nemo-curator",
@@ -35,7 +35,7 @@ def nemo_curator_slurm_executor() -> SlurmExecutor:
3535

3636

3737
def main():
38-
# Path to NeMo-Curator/examples/slurm/container_entrypoint.sh on the SLURM cluster
38+
# Path to NeMo-Curator/examples/slurm/container_entrypoint.sh on the Slurm cluster
3939
container_entrypoint = "/cluster/path/slurm/container_entrypoint.sh"
4040
# The NeMo Curator command to run
4141
# This command can be susbstituted with any NeMo Curator command

examples/slurm/README.md

+9
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# Dask with Slurm
2+
3+
This directory provides an example Slurm script pipeline.
4+
This pipeline has a script `start-slurm.sh` that provides configuration options similar to what `get_client` provides.
5+
Every Slurm cluster is different, so make sure you understand how your Slurm cluster works so the scripts can be easily adapted.
6+
`start-slurm.sh` calls `containter-entrypoint.sh`, which sets up a Dask scheduler and workers across the cluster.
7+
8+
Our Python examples are designed to work such that they can be run locally on their own, or easily substituted into the `start-slurm.sh` script to run on multiple nodes.
9+
You can adapt your scripts easily too by simply following the pattern of adding `get_client` with `add_distributed_args`.

examples/slurm/start-slurm.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
# Begin easy customization
2424
# =================================================================
2525

26-
# Base directory for all SLURM job logs and files
26+
# Base directory for all Slurm job logs and files
2727
# Does not affect directories referenced in your script
2828
export BASE_JOB_DIR=`pwd`/nemo-curator-jobs
2929
export JOB_DIR=$BASE_JOB_DIR/$SLURM_JOB_ID

nemo_curator/modules/dataset_ops.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -117,7 +117,7 @@ def blend_datasets(
117117
target_size: int, datasets: List[DocumentDataset], sampling_weights: List[float]
118118
) -> DocumentDataset:
119119
"""
120-
Combined multiple datasets into one with different amounts of each dataset
120+
Combines multiple datasets into one with different amounts of each dataset.
121121
Args:
122122
target_size: The number of documents the resulting dataset should have.
123123
The actual size of the dataset may be slightly larger if the normalized weights do not allow

nemo_curator/nemo_run/slurm.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
@dataclass
2424
class SlurmJobConfig:
2525
"""
26-
Configuration for running a NeMo Curator script on a SLURM cluster using
26+
Configuration for running a NeMo Curator script on a Slurm cluster using
2727
NeMo Run
2828
2929
Args:
@@ -74,13 +74,13 @@ def to_script(self, add_scheduler_file: bool = True, add_device: bool = True):
7474
add_scheduler_file: Automatically appends a '--scheduler-file' argument to the
7575
script_command where the value is job_dir/logs/scheduler.json. All
7676
scripts included in NeMo Curator accept and require this argument to scale
77-
properly on SLURM clusters.
77+
properly on Slurm clusters.
7878
add_device: Automatically appends a '--device' argument to the script_command
7979
where the value is the member variable of device. All scripts included in
8080
NeMo Curator accept and require this argument.
8181
Returns:
8282
A NeMo Run Script that will intialize a Dask cluster, and run the specified command.
83-
It is designed to be executed on a SLURM cluster
83+
It is designed to be executed on a Slurm cluster
8484
"""
8585
env_vars = self._build_env_vars()
8686

nemo_curator/scripts/README.md

+29
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# NeMo Curator CLI Scripts
2+
3+
The following Python scripts are designed to be executed from the command line (terminal) only.
4+
5+
Here, we list all of the Python scripts and their terminal commands:
6+
7+
| Python Command | CLI Command |
8+
|------------------------------------------|--------------------------------|
9+
| python add_id.py | add_id |
10+
| python blend_datasets.py | blend_datasets |
11+
| python download_and_extract.py | download_and_extract |
12+
| python filter_documents.py | filter_documents |
13+
| python find_exact_duplicates.py | gpu_exact_dups |
14+
| python find_matching_ngrams.py | find_matching_ngrams |
15+
| python find_pii_and_deidentify.py | deidentify |
16+
| python get_common_crawl_urls.py | get_common_crawl_urls |
17+
| python get_wikipedia_urls.py | get_wikipedia_urls |
18+
| python make_data_shards.py | make_data_shards |
19+
| python prepare_fasttext_training_data.py | prepare_fasttext_training_data |
20+
| python prepare_task_data.py | prepare_task_data |
21+
| python remove_matching_ngrams.py | remove_matching_ngrams |
22+
| python separate_by_metadata.py | separate_by_metadata |
23+
| python text_cleaning.py | text_cleaning |
24+
| python train_fasttext.py | train_fasttext |
25+
| python verify_classification_results.py | verify_classification_results |
26+
27+
For more information about the arguments needed for each script, you can use `add_id --help`, etc.
28+
29+
More scripts can be found in the `classifiers`, `fuzzy_deduplication`, and `semdedup` subdirectories.
+92
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
## Text Classification
2+
3+
The Python scripts in this directory demonstrate how to run classification on your text data with each of these 4 classifiers:
4+
5+
- Domain Classifier
6+
- Quality Classifier
7+
- AEGIS Safety Models
8+
- FineWeb Educational Content Classifier
9+
10+
For more information about these classifiers, please see NeMo Curator's [Distributed Data Classification documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/distributeddataclassification.html).
11+
12+
### Usage
13+
14+
#### Domain classifier inference
15+
16+
```bash
17+
# same as `python domain_classifier_inference.py`
18+
domain_classifier_inference \
19+
--input-data-dir /path/to/data/directory \
20+
--output-data-dir /path/to/output/directory \
21+
--input-file-type "jsonl" \
22+
--input-file-extension "jsonl" \
23+
--output-file-type "jsonl" \
24+
--input-text-field "text" \
25+
--batch-size 64 \
26+
--autocast \
27+
--max-chars 2000 \
28+
--device "gpu"
29+
```
30+
31+
Additional arguments may be added for customizing a Dask cluster and client. Run `domain_classifier_inference --help` for more information.
32+
33+
#### Quality classifier inference
34+
35+
```bash
36+
# same as `python quality_classifier_inference.py`
37+
quality_classifier_inference \
38+
--input-data-dir /path/to/data/directory \
39+
--output-data-dir /path/to/output/directory \
40+
--input-file-type "jsonl" \
41+
--input-file-extension "jsonl" \
42+
--output-file-type "jsonl" \
43+
--input-text-field "text" \
44+
--batch-size 64 \
45+
--autocast \
46+
--max-chars 2000 \
47+
--device "gpu"
48+
```
49+
50+
Additional arguments may be added for customizing a Dask cluster and client. Run `quality_classifier_inference --help` for more information.
51+
52+
#### AEGIS classifier inference
53+
54+
```bash
55+
# same as `python aegis_classifier_inference.py`
56+
aegis_classifier_inference \
57+
--input-data-dir /path/to/data/directory \
58+
--output-data-dir /path/to/output/directory \
59+
--input-file-type "jsonl" \
60+
--input-file-extension "jsonl" \
61+
--output-file-type "jsonl" \
62+
--input-text-field "text" \
63+
--batch-size 64 \
64+
--max-chars 6000 \
65+
--device "gpu" \
66+
--aegis-variant "nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0" \
67+
--token "hf_1234"
68+
```
69+
70+
- `--aegis-variant` can be `nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0`, `nvidia/Aegis-AI-Content-Safety-LlamaGuard-Permissive-1.0`, or a path to your own PEFT of LlamaGuard 2.
71+
- `--token` is your HuggingFace token, which is used when downloading the base Llama Guard model.
72+
73+
Additional arguments may be added for customizing a Dask cluster and client. Run `aegis_classifier_inference --help` for more information.
74+
75+
#### FineWeb-Edu classifier inference
76+
77+
```bash
78+
# same as `python fineweb_edu_classifier_inference.py`
79+
fineweb_edu_classifier_inference \
80+
--input-data-dir /path/to/data/directory \
81+
--output-data-dir /path/to/output/directory \
82+
--input-file-type "jsonl" \
83+
--input-file-extension "jsonl" \
84+
--output-file-type "jsonl" \
85+
--input-text-field "text" \
86+
--batch-size 64 \
87+
--autocast \
88+
--max-chars 2000 \
89+
--device "gpu"
90+
```
91+
92+
Additional arguments may be added for customizing a Dask cluster and client. Run `fineweb_edu_classifier_inference --help` for more information.

0 commit comments

Comments
 (0)