Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update datasets.mdx #224

Closed
wants to merge 2 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 22 additions & 16 deletions fern/pages/get-started/datasets.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -50,26 +50,26 @@ co = cohere.Client(api_key='Your API key')

### Dataset Creation

Datasets are created by uploading files, specifying both a `name` for the dataset and the `dataset_type`.
Datasets are created by uploading files, specifying both a `name` for the dataset and the dataset `type`.

The file extension and file contents have to match the requirements for the selected `dataset_type`. See the table below to learn more about the supported dataset types.
The file extension and file contents have to match the requirements for the selected dataset `type`. See the table below to learn more about the supported dataset types.

The dataset `name` is useful when browsing the datasets you've uploaded. In addition to its name, each dataset will also be assigned a unique `id` when it's created.

Here is an example code snippet illustrating the process of creating a dataset, with both the `name` and the `dataset_type` specified.
Here is an example code snippet illustrating the process of creating a dataset, with both the `name` and the dataset `type` specified.

```python PYTHON
my_dataset = co.datasets.create(
name="shakespeare",
data=open("./shakespeare.jsonl", "rb"),
dataset_type="chat-finetune-input")
type="chat-finetune-input")

print(my_dataset.id)
```

### Dataset Validation

Whenever a dataset is created, the data is validated asynchronously against the rules for the specified `dataset_type` . This validation is kicked off automatically on the backend, and must be completed before a dataset can be used with other endpoints.
Whenever a dataset is created, the data is validated asynchronously against the rules for the specified dataset `type` . This validation is kicked off automatically on the backend, and must be completed before a dataset can be used with other endpoints.

Here's a code snippet showing how to check the validation status of a dataset you've created.

Expand Down Expand Up @@ -114,10 +114,9 @@ ds=co.datasets.create(
name='sample_file',
# insert your file path here - you can upload it on the right - we accept .csv and jsonl files
data=open('embed_jobs_sample_data.jsonl', 'rb'),
keep_fields=['wiki_id','url','views','title']
optional_fields=['langs']
dataset_type="embed-input",
embedding_types=['float']
keep_fields=['wiki_id','url','views','title'],
optional_fields=['langs'],
type="embed-input"
)

# wait for the dataset to finish validation
Expand All @@ -134,18 +133,25 @@ In the example below, we will create a new dataset and upload an evaluation set
# create a dataset
my_dataset = co.datasets.create(
name="shakespeare",
dataset_type="chat-finetune-input",
data=open("./shakespeare.csv", "rb"),
eval_data=open("./shakespeare-eval.csv", "rb")
type="chat-finetune-input",
data=open("./shakespeare.jsonl", "rb"),
eval_data=open("./shakespeare-eval.jsonl", "rb")
)

co.wait(my_dataset)

# start training a custom model using the dataset
co.finetuning.create_finetuned_model(
name="shakespearean-model",
model_type="GENERATIVE",
dataset=my_dataset)
request=FinetunedModel(
name="shakespearean-model",
settings=Settings(
base_model=BaseModel(
base_type="BASE_TYPE_CHAT",
),
dataset_id=my_dataset.id
),
)
)
```

### Dataset Types
Expand Down Expand Up @@ -195,7 +201,7 @@ Here is an example code snippet showing how to fetch a dataset by its unique `id
my_dataset = co.datasets.get(id="<DATASET_ID>")

# print each entry in the dataset
for record in my_dataset.open():
for record in my_dataset:
print(record)

# save the dataset as jsonl
Expand Down
Loading