Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the garden blog post #1144

Merged
merged 6 commits into from
Jan 30, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/_static/introducing_garden/ai_ethics.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/introducing_garden/clusters.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/introducing_garden/nsfw.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
175 changes: 175 additions & 0 deletions docs/blog/introducing-garden.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
# 100x Faster Clustering with Lilac Garden

_Automatic and human-readable clustering of 1 million conversations from Chatbot Arena, computed in
20 minutes.
[Join the waitlist for Lilac Garden.](https://docs.google.com/forms/d/e/1FAIpQLSd53PiLVU-65n0czUbhQySvZVtybbSR1fH-WJDDycOZxVRvNQ/viewform)_

_Jan 30, 2024_
dsmilkov marked this conversation as resolved.
Show resolved Hide resolved

<a href="https://lilacai-lilac.hf.space/datasets#lilac/lmsys-chat-1m&query=%7B%7D&viewPivot=true&pivot=%7B%22outerPath%22%3A%5B%22conversation__clusters%22%2C%22category_title%22%5D%2C%22innerPath%22%3A%5B%22conversation__clusters%22%2C%22cluster_title%22%5D%7D">
<img title="Cluster view of conversations from Chatbot Arena" src="../_static/introducing_garden/clusters.png" />
</a>

At Lilac, our mission is to help you curate your text data so you can build the best possible LLM
dsmilkov marked this conversation as resolved.
Show resolved Hide resolved
apps. We’ve built a new LLM-powered text clustering pipeline that’s more accurate and easy to
understand than any clustering approach we’ve seen before. This is our first cloud service that we
are launching on [Lilac Garden](https://www.lilacml.com/#garden), our new accelerated computation
platform.

## The problem

Visualizing text datasets poses a unique set of challenges. Unlike visual data, where patterns and
dsmilkov marked this conversation as resolved.
Show resolved Hide resolved
relationships can often be discerned through graphical representations, understanding text is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit of a mouthful

how about "Unlike images or videos where we immediately can see patterns, text requires a different approach."

inherently tricky. Our new LLM-powered text clustering pipeline addresses this challenge by
dsmilkov marked this conversation as resolved.
Show resolved Hide resolved
leveraging long context embeddings, massively parallel GPU compute, and sophisticated LLMs to
generate concise, descriptive titles for each cluster. These clusters play a crucial role in
revealing the major segments of your dataset by grouping similar documents together. This approach
can dramatically speed up the process of curation, allowing the identification and removal of
problematic clusters, sub-sampling large clusters to reduce dataset size, and creating task-specific
datasets. Moreover, examining clusters in user-LLM interaction logs sheds light on subtle and
potentially hazardous ways in which users engage with your product.

Don’t just take our word for it - check out Lilac’s demo page to see clusters for LLM fine-tuning
datasets
([OpenOrca](https://lilacai-lilac.hf.space/datasets#lilac/OpenOrca&viewPivot=true&pivot=%7B%22outerPath%22%3A%5B%22question__cluster%22%2C%22category_title%22%5D%2C%22innerPath%22%3A%5B%22question__cluster%22%2C%22cluster_title%22%5D%7D),
[Capybara](https://lilacai-lilac.hf.space/datasets#lilac/Capybara&viewPivot=true&pivot=%7B%22outerPath%22%3A%5B%22conversation_input__cluster%22%2C%22category_title%22%5D%2C%22innerPath%22%3A%5B%22conversation_input__cluster%22%2C%22cluster_title%22%5D%7D),
[UltraChat](https://lilacai-lilac.hf.space/datasets#lilac/UltraChat-200k&query=%7B%7D&viewPivot=true&pivot=%7B%22outerPath%22%3A%5B%22messages__clusters%22%2C%22category_title%22%5D%2C%22innerPath%22%3A%5B%22messages__clusters%22%2C%22cluster_title%22%5D%7D),
[Glaive](https://lilacai-lilac.hf.space/datasets#lilac/glaive-code-assistant&viewPivot=true&pivot=%7B%22outerPath%22%3A%5B%22question__cluster%22%2C%22category_title%22%5D%2C%22innerPath%22%3A%5B%22question__cluster%22%2C%22cluster_title%22%5D%7D),
[SlimOrca](https://lilacai-lilac.hf.space/datasets#lilac/SlimOrca&query=%7B%7D&viewPivot=true&pivot=%7B%22outerPath%22%3A%5B%22conversation__clusters%22%2C%22category_title%22%5D%2C%22innerPath%22%3A%5B%22conversation__clusters%22%2C%22cluster_title%22%5D%7D)),
LLM eval datasets
([MMLU](https://lilacai-lilac.hf.space/datasets#lilac/MMLU&query=%7B%7D&viewPivot=true&pivot=%7B%22outerPath%22%3A%5B%22question__cluster%22%2C%22category_title%22%5D%2C%22innerPath%22%3A%5B%22question__cluster%22%2C%22cluster_title%22%5D%7D),
[ARC](https://lilacai-lilac.hf.space/datasets#lilac/ARC-Challenge&query=%7B%7D&viewPivot=true&pivot=%7B%22outerPath%22%3A%5B%22question__cluster%22%2C%22category_title%22%5D%2C%22innerPath%22%3A%5B%22question__cluster%22%2C%22cluster_title%22%5D%7D),
[Winogrande](https://lilacai-lilac.hf.space/datasets#lilac/WinoGrande&query=%7B%7D&viewPivot=true&pivot=%7B%22outerPath%22%3A%5B%22sentence__cluster%22%2C%22category_title%22%5D%2C%22innerPath%22%3A%5B%22sentence__cluster%22%2C%22cluster_title%22%5D%7D)),
user logs
([Chatbot Arena](https://lilacai-lilac.hf.space/datasets#lilac/lmsys-chat-1m&viewPivot=true&pivot=%7B%22outerPath%22%3A%5B%22conversation__clusters%22%2C%22category_title%22%5D%2C%22innerPath%22%3A%5B%22conversation__clusters%22%2C%22cluster_title%22%5D%7D)),
content scraping
([HN comments](https://lilacai-lilac.hf.space/datasets#lilac/hncomments-1m&query=%7B%7D&viewPivot=true&pivot=%7B%22outerPath%22%3A%5B%22text__cluster%22%2C%22category_title%22%5D%2C%22innerPath%22%3A%5B%22text__cluster%22%2C%22cluster_title%22%5D%7D)),
and more.

## Jailbreak clusters

[ChatBot Arena](https://chat.lmsys.org/) is a website that allows users to interact with and compare
chatbot quality. The Arena team has released a subset of redacted user logs, and we discovered that
quite a few users test their jailbreaks and prompt injections on the Arena.

<a href="https://lilacai-lilac.hf.space/datasets#lilac/lmsys-chat-1m&query=%7B%22filters%22%3A%5B%7B%22path%22%3A%5B%22conversation__clusters%22%2C%22category_title%22%5D%2C%22op%22%3A%22equals%22%2C%22value%22%3A%22Inappropriate%20Content%22%7D%5D%7D&groupBy=%7B%22path%22%3A%5B%22conversation__clusters%22%2C%22cluster_title%22%5D%2C%22value%22%3A%22Inappropriate%20Content%20Requests%22%7D">
<img title="Jailbreaking models" src="../_static/introducing_garden/jailbreak_unhinged.png"/>
</a>

<a href="https://lilacai-lilac.hf.space/datasets#lilac/lmsys-chat-1m&query=%7B%22filters%22%3A%5B%7B%22path%22%3A%5B%22conversation__clusters%22%2C%22category_title%22%5D%2C%22op%22%3A%22equals%22%2C%22value%22%3A%22AI%20Role-Play%22%7D%5D%7D&groupBy=%7B%22path%22%3A%5B%22conversation__clusters%22%2C%22cluster_title%22%5D%2C%22value%22%3A%22Role-playing%20scenarios%20with%20DAN%20AI%20model%22%7D">
<img title="DAN Jailbreak" src="../_static/introducing_garden/jailbreak_dan.png"/>
</a>

Unsurprisingly, another popular theme is users trying to get chatbots to say NSFW things (and they
seem to be broadly successful).

```{warning}
The image below links to explicit content.
```

<a href="https://lilacai-lilac.hf.space/datasets#lilac/lmsys-chat-1m&groupBy=%7B%22path%22%3A%5B%22conversation__clusters%22%2C%22category_title%22%5D%2C%22value%22%3A%22Inappropriate%20Content%22%7D&rowId=%22fff8851628884e259af503b0570ba194%22">
<img title="NSFW" src="../_static/introducing_garden/nsfw.png"/>

</a>

Users are also trying to understand the guardrails of the chatbots by poking around controversial
topics.

<a href="https://lilacai-lilac.hf.space/datasets#lilac/lmsys-chat-1m&groupBy=%7B%22path%22%3A%5B%22conversation__clusters%22%2C%22category_title%22%5D%2C%22value%22%3A%22AI%20Ethics%20Conversations%22%7D&rowId=%22ff6668747746466ca528b1912133926a%22">
<img title="AI Ethics" src="../_static/introducing_garden/ai_ethics.png"/>
</a>

Our cluster breakdown page also supports grouping by custom fields. We can
<a href="https://lilacai-lilac.hf.space/datasets#lilac/lmsys-chat-1m&viewPivot=true&pivot=%7B%22outerPath%22%3A%5B%22model%22%5D%2C%22innerPath%22%3A%5B%22conversation__clusters%22%2C%22cluster_title%22%5D%7D">group
by the Model column</a> to see what sorts of interactions users have with each model.

<a href="https://lilacai-lilac.hf.space/datasets#lilac/lmsys-chat-1m&viewPivot=true&pivot=%7B%22outerPath%22%3A%5B%22model%22%5D%2C%22innerPath%22%3A%5B%22conversation__clusters%22%2C%22cluster_title%22%5D%7D">
<img title="Grouping cluster by model type" src="../_static/introducing_garden/group_cluster_by_model.png" />
</a>

We’re excited to see what interesting clusters you find!

## Clustering, powered by Lilac Garden

Lilac users can pip install the latest version on PyPI to run
[clustering locally](https://docs.lilacml.com/datasets/dataset_cluster.html). Local clustering works
reasonably well up to ~10,000 data points, but it will take a few hours. For users with a GPU and
CuML installed, datasets of ~100,000 points should be feasible.

To cluster a dataset is simple, without any hyperparameters.

```py
import lilac as ll
dataset = ll.from_huggingface('LDJnr/Capybara')
dataset.cluster(input='conversation.*.input', use_garden=False)
```

For datasets larger than 10k rows – or if you’re impatient, like us –
[Lilac Garden](https://www.lilacml.com/#garden) is a remote computation service that powers
compute-heavy features like clustering, perplexity scoring, and embedding computation. Lilac Garden
clustered our largest datasets of 4 million data points in just an hour. A 10,000 data point dataset
would cluster in less than a minute on Garden - **100x faster** than local computation!

Lilac Garden fundamentally changed how we interact with data, allowing to use the power of LLMs over
entire datasets, without needing to wait and monitor multi-day pipelines.

To join the waitlist,
[fill out this form](https://docs.google.com/forms/d/e/1FAIpQLSd53PiLVU-65n0czUbhQySvZVtybbSR1fH-WJDDycOZxVRvNQ/viewform).

### How we did it

Historically, clustering as a service has been hard to pull off for several reasons:

- Clustering is slow! (A 10k row dataset takes 15 minutes on an M2 Pro Macbook)
- Clustering requires the entire dataset to be in RAM. Each new dataset requires fiddling with
clustering hyperparameters
- The dependencies are fiddly
- Finding a good embedding function is hard
- Even after you get your clusters, it’s hard to understand what each of the thousands of clusters
actually are.

Today, these issues are all solvable, thanks to the work of many other companies and researchers.

The Lilac Clustering pipeline is a four-step process.

1. Embed documents with
[Jina V2](https://jina.ai/news/jina-ai-launches-worlds-first-open-source-8k-text-embedding-rivaling-openai/),
which supports embedding long documents with a single embedding.
1. Reduce the dimensionality using [UMAP](https://umap-learn.readthedocs.io/en/latest/), which
projects hundreds to thousands of embedding dimensions to a much smaller size, emphasizing
scale-free local structure, and feeding well into HDBScan’s cluster detection algorithm.
1. Compute clusters with [HDBScan](https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html),
enabling clustering without extensive hyperparameter tuning.
1. Attach human-legible titles to these clusters by asking [GPT](https://openai.com/product) to
title the cluster.

![Clustering pipeline](../_static/introducing_garden/cluster_pipeline.png)

This process, run once, generates fine-grained clusters like _"Translating English to Czech"_ and
_"Persian-English Translation"_. We then repeat the process on the generated titles to create
cluster categories, like _"Translation"_, which categorizes the fine-grained clusters from the first
phase of the pipeline.

Thanks to the wizards at NVidia, [CuML](https://docs.rapids.ai/api/cuml/stable/) can accelerate
these computations tenfold. Then, we rely on [Modal](https://modal.com) to provide parallel GPUs and
a RPC framework to schedule all of these computations. The final step is to put a human-legible
label on these clusters by prompt-engineering a command to GPT3.5.

### Beyond clusters

Clustering can give us valuable insights into the shape of our data, or how users are interacting
with production systems. However, we can also use the generated clusters to curate data much faster
in Lilac:

- Drop problematic clusters
- Sample across clusters to reduce the size of our dataset
- Create a task-specific dataset from a much larger dataset

We’ve uploaded all of the clustered datasets (with cluster titles and cluster IDs) at our
[HuggingFace page](https://huggingface.co/lilacai).

[Here’s a notebook](https://github.com/lilacai/lilac/blob/main/notebooks/ClusterSampling.ipynb) that
demonstrates how to download and filter using our clusters, using Translation clusters in SlimOrca
to create a translation-only SlimOrca dataset.

We’re excited to see what you end up doing with the clusters!
16 changes: 9 additions & 7 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,21 +12,23 @@
:parser: myst_parser.sphinx_

.. toctree::
:caption: Getting Started
:caption: Blog posts
:hidden:
:maxdepth: 1
:glob:

getting_started/quickstart.md
getting_started/installation.md
getting_started/quickstart_python.md
blog/introducing-lilac.md
blog/curate-coding-dataset.md
blog/introducing-garden.md

.. toctree::
:caption: Blog posts
:caption: Getting Started
:hidden:
:maxdepth: 1
:glob:

blog/*
getting_started/quickstart.md
getting_started/installation.md
getting_started/quickstart_python.md

.. toctree::
:caption: Datasets
Expand Down
2 changes: 1 addition & 1 deletion docs/poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading