Skip to content

Commit

Permalink
Modernize the readme. (#751)
Browse files Browse the repository at this point in the history
  • Loading branch information
nsthorat authored Oct 9, 2023
1 parent 3918849 commit 92af135
Show file tree
Hide file tree
Showing 2 changed files with 205 additions and 38 deletions.
240 changes: 203 additions & 37 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,55 @@
# 🌸 Lilac

[![Static Badge](https://img.shields.io/badge/Homepage-8A2BE2?link=http%3A%2F%2Flilacml.com%2F)](https://lilacml.com)
[![Downloads](https://static.pepy.tech/badge/lilac/month)](https://pepy.tech/project/lilac)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Twitter](https://img.shields.io/twitter/follow/lilac_ai)](https://twitter.com/lilac_ai)
[![](https://dcbadge.vercel.app/api/server/jNzw9mC8pp?compact=true&style=flat)](https://discord.gg/jNzw9mC8pp)

> **NEW: Try the [Lilac hosted demo with pre-loaded datasets](https://lilacai-lilac.hf.space/)**
## 👋 Welcome

[Lilac](http://lilacml.com) is an open-source product that helps you **analyze**, **structure**, and
**clean** unstructured data with AI.

Lilac can be used from our UI or from Python.
<h1 align="center">🌸 Lilac</h1>
<h3 align="center" style="font-size: 20px; margin-bottom: 4px">Curate better data for LLMs</h3>
<p align="center">
<a style="padding: 4px;" href="https://lilacai-lilac.hf.space/">
<span style="margin-right: 4px; font-size: 12px">🔗</span> <span style="font-size: 14px">Try the Lilac web demo!</span>
</a>
<br/><br/>
<a href="https://lilacml.com/">
<img alt="Site" src="https://img.shields.io/badge/Site-lilacml.com-ed2dd0?link=https%3A%2F%2Flilacml.com"/>
</a>
<a href="https://dcbadge.vercel.app/api/server/jNzw9mC8pp?compact=true&style=flat">
<img alt="Discord" src="https://img.shields.io/badge/Join-important.svg?color=ed2dd0&label=Discord&logo=slack" />
</a>
<a href="https://github.com/lilacai/lilac/blob/main/LICENSE">
<img alt="License Apache 2.0" src="https://img.shields.io/badge/License-Apache 2.0-blue.svg?style=flat&color=ed2dd0" height="20" width="auto">
</a>
<br/>
<a href="https://github.com/lilacai/lilac">
<img src="https://img.shields.io/github/stars/lilacai/lilac?style=social" />
</a>
<a href="https://twitter.com/lilac_ai">
<img src="https://img.shields.io/twitter/follow/lilac_ai" alt="Follow on Twitter" />
</a>
</p>

Lilac helps you **curate data** for LLMs, from RAGs to fine-tuning datasets.

Lilac runs **on-device** using open-source LLMs with a UI and Python API for:

- **Exploring** datasets with natural language (documents)
- **Annotating & structuring** data (e.g. PII detection, profanity, text statistics)
- **Semantic search** to find similar results to a query
- **Conceptual search** to find and tag results that match a fuzzy concept (e.g. low command of
English language)
- **Clustering** data semantically for understanding & deduplication
- **Labeling** and **Bulk Labeling** to curate data

https://github.com/lilacai/lilac/assets/2294279/cb1378f8-92c1-4f2a-9524-ce5ddd8e0c53

## 💻 Install
## 🔥 Getting started

To install Lilac on your machine:
### 💻 Install

```sh
pip install lilac
pip install lilac[all]
```

You can also use Lilac with no installation by
[forking our public HuggingFace Spaces demo](https://lilacai-lilac.hf.space/).
If you prefer no local installation, you can fork the
[fork the HuggingFace Spaces demo](https://lilacai-lilac.hf.space/). Documentation
[here](https://lilacml.com/huggingface/huggingface_spaces.html).

## 🔥 Getting started
### 🌐 Start a webserver

Start a Lilac webserver from the CLI:

Expand All @@ -44,42 +65,187 @@ import lilac as ll
ll.start_server(project_dir='~/my_project')
```

This will open start a webserver at http://localhost:5432/.
This will open start a webserver at http://localhost:5432/ where you can now load datasets and
explore them.

### Run via Docker

We haven't yet published a docker image, but you can build one locally:
Build the image after cloning the repo:

```sh
docker build -t lilac .
```

The docker runs on the virtual port `8000`. If you have an existing lilac data direcotry you will
have to mount it and set the `LILAC_PROJECT_DIR` environment variable:
The container runs on the virtual port `8000`, this command maps it to the host machine port `5432`.

If you have an existing lilac project, mount it and set the `LILAC_PROJECT_DIR` environment
variable:

```sh
docker run -it \
-p 8000:8000 \
-p 5432:8000 \
--volume /host/path/to/data:/data \
-e LILAC_PROJECT_DIR="/data" \
lilac
```

## 📁 Documentation
### 📊 Load data

Datasets can be loaded directly from HuggingFace, CSV, JSON,
[LangSmith from LangChain](https://www.langchain.com/langsmith), SQLite,
[LLamaHub](https://llamahub.ai/), Pandas, Parquet, and more. More documentation
[here](https://lilacml.com/datasets/dataset_load.html).

```python
import lilac as ll

ll.set_project_dir('~/my_project')

config = ll.DatasetConfig(
namespace='local',
name='imdb',
source=ll.HuggingFaceSource(dataset_name='imdb'))

dataset = ll.create_dataset(config)
```

If you prefer, you can load datasets directly from the UI without writing any Python:

<img width="600" alt="image" src="https://github.com/lilacai/lilac/assets/1100749/d5d385ce-f11c-47e6-9c00-ea29983e24f0">

### 🔎 Explore

> [🔗 Try OpenOrca-100K before installing!](https://lilacai-lilac.hf.space/datasets#lilac/OpenOrca-100k)
Once we've loaded a dataset, we can explore it from the UI and get a sense for what's in the data.
More documentation [here](https://lilacml.com/datasets/dataset_explore.html).

<img alt="image" src="docs/_static/dataset/dataset_explore.png">

### ⚡ Annotate with Signals (PII, Text Statistics, Language Detection, Neardup, etc)

Annotating data with signals will produce another column in your data.

```python
import lilac as ll

ll.set_project_dir('~/my_project')

dataset = ll.get_dataset('local', 'imdb')

# [Language detection] Detect the language of each document.
dataset.compute_signal(ll.LangDetectionSignal(), 'text')

# [PII] Find emails, phone numbers, ip addresses, and secrets.
dataset.compute_signal(ll.PIISignal(), 'text')

# [Text Statistics] Compute readability scores, number of chars, TTR, non-ascii chars, etc.
dataset.compute_signal(ll.PIISignal(), 'text')

# [Near Duplicates] Computes clusters based on minhash LSH.
dataset.compute_signal(ll.NearDuplicateSignal(), 'text')

# Print the resulting manifest, with the new field added.
print(dataset.manifest())
```

We can also compute signals from the UI:

<img width="600" alt="image" src="docs/_static/dataset/dataset_compute_signal_modal.png">

### 🔎 Search

Semantic and conceptual search requires computing an embedding first:

```python
dataset.compute_embedding('gte-small', path='text')
```

#### Semantic search

In the UI, we can search by semantic similarity or by classic keyword search to find chunks of
documents similar to a query:

<img width="600" alt="image" src="https://github.com/lilacai/lilac/assets/1100749/4adb603e-8dca-43a3-a492-fd862e194a5a">

<img width="600" alt="image" src="https://github.com/lilacai/lilac/assets/1100749/fdee2127-250b-4e06-9ff9-b1023c03b72f">

Visit our website: [lilacml.com](http://lilacml.com)
We can run the same search in Python:

## 💻 Why Lilac?
```python
rows = dataset.select_rows(
columns=['text', 'label'],
searches=[
ll.SemanticSearch(
path='text',
embedding='gte-small')
],
limit=1)

Lilac is a visual tool and a Python API that helps you:
print(list(rows))
```

#### Conceptual search

Conceptual search is a much more controllable and powerful version of semantic search, where
"concepts" can be taught to Lilac by providing positive and negative examples of that concept.

Lilac provides a set of built-in concepts, but you can create your own for very specif

<img width="600" alt="image" src="https://github.com/lilacai/lilac/assets/1100749/9941024b-7c24-4d87-ae46-925f8da435e1">

We can create a concept in Python with a few examples, and search by it:

```python
concept_db = ll.DiskConceptDB()
db.create(namespace='local', name='spam')
# Add examples of spam and not-spam.
db.edit('local', 'spam', ll.concepts.ConceptUpdate(
insert=[
ll.concepts.ExampleIn(label=False, text='This is normal text.'),
ll.concepts.ExampleIn(label=True, text='asdgasdgkasd;lkgajsdl'),
ll.concepts.ExampleIn(label=True, text='11757578jfdjja')
]
))

# Search by the spam concept.
rows = dataset.select_rows(
columns=['text', 'label'],
searches=[
ll.ConceptSearch(
path='text',
concept_namespace='lilac',
concept_name='spam',
embedding='gte-small')
],
limit=1)

print(list(rows))
```

- **Explore** datasets with natural language (e.g. documents)
- **Enrich** your dataset with metadata (e.g. PII detection, profanity, text statistics, etc.)
- Conceptually **search** and tag your data (e.g. find paragraphs about injury)
- **Remove** unwanted or problematic data based on your own criteria
- **Analyze** patterns in your data
### 🏷️ Labeling

Lilac allows you to label individual points, or slices of data:
<img width="600" alt="image" src="docs/_static/dataset/dataset_add_label_tag.png">

We can also label all data given a filter. In this case, adding the label "short" to all text with a
small amount of characters. This field was produced by the automatic `text_statistics` signal.

<img width="600" alt="image" src="docs/_static/dataset/dataset_add_label_all_short.png">

We can do the same in Python:

```python
dataset.add_labels(
'short',
filters=[
(('text', 'text_statistics', 'num_characters'), 'less', 1000)
]
)
```

Lilac runs completely **on device** using powerful open-source LLM technologies.
Labels can be exported for downstream tasks. Detailed documentation
[here](https://lilacml.com/datasets/dataset_labels.html).

## 💬 Contact

Expand Down
3 changes: 2 additions & 1 deletion run_server_prod.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@ set -e
./scripts/build_server_prod.sh

# Run the node server.
poetry run uvicorn lilac.server:app --port 5432 --host 0.0.0.0 &
export LILAC_PROJECT_DIR='./data'
poetry run uvicorn lilac.server:app --port 6432 --host 0.0.0.0 &
pid[0]=$!

# When control+c is pressed, kill all process ids.
Expand Down

0 comments on commit 92af135

Please sign in to comment.