Modernize the readme. (#751)

https://github.com/lilacai/lilac/blob/nik-readme/README.md
databricks · Oct 9, 2023 · 92af135 · 92af135
1 parent 3918849
commit 92af135
Show file tree

Hide file tree

Showing 2 changed files with 205 additions and 38 deletions.
diff --git a/README.md b/README.md
@@ -1,34 +1,55 @@
-# 🌸 Lilac
-
-[![Static Badge](https://img.shields.io/badge/Homepage-8A2BE2?link=http%3A%2F%2Flilacml.com%2F)](https://lilacml.com)
-[![Downloads](https://static.pepy.tech/badge/lilac/month)](https://pepy.tech/project/lilac)
-[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
-[![Twitter](https://img.shields.io/twitter/follow/lilac_ai)](https://twitter.com/lilac_ai)
-[![](https://dcbadge.vercel.app/api/server/jNzw9mC8pp?compact=true&style=flat)](https://discord.gg/jNzw9mC8pp)
-
-> **NEW: Try the [Lilac hosted demo with pre-loaded datasets](https://lilacai-lilac.hf.space/)**
-
-## 👋 Welcome
-
-[Lilac](http://lilacml.com) is an open-source product that helps you **analyze**, **structure**, and
-**clean** unstructured data with AI.
-
-Lilac can be used from our UI or from Python.
+<h1 align="center">🌸 Lilac</h1>
+<h3 align="center" style="font-size: 20px; margin-bottom: 4px">Curate better data for LLMs</h3>
+<p align="center">
+  <a style="padding: 4px;"  href="https://lilacai-lilac.hf.space/">
+    <span style="margin-right: 4px; font-size: 12px">🔗</span> <span style="font-size: 14px">Try the Lilac web demo!</span>
+  </a>
+  <br/><br/>
+  <a href="https://lilacml.com/">
+        <img alt="Site" src="https://img.shields.io/badge/Site-lilacml.com-ed2dd0?link=https%3A%2F%2Flilacml.com"/>
+    </a>
+    <a href="https://dcbadge.vercel.app/api/server/jNzw9mC8pp?compact=true&style=flat">
+        <img alt="Discord" src="https://img.shields.io/badge/Join-important.svg?color=ed2dd0&label=Discord&logo=slack" />
+    </a>
+    <a href="https://github.com/lilacai/lilac/blob/main/LICENSE">
+          <img alt="License Apache 2.0" src="https://img.shields.io/badge/License-Apache 2.0-blue.svg?style=flat&color=ed2dd0" height="20" width="auto">
+    </a>
+    <br/>
+    <a href="https://github.com/lilacai/lilac">
+      <img src="https://img.shields.io/github/stars/lilacai/lilac?style=social" />
+    </a>
+    <a href="https://twitter.com/lilac_ai">
+      <img src="https://img.shields.io/twitter/follow/lilac_ai" alt="Follow on Twitter" />
+    </a>
+</p>
+
+Lilac helps you **curate data** for LLMs, from RAGs to fine-tuning datasets.
+
+Lilac runs **on-device** using open-source LLMs with a UI and Python API for:
+
+- **Exploring** datasets with natural language (documents)
+- **Annotating & structuring** data (e.g. PII detection, profanity, text statistics)
+- **Semantic search** to find similar results to a query
+- **Conceptual search** to find and tag results that match a fuzzy concept (e.g. low command of
+  English language)
+- **Clustering** data semantically for understanding & deduplication
+- **Labeling** and **Bulk Labeling** to curate data
 
 https://github.com/lilacai/lilac/assets/2294279/cb1378f8-92c1-4f2a-9524-ce5ddd8e0c53
 
-## 💻 Install
+## 🔥 Getting started
 
-To install Lilac on your machine:
+### 💻 Install
 
 ```sh
-pip install lilac
+pip install lilac[all]
 ```
 
-You can also use Lilac with no installation by
-[forking our public HuggingFace Spaces demo](https://lilacai-lilac.hf.space/).
+If you prefer no local installation, you can fork the
+[fork the HuggingFace Spaces demo](https://lilacai-lilac.hf.space/). Documentation
+[here](https://lilacml.com/huggingface/huggingface_spaces.html).
 
-## 🔥 Getting started
+### 🌐 Start a webserver
 
 Start a Lilac webserver from the CLI:
 
@@ -44,42 +65,187 @@ import lilac as ll
 ll.start_server(project_dir='~/my_project')
 ```
 
-This will open start a webserver at http://localhost:5432/.
+This will open start a webserver at http://localhost:5432/ where you can now load datasets and
+explore them.
 
 ### Run via Docker
 
-We haven't yet published a docker image, but you can build one locally:
+Build the image after cloning the repo:
 
 ```sh
 docker build -t lilac .
 ```
 
-The docker runs on the virtual port `8000`. If you have an existing lilac data direcotry you will
-have to mount it and set the `LILAC_PROJECT_DIR` environment variable:
+The container runs on the virtual port `8000`, this command maps it to the host machine port `5432`.
+
+If you have an existing lilac project, mount it and set the `LILAC_PROJECT_DIR` environment
+variable:
 
 ```sh
 docker run -it \
-  -p 8000:8000 \
+  -p 5432:8000 \
   --volume /host/path/to/data:/data \
   -e LILAC_PROJECT_DIR="/data" \
   lilac
 ```
 
-## 📁 Documentation
+### 📊 Load data
+
+Datasets can be loaded directly from HuggingFace, CSV, JSON,
+[LangSmith from LangChain](https://www.langchain.com/langsmith), SQLite,
+[LLamaHub](https://llamahub.ai/), Pandas, Parquet, and more. More documentation
+[here](https://lilacml.com/datasets/dataset_load.html).
+
+```python
+import lilac as ll
+
+ll.set_project_dir('~/my_project')
+
+config = ll.DatasetConfig(
+  namespace='local',
+  name='imdb',
+  source=ll.HuggingFaceSource(dataset_name='imdb'))
+
+dataset = ll.create_dataset(config)
+```
+
+If you prefer, you can load datasets directly from the UI without writing any Python:
+
+<img width="600" alt="image" src="https://github.com/lilacai/lilac/assets/1100749/d5d385ce-f11c-47e6-9c00-ea29983e24f0">
+
+### 🔎 Explore
+
+> [🔗 Try OpenOrca-100K before installing!](https://lilacai-lilac.hf.space/datasets#lilac/OpenOrca-100k)
+
+Once we've loaded a dataset, we can explore it from the UI and get a sense for what's in the data.
+More documentation [here](https://lilacml.com/datasets/dataset_explore.html).
+
+<img alt="image" src="docs/_static/dataset/dataset_explore.png">
+
+### ⚡ Annotate with Signals (PII, Text Statistics, Language Detection, Neardup, etc)
+
+Annotating data with signals will produce another column in your data.
+
+```python
+import lilac as ll
+
+ll.set_project_dir('~/my_project')
+
+dataset = ll.get_dataset('local', 'imdb')
+
+# [Language detection] Detect the language of each document.
+dataset.compute_signal(ll.LangDetectionSignal(), 'text')
+
+# [PII] Find emails, phone numbers, ip addresses, and secrets.
+dataset.compute_signal(ll.PIISignal(), 'text')
+
+# [Text Statistics] Compute readability scores, number of chars, TTR, non-ascii chars, etc.
+dataset.compute_signal(ll.PIISignal(), 'text')
+
+# [Near Duplicates] Computes clusters based on minhash LSH.
+dataset.compute_signal(ll.NearDuplicateSignal(), 'text')
+
+# Print the resulting manifest, with the new field added.
+print(dataset.manifest())
+```
+
+We can also compute signals from the UI:
+
+<img width="600" alt="image" src="docs/_static/dataset/dataset_compute_signal_modal.png">
+
+### 🔎 Search
+
+Semantic and conceptual search requires computing an embedding first:
+
+```python
+dataset.compute_embedding('gte-small', path='text')
+```
+
+#### Semantic search
+
+In the UI, we can search by semantic similarity or by classic keyword search to find chunks of
+documents similar to a query:
+
+<img width="600" alt="image" src="https://github.com/lilacai/lilac/assets/1100749/4adb603e-8dca-43a3-a492-fd862e194a5a">
+
+<img width="600" alt="image" src="https://github.com/lilacai/lilac/assets/1100749/fdee2127-250b-4e06-9ff9-b1023c03b72f">
 
-Visit our website: [lilacml.com](http://lilacml.com)
+We can run the same search in Python:
 
-## 💻 Why Lilac?
+```python
+rows = dataset.select_rows(
+  columns=['text', 'label'],
+  searches=[
+    ll.SemanticSearch(
+      path='text',
+      embedding='gte-small')
+  ],
+  limit=1)
 
-Lilac is a visual tool and a Python API that helps you:
+print(list(rows))
+```
+
+#### Conceptual search
+
+Conceptual search is a much more controllable and powerful version of semantic search, where
+"concepts" can be taught to Lilac by providing positive and negative examples of that concept.
+
+Lilac provides a set of built-in concepts, but you can create your own for very specif
+
+<img width="600" alt="image" src="https://github.com/lilacai/lilac/assets/1100749/9941024b-7c24-4d87-ae46-925f8da435e1">
+
+We can create a concept in Python with a few examples, and search by it:
+
+```python
+concept_db = ll.DiskConceptDB()
+db.create(namespace='local', name='spam')
+# Add examples of spam and not-spam.
+db.edit('local', 'spam', ll.concepts.ConceptUpdate(
+  insert=[
+    ll.concepts.ExampleIn(label=False, text='This is normal text.'),
+    ll.concepts.ExampleIn(label=True, text='asdgasdgkasd;lkgajsdl'),
+    ll.concepts.ExampleIn(label=True, text='11757578jfdjja')
+  ]
+))
+
+# Search by the spam concept.
+rows = dataset.select_rows(
+  columns=['text', 'label'],
+  searches=[
+    ll.ConceptSearch(
+      path='text',
+      concept_namespace='lilac',
+      concept_name='spam',
+      embedding='gte-small')
+  ],
+  limit=1)
+
+print(list(rows))
+```
 
-- **Explore** datasets with natural language (e.g. documents)
-- **Enrich** your dataset with metadata (e.g. PII detection, profanity, text statistics, etc.)
-- Conceptually **search** and tag your data (e.g. find paragraphs about injury)
-- **Remove** unwanted or problematic data based on your own criteria
-- **Analyze** patterns in your data
+### 🏷️ Labeling
+
+Lilac allows you to label individual points, or slices of data:
+<img width="600" alt="image" src="docs/_static/dataset/dataset_add_label_tag.png">
+
+We can also label all data given a filter. In this case, adding the label "short" to all text with a
+small amount of characters. This field was produced by the automatic `text_statistics` signal.
+
+<img width="600" alt="image" src="docs/_static/dataset/dataset_add_label_all_short.png">
+
+We can do the same in Python:
+
+```python
+dataset.add_labels(
+  'short',
+  filters=[
+    (('text', 'text_statistics', 'num_characters'), 'less', 1000)
+  ]
+)
+```
 
-Lilac runs completely **on device** using powerful open-source LLM technologies.
+Labels can be exported for downstream tasks. Detailed documentation
+[here](https://lilacml.com/datasets/dataset_labels.html).
 
 ## 💬 Contact
 

diff --git a/run_server_prod.sh b/run_server_prod.sh
@@ -6,7 +6,8 @@ set -e
 ./scripts/build_server_prod.sh
 
 # Run the node server.
-poetry run uvicorn lilac.server:app --port 5432 --host 0.0.0.0 &
+export LILAC_PROJECT_DIR='./data'
+poetry run uvicorn lilac.server:app --port 6432 --host 0.0.0.0 &
 pid[0]=$!
 
 # When control+c is pressed, kill all process ids.