Skip to content

Commit

Permalink
update READMEs
Browse files Browse the repository at this point in the history
  • Loading branch information
enjalot committed Feb 15, 2024
1 parent da67515 commit ea7aa60
Show file tree
Hide file tree
Showing 6 changed files with 163 additions and 64 deletions.
2 changes: 0 additions & 2 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,4 @@ VOYAGE_API_KEY=XXX
TOGETHER_API_KEY=XXX
COHERE_API_KEY=XXX
MISTRAL_API_KEY=XXX
ANTHROPIC_API_KEY=XXX
GOOGLE_API_KEY=XXX
LATENT_SCOPE_DATA='~/latent-scope-data'
72 changes: 61 additions & 11 deletions DEVELOPMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,13 @@

If you are interested in customizing or contributing to latent-scope this document explains the ways to develop it.

## Repository Overview
TODO

## Python module
The `latentscope` directory contains the python source code for the [latent scope pip module](). There are three primary parts to the module:

1. server
2. scripts
3. models
1. server - contains the flask app and API routes
2. scripts -
3. models/
4. util/

### Running locally
If you modify the python code and want to try out your changes, you can run locally like so:
Expand All @@ -24,7 +22,7 @@ ls-serve ~/latent-scope-data


## Web client
The `web` directory contains the JavaScript React source code for the web interface.
The `web` directory contains the JavaScript React source code for the web interface. Node.js is required to be installed on your system to run the development server or build a new version of the module.

```
cd web
Expand All @@ -36,7 +34,7 @@ This will call the local API at http://localhost:5001 as set in `web/.env.develo


## Building for distribution
TODO: flesh out these instructions
You can build a new version of the module, this will package the latest version of the web interface as well.

```
python setup.py sdist bdist_wheel
Expand All @@ -53,8 +51,60 @@ pip install dist/latentscope-0.1.0-py3-none-any.whl


# Python Code
TODO

## Embedding models
models prepared in [latentscope/models/embedding_models.json](latentscope/models/embedding_models.json)
## Configuration
`latentscope/util`

The module makes use of `dotenv` to save important configuration environment variables. The most important variable is `DATA_DIR`.
This determines where the input data is stored as well as all of output from each step in the process.

API keys for the various proprietary model APIs are also stored in the .env file created by `dotenv`.

## App
`latentscope/server/`
The flask app that runs the API and hosts the web UI has multiple components. The main setup is in `app.py` while `datasets.py`, `search.py`, `tags.py` provide specific routes. `jobs.py` is explained below.

## Jobs
`latentscope/server/jobs.py`
The process run by the web UI is done by kicking off subprocesses that call the command line script for each step. The progress of the subprocess is captured and saved in a job JSON file and is updated continuously until the process completes or errors. This allows us to poll and display the status of the commands from the web UI.

## Models
`latentscope/models`
The code to run models or call APIs is centralized here. The idea is to provide a uniform interface for embedding and another for summarization and allow the configuration of each model (context length, truncation etc.) to be specified in a single JSON file. We then can use the JSON file to power UI choices on which model to use in the process.

### Embedding models
Embedding models are prepared in [latentscope/models/embedding_models.json](latentscope/models/embedding_models.json).
There is a `get_embedding_model(id)` function which will load the appropriate class based on the model provider. See `providers/` for `transformers`, `openai`, `cohereai`, `togetherai`, `voyageai`

### Chat models
Chat models for summarization of clusters are prepared in [latentscope/models/chat_models.json](latentscope/models/chat_models.json).
There is a `get_chat_model(id)` function which will load the appropriate class based on the model provider. Each provider can support a chat model if the interface is implemented. Adding more chat providers should be relatively straightforward and is still a TODO item.

## Scripts
The scripts are outlined in the [README.md](README.md). Each one provides a python interface as well as a command line interface for running its part of the process. They all use ids to read relevant data from disk and then output any relevant information to disk.

The idea is that each step in the process may be run with many different parameters or need to be rerun and you shouldn't have to wonder what you did before.


# React code
The react application that powers the web UI is split up into pages and components. It could certainly use some refactoring and there is a lot of development planned around the UI as the underlying data is solidified.

## Pages

### Home
See list of datasets and scopes

### Setup
Setup scopes

### Explore
Explore scopes

### Jobs
Show list of jobs that have run for a dataset

### Job
Follow a specific job while its running, or rerun an error job.

### Mobile
Mobile is currently unsupported as the regl-scatterplot component that powers the scatter plots doesn't work well on Android or at all on iOS.
109 changes: 66 additions & 43 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,16 @@
Quickly embed, project, cluster and explore a dataset. This project is a new kind of workflow + tool for visualizing and exploring datasets through the lens of latent spaces.

### Demo
This tool is meant to be run locally or on a trusted server to process data for viewing in the latent scope. You can see the result of the process in live demos:
* TODO: datavis survey responses
* TODO: Dolly 15k
* TODO: r/DadJokes
* TODO: emotion
This tool is meant to be run locally or on a trusted server to process data for viewing in the latent scope. You can see the result of the process in [live demos](https://enjalot.github.io/latent-scope):
* [datavis survey responses](https://enjalot.github.io/latent-scope/#/datasets/datavis-misunderstood/explore/scopes-001) - 700 survey responses
* [Dolly 15k](https://enjalot.github.io/latent-scope/#/datasets/dolly15k/explore/scopes-001) - 15k instructions
* [r/DadJokes](https://enjalot.github.io/latent-scope/#/datasets/dadabase/explore/scopes-004) - 50k dad jokes
* [emotion](https://enjalot.github.io/latent-scope/#/datasets/emotion/explore/scopes-001) - 400k emotion statements from Twitter

TODO: YouTube getting started video
The source of each demo dataset is documented in the notebooks linked below. Each demo was chosen to represent different scales of data as well as some common usecases.

### Quick Start
To get started, install the [latent-scope module]() and run the server:
To get started, install the [latent-scope module](https://pypi.org/project/latentscope/) and run the server via the Command Line:

```bash
python -m venv venv
Expand All @@ -21,24 +21,26 @@ pip install latent-scope
ls-serve ~/local-scope-data
```

Then open your browser to http://localhost:5001 and upload your first dataset!
Then open your browser to http://localhost:5001 and start processing your first dataset!

You can also ingest data from a Pandas dataframe:
### Python interface
You can also ingest data from a Pandas dataframe using the Python interface:
```python
from latentscope import ls
df = pd.read_parquet("...")
ls.init("~/latent-scope-data")
ls.ingest("dadabase", df, text_column="joke")
ls.serve()
```
See the notebooks linked below for detailed examples of using the Python interface.

### Notebooks
You can also configure and run the server from inside python, see these notebooks for examples of preparing and loading data:
* [dvs-survey](notebooks/dvs-survey.ipynb) - a small test dataset of 700 rows to quickly illustrate the process
* [dadabase](notebooks/dadabase.ipynb) - a more interesting (and funny) dataset of 50k rows
See these notebooks for detailed examples of using the Python interface to prepare and load data.
* [dvs-survey](notebooks/dvs-survey.ipynb) - A small test dataset of 700 rows to quickly illustrate the process. This notebook shows how you can do every step of the process with the Python interface.
* [dadabase](notebooks/dadabase.ipynb) - A more interesting (and funny) dataset of 50k rows. This notebook shows how you can preprocess a dataset, ingest it into latentscope and then use the web interface to complete the process.
* [dolly15k](notebooks/dolly15k.ipynb) - Grab data from HuggingFace datasets and ingest into the process.
* [emotion](notebooks/emotion.ipynb) - 400k rows of emotional tweets.

### Command line scripts
When latent-scope is installed, it creates a suite of command line scripts that can be used to setup scopes for exploring and running the web application.
### Command line quick start
When latent-scope is installed, it creates a suite of command line scripts that can be used to setup the scopes for exploring in the web application. The output of each step in the process is flat files stored in the data directory specified at init. These files are in standard formats that were designed to be ported into other pipelines or interfaces.

```bash
# like above, we make sure to install latent-scope
Expand All @@ -61,7 +63,7 @@ ls-umap datavis-misunderstood embedding-001 25 .1
# ls-cluster dataset_id umap_id samples min_samples
ls-cluster datavis-misunderstood umap-001 5 5
# ls-label dataset_id text_column cluster_id model_id context
ls-label datavis-misunderstood "answer" cluster-001 transformers-HuggingFaceH4___zephyr-7b-beta
ls-label datavis-misunderstood "answer" cluster-001 transformers-HuggingFaceH4___zephyr-7b-beta ""
# ls-scope dataset_id embedding_id umap_id cluster_id cluster_labels_id label description
ls-scope datavis-misunderstood cluster-001-labels-001 "E5 demo" "E5 embeddings summarized by Zephyr 7B"
# start the server to explore your scope
Expand All @@ -85,7 +87,7 @@ This tool is meant to be a part of a larger process. Something that hopefully he
- We consider an input dataset the source of truth, a list of rows that can be indexed into. So all downstream operations, whether its embeddings, pointing to nearest neighbors or assigning data points to clusters, all use indices into the input dataset.


## Scripts
## Command Line Scripts: Detailed description
If you want to use the CLI instead of the web UI you can use the following scripts.

The scripts should be run in order once you have an `input.csv` file in your folder. Alternatively the Setup page in the web UI will run these scripts via API calls to the server for you.
Expand All @@ -104,60 +106,81 @@ ls-ingest database-curated
Take the text from the input and embed it. Default is to use `BAAI/bge-small-en-v1.5` locally via HuggingFace transformers. API services are supported as well, see [latentscope/models/embedding_models.json](latentscope/models/embedding_models.json) for model ids.

```bash
# you can get a list of models available with:
ls-list-models
# ls-embed <dataset_name> <text_column> <model_id>
ls-embed dadabase-curated joke transformers-intfloat___e5-small-v2
ls-embed dadabase joke transformers-intfloat___e5-small-v2
```

### 2. umap
Map the embeddings from high-dimensional space to 2D with UMAP. Will generate a thumbnail of the scatterplot.
```bash
# ls-umap <dataset_name> <model_id> <neighbors> <min_dist>
ls-umap dadabase-curated transformers-intfloat___e5-small-v2 50 0.1
# ls-umap <dataset_name> <embedding_id> <neighbors> <min_dist>
ls-umap dadabase embedding-001 50 0.1
```


### 3. cluster
Cluster the UMAP points using HDBSCAN. This will label each point with a cluster label
```bash
# ls-cluster <dataset_name> <umap_name> <samples> <min-samples>
ls-cluster dadabase-curated umap-005 5 3
# ls-cluster <dataset_name> <umap_id> <samples> <min-samples>
ls-cluster dadabase umap-001 5 3
```

### 4. label
We support auto-labeling clusters by summarizing them with an LLM. Supported models and APIs are listed in [latentscope/models/chat_models.json](latentscope/models/chat_models.json).
You can pass context that will be injected into the system prompt for your dataset.
```bash
# ls-label <dataset_name> <cluster_name> <model_id> <context>
ls-label dadabase-curated cluster-005 openai-gpt-3.5-turbo ""
# ls-label <dataset_id> <cluster_id> <chat_model_id> <context>
ls-label dadabase "joke" cluster-001 openai-gpt-3.5-turbo ""
```

### 5. scope
The scope command ties together each step of the process to create an explorable configuration. You can have several scopes to view different choices, for example using different embeddings or even different parameters for UMAP and clustering. Switching between scopes in the UI is instant.

```bash
# ls-scope <dataset_id> <embedding_id> <umap_id> <cluster_id> <cluster_labels_id> <label> <description>
ls-scope datavis-misunderstood cluster-001-labels-001 "E5 demo" "E5 embeddings summarized by GPT3.5-Turbo"
```

### 6. serve
To start the web UI we run a small server. This also enables nearest neighbor similarity search and interactively querying subsets of the input data while exploring the scopes.

```bash
ls-serve ~/latent-scope-data
```


## Dataset directory structure
Each dataset will have its own directory in data/ created when you ingest your CSV. All subsequent steps of setting up a dataset write their data and metadata to this directory.
There are no databases in this tool, just flat files that are easy to copy and edit.
<pre>
├── data/
| ├── dataset1/
| | ├── input.parquet # you provide this file
| | ├── input.parquet # from ingest.py, the dataset
| | ├── meta.json # from ingest.py, metadata for dataset, #rows, columns, text_column
| | ├── embeddings/
| | | ├── e5-small-v2.npy # from embed-*.py, embedding vectors
| | | ├── UAE-Large-V1.npy # files are named after the model
| | | ├── embedding-001.h5 # from embed.py, embedding vectors
| | | ├── embedding-001.json # from embed.py, parameters used to embed
| | | ├── embedding-002...
| | ├── umaps/
| | | ├── umap-001.parquet # from umap.py, x,y coordinates
| | | ├── umap-001.json # from umap.py, params used
| | | ├── umap-001.png # from umap.py, thumbnail of plot
| | | ├── umap-002.... # subsequent runs increment
| | | ├── umap-001.parquet # from umap.py, x,y coordinates
| | | ├── umap-001.json # from umap.py, params used
| | | ├── umap-001.png # from umap.py, thumbnail of plot
| | | ├── umap-002....
| | ├── clusters/
| | | ├── clusters-001.parquet # from clusters.py, cluster labels
| | | ├── clusters-001.json # from clusters.py, params used
| | | ├── clusters-001.png # from clusters.py, thumbnail of plot
| | | ├── clusters-... # from clusters.py, thumbnail of plot
| | ├── slides/
| | | ├── slides-001.parquet # from slides.py, cluster labels
| | | ├── slides-001.json # from slides.py, cluster labels
| | | ├── slides-... # from slides.py, thumbnail of plot
| | | ├── clusters-001.parquet # from cluster.py, cluster indices
| | | ├── clusters-001-labels-default.parquet # from cluster.py, default labels
| | | ├── clusters-001-labels-001.parquet # from label_clusters.py, LLM generated labels
| | | ├── clusters-001.json # from cluster.py, params used
| | | ├── clusters-001.png # from cluster.py, thumbnail of plot
| | | ├── clusters-002...
| | ├── scopes/
| | | ├── scopes-001.json # from scope.py, combination of embed, umap, clusters and label choice
| | | ├── scopes-...
| | ├── tags/
| | | ├── ❤️.indices # tagged by UI, powered by server.py
| | | ├── ... # can have arbitrary named tags
| | | ├── ❤️.indices # tagged by UI, powered by tags.py
| | | ├── ... # can have arbitrary named tags
| | ├── jobs/
| | | ├── 8980️-12345...json # created when job is run via web UI
| | | ├── 8980️-12345...json # created when job is run via web UI
</pre>
30 changes: 29 additions & 1 deletion notebooks/dolly15k.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,42 @@
"import latentscope as ls"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install datasets"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# https://huggingface.co/datasets/databricks/databricks-dolly-15k\n",
"df = pd.read_parquet(\"https://huggingface.co/api/datasets/databricks/databricks-dolly-15k/parquet/default/train/0.parquet\")"
"from datasets import load_dataset\n",
"dataset = load_dataset(\"databricks/databricks-dolly-15k\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = dataset[\"train\"].to_pandas()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.shape"
]
},
{
Expand Down
2 changes: 2 additions & 0 deletions notebooks/dvs-survey.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@
"metadata": {},
"outputs": [],
"source": [
"# this dataset is extracted from the Data Visualization Society annual survey 2019\n",
"# https://github.com/data-visualization-society/data_visualization_survey/tree/master/data\n",
"url = \"https://storage.googleapis.com/fun-data/latent-scope/examples/dvs-survey/datavis-misunderstood.csv\"\n",
"df = pd.read_csv(url)"
]
Expand Down
12 changes: 5 additions & 7 deletions web/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
# React + Vite
# Web UI

This template provides a minimal setup to get React working in Vite with HMR and some ESLint rules.

Currently, two official plugins are available:

- [@vitejs/plugin-react](https://github.com/vitejs/vite-plugin-react/blob/main/packages/plugin-react/README.md) uses [Babel](https://babeljs.io/) for Fast Refresh
- [@vitejs/plugin-react-swc](https://github.com/vitejs/vite-plugin-react-swc) uses [SWC](https://swc.rs/) for Fast Refresh
```bash
npm install
npm run dev
```

0 comments on commit ea7aa60

Please sign in to comment.