update READMEs

enjalot · Feb 15, 2024 · ea7aa60 · ea7aa60
1 parent da67515
commit ea7aa60
Show file tree

Hide file tree

Showing 6 changed files with 163 additions and 64 deletions.
diff --git a/.env.example b/.env.example
@@ -4,6 +4,4 @@ VOYAGE_API_KEY=XXX
 TOGETHER_API_KEY=XXX
 COHERE_API_KEY=XXX
 MISTRAL_API_KEY=XXX
-ANTHROPIC_API_KEY=XXX
-GOOGLE_API_KEY=XXX
 LATENT_SCOPE_DATA='~/latent-scope-data'
diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md
@@ -2,15 +2,13 @@
 
 If you are interested in customizing or contributing to latent-scope this document explains the ways to develop it.
 
-## Repository Overview
-TODO
-
 ## Python module
 The `latentscope` directory contains the python source code for the [latent scope pip module](). There are three primary parts to the module:
 
-1. server
-2. scripts
-3. models
+1. server - contains the flask app and API routes
+2. scripts - 
+3. models/
+4. util/
 
 ### Running locally
 If you modify the python code and want to try out your changes, you can run locally like so:
@@ -24,7 +22,7 @@ ls-serve ~/latent-scope-data
 
 
 ## Web client
-The `web` directory contains the JavaScript React source code for the web interface.
+The `web` directory contains the JavaScript React source code for the web interface. Node.js is required to be installed on your system to run the development server or build a new version of the module.
 
 ```
 cd web
@@ -36,7 +34,7 @@ This will call the local API at http://localhost:5001 as set in `web/.env.develo
 
 
 ## Building for distribution
-TODO: flesh out these instructions
+You can build a new version of the module, this will package the latest version of the web interface as well.
 
 ```
 python setup.py sdist bdist_wheel
@@ -53,8 +51,60 @@ pip install dist/latentscope-0.1.0-py3-none-any.whl
 
 
 # Python Code
-TODO
 
-## Embedding models
-models prepared in [latentscope/models/embedding_models.json](latentscope/models/embedding_models.json)
+## Configuration
+`latentscope/util`   
+
+The module makes use of `dotenv` to save important configuration environment variables. The most important variable is `DATA_DIR`.
+This determines where the input data is stored as well as all of output from each step in the process.
+
+API keys for the various proprietary model APIs are also stored in the .env file created by `dotenv`.
+
+## App
+`latentscope/server/`   
+The flask app that runs the API and hosts the web UI has multiple components. The main setup is in `app.py` while `datasets.py`, `search.py`, `tags.py` provide specific routes. `jobs.py` is explained below.
+
+## Jobs
+`latentscope/server/jobs.py`  
+The process run by the web UI is done by kicking off subprocesses that call the command line script for each step. The progress of the subprocess is captured and saved in a job JSON file and is updated continuously until the process completes or errors. This allows us to poll and display the status of the commands from the web UI.
+
+## Models
+`latentscope/models`
+The code to run models or call APIs is centralized here. The idea is to provide a uniform interface for embedding and another for summarization and allow the configuration of each model (context length, truncation etc.) to be specified in a single JSON file. We then can use the JSON file to power UI choices on which model to use in the process.
+
+### Embedding models
+Embedding models are prepared in [latentscope/models/embedding_models.json](latentscope/models/embedding_models.json).
 There is a `get_embedding_model(id)` function which will load the appropriate class based on the model provider. See `providers/` for `transformers`, `openai`, `cohereai`, `togetherai`, `voyageai`
+
+### Chat models
+Chat models for summarization of clusters are prepared in [latentscope/models/chat_models.json](latentscope/models/chat_models.json). 
+There is a `get_chat_model(id)` function which will load the appropriate class based on the model provider. Each provider can support a chat model if the interface is implemented. Adding more chat providers should be relatively straightforward and is still a TODO item.
+
+## Scripts
+The scripts are outlined in the [README.md](README.md). Each one provides a python interface as well as a command line interface for running its part of the process. They all use ids to read relevant data from disk and then output any relevant information to disk.
+
+The idea is that each step in the process may be run with many different parameters or need to be rerun and you shouldn't have to wonder what you did before.
+
+
+# React code
+The react application that powers the web UI is split up into pages and components. It could certainly use some refactoring and there is a lot of development planned around the UI as the underlying data is solidified.
+
+## Pages
+
+### Home
+See list of datasets and scopes
+
+### Setup
+Setup scopes
+
+### Explore
+Explore scopes
+
+### Jobs
+Show list of jobs that have run for a dataset
+
+### Job
+Follow a specific job while its running, or rerun an error job.
+
+### Mobile
+Mobile is currently unsupported as the regl-scatterplot component that powers the scatter plots doesn't work well on Android or at all on iOS.
diff --git a/README.md b/README.md
@@ -3,16 +3,16 @@
 Quickly embed, project, cluster and explore a dataset. This project is a new kind of workflow + tool for visualizing and exploring datasets through the lens of latent spaces. 
 
 ### Demo
-This tool is meant to be run locally or on a trusted server to process data for viewing in the latent scope. You can see the result of the process in live demos:
-* TODO: datavis survey responses
-* TODO: Dolly 15k
-* TODO: r/DadJokes
-* TODO: emotion
+This tool is meant to be run locally or on a trusted server to process data for viewing in the latent scope. You can see the result of the process in [live demos](https://enjalot.github.io/latent-scope):
+* [datavis survey responses](https://enjalot.github.io/latent-scope/#/datasets/datavis-misunderstood/explore/scopes-001) - 700 survey responses
+* [Dolly 15k](https://enjalot.github.io/latent-scope/#/datasets/dolly15k/explore/scopes-001) - 15k instructions
+* [r/DadJokes](https://enjalot.github.io/latent-scope/#/datasets/dadabase/explore/scopes-004) - 50k dad jokes
+* [emotion](https://enjalot.github.io/latent-scope/#/datasets/emotion/explore/scopes-001) - 400k emotion statements from Twitter
 
-TODO: YouTube getting started video
+The source of each demo dataset is documented in the notebooks linked below. Each demo was chosen to represent different scales of data as well as some common usecases.
 
 ### Quick Start
-To get started, install the [latent-scope module]() and run the server:
+To get started, install the [latent-scope module](https://pypi.org/project/latentscope/) and run the server via the Command Line:
 
 ```bash
 python -m venv venv
@@ -21,24 +21,26 @@ pip install latent-scope
 ls-serve ~/local-scope-data
 ```
 
-Then open your browser to http://localhost:5001 and upload your first dataset!
+Then open your browser to http://localhost:5001 and start processing your first dataset!
 
-You can also ingest data from a Pandas dataframe:
+### Python interface
+You can also ingest data from a Pandas dataframe using the Python interface:
 ```python
 from latentscope import ls
+df = pd.read_parquet("...")
 ls.init("~/latent-scope-data")
 ls.ingest("dadabase", df, text_column="joke")
 ls.serve()
 ```
-See the notebooks linked below for detailed examples of using the Python interface.
 
-### Notebooks
-You can also configure and run the server from inside python, see these notebooks for examples of preparing and loading data:
-* [dvs-survey](notebooks/dvs-survey.ipynb) - a small test dataset of 700 rows to quickly illustrate the process
-* [dadabase](notebooks/dadabase.ipynb) - a more interesting (and funny) dataset of 50k rows
+See these notebooks for detailed examples of using the Python interface to prepare and load data.  
+* [dvs-survey](notebooks/dvs-survey.ipynb) - A small test dataset of 700 rows to quickly illustrate the process. This notebook shows how you can do every step of the process with the Python interface.
+* [dadabase](notebooks/dadabase.ipynb) - A more interesting (and funny) dataset of 50k rows. This notebook shows how you can preprocess a dataset, ingest it into latentscope and then use the web interface to complete the process.
+* [dolly15k](notebooks/dolly15k.ipynb) - Grab data from HuggingFace datasets and ingest into the process.
+* [emotion](notebooks/emotion.ipynb) - 400k rows of emotional tweets.
 
-### Command line scripts
-When latent-scope is installed, it creates a suite of command line scripts that can be used to setup scopes for exploring and running the web application.
+### Command line quick start
+When latent-scope is installed, it creates a suite of command line scripts that can be used to setup the scopes for exploring in the web application. The output of each step in the process is flat files stored in the data directory specified at init. These files are in standard formats that were designed to be ported into other pipelines or interfaces.
 
 ```bash
 # like above, we make sure to install latent-scope
@@ -61,7 +63,7 @@ ls-umap datavis-misunderstood embedding-001 25 .1
 # ls-cluster dataset_id umap_id samples min_samples
 ls-cluster datavis-misunderstood umap-001 5 5
 # ls-label dataset_id text_column cluster_id model_id context
-ls-label datavis-misunderstood "answer" cluster-001 transformers-HuggingFaceH4___zephyr-7b-beta
+ls-label datavis-misunderstood "answer" cluster-001 transformers-HuggingFaceH4___zephyr-7b-beta ""
 # ls-scope  dataset_id embedding_id umap_id cluster_id cluster_labels_id label description
 ls-scope datavis-misunderstood cluster-001-labels-001 "E5 demo" "E5 embeddings summarized by Zephyr 7B"
 # start the server to explore your scope
@@ -85,7 +87,7 @@ This tool is meant to be a part of a larger process. Something that hopefully he
   - We consider an input dataset the source of truth, a list of rows that can be indexed into. So all downstream operations, whether its embeddings, pointing to nearest neighbors or assigning data points to clusters, all use indices into the input dataset.
 
 
-## Scripts
+## Command Line Scripts: Detailed description
 If you want to use the CLI instead of the web UI you can use the following scripts.
 
 The scripts should be run in order once you have an `input.csv` file in your folder. Alternatively the Setup page in the web UI will run these scripts via API calls to the server for you.  
@@ -104,60 +106,81 @@ ls-ingest database-curated
 Take the text from the input and embed it. Default is to use `BAAI/bge-small-en-v1.5` locally via HuggingFace transformers. API services are supported as well, see [latentscope/models/embedding_models.json](latentscope/models/embedding_models.json) for model ids. 
 
 ```bash
+# you can get a list of models available with:
+ls-list-models
 # ls-embed <dataset_name> <text_column> <model_id>
-ls-embed dadabase-curated joke transformers-intfloat___e5-small-v2
+ls-embed dadabase joke transformers-intfloat___e5-small-v2
 ```
 
 ### 2. umap
 Map the embeddings from high-dimensional space to 2D with UMAP. Will generate a thumbnail of the scatterplot.
 ```bash
-# ls-umap <dataset_name> <model_id> <neighbors> <min_dist>
-ls-umap dadabase-curated transformers-intfloat___e5-small-v2 50 0.1
+# ls-umap <dataset_name> <embedding_id> <neighbors> <min_dist>
+ls-umap dadabase embedding-001 50 0.1
 ```
 
 
 ### 3. cluster
 Cluster the UMAP points using HDBSCAN. This will label each point with a cluster label
 ```bash
-# ls-cluster <dataset_name> <umap_name> <samples> <min-samples>
-ls-cluster dadabase-curated umap-005 5 3
+# ls-cluster <dataset_name> <umap_id> <samples> <min-samples>
+ls-cluster dadabase umap-001 5 3
 ```
 
 ### 4. label
 We support auto-labeling clusters by summarizing them with an LLM. Supported models and APIs are listed in [latentscope/models/chat_models.json](latentscope/models/chat_models.json). 
 You can pass context that will be injected into the system prompt for your dataset.
 ```bash
-# ls-label <dataset_name> <cluster_name> <model_id> <context>
-ls-label dadabase-curated cluster-005 openai-gpt-3.5-turbo ""
+# ls-label <dataset_id> <cluster_id> <chat_model_id> <context>
+ls-label dadabase "joke" cluster-001 openai-gpt-3.5-turbo ""
+```
+
+### 5. scope
+The scope command ties together each step of the process to create an explorable configuration. You can have several scopes to view different choices, for example using different embeddings or even different parameters for UMAP and clustering. Switching between scopes in the UI is instant.
+
+```bash
+# ls-scope  <dataset_id> <embedding_id> <umap_id> <cluster_id> <cluster_labels_id> <label> <description>
+ls-scope datavis-misunderstood cluster-001-labels-001 "E5 demo" "E5 embeddings summarized by GPT3.5-Turbo"
 ```
 
+### 6. serve
+To start the web UI we run a small server. This also enables nearest neighbor similarity search and interactively querying subsets of the input data while exploring the scopes.
+
+```bash
+ls-serve ~/latent-scope-data
+```
+
+
 ## Dataset directory structure
 Each dataset will have its own directory in data/ created when you ingest your CSV. All subsequent steps of setting up a dataset write their data and metadata to this directory.
 There are no databases in this tool, just flat files that are easy to copy and edit.
 <pre>
 ├── data/
 |   ├── dataset1/
-|   |   ├── input.parquet                       # you provide this file
+|   |   ├── input.parquet                           # from ingest.py, the dataset
+|   |   ├── meta.json                               # from ingest.py, metadata for dataset, #rows, columns, text_column
 |   |   ├── embeddings/
-|   |   |   ├── e5-small-v2.npy                 # from embed-*.py, embedding vectors
-|   |   |   ├── UAE-Large-V1.npy                # files are named after the model
+|   |   |   ├── embedding-001.h5                    # from embed.py, embedding vectors
+|   |   |   ├── embedding-001.json                  # from embed.py, parameters used to embed
+|   |   |   ├── embedding-002...                   
 |   |   ├── umaps/
-|   |   |   ├── umap-001.parquet                # from umap.py, x,y coordinates
-|   |   |   ├── umap-001.json                   # from umap.py, params used
-|   |   |   ├── umap-001.png                    # from umap.py, thumbnail of plot
-|   |   |   ├── umap-002....                    # subsequent runs increment
+|   |   |   ├── umap-001.parquet                    # from umap.py, x,y coordinates
+|   |   |   ├── umap-001.json                       # from umap.py, params used
+|   |   |   ├── umap-001.png                        # from umap.py, thumbnail of plot
+|   |   |   ├── umap-002....                        
 |   |   ├── clusters/
-|   |   |   ├── clusters-001.parquet            # from clusters.py, cluster labels
-|   |   |   ├── clusters-001.json               # from clusters.py, params used
-|   |   |   ├── clusters-001.png                # from clusters.py, thumbnail of plot
-|   |   |   ├── clusters-...                    # from clusters.py, thumbnail of plot
-|   |   ├── slides/
-|   |   |   ├── slides-001.parquet              # from slides.py, cluster labels
-|   |   |   ├── slides-001.json                 # from slides.py, cluster labels
-|   |   |   ├── slides-...                      # from slides.py, thumbnail of plot
+|   |   |   ├── clusters-001.parquet                # from cluster.py, cluster indices
+|   |   |   ├── clusters-001-labels-default.parquet # from cluster.py, default labels
+|   |   |   ├── clusters-001-labels-001.parquet     # from label_clusters.py, LLM generated labels
+|   |   |   ├── clusters-001.json                   # from cluster.py, params used
+|   |   |   ├── clusters-001.png                    # from cluster.py, thumbnail of plot
+|   |   |   ├── clusters-002...                     
+|   |   ├── scopes/
+|   |   |   ├── scopes-001.json                     # from scope.py, combination of embed, umap, clusters and label choice
+|   |   |   ├── scopes-...                      
 |   |   ├── tags/
-|   |   |   ├── ❤️.indices                       # tagged by UI, powered by server.py
-|   |   |   ├── ...                             # can have arbitrary named tags
+|   |   |   ├── ❤️.indices                           # tagged by UI, powered by tags.py
+|   |   |   ├── ...                                 # can have arbitrary named tags
 |   |   ├── jobs/
-|   |   |   ├──  8980️-12345...json              # created when job is run via web UI
+|   |   |   ├──  8980️-12345...json                  # created when job is run via web UI
 </pre>
diff --git a/notebooks/dolly15k.ipynb b/notebooks/dolly15k.ipynb
@@ -10,14 +10,42 @@
     "import latentscope as ls"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install datasets"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
     "# https://huggingface.co/datasets/databricks/databricks-dolly-15k\n",
-    "df = pd.read_parquet(\"https://huggingface.co/api/datasets/databricks/databricks-dolly-15k/parquet/default/train/0.parquet\")"
+    "from datasets import load_dataset\n",
+    "dataset = load_dataset(\"databricks/databricks-dolly-15k\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = dataset[\"train\"].to_pandas()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df.shape"
    ]
   },
   {

diff --git a/notebooks/dvs-survey.ipynb b/notebooks/dvs-survey.ipynb
@@ -17,6 +17,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "# this dataset is extracted from the Data Visualization Society annual survey 2019\n",
+    "# https://github.com/data-visualization-society/data_visualization_survey/tree/master/data\n",
     "url = \"https://storage.googleapis.com/fun-data/latent-scope/examples/dvs-survey/datavis-misunderstood.csv\"\n",
     "df = pd.read_csv(url)"
    ]

diff --git a/web/README.md b/web/README.md
@@ -1,8 +1,6 @@
-# React + Vite
+# Web UI
 
-This template provides a minimal setup to get React working in Vite with HMR and some ESLint rules.
-
-Currently, two official plugins are available:
-
-- [@vitejs/plugin-react](https://github.com/vitejs/vite-plugin-react/blob/main/packages/plugin-react/README.md) uses [Babel](https://babeljs.io/) for Fast Refresh
-- [@vitejs/plugin-react-swc](https://github.com/vitejs/vite-plugin-react-swc) uses [SWC](https://swc.rs/) for Fast Refresh
+```bash
+npm install
+npm run dev
+```