notes 🖊️

boettiger-lab · Jan 5, 2025 · 5107814 · 5107814
1 parent e6fc859
commit 5107814
Show file tree

Hide file tree

Showing 2 changed files with 55 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -13,3 +13,31 @@ license: bsd-2-clause
 :hugs: Shiny App on Huggingface: <https://huggingface.co/spaces/boettiger-lab/geo-llm-r>
 
 Work in progress.  This is a proof-of-principle for an LLM-driven interface to dynamic mapping. Key technologies include duckdb, geoparquet, pmtiles, maplibre, open LLMs (via VLLM + LiteLLM).  R interface through ellmer (LLMs), mapgl (maplibre), shiny, and duckdb.
+
+# Setup
+
+## GitHub with HuggingFace Deploy
+
+All edits should be pushed to GitHub. Edits to `main` branch are automatically deployed to HuggingFace via GitHub Actions.
+When using this scaffold, you will first have to set up your auto-deploy system: 
+
+- [Create a new HuggingFace Space](https://huggingface.co/new-space) (any template is fine, will be overwritten).
+- [Create a HuggingFace Token](https://huggingface.co/settings/tokens/new?tokenType=write) with write permissions if you do not have one.  
+- In the GitHub Settings of your repository, add the token as a "New Repository Secret" under the `Secrets and Variables` -> `Actions` section of settings (`https://github.com/{USER}/{REPO}/settings/secrets/actions`).  
+- Edit the `.github/workflows/deploy.yml` file to specify your HuggingFace user name and HF repo to publish to. 
+
+## Language Model setup
+
+This example is designed to be able to leverage open source or open weights models.  You will need to adjust the API URL and API key accordingly. This could be a local model with `vllm` or `ollama`, and of course commercial models should work too. The demo app currently runs on an VLLM+LiteLLM backed model, currently a Llama3 variant, hosted on the National Research Platform.
+
+The LLM plays only a simple role in generating SQL queries from background information on the data including the table schema, see the system prompt for details. Most open models I have experimented with do not support the [tool use](https://ellmer.tidyverse.org/articles/tool-calling.html) or [structured data](https://ellmer.tidyverse.org/articles/structured-data.html) interfaces very well compared to commercial models.  An important trick in working with open models used here is merely requesting the reply be structured as JSON.  Open models are quite decent at this, and at SQL construction, given necessary context about the data. The map and chart elements merely react the resulting data frames, and the entire analysis is thus transparent and reproducible as it would be if the user had composed their request in SQL instead of plain English. 
+
+## Software Dependencies
+
+The Dockerfile includes all dependencies required for the HuggingFace deployment, and can be used as a template or directly to serve RStudio server.
+
+## Data pre-processing
+
+Pre-processing the data into cloud-native formats and hosting data on a high bandwidth, highly avalialbe server is essential for efficient and scalable renending.  Pre-computing expensive operations such as zonal statistics across all features is also necessary. These steps are described in [preprocess.md](preprocess.md) and corresponding scripts.
+
+
diff --git a/preprocess.md b/preprocess.md
@@ -0,0 +1,27 @@
+---
+---
+
+
+# Vector Layers
+
+The heart of this application design is a vector dataset serialized as both (Geo)Parquet and PMTiles.
+The parquet version allows for real-time calculations through rapid SQL queries via duckdb, 
+and the PMTiles version allows the data to be quickly visualized at any zoom through maplibre.
+maplibre can also efficiently filter the PMTiles data given a feature ids returned by duckdb.
+
+`gdal_translates` can generate both PMTiles and geoparquet, though `tippecanoe` provides more
+options for PMTiles generation and can produce nicer tile sets.
+
+The demo uses the CDC Social Vulnerability data because it is built on the hierachical partitioning
+used by the Census (Country->State->County->Tract) hierarchy.  
+
+# Raster Layers
+
+## Generating static tiles
+
+## Zonal statistics calculations 
+
+The application is essentially driven by the vector layer data using SQL. 
+I find it helpful to pre-process 'zonal' calculations, e.g. the mean value of each raster layer
+within each feature in the 'focal' vector data set(s).
+