presentation.qmd

---
title: "`starpilot`"
subtitle: "A tool to suggest GitHub repos from your users stars"
format: 
    revealjs:
        theme: "solarized"
        mermaid: 
          format: "png"
          theme: "neutral"
editor: 
  markdown: 
    wrap: 72
---

```{r label = "setup"}
if (!require(ggplot2)) {
    install.packages("ggplot2")
}
```

# Introduction

::: notes
- Hi, I'm Dave, and thanks for coming to see me talk about a recent sideproject.
:::

## Goals for the talk

-   Share an AI project I've been working on
-   Share some of the design decisions I made to build it
-   Share some of the experience I had building it
-   Share some of the code I wrote to build it

::: notes
Before we start I'd like to understand more about all of you, as we have a mixture of folks potentially from AIWales and from PyData Cardiff.
:::

## Level Set {auto-animate="true"}

🙌 🙌🏻 🙌🏼 🙌🏽 🙌🏾 🙌🏿 🙌🏿

::: {.incremental data-id="level-set"}
-   Retrieval Augmented Generation
-   Vector Embedding
-   Vector Store
-   Vector Similarity
:::

::: notes
-   I'm going to assume you know the meaning of terms like 'AI', 'LLM',
    'GPT'
-   If you don't, hello, welcome, you are in a great place to learn, but
    I won't go over them now. Feel free to make a neighbor into friend
    and ask them, or get Joe to explain it in his session.
-   Can you define these terms?
-   I'm not going to put anyone on the spot here, but I'd love to have
    people raise their hands if they feel they can define these terms
-   This will help me understand the level of knowledge in the room and hopefully
    tune the rest of the talk
:::

## Level Set {auto-animate="true"}

::: {data-id="level-set"}
-   Retrieval Augmented Generation
    -   Using a language model supported by a vector store
-   Vector Embedding
    -   Converting text into a list of numbers to represent semantic
        meaning
-   Vector Store
    -   Database optimized for storing embedding vectors
-   Vector Similarity
    -   Quantifying the similarity between embedding vectors
:::

::: notes
-   So here are the definitions of the terms, keep the hands up if you
    got most of them right
-   If you knew these terms already, great! This talk is largely about
    applying these concepts in code.
-   If you didn't, that's also great, because you're in the right place
    to learn about them.
:::

## Level Set {auto-animate="true"}

🙌 🙌🏻 🙌🏼 🙌🏽 🙌🏾 🙌🏿 🙌🏿

> Who is comfortable reading `python`?

# The Project

## Goals for an AI project

::: incremental
-   As a Data Scientist
-   I want to build something novel with AI
-   So that I can learn and experiment with the technology
-   And maybe make something useful for other people
:::

::: notes
-   Apply Retrieval Augmented Generation to a goal I have (and maybe
    other people)
-   Understand how LangChain (and friends) works through building
    something novel
-   Experiment in a way that leans into what LLMs + AI are good at
    -   Semantic meaning
:::

## The Project is `starpilot`

> A tool to suggest GitHub repos from your users stars based on semantic similarity

![](images/563000_The github octocat riding a spaceship through a st_xl-1024-v1-0.png){fig-align="center" height=400}

::: notes
-   The project is called `starpilot`
-   It's a tool that suggests GitHub repos based on semantic similarity
-   The idea is that you can give it a query, and it will return the
    most semantically similar repos to that query that your GitHub user has starred
:::

## Demo 1

```{=html}
<script src="https://asciinema.org/a/661841.js" id="asciicast-661841" async="true"></script>
```
::: notes
-   So here is a smash cut to the most basic demo.
-   I call starpilot to execute the 'shoot' command with the query
    "machine learning"
-   Starpilot returns 10 repos that are all semantically similar to the
    query, and all exist, no hallucinations here
-   Most interestingly the repo `DeepLearningExamples` is returned,
    which is depending on your level of pedanticness, not a 'machine
    learning' repo, but a 'deep learning' repo
-   However, `DeepLearningExamples` never states the term machine
    learning. It's a deep learning, and that is semantically similar to
    machine learning. This is exactly the kind of result I was hoping
    for. The tool is looking at the search *intention* not the search
    *terms*
:::

## Why is 'semantic similarity' important?

-   Keyword search tests "how close a string is to another string"
    -   Exact Match: `cool cats != chill felines`
    -   Partial Match: `cool cat` partially fits `cooler cats`
-   Semantic search tests "how similar in *meaning* is the phrase to
    another phrase"
    -   `cool cats` is equivalent in meaning to `chill felines`, also
        `groovy kittys`, `suave main coon` and `sphinx with style`

::: notes
-   I think applying AI, LLM and RAG to a problem like might be a good
    fit
-   This is because the problem is about semantic similarity
-   I'm looking for a library that does "DataFrames". Will I also be
    interested in libraries that do "Data Tables"? Probably.
-   I'm looking for a 'Front end' framework, will I also be interested
    in 'UI' frameworks? Probably.
:::

## Why is this data suitable for 'semantic' but not 'keyword' search?

-   GitHub stars are public and common across
    languages/stacks/specialisms
-   GitHub repos are text rich data
-   GitHub repos are a 'weak standard'
    -   Topics are free text
    -   Descriptions are free text

::: notes
-   Without access to suitable data, this project can't get off the
    ground, but luckily GitHub is a treasure trove of easily acciessble
    data for this use case
-   The data is text rich, and the text is 'rich' in meaning
-   However, the data is also 'varied' in structure, quality and context
-   Repos vary subjectively by community, project requirements, and
    developer style
-   There is no 'repo police' saying a repo must have a certain
    structure, or a certain set of tags
-   This is where the AI comes in
:::

# An Intuitive Example

## Polars vs Pandas {auto-animate="true"}

::: columns
::: {.column width="50%" data-id="pandas"}
-   data-analysis
-   flexible
-   alignment
-   python
-   data-science
:::

::: {.column width="50%" data-id="polars"}
-   dataframe-library
-   dataframe
-   dataframes
-   arrow
-   python
-   out-of-core
:::
:::

::: notes
-   To illustrate this point, let's look at a few tags from GitHub
-   This is the topic list for the `Polars` and `Pandas` repos
-   They are functionally extremely close. They are both Python
    libraries for data manipulation
-   While there may be some subjective views about when to use which
    they both solve practically identical problems and fit practically
    identical use-cases
-   If we were to search for 'DataFrames for Python' we would want to
    return both of these repos
-   However, keyword search would only return one of them because none
    of the topics are shared
:::

## Polars vs Pandas {auto-animate="true"}

::: columns
::: {.column width="50%" data-id="pandas"}
> Flexible and powerful data analysis / manipulation library for Python,
> providing labeled data structures similar to R data.frame objects,
> statistical functions, and much more
:::

::: {.column width="50%" data-id="polars"}
> Dataframes powered by a multithreaded, vectorized query engine,
> written in Rust
:::
:::

::: notes
-   Here are the descriptions for the `Polars` and `Pandas` repos
-   Again very similar, but keyword search would not return both of
    these repos
-   I'd suggest unless you knew already what the difference between
    these two libraries was, you'd be hard pressed to know which one has
    been which in the last 2 slides
:::

## Polars vs Pandas {auto-animate="true"}

::: columns
::: {.column width="50%" data-id="pandas"}
![](images/panda4.png)
:::

::: {.column width="50%" data-id="polars"}
![](images/polars3.png)
:::
:::

::: notes
-   If you managed to guess correctly, congratulations, you've won a
    smug sense of self-satisfaction
-   Even if you didn't, that's fine, because the point is that these two
    libraries are very similar in meaning, but very different in the way
    they are described
-   Also, I didn't set this up, but I lobe the way that the panda was
    generated with 2 laptops for some reason, and the polar bear is sort
    of almost a pandas
-   It's like the polar bear is all "Look, I'm Pandas too!" and the
    Panda is furiously dual wielding laptops to attempt to prove it's
    multi-threaded
:::

# Embedding Example

-   "pandas package"
-   "polars module"
-   "panda breeding"
-   "polar environment"

::: notes
To create an embedding, the source data is converted into a list of
numbers through a process called 'embedding' The numbers represent the
semantic meaning of the text You can imagine each number as a 'relevance
score' for a given topic I'm deliberately simplifying this, but it's a
good enough mental model for now Suppose we have the following 4 pieces
of text we want to embed separately
:::

## Embedding Example {auto-animate="true"}

```{mermaid label="embedding diagram"}
flowchart LR

subgraph strings
pp("pandas package")
pm("polars module")
pb("panda breeding")
pe("polar environment")
end

subgraph embedder
embed([function])

pp --> embed
pm --> embed
pb --> embed
pe --> embed
end

subgraph vector
ppv("0.37, 0.88")
pmv("0.49, 0.66")
pbv("0.83, 0.24")
pev("0.70, 0.12")

embed --> ppv
embed --> pmv
embed --> pbv
embed --> pev
end
```

::: notes
The embedding process takes the text and converts it into a list of
numbers In our silly example, the numbers are 2D, but in reality, they
are billions of dimensions
:::

## Embedding Example {auto-animate="true"}

```{r label="intuition example"}
library(ggplot2)

df <- data.frame(
    query = c("pandas package", "polars module", "panda breeding", "polar environment"),
    relevance_to_zoologists = c(0.37, 0.49, 0.83, 0.70),
    relevance_to_data_scientists = c(0.88, 0.66, 0.24, 0.12),
    stringsAsFactors = FALSE
)

ggplot(df, aes(x=relevance_to_zoologists, y=relevance_to_data_scientists)) +
    geom_point() +
    geom_text(aes(label=query), hjust=0.5, vjust=-0.5, check_overlap = TRUE) +
    scale_x_continuous(limits = c(0, 1)) +
    scale_y_continuous(limits = c(0, 1)) +
    xlab("Relevance to Zoologists") +
    ylab("Relevance to Data Scientists") +
    coord_fixed(ratio = 1)
```

::: notes
Here's a plot of the 4 pieces of text in 2D space - The x-axis
represents the relevance to zoologists - The y-axis represents the
relevance to data scientists - The points are the 4 pieces of text
converted into an embedding
:::

## Embedding Example {auto-animate="true"}

```{r label="intuition example with query"}
# Add the 'data frames' query
reference_vector <- c(0.1, 0.6)
df <- rbind(df, data.frame(query = "data frames", relevance_to_zoologists = reference_vector[1], relevance_to_data_scientists = reference_vector[2]))

# Function to calculate cosine similarity
cos_sim <- function(x, reference_vector) {
    dot_product <- sum(x * reference_vector)
    magn_x <- sqrt(sum(x^2))
    magn_reference <- sqrt(sum(reference_vector^2))
    similarity <- dot_product / (magn_x * magn_reference)
    return(similarity)
}

# Apply function to calculate cosine similarities
df$cosine_similarity <- apply(
    df[, c("relevance_to_zoologists", "relevance_to_data_scientists")],
    1,
    function(x) {cos_sim(x, reference_vector)}
)

# Remove query similarity to self
df$cosine_similarity[end(df$cosine_similarity)[1]] = NA

# Plot
p <- ggplot(df, aes(x=relevance_to_zoologists, y=relevance_to_data_scientists)) +
    geom_point() +
    geom_text(aes(label=query), hjust=0.5, vjust=-0.5, check_overlap = TRUE) +
    geom_segment(aes(
        x = 0, y = 0, xend = relevance_to_zoologists, yend = relevance_to_data_scientists
        ),
        arrow = arrow(length = unit(0.03, "npc")), size = 1, color = "black"
    ) +
    scale_x_continuous(limits = c(0, 1)) +
    scale_y_continuous(limits = c(0, 1)) +
    xlab("Relevance to Zoologists") +
    ylab("Relevance to Data Scientists") +
    coord_fixed(ratio = 1)

# Loop to draw arcs and label
arc_cols = c("blue", "red", "green", "purple")
for(i in 1:nrow(df)) {

    seg_pos <- 4 / (4 + i)

    curve_data <- df[i, ]
    curve_data$xpos <- seg_pos * reference_vector[1]
    curve_data$ypos <- seg_pos * reference_vector[2]

    p <- p + geom_curve(
        data = curve_data,
        aes(
            x = xpos, y = ypos,
            xend = relevance_to_zoologists / 2, yend = relevance_to_data_scientists / 2),
            curvature = -0.4, color = arc_cols[i], linetype = "dashed", size = 1.2
        ) +
    geom_text(
        data = curve_data, aes(
            x = relevance_to_zoologists / 2, y = relevance_to_data_scientists / 2, 
            label = round(cosine_similarity, 2)), check_overlap = TRUE, color = "black",
            vjust = -2.5, hjust=0.7
        )
}
p
```

::: notes
So now lets invent a query to put in the embedding space: "data frames"
When we embed this query, it will be placed in the embedding space
From this known point we then compute the cosine similarity to the other points
This is a simple mathematical operation that measures the similarity between two vectors, based on the angle between them
The closer the angle is to 0, the more similar the vectors are, and the higher the cosine similarity
This is shown in 2-dimensions, but in practice, the embeddings are in 100s of dimensions, and the cosine similarity is calculated in the same way
:::

# First Experiment

## Solution Requirements

::: notes
-   A method to extract the data from GitHub
-   A connection to an llm
-   A framework to parse 100s (1000s?) of 'similarly' structured data
-   A data store for processed data
-   An interactive interface for the developer to access the data
:::

## Solution Architecture

:::{}
![](images/github.png){.fragment width=400}

![](images/langchain.png){.fragment width=400}
![](images/chroma.svg){.fragment width=400}
![](images/typer.svg){.fragment width=400}
:::

::: notes
-   So to build this I need a few tools, libraries and datasources
-   GitHub User Stars
    -   Rest API then GraphQL API
        -   I'll mention more about the change between those in a moment
    -   JSON files
-   Langchain
    -   LLM Rest API Access
    -   Document parsing tools
    -   Vector store prompt query framework
-   Chroma
    -   On disk vector store
-   Typer
    -   'FastAPI' style CLI framework
:::

## Similarity Search

```{mermaid label="minimum viable solution",  .smaller}
flowchart LR

subgraph 0 Resources
	GH([GitHub Repo])
	DB[(Chroma)]
	LLM([LLM])
end

subgraph 1 ETL
	raw(JSON)
	docs_in(Docs)
	vec(Embedding)

	GH--->raw
	raw-->docs_in
	docs_in-->LLM
	LLM-->vec
	vec-->DB
end

subgraph 2 Execution
	cli([cli])
	q(query)
	LLM([LLM])
	q_vec(Query Vector)
	docs_out(Document)
	a(suggestions)

	cli-->q
	q -->LLM
	LLM -->q_vec
	q_vec -->DB
	DB-->docs_out
	docs_out-->a
end
```

## Outcome {auto-animate="true"}

-   IT'S ALIVE
    -   Data gets read
    -   Read data gets embedded
    -   Embedded data gets stored
    -   Embedded data gets queried
    
## Outcome {auto-animate="true"}
-   IT STINKS
    -   Reading data is punishingly long
        -   200 stars in 30 minutes
    -   Querying data is basic
        -   Semantic search works, but is broad
        -   'DataFrames for Python' returns Pandas, and Tidyverse in R,
            and DataTables in JavaScript

# Refinement

## Refined Version

```{mermaid label = "self query solution"}
flowchart LR

LLM([LLM])
GH([GitHub Repo])
raw(raw_GH/*.json)
prepped(enhanced_GH/*.json)
style prepped fill:red
docs_in(Documents)
vec_doc(Vector Document)
DB[(Chroma)]
self_query(Self Query)
vec_q(Vector Embedding)
db_filter(Filter Value)
cli([cli])
q(query)
docs_out(Document)
a(suggestions)

subgraph 0 Resources
	GH
	DB
	LLM
end

subgraph 1 ETL
	GH--->raw
	raw-->prepped
	prepped-->docs_in
	docs_in-->LLM
	LLM-->vec_doc
	vec_doc-->DB
end

subgraph 2 Execution
	LLM-->self_query
	self_query-->db_filter
	style self_query fill:red
	self_query-->vec_q
	db_filter-->DB
	style db_filter fill:red
	vec_q --> DB
	cli-->q
	q -->LLM
	DB-->docs_out
	docs_out-->a
end
```

## What this gets us

-   Faster Data reads from GitHub
    -   20x
-   Enhanced data preprocessing
    -   Tags, Stars, Primary Language
-   Selective data load
    -   'Popular' repos
-   Self Querying
    -   Language specific results


::: notes
-   Faster data read with `GraphQL`
    -   Reduces network overhead of a per repo call
        -   Returns are now paginated `json` files
    -   20x speed up
-   Enhanced data preprocessing
    -   Extracts more data from the repo
        -   Repo tags
        -   Stars
        -   Primary Language
    -   More selective on data load
        -   Only load 'popular' repos
        -   Only load 'relevant' repos
-   Enhanced data preprocessing + Self Querying
    -   Enables more nuanced queries like "Dataframes for Python" -\>
        "Pandas"
-   So lets look at some key parts of the code now
:::

# Example Data

## Example JSON in

``` json
{
    "name": "langchain",
    "nameWithOwner": "langchain-ai/langchain",
    "url": "https://github.com/langchain-ai/langchain",
    "homepageUrl": "https://python.langchain.com",
    "description": "\ud83e\udd9c\ud83d\udd17 Build context-aware reasoning applications",
    "stargazerCount": 77908,
    "primaryLanguage": "Python",
    "languages": [
        "Python",
        "Makefile",
        "HTML",
        "Dockerfile",
        "TeX",
        "JavaScript",
        "Shell",
        "XSLT",
        "Jupyter Notebook",
        "MDX"
    ],
    "owner": "langchain-ai",
    "content": "langchain \ud83e\udd9c\ud83d\udd17 Build context-aware reasoning applications Python"
}
```

::: notes
-   This is an example of the json file that is written to disk
-   It contains the metadata and content of a repo
-   Content is created by starpilot by appending the `description` and
    `name` fields
-   At this stage I have one json per starred repo, and these now need
    to be processed and loaded into the vector store
-   To do this there is a `prepare_documents` function in Starpilot
:::

# The `prepare_documents` function

## `prepare_documents` {auto-animate="true"}

-   The `prepare_documents` function is responsible for reading in the
    json files and creating `Document` objects

``` {.python .python}
def prepare_documents(
    repo_contents_dir: str = "./repo_content",
) -> List[Document]:
    """
    Prepare the documents for ingestion into the vectorstore
    """
    ...
    return documents
```

::: notes
-   Here's the function signature
-   It takes a directory of json files as input
-   It returns a list of `Document` objects
-   A `Document` is a class defined in the `LangChain` library
-   The `Document` object contains the content *and* metadata for a
    given repo
:::

## `prepare_documents` {auto-animate="true"}

Get the file paths for each `json` file for each repo

``` {.python code-line-numbers="7-10"}
def prepare_documents(
    repo_contents_dir: str = "./repo_content",
) -> List[Document]:
    """
    Prepare the documents for ingestion into the vectorstore
    """

    file_paths = []
    for file in os.listdir(repo_contents_dir):
        file_paths.append(os.path.join(repo_contents_dir, file))

    ...

    return documents
```

::: notes
-   `prepare_documents` first creates a list of file paths to the json
    files
-   Each item in this list is the direct path to a specific json file,
    each of which is the content of a specific repo read through
    `GraphQL`
:::

## `prepare_documents` {auto-animate="true"}

Load the `Document` objects from the json files

``` {.python code-line-numbers="5-16"}
def prepare_documents(
    repo_contents_dir: str = "./repo_content",
) -> List[Document]:
    ...
    documents = []
    for file_path in track(file_paths, description="Loading documents..."):
        logger.debug("Loading document", file=file_path)
        loader = JSONLoader(
            file_path,
            jq_schema=".",
            content_key="content",
            metadata_func=_metadata_func,
            text_content=False,
        )
        if (loaded_document := loader.load())[0].page_content != "":
            documents.extend(loaded_document)
    ...
    return documents
```

::: notes
-   The function reads in the json files written per repo into memory
-   The `JSONLoader` class is used to load the json files
-   The `Document` objects are then returned as a list
-   `jq_schema` defines the "root" of the json file
-   `content_key` defines the key in the json file that contains the
    content
-   `metadata_func` is a function that extracts metadata from the json
    file
-   Finally if the `Document` object has content it is added to the list
    of `Document` objects
:::

## `prepare_documents` {auto-animate="true"}

The `metadata_func` function is used to extract metadata from the json
file

``` {.python code-line-numbers="12"}
def prepare_documents(
    repo_contents_dir: str = "./repo_content",
) -> List[Document]:
    ...
    documents = []
    for file_path in track(file_paths, description="Loading documents..."):
        logger.debug("Loading document", file=file_path)
        loader = JSONLoader(
            file_path,
            jq_schema=".",
            content_key="content",
            metadata_func=_metadata_func,
            text_content=False,
        )
        if (loaded_document := loader.load())[0].page_content != "":
            documents.extend(loaded_document)
    ...
    return documents
```

::: notes
-   However, what's the `_metadata_func` function?
:::

## `prepare_documents` {auto-animate="true"}

The `_metadata_func` function is used to extract metadata from the json
file

``` {.python code-line-numbers="5-18"}
def prepare_documents(
    repo_contents_dir: str = "./repo_content",
) -> List[Document]:

    def _metadata_func(record: dict, metadata: dict) -> dict:
        metadata["url"] = record.get("url")
        metadata["name"] = record.get("name")
        metadata["stargazerCount"] = record["stargazerCount"]
        if (primary_language := record.get("primaryLanguage")) is not None:
            metadata["primaryLanguage"] = primary_language
        if (description := record.get("description")) is not None:
            metadata["description"] = description
        if (topics := record.get("topics")) is not None:
            metadata["topics"] = " ".join(topics)
        if (languages := record.get("languages")) is not None:
            metadata["languages"] = " ".join(languages)

        return metadata

    ...

    return documents
```

::: notes
-   `_metadata_func` is a function that extracts metadata for the repo
    and processes it to be included as the metadata of the `Document`
    object
-   It extracts the `url`, `name`, `stargazerCount`, `primaryLanguage`,
    `description`, `topics` and `languages` fields from the json file as
    long as they exist
-   The items that are extracted are then added to the `metadata`
    dictionary
-   The `metadata` dictionary is then returned as part of the `Document`
    object
:::

## Using the `prepare_documents` function {auto-animate="true"}

`prepare_documents` is used to load the `Document` objects into \`Chroma

``` {.python code-line-numbers="5"}
from langchain_community.vectorstores import Chroma
from langchain_openai.embeddings import OpenAIEmbeddings

Chroma.from_documents(
    documents=utils.prepare_documents(),
    embedding=OpenAIEmbeddings(model="text-embedding-3-large"),
    persist_directory="./vectorstore-chroma",
)
```

::: notes
-   To use this list of `Documents` , we need to pass it to the `Chroma`
    class using the `from_documents` method.
:::

## Using the `prepare_documents` function {auto-animate="true"}

`Chroma.from_documents` also needs an `embedding` object and a
`persist_directory`

``` {.python code-line-numbers="6-7"}
from langchain_community.vectorstores import Chroma
from langchain_openai.embeddings import OpenAIEmbeddings

Chroma.from_documents(
    documents=utils.prepare_documents(),
    embedding=OpenAIEmbeddings(model="text-embedding-3-large"),
    persist_directory="./vectorstore-chroma",
)
```

::: notes
-   The `Chroma` class is a class that is part of the `LangChain`
    library
-   We also need to pass an `embedding` object to the `Chroma` class
-   On call, this method will create a vector store from the `Document`
    objects and then for each `Document` object, it will embed the
    content of the `Document` object using the `OpenAIEmbeddings` class
    and store the embeddings in the vector store
-   The metadata of the `Document` object is also stored in the vector
    store, but not embedded. We'll see why this is important later
:::

# Accessing the Vector Store

## Semantic similarity search {auto-animate="true"}

Semantic search accepts a query string and returns the most relevant
`Document` objects

``` {.python code-line-numbers="5"}
def shoot(
    vectorstore_path: str,
    k: int,
    method: SearchMethods = SearchMethods.similarity,
    query: str,
) -> List[Document]:
    """
    Create a retriever from a vectorstore and query it
    """
```

::: notes
-   This is the function signature for the `create_retriever` function
    in starpilot
:::

## Semantic similarity search {auto-animate="true"}

Create a `retriever` object from the vector store and query it

``` {.python code-line-numbers="9-21"}
def shoot(
        vectorstore_path: str,
    k: int,
    method: SearchMethods = SearchMethods.similarity,
    query: str,
) -> List[Document]:
    """
    Create a retriever from a vectorstore and query it
    """
    retriever = Chroma(
        persist_directory=vectorstore_path,
        embedding_function=OpenAIEmbeddings(
            model="text-embedding-3-large"
        ),
    ).as_retriever(
        search_type=method,
        search_kwargs={
            "k": k,
        },
    )
    return retriever.get_relevant_documents(query)
```

::: notes
-   This is the `shoot` function in starpilot, which is used to perform
    a semantic search on the vector store
-   First, we need to create a retriever object from the vector store
    Class
-   We need to specify the embedding function that was used to embed the
    content of the `Document` object
-   We also need to specify the search method that we want to use, which
    is effectively the algorithm to compute the similarity between the
    query vector and the vectors in the vector store
-   Finally, we need to pass the query string to the
    `get_relevant_documents` method of the retriever object, and it will
    return the most relevant `Document` objects
:::

# Self Querying

## Self Querying {auto-animate="true"}

Self-querying is a way to pre-filter the results of a semantic search by
metadata fields

```` {.python code-line-numbers="1-16"}
def astrologer(
    query: str,
    k: Optional[int] = typer.Option(
        4, help="Number of results to fetch from the vectorstore"
    ),
) -> List[Document]:
    """
    A self-query of the vectorstore that allows the user to search for a repo while filtering by attributes

    Example:
    ```
    starpilot astrologer "What can I use to build a web app with Python?"
    starpilot astrologer "Suggest some Rust machine learning crates"
    ```

    """
    metadata_field_info = [
        AttributeInfo(
            name="languages",
            description="the programming languages of a repo. Example: ['python', 'R', 'Rust']",
            type="string",
        ),
        AttributeInfo(
            name="name",
            description="the name of a repository. Example: 'langchain'",
            type="string",
        ),
        AttributeInfo(
            name="topics",
            description="the topics a repository is tagged with. Example: ['data-science', 'machine-learning', 'web-development', 'tidyverse']",
            type="string",
        ),
        AttributeInfo(
            name="url",
            description="the url of a repository on GitHub",
            type="string",
        ),
        AttributeInfo(
            name="stargazerCount",
            description="the number of stars a repository has on GitHub",
            type="number",
        ),
    ]

    document_content_description = "content describing a repository on GitHub"

    prompt = get_query_constructor_prompt(
        document_content_description,
        metadata_field_info,
        examples=[
            (
                "Python machine learning repos",
                {
                    "query": "machine learning",
                    "filter": 'eq("primaryLanguage", "Python")',
                },
            ),
            (
                "Rust Dataframe crates",
                {"query": "data frame", "filter": 'eq("primaryLanguage", "Rust")'},
            ),
            (
                "What R packages do time series analysis",
                {"query": "time series", "filter": 'eq("primaryLanguage", "R")'},
            ),
            (
                "data frame packages with 100 stars or more",
                {
                    "query": "data frame",
                    "filter": 'gte("stargazerCount", 100)',
                },
            ),
        ],
        allowed_comparators=[
            Comparator.EQ,
            Comparator.NE,
            Comparator.GT,
            Comparator.GTE,
            Comparator.LT,
            Comparator.LTE,
        ],
    )

    llm = ChatOpenAI(model="gpt-3.5-turbo",)

    output_parser = StructuredQueryOutputParser.from_components()

    query_constructor = prompt | llm | output_parser

    vectorstore = Chroma(
        persist_directory="./vectorstore-chroma",
        embedding_function=OpenAIEmbeddings(model="text-embedding-3-large"),
    )

    retriever = SelfQueryRetriever(
        query_constructor=query_constructor,
        vectorstore=vectorstore,
        structured_query_translator=ChromaTranslator(),
        search_kwargs={"k": k},
    )

    results = retriever.invoke(query)

    return results
````

::: notes
-   So this is the `astrologer` function in starpilot
-   It's distinct from the `shoot` function in that it allows the user
    to filter the results by metadata fields
-   To do this we use more LangChain classes to orchestrate the query
:::

## Self Querying {auto-animate="true"}

Create a list of `AttributeInfo` objects that describe the metadata
fields

```` {.python code-line-numbers="17-43"}
def astrologer(
    query: str,
    k: Optional[int] = typer.Option(
        4, help="Number of results to fetch from the vectorstore"
    ),
) -> List[Document]:
    """
    A self-query of the vectorstore that allows the user to search for a repo while filtering by attributes

    Example:
    ```
    starpilot astrologer "What can I use to build a web app with Python?"
    starpilot astrologer "Suggest some Rust machine learning crates"
    ```

    """
    metadata_field_info = [
        AttributeInfo(
            name="languages",
            description="the programming languages of a repo. Example: ['python', 'R', 'Rust']",
            type="string",
        ),
        AttributeInfo(
            name="name",
            description="the name of a repository. Example: 'langchain'",
            type="string",
        ),
        AttributeInfo(
            name="topics",
            description="the topics a repository is tagged with. Example: ['data-science', 'machine-learning', 'web-development', 'tidyverse']",
            type="string",
        ),
        AttributeInfo(
            name="url",
            description="the url of a repository on GitHub",
            type="string",
        ),
        AttributeInfo(
            name="stargazerCount",
            description="the number of stars a repository has on GitHub",
            type="number",
        ),
    ]

    document_content_description = "content describing a repository on GitHub"

    prompt = get_query_constructor_prompt(
        document_content_description,
        metadata_field_info,
        examples=[
            (
                "Python machine learning repos",
                {
                    "query": "machine learning",
                    "filter": 'eq("primaryLanguage", "Python")',
                },
            ),
            (
                "Rust Dataframe crates",
                {"query": "data frame", "filter": 'eq("primaryLanguage", "Rust")'},
            ),
            (
                "What R packages do time series analysis",
                {"query": "time series", "filter": 'eq("primaryLanguage", "R")'},
            ),
            (
                "data frame packages with 100 stars or more",
                {
                    "query": "data frame",
                    "filter": 'gte("stargazerCount", 100)',
                },
            ),
        ],
        allowed_comparators=[
            Comparator.EQ,
            Comparator.NE,
            Comparator.GT,
            Comparator.GTE,
            Comparator.LT,
            Comparator.LTE,
        ],
    )

    llm = ChatOpenAI(model="gpt-3.5-turbo",)

    output_parser = StructuredQueryOutputParser.from_components()

    query_constructor = prompt | llm | output_parser

    vectorstore = Chroma(
        persist_directory="./vectorstore-chroma",
        embedding_function=OpenAIEmbeddings(model="text-embedding-3-large"),
    )

    retriever = SelfQueryRetriever(
        query_constructor=query_constructor,
        vectorstore=vectorstore,
        structured_query_translator=ChromaTranslator(),
        search_kwargs={"k": k},
    )

    results = retriever.invoke(query)

    return results
````

::: notes
-   To enable self-querying, we are effectively creating a prompt for an
    llm
-   The prompt is going to receive the users query, then it will
    destructure the query into the part that is the query and the part
    that is the filter
-   To do this we need to describe those metadata fields from earlier we
    want to filter by
-   LangChain provides `AttributeInfo` objects to describe the metadata
    fields
-   We provide one for each field and collect them into a list
:::

## Self Querying {auto-animate="true"}

Construct the query prompt using get query constructor prompt to return
a `BasePromptTemplate` object

```` {.python code-line-numbers="45-82"}
def astrologer(
    query: str,
    k: Optional[int] = typer.Option(
        4, help="Number of results to fetch from the vectorstore"
    ),
) -> List[Document]:
    """
    A self-query of the vectorstore that allows the user to search for a repo while filtering by attributes

    Example:
    ```
    starpilot astrologer "What can I use to build a web app with Python?"
    starpilot astrologer "Suggest some Rust machine learning crates"
    ```

    """
    metadata_field_info = [
        AttributeInfo(
            name="languages",
            description="the programming languages of a repo. Example: ['python', 'R', 'Rust']",
            type="string",
        ),
        AttributeInfo(
            name="name",
            description="the name of a repository. Example: 'langchain'",
            type="string",
        ),
        AttributeInfo(
            name="topics",
            description="the topics a repository is tagged with. Example: ['data-science', 'machine-learning', 'web-development', 'tidyverse']",
            type="string",
        ),
        AttributeInfo(
            name="url",
            description="the url of a repository on GitHub",
            type="string",
        ),
        AttributeInfo(
            name="stargazerCount",
            description="the number of stars a repository has on GitHub",
            type="number",
        ),
    ]

    document_content_description = "content describing a repository on GitHub"

    prompt = get_query_constructor_prompt(
        document_content_description,
        metadata_field_info,
        examples=[
            (
                "Python machine learning repos",
                {
                    "query": "machine learning",
                    "filter": 'eq("primaryLanguage", "Python")',
                },
            ),
            (
                "Rust Dataframe crates",
                {"query": "data frame", "filter": 'eq("primaryLanguage", "Rust")'},
            ),
            (
                "What R packages do time series analysis",
                {"query": "time series", "filter": 'eq("primaryLanguage", "R")'},
            ),
            (
                "data frame packages with 100 stars or more",
                {
                    "query": "data frame",
                    "filter": 'gte("stargazerCount", 100)',
                },
            ),
        ],
        allowed_comparators=[
            Comparator.EQ,
            Comparator.NE,
            Comparator.GT,
            Comparator.GTE,
            Comparator.LT,
            Comparator.LTE,
        ],
    )

    llm = ChatOpenAI(model="gpt-3.5-turbo",)

    output_parser = StructuredQueryOutputParser.from_components()

    query_constructor = prompt | llm | output_parser

    vectorstore = Chroma(
        persist_directory="./vectorstore-chroma",
        embedding_function=OpenAIEmbeddings(model="text-embedding-3-large"),
    )

    retriever = SelfQueryRetriever(
        query_constructor=query_constructor,
        vectorstore=vectorstore,
        structured_query_translator=ChromaTranslator(),
        search_kwargs={"k": k},
    )

    results = retriever.invoke(query)

    return results
````

::: notes
-   As well as the metadata fields, we describe the kind of data we are
    querying
-   We also provide examples of queries and filters in a 'few-shot'
    style prompting format
-   The examples should reference the metadata fields we described
    earlier
-   The `get_query_constructor_prompt` function also accepts a list of
    allowed comparators, which are specific to the vector store
    implementation. `Chroma` will have different comparators to
    `Weaviate` for example
:::

## Self Querying {auto-animate="true"}

Use LangChain Expression Language in a chain to make a
`QueryConstructor` object

```` {.python code-line-numbers="84-88"}
def astrologer(
    query: str,
    k: Optional[int] = typer.Option(
        4, help="Number of results to fetch from the vectorstore"
    ),
) -> List[Document]:
    """
    A self-query of the vectorstore that allows the user to search for a repo while filtering by attributes

    Example:
    ```
    starpilot astrologer "What can I use to build a web app with Python?"
    starpilot astrologer "Suggest some Rust machine learning crates"
    ```

    """
    metadata_field_info = [
        AttributeInfo(
            name="languages",
            description="the programming languages of a repo. Example: ['python', 'R', 'Rust']",
            type="string",
        ),
        AttributeInfo(
            name="name",
            description="the name of a repository. Example: 'langchain'",
            type="string",
        ),
        AttributeInfo(
            name="topics",
            description="the topics a repository is tagged with. Example: ['data-science', 'machine-learning', 'web-development', 'tidyverse']",
            type="string",
        ),
        AttributeInfo(
            name="url",
            description="the url of a repository on GitHub",
            type="string",
        ),
        AttributeInfo(
            name="stargazerCount",
            description="the number of stars a repository has on GitHub",
            type="number",
        ),
    ]

    document_content_description = "content describing a repository on GitHub"

    prompt = get_query_constructor_prompt(
        document_content_description,
        metadata_field_info,
        examples=[
            (
                "Python machine learning repos",
                {
                    "query": "machine learning",
                    "filter": 'eq("primaryLanguage", "Python")',
                },
            ),
            (
                "Rust Dataframe crates",
                {"query": "data frame", "filter": 'eq("primaryLanguage", "Rust")'},
            ),
            (
                "What R packages do time series analysis",
                {"query": "time series", "filter": 'eq("primaryLanguage", "R")'},
            ),
            (
                "data frame packages with 100 stars or more",
                {
                    "query": "data frame",
                    "filter": 'gte("stargazerCount", 100)',
                },
            ),
        ],
        allowed_comparators=[
            Comparator.EQ,
            Comparator.NE,
            Comparator.GT,
            Comparator.GTE,
            Comparator.LT,
            Comparator.LTE,
        ],
    )

    llm = ChatOpenAI(model="gpt-3.5-turbo",)

    output_parser = StructuredQueryOutputParser.from_components()

    query_constructor = prompt | llm | output_parser

    vectorstore = Chroma(
        persist_directory="./vectorstore-chroma",
        embedding_function=OpenAIEmbeddings(model="text-embedding-3-large"),
    )

    retriever = SelfQueryRetriever(
        query_constructor=query_constructor,
        vectorstore=vectorstore,
        structured_query_translator=ChromaTranslator(),
        search_kwargs={"k": k},
    )

    results = retriever.invoke(query)

    return results
````

::: notes
-   So LCEL is a chain of functions that are used to create a
    `QueryConstructor` object
-   LCEL allows you to pipe functions together to return this execution
    chain
-   Similar to the `|` operator in bash or `.pipe()` in pandas
:::

## Self Querying {auto-animate="true"}

Create a `SelfQueryRetriever` object from the `QueryConstructor` object
and the `Chroma` object

```` {.python code-line-numbers="90-104"}
def astrologer(
    query: str,
    k: Optional[int] = typer.Option(
        4, help="Number of results to fetch from the vectorstore"
    ),
) -> List[Document]:
    """
    A self-query of the vectorstore that allows the user to search for a repo while filtering by attributes

    Example:
    ```
    starpilot astrologer "What can I use to build a web app with Python?"
    starpilot astrologer "Suggest some Rust machine learning crates"
    ```

    """
    metadata_field_info = [
        AttributeInfo(
            name="languages",
            description="the programming languages of a repo. Example: ['python', 'R', 'Rust']",
            type="string",
        ),
        AttributeInfo(
            name="name",
            description="the name of a repository. Example: 'langchain'",
            type="string",
        ),
        AttributeInfo(
            name="topics",
            description="the topics a repository is tagged with. Example: ['data-science', 'machine-learning', 'web-development', 'tidyverse']",
            type="string",
        ),
        AttributeInfo(
            name="url",
            description="the url of a repository on GitHub",
            type="string",
        ),
        AttributeInfo(
            name="stargazerCount",
            description="the number of stars a repository has on GitHub",
            type="number",
        ),
    ]

    document_content_description = "content describing a repository on GitHub"

    prompt = get_query_constructor_prompt(
        document_content_description,
        metadata_field_info,
        examples=[
            (
                "Python machine learning repos",
                {
                    "query": "machine learning",
                    "filter": 'eq("primaryLanguage", "Python")',
                },
            ),
            (
                "Rust Dataframe crates",
                {"query": "data frame", "filter": 'eq("primaryLanguage", "Rust")'},
            ),
            (
                "What R packages do time series analysis",
                {"query": "time series", "filter": 'eq("primaryLanguage", "R")'},
            ),
            (
                "data frame packages with 100 stars or more",
                {
                    "query": "data frame",
                    "filter": 'gte("stargazerCount", 100)',
                },
            ),
        ],
        allowed_comparators=[
            Comparator.EQ,
            Comparator.NE,
            Comparator.GT,
            Comparator.GTE,
            Comparator.LT,
            Comparator.LTE,
        ],
    )

    llm = ChatOpenAI(model="gpt-3.5-turbo",)

    output_parser = StructuredQueryOutputParser.from_components()

    query_constructor = prompt | llm | output_parser

    vectorstore = Chroma(
        persist_directory="./vectorstore-chroma",
        embedding_function=OpenAIEmbeddings(model="text-embedding-3-large"),
    )

    retriever = SelfQueryRetriever(
        query_constructor=query_constructor,
        vectorstore=vectorstore,
        structured_query_translator=ChromaTranslator(),
        search_kwargs={"k": k},
    )

    results = retriever.invoke(query)

    return results
````

::: notes
-   Finally we instantiate the `SelfQueryRetriever` class with the
    `QueryConstructor` object we created earlier and the `Chroma` object
    like last time
-   We then call the `invoke` method on the `SelfQueryRetriever` object
    with the query string
-   The `invoke` method will return the most relevant `Document` objects
    *after* filtering by the metadata fields
:::

## Demo `astrolger` function

```{=html}
<script src="https://asciinema.org/a/UvFTn7EMZoUVC8eMbWU59mNyc.js" id="asciicast-UvFTn7EMZoUVC8eMbWU59mNyc" async="true"></script>
```
# Evaluation

## What I liked {auto-animate="true"}

-   `Langchain` is an "extensively" documented
-   Langchain company backed videos and tutorials
-   Langchain is a suitable orchestration framework
    -   It's like `scikit-learn` for LLMs

::: notes
-   `Langchain` has a huge amount of documentation
-   It's relatively easy to find an example from a notebook that
    stitches together what you might want
-   There are also all the expected class oriented API docs
-   The LangChain *company* has worked really hard to produce lots of
    examples and tutorials
    -   Lance from LangChain has great hands on YT videos
-   LangChain fulfills the goal of 'being the glue between
    LLM/AI/Vectorstore services'
    -   Kind of like `scikit-learn` for LLMs
:::

## What I didn't like {auto-animate="true"}

-   *Heavily* OOP
-   Reading the sourcecode is still necessary
    -   `kwargs` are often silently ignored and undocumented
    -   'Wrappers' and 'Connectors' don't implement the full API
-   Moves fast and breaks things (\@`v0.1.*`)
-   Observability is secondary

::: notes
-   *heavily* OOP
-   Must read source code to understand exactly what is and isn't
    supported for each class
    -   Entirely unclear when `kwargs` are actually needed, or when they
        are silently ignored
    -   'Wrappers' and 'Connectors' don't implement the full API
-   Moves fast and breaks things (\@`v0.1.*`)
    -   Documentation is not always up to date and often conflicting
    -   Huge structural changes (\@`v0.2.*`) split into `langchain` and
        `langchain_community`
-   Observability is secondary
    -   LangChain is a *company* after all
    -   They are investing a lot into the langchain ecosystem, and
        AI/LLM community as a whole
    -   They do however still need to make themselves financially
        sustainable, and they are attempting to do that through
        monetizing observability for production AI applications though
        LangSmith
    -   I'm mixed about that, in that I think it's important to make
        sure they don't end up as a vast voluntary project that leads to
        burnout and the framework failing, but at the same time they are
        taking a very product centric route out of that.
    -   I feel that as a result *developer* logging and observability
        suffers as a side effect.
:::

# Conclusion

## Outcome for developers

-   `starpilot`is available now:`DaveParr/starpilot`
    -   Star it, fork it, PR it!
-   `Langchain` is also available now: `langchain/langchain`
-   This talk is also also available now:
    `DaveParr/starpilot-presentation`

::: notes
-   All the code I've shown today and all the code I didn't show is all
    Open Source licensed, so dig in fork it and build a new feature. This is already my most popular repo to date, so I'm excited to see where it goes.
-   `LangChain` is also open source licensed, but much harder to get
    involved in, and tougher to read
-   This talk is also open source, along with speaker notes, so you can
    go back and remind yourself of the last 30 minutes of your life
:::

## Outcome for me

-   I have a working prototype
-   I have a better understanding of the LangChain library
-   I've translated this understanding into another, similar project for
    Magic the Gathering underway

::: notes
-   I have something that works on my machine, that I can use to get
    more out of my GitHub stars, so that's nice
-   I've also been able to get hands on with building something novel,
    which is something that I enjoy from a learning perspective
-   And I've started to transfer these and other concepts into a new
    project on Magic the Gathering, which extends on these ideas with a
    new dataset, ustilises Hypothetical Document Embeddings, and has a
    streamlit frontend, which might be a future talk?
:::

# Thanks

-   Github: [`daveparr`](https://github.com/DaveParr)
-   Blog: [`daveparr.info`](https://daveparr.info)
-   [Starpilot](https://github.com/DaveParr/starpilot)
    -   [Starpilot Presentation](https://github.com/DaveParr/starpilot-presentation)