Skip to content

Commit

Permalink
langchain[minor],docs[minor]: Add MatryoshkaRetriever (#4458)
Browse files Browse the repository at this point in the history
* langchain[minor],docs[minor]: Add AdaptiveRetrieval to lc experimental

* chore: lint files

* cr

* chore: lint files

* cr

* Update docs/core_docs/docs/modules/experimental/retrievers/matryoshka_retrieval.mdx

* chore: lint files

* Major improvements & fixes

* tests

* cr

* cr

* rm deno lock

* fixed tests

* rm build artifacts

* chore: lint files

* add note about stringified stuff

* cleanup docs

* chore: lint files

* retrieval to retriever

* docs file name retrieval to retriever

* correct import

* cr
  • Loading branch information
bracesproul authored Feb 29, 2024
1 parent 9490454 commit 36c03e4
Show file tree
Hide file tree
Showing 18 changed files with 493 additions and 0 deletions.
4 changes: 4 additions & 0 deletions docs/core_docs/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ yarn-error.log*
/.quarto/

# AUTO_GENERATED_DOCS
docs/expression_language/streaming.mdx
docs/use_cases/tool_use/human_in_the_loop.mdx
docs/use_cases/question_answering/streaming.mdx
docs/use_cases/question_answering/sources.mdx
Expand All @@ -41,4 +42,7 @@ docs/use_cases/question_answering/local_retrieval_qa.mdx
docs/use_cases/question_answering/conversational_retrieval_agents.mdx
docs/use_cases/question_answering/citations.mdx
docs/use_cases/question_answering/chat_history.mdx
docs/modules/model_io/output_parsers/custom.mdx
docs/modules/memory/chat_messages/custom.mdx
docs/modules/data_connection/vectorstores/custom.mdx
docs/modules/model_io/output_parsers/types/openai_tools.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Matryoshka Retriever

This is an implementation of the [Supabase](https://supabase.com/) blog post
["Matryoshka embeddings: faster OpenAI vector search using Adaptive Retrieval"](https://supabase.com/blog/matryoshka-embeddings).

### Overview

This class performs "Adaptive Retrieval" for searching text embeddings efficiently using the
Matryoshka Representation Learning (MRL) technique. It retrieves documents similar to a query
embedding in two steps:

- **First-pass**: Uses a lower dimensional sub-vector from the MRL embedding for an initial, fast,
but less accurate search.

- **Second-pass**: Re-ranks the top results from the first pass using the full, high-dimensional
embedding for higher accuracy.

This code demonstrates using MRL embeddings for efficient vector search by combining faster,
lower-dimensional initial search with accurate, high-dimensional re-ranking.

## Example

### Setup

import IntegrationInstallTooltip from "@mdx_components/integration_install_tooltip.mdx";

<IntegrationInstallTooltip></IntegrationInstallTooltip>

```bash npm2yarn
npm install @langchain/openai @langchain/community
```

To follow the example below, you need an OpenAI API key:

```bash
export OPENAI_API_KEY=your-api-key
```

We'll also be using `chroma` for our vector store. Follow the instructions [here](/docs/integrations/vectorstores/chroma) to setup.

import CodeBlock from "@theme/CodeBlock";
import Example from "@examples/retrievers/matryoshka_retriever.ts";

<CodeBlock language="typescript">{Example}</CodeBlock>

:::note
Due to the constraints of some vector stores, the large embedding metadata field is stringified (`JSON.stringify`) before being stored. This means that the metadata field will need to be parsed (`JSON.parse`) when retrieved from the vector store.
:::
1 change: 1 addition & 0 deletions environment_tests/test-exports-bun/src/entrypoints.js
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@ export * from "langchain/retrievers/document_compressors/embeddings_filter";
export * from "langchain/retrievers/hyde";
export * from "langchain/retrievers/score_threshold";
export * from "langchain/retrievers/vespa";
export * from "langchain/retrievers/matryoshka_retriever";
export * from "langchain/cache";
export * from "langchain/stores/doc/in_memory";
export * from "langchain/stores/file/in_memory";
Expand Down
1 change: 1 addition & 0 deletions environment_tests/test-exports-cf/src/entrypoints.js
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@ export * from "langchain/retrievers/document_compressors/embeddings_filter";
export * from "langchain/retrievers/hyde";
export * from "langchain/retrievers/score_threshold";
export * from "langchain/retrievers/vespa";
export * from "langchain/retrievers/matryoshka_retriever";
export * from "langchain/cache";
export * from "langchain/stores/doc/in_memory";
export * from "langchain/stores/file/in_memory";
Expand Down
1 change: 1 addition & 0 deletions environment_tests/test-exports-cjs/src/entrypoints.js
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@ const retrievers_document_compressors_embeddings_filter = require("langchain/ret
const retrievers_hyde = require("langchain/retrievers/hyde");
const retrievers_score_threshold = require("langchain/retrievers/score_threshold");
const retrievers_vespa = require("langchain/retrievers/vespa");
const retrievers_matryoshka_retriever = require("langchain/retrievers/matryoshka_retriever");
const cache = require("langchain/cache");
const stores_doc_in_memory = require("langchain/stores/doc/in_memory");
const stores_file_in_memory = require("langchain/stores/file/in_memory");
Expand Down
1 change: 1 addition & 0 deletions environment_tests/test-exports-esbuild/src/entrypoints.js
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@ import * as retrievers_document_compressors_embeddings_filter from "langchain/re
import * as retrievers_hyde from "langchain/retrievers/hyde";
import * as retrievers_score_threshold from "langchain/retrievers/score_threshold";
import * as retrievers_vespa from "langchain/retrievers/vespa";
import * as retrievers_matryoshka_retriever from "langchain/retrievers/matryoshka_retriever";
import * as cache from "langchain/cache";
import * as stores_doc_in_memory from "langchain/stores/doc/in_memory";
import * as stores_file_in_memory from "langchain/stores/file/in_memory";
Expand Down
1 change: 1 addition & 0 deletions environment_tests/test-exports-esm/src/entrypoints.js
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@ import * as retrievers_document_compressors_embeddings_filter from "langchain/re
import * as retrievers_hyde from "langchain/retrievers/hyde";
import * as retrievers_score_threshold from "langchain/retrievers/score_threshold";
import * as retrievers_vespa from "langchain/retrievers/vespa";
import * as retrievers_matryoshka_retriever from "langchain/retrievers/matryoshka_retriever";
import * as cache from "langchain/cache";
import * as stores_doc_in_memory from "langchain/stores/doc/in_memory";
import * as stores_file_in_memory from "langchain/stores/file/in_memory";
Expand Down
1 change: 1 addition & 0 deletions environment_tests/test-exports-vercel/src/entrypoints.js
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@ export * from "langchain/retrievers/document_compressors/embeddings_filter";
export * from "langchain/retrievers/hyde";
export * from "langchain/retrievers/score_threshold";
export * from "langchain/retrievers/vespa";
export * from "langchain/retrievers/matryoshka_retriever";
export * from "langchain/cache";
export * from "langchain/stores/doc/in_memory";
export * from "langchain/stores/file/in_memory";
Expand Down
1 change: 1 addition & 0 deletions environment_tests/test-exports-vite/src/entrypoints.js
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@ export * from "langchain/retrievers/document_compressors/embeddings_filter";
export * from "langchain/retrievers/hyde";
export * from "langchain/retrievers/score_threshold";
export * from "langchain/retrievers/vespa";
export * from "langchain/retrievers/matryoshka_retriever";
export * from "langchain/cache";
export * from "langchain/stores/doc/in_memory";
export * from "langchain/stores/file/in_memory";
Expand Down
1 change: 1 addition & 0 deletions examples/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
"dependencies": {
"@clickhouse/client": "^0.2.5",
"@elastic/elasticsearch": "^8.4.0",
"@faker-js/faker": "^8.4.1",
"@getmetal/metal-sdk": "^4.0.0",
"@getzep/zep-js": "^0.9.0",
"@gomomento/sdk": "^1.51.1",
Expand Down
67 changes: 67 additions & 0 deletions examples/src/retrievers/matryoshka_retriever.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
import { MatryoshkaRetriever } from "langchain/retrievers/matryoshka_retriever";
import { Chroma } from "@langchain/community/vectorstores/chroma";
import { OpenAIEmbeddings } from "@langchain/openai";
import { Document } from "@langchain/core/documents";
import { faker } from "@faker-js/faker";

const smallEmbeddings = new OpenAIEmbeddings({
modelName: "text-embedding-3-small",
dimensions: 512, // Min num for small
});
const largeEmbeddings = new OpenAIEmbeddings({
modelName: "text-embedding-3-large",
dimensions: 3072, // Max num for large
});

const vectorStore = new Chroma(smallEmbeddings, {
numDimensions: 512,
});

const retriever = new MatryoshkaRetriever({
vectorStore,
largeEmbeddingModel: largeEmbeddings,
largeK: 5,
});

const irrelevantDocs = Array.from({ length: 250 }).map(
() =>
new Document({
pageContent: faker.lorem.word(7), // Similar length to the relevant docs
})
);
const relevantDocs = [
new Document({
pageContent: "LangChain is an open source github repo",
}),
new Document({
pageContent: "There are JS and PY versions of the LangChain github repos",
}),
new Document({
pageContent: "LangGraph is a new open source library by the LangChain team",
}),
new Document({
pageContent: "LangChain announced GA of LangSmith last week!",
}),
new Document({
pageContent: "I heart LangChain",
}),
];
const allDocs = [...irrelevantDocs, ...relevantDocs];

/**
* IMPORTANT:
* The `addDocuments` method on `MatryoshkaRetriever` will
* generate the small AND large embeddings for all documents.
*/
await retriever.addDocuments(allDocs);

const query = "What is LangChain?";
const results = await retriever.getRelevantDocuments(query);
console.log(results.map(({ pageContent }) => pageContent).join("\n"));
/**
I heart LangChain
LangGraph is a new open source library by the LangChain team
LangChain is an open source github repo
LangChain announced GA of LangSmith last week!
There are JS and PY versions of the LangChain github repos
*/
4 changes: 4 additions & 0 deletions langchain/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -930,6 +930,10 @@ retrievers/vespa.cjs
retrievers/vespa.js
retrievers/vespa.d.ts
retrievers/vespa.d.cts
retrievers/matryoshka_retriever.cjs
retrievers/matryoshka_retriever.js
retrievers/matryoshka_retriever.d.ts
retrievers/matryoshka_retriever.d.cts
cache.cjs
cache.js
cache.d.ts
Expand Down
1 change: 1 addition & 0 deletions langchain/langchain.config.js
Original file line number Diff line number Diff line change
Expand Up @@ -295,6 +295,7 @@ export const config = {
"retrievers/self_query/weaviate": "retrievers/self_query/weaviate",
"retrievers/self_query/vectara": "retrievers/self_query/vectara",
"retrievers/vespa": "retrievers/vespa",
"retrievers/matryoshka_retriever": "retrievers/matryoshka_retriever",
// cache
cache: "cache/index",
"cache/cloudflare_kv": "cache/cloudflare_kv",
Expand Down
13 changes: 13 additions & 0 deletions langchain/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -942,6 +942,10 @@
"retrievers/vespa.js",
"retrievers/vespa.d.ts",
"retrievers/vespa.d.cts",
"retrievers/matryoshka_retriever.cjs",
"retrievers/matryoshka_retriever.js",
"retrievers/matryoshka_retriever.d.ts",
"retrievers/matryoshka_retriever.d.cts",
"cache.cjs",
"cache.js",
"cache.d.ts",
Expand Down Expand Up @@ -3641,6 +3645,15 @@
"import": "./retrievers/vespa.js",
"require": "./retrievers/vespa.cjs"
},
"./retrievers/matryoshka_retriever": {
"types": {
"import": "./retrievers/matryoshka_retriever.d.ts",
"require": "./retrievers/matryoshka_retriever.d.cts",
"default": "./retrievers/matryoshka_retriever.d.ts"
},
"import": "./retrievers/matryoshka_retriever.js",
"require": "./retrievers/matryoshka_retriever.cjs"
},
"./cache": {
"types": {
"import": "./cache.d.ts",
Expand Down
1 change: 1 addition & 0 deletions langchain/src/load/import_map.ts
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ export * as retrievers__document_compressors__embeddings_filter from "../retriev
export * as retrievers__hyde from "../retrievers/hyde.js";
export * as retrievers__score_threshold from "../retrievers/score_threshold.js";
export * as retrievers__vespa from "../retrievers/vespa.js";
export * as retrievers__matryoshka_retriever from "../retrievers/matryoshka_retriever.js";
export * as stores__doc__in_memory from "../stores/doc/in_memory.js";
export * as stores__file__in_memory from "../stores/file/in_memory.js";
export * as stores__message__in_memory from "../stores/message/in_memory.js";
Expand Down
Loading

0 comments on commit 36c03e4

Please sign in to comment.