Skip to content

Commit

Permalink
Core abstractions review
Browse files Browse the repository at this point in the history
* Reviewed that documentation matches the code and fixed docs discrepancies
* Refactored the `Embedding` object to be a record and include metadata and an ID that will be retrieved afterwards from the vector store
* Implemented constructor that accepts explicit credentials in addition to supporting environment variables
* The OpenAI embeddings generator now injects the original text, the model name and a timestamp in the embeddings metadata automatically
  • Loading branch information
javiertoledo committed Aug 29, 2023
1 parent d3920c7 commit fadd524
Show file tree
Hide file tree
Showing 13 changed files with 258 additions and 151 deletions.
83 changes: 56 additions & 27 deletions .idea/workspace.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
package com.theagilemonkeys.ellmental.core.schema;

import java.util.List;

public class Embedding {

public Embedding(List<Double> vector) {
this.vector = vector;
}
public List<Double> vector;

}
import java.util.Map;
import java.util.UUID;

/**
* Embeddings represent a point in the embeddings space, representing the semantics of a given text.
*/
public record Embedding(
UUID id,
List<Double> vector,
Map<String, String> metadata
) {}
35 changes: 28 additions & 7 deletions docs-site/docs/03_components/01_core_abstractions.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,23 @@
# Core Abstractions

eLLMental uses different 3rd party components and APIs and provides a unified interface. To ensure extensibility and avoid tight coupling with any specific API, the library provides a series of abstract classes that define the expected interface for these components to work with eLLMental. To use eLLMental, you can provide your own implementation or use one of the built-in concrete implementations.
eLLMental uses different 3rd party components and APIs and provides a unified interface. To ensure extensibility and avoid tight coupling with any specific API, the library provides a series of abstract classes that define the expected interface for these components to work with eLLMental. To use eLLMental, you can provide your own implementation or use one of the built-in concrete implementations.

## `Embedding` object

In eLLMental embeddings are represented by the `Embedding` record, which has the following attributes:

- `id`: An unique identifier of the embedding.
- `vector`: A numeric vector that represents the semantic location of the text.
- `metadata`: Additional information associated with the embedding. It can be used to store the original text, the model used to generate the embedding, or any other information you may find useful.

```java
public record Embedding(
UUID id,
List<Double> vector,
Map<String, String> metadata
) {}

```

## `EmbeddingsGenerationModel`

Expand All @@ -17,19 +34,22 @@ public abstract class EmbeddingsGenerationModel {
eLLMental provides an implementation to use [OpenAI's embeddings model](https://platform.openai.com/docs/guides/embeddings). This model is only accessible via API, so you'll need to initialize it with a valid OpenAI API key.

```java
// You can explicitly initialize the model with your API key, or use the constructor without parameters and set the `OPENAI_API_KEY` environment variable.
EmbeddingsGenerationModel openAIModel = new OpenAIEmbeddingsGenerationModel("YOUR_OPENAI_API_KEY");

// You'll rarely need to interact directly with the `openAIModel`, but you can use it to generate an embedding:
openAIModel.generateEmbedding("Sample string");
// You'll rarely need to interact directly with the `openAIModel`, but you can use it to generate an embedding object:
Embedding embedding = openAIModel.generateEmbedding("Sample string");
```

The OpenAI embeddings generator will automatically include the original text, the timestamp and the model used to generate the embedding in the metadata.

## `EmbeddingsStore`

This abstract class defines the expected interface for a persistence mechanism capable of storing and querying embeddings:

```java
public abstract class EmbeddingsStore {
public abstract void store(Embedding embedding, Metadata metadata);
public abstract void store(Embedding embedding);
public abstract List<Embedding> similaritySearch(Embedding reference, int limit);
}
```
Expand All @@ -39,9 +59,10 @@ public abstract class EmbeddingsStore {
eLLMental provides a concrete implementation for Pinecone, which requires defining an URL, an API Key and a space.

```java
EmbeddingsStore pineconeStore = new PineconeEmbeddingsStore("YOUR_PINECONE_URL", "YOUR_PINECONE_API_KEY", "YOUR_PINECONE_SPACE");
// You can explicitly initialize the store with your credentials, or use the constructor with no parameters and set the `PINECONE_URL`, `PINECONE_API_KEY` and `PINECONE_NAMESPACE` environment variables.
EmbeddingsStore pineconeStore = new PineconeEmbeddingsStore("YOUR_PINECONE_URL", "YOUR_PINECONE_API_KEY", "YOUR_PINECONE_NAMESPACE");

// You can insert or perform similarity searches using this object. Metadata is optional.
pineconeStore.store(someEmbedding, someMetadata);
// You can now insert or perform similarity searches using the pineconeStore instance:
pineconeStore.store(someEmbedding);
List<Embedding> similarEmbeddings = pineconeStore.similaritySearch(referenceEmbedding, 5);
```
20 changes: 0 additions & 20 deletions docs-site/docs/03_components/02_embeddings_space.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,26 +11,6 @@ Leveraging the power of embeddings models, this component allows you to represen
[//]: # (TODO: Highlight this as a warning message in Docusaurus)
> **Warning**: The current version supports the [OpenAI embeddings model](https://platform.openai.com/docs/guides/embeddings) and [Pinecone](https://www.pinecone.io) for storage. Please provide the necessary credentials for these services. Note: these services may involve costs, always review their pricing details before use.
## `Embedding` object

In eLLMental embeddings are represented by the `Embedding` class, which has the following attributes:

- `id`: The unique identifier of the embedding.
- `embedding`: The numeric representation of the text.
- `metadata`: Additional information associated with the embedding.
- `createdAt`: The timestamp of the embedding creation.
- `modelId`: The identifier of the embeddings model used to generate the embedding.

```java
public class Embedding {
private final String id;
private final float[] embedding;
private final Map<String, String> metadata;
private final Instant createdAt;
private final String modelId;
}
```

The `EmbeddingsSpaceComponent` interface defines the following methods:

## Constructor
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,7 @@ public static void main(String[] args) {

// Step 2: save the generated embeddings to a store (Pinecone in this case)
EmbeddingsStore embeddingStore = new PineconeEmbeddingsStore();
Map<String, String> metadata = new HashMap<>();
metadata.put("key1", "value1");
metadata.put("key2", "value2");
embeddingStore.store(embedding, metadata);
embeddingStore.store(embedding);

// Step 3: search for the embedding in the store
List<Embedding> searchEmbeddings = embeddingStore.similaritySearch(embedding, 5);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@

import com.theagilemonkeys.ellmental.core.schema.Embedding;

/**
* Abstract class that defines the interface expected by eLLMental for a valid embeddings generation model.
*/
public abstract class EmbeddingsGenerationModel {
public abstract Embedding generateEmbedding(String text);
}
Original file line number Diff line number Diff line change
Expand Up @@ -11,30 +11,52 @@

import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.UUID;

/**
* OpenAI `EmbeddingsGenerationModel` implementation.
*/
public class OpenAIEmbeddingsModel extends EmbeddingsGenerationModel {
private static OpenAiService service;
private static String openAIKey;
private static OpenAiService cachedService;
public static String embeddingOpenAiModel = "text-embedding-ada-002";

/**
* Constructor that initializes the OpenAI embeddings model.
* It will try to load the API key from the environment variable `OPEN_AI_API_KEY`.
*/
public OpenAIEmbeddingsModel() {
var dotenv = Dotenv
.configure()
.ignoreIfMissing()
.ignoreIfMalformed()
.load();
String openAIKey = dotenv.get("OPEN_AI_API_KEY");
if (openAIKey == null) {
throw new EnvironmentVariableNotDeclaredException("Environment variable OPEN_AI_API_KEY is not declared.");
}
service = new OpenAiService(openAIKey);
this(null);
}

/**
* Constructor that initializes the OpenAI embeddings model with an explicit API Key.
*
* @param openAIKey OpenAI API key.
*/
public OpenAIEmbeddingsModel(String APIKey) {
if (APIKey != null) {
openAIKey = APIKey;
} else { // Tries to load it from the environment variable
var dotenv = Dotenv
.configure()
.ignoreIfMissing()
.ignoreIfMalformed()
.load();
openAIKey = dotenv.get("OPEN_AI_API_KEY");
}
}

/**
* Generates an embedding for the given input string.
*
* @param inputString Input string to generate the embedding for.
* @return An embedding object for the given text.
*/
public Embedding generateEmbedding(String inputString) {

// TODO: the embeddings function from the library uses an array as input. We are
// only using a length 1 array.
// Check if we should implement an array option.
/* TODO: the embeddings function from the library uses an array as input.
We are only using a length 1 array. Should implement an array option? */
List<String> embeddingsInput = new ArrayList<>();
embeddingsInput.add(inputString);

Expand All @@ -43,8 +65,27 @@ public Embedding generateEmbedding(String inputString) {
.input(embeddingsInput)
.build();

List<Double> embedding = service.createEmbeddings(embeddingRequest).getData().get(0).getEmbedding();
List<Double> vector = getService()
.createEmbeddings(embeddingRequest)
.getData()
.get(0)
.getEmbedding();

return new Embedding(embedding);
Map<String, String> metadata = new java.util.HashMap<>();
metadata.put("input", inputString);
metadata.put("source", "OpenAI");
metadata.put("model", embeddingOpenAiModel);
metadata.put("createdAt", java.time.LocalDateTime.now().toString());
return new Embedding(UUID.randomUUID(), vector, metadata);
}

private OpenAiService getService() {
if (cachedService == null) {
if (openAIKey == null) {
throw new EnvironmentVariableNotDeclaredException("Environment variable OPEN_AI_API_KEY is required.");
}
cachedService = new OpenAiService(openAIKey);
}
return cachedService;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ public void testGenerateEmbedding(){
Embedding embedding = openAI.generateEmbedding("The Agile Monkeys rule!");
TestValues testValues = new TestValues();

assertEquals(embedding.vector.size(), testValues.testGenerateEmbeddingExpectedValue.size());
assertArrayEquals(embedding.vector.toArray(), testValues.testGenerateEmbeddingExpectedValue.toArray());
assertEquals(embedding.vector().size(), testValues.testGenerateEmbeddingExpectedValue.size());
assertArrayEquals(embedding.vector().toArray(), testValues.testGenerateEmbeddingExpectedValue.toArray());
}
}

Expand Down
Loading

0 comments on commit fadd524

Please sign in to comment.