Skip to content

Commit

Permalink
Add split by token.
Browse files Browse the repository at this point in the history
  • Loading branch information
lgrammel committed Apr 28, 2023
1 parent 48a3b79 commit cb9d151
Show file tree
Hide file tree
Showing 18 changed files with 156 additions and 57 deletions.
10 changes: 6 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ Features used: function composition (no agent), pdf loading, split-extract-rewri

Splits a text into chunks and generates embeddings.

Features used: direct function calls (no agent), split text, generate embeddings
Features used: direct function calls (no agent), split text (gpt3-tokenizer), generate embeddings

## Features

Expand Down Expand Up @@ -80,7 +80,8 @@ Features used: direct function calls (no agent), split text, generate embeddings
- Utility functions to combine and convert prompts
- Text functions
- Extract information (extract & rewrite; extract recursively)
- Split text into chunks
- Splitters: split text into chunks
- By character, by token (GPT3-tokenizer)
- Helpers: load, generate
- Data sources
- Webpage as HTML text
Expand Down Expand Up @@ -150,8 +151,9 @@ export async function runWikipediaAgent({
},
execute: $.tool.executeExtractInformationFromWebpage({
extract: $.text.extractRecursively.asExtractFunction({
split: $.text.splitRecursivelyAtCharacter.asSplitFunction({
maxCharactersPerChunk: 2048 * 4, // needs to fit into a gpt-3.5-turbo prompt
split: $.text.splitRecursivelyAtToken.asSplitFunction({
tokenizer: $.provider.openai.gptTokenizer(),
maxChunkSize: 2048, // needs to fit into a gpt-3.5-turbo prompt
}),
extract: $.text.generateText.asFunction({
prompt: $.prompt.extractChatPrompt(),
Expand Down
7 changes: 4 additions & 3 deletions docs/concepts/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,10 @@ You can use all almost all helper functions in JS Agent directly. This includes
Here is an example of splitting a text into chunks and using the OpenAI embedding API directly to get the embedding of each chunk ([full example](https://github.com/lgrammel/js-agent/tree/main/examples/split-and-embed-text)):

```typescript
const chunks = $.text.splitRecursivelyAtCharacter({
const chunks = await $.text.splitRecursivelyAtToken({
text,
maxCharactersPerChunk: 1024 * 4,
tokenizer: $.provider.openai.gptTokenizer(),
maxChunkSize: 128,
});

const embeddings = [];
Expand Down Expand Up @@ -44,7 +45,7 @@ Here is the example that creates a Twitter thread on a topic using the content o
```typescript
const rewriteAsTwitterThread = $.text.splitExtractRewrite.asExtractFunction({
split: $.text.splitRecursivelyAtCharacter.asSplitFunction({
maxCharactersPerChunk: 1024 * 4,
maxChunkSize: 1024 * 4,
}),
extract: $.text.generateText.asFunction({
model: gpt4,
Expand Down
5 changes: 3 additions & 2 deletions docs/docs/tutorial-wikipedia-agent/complete-agent.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,8 +60,9 @@ async function runWikipediaAgent({
},
execute: $.tool.executeExtractInformationFromWebpage({
extract: $.text.extractRecursively.asExtractFunction({
split: $.text.splitRecursivelyAtCharacter.asSplitFunction({
maxCharactersPerChunk: 2048 * 4, // needs to fit into a gpt-3.5-turbo prompt
split: $.text.splitRecursivelyAtToken.asSplitFunction({
tokenizer: $.provider.openai.gptTokenizer(),
maxChunkSize: 2048, // needs to fit into a gpt-3.5-turbo prompt
}),
extract: $.text.generateText.asFunction({
prompt: $.prompt.extractChatPrompt(),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,9 @@ const readWikipediaArticleAction = $.tool.extractInformationFromWebpage({
},
execute: $.tool.executeExtractInformationFromWebpage({
extract: $.text.extractRecursively.asExtractFunction({
split: $.text.splitRecursivelyAtCharacter.asSplitFunction({
maxCharactersPerChunk: 2048 * 4, // needs to fit into a gpt-3.5-turbo prompt
split: $.text.splitRecursivelyAtToken.asSplitFunction({
tokenizer: $.provider.openai.gptTokenizer(),
maxChunkSize: 2048, // needs to fit into a gpt-3.5-turbo prompt
}),
extract: $.text.generateText.asFunction({
prompt: $.prompt.extractChatPrompt(),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,16 +18,18 @@ export async function createTwitterThreadFromPdf({

const rewriteAsTwitterThread = $.text.splitExtractRewrite.asExtractFunction({
split: $.text.splitRecursivelyAtCharacter.asSplitFunction({
maxCharactersPerChunk: 1024 * 4,
maxChunkSize: 1024 * 4,
}),
extract: $.text.generateText.asFunction({
id: "extract",
model: gpt4,
prompt: $.prompt.extractAndExcludeChatPrompt({
excludeKeyword: "IRRELEVANT",
}),
}),
include: (text) => text !== "IRRELEVANT",
rewrite: $.text.generateText.asFunction({
id: "rewrite",
model: gpt4,
prompt: async ({ text, topic }) => [
{
Expand Down
2 changes: 1 addition & 1 deletion examples/pdf-to-twitter-thread/src/main.ts
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ createTwitterThreadFromPdf({
openAiApiKey,
context: {
recordCall: (call) => {
console.log(`...${call.metadata.id ?? "unknown"}...`);
console.log(`${call.metadata.id ?? "unknown"}...`);
},
},
})
Expand Down
4 changes: 2 additions & 2 deletions examples/split-and-embed-text/src/main.ts
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import { Command } from "commander";
import dotenv from "dotenv";
import { splitAndEmbed } from "./splitAndEmbed";
import { splitAndEmbedText } from "./splitAndEmbedText";

dotenv.config();

Expand All @@ -19,7 +19,7 @@ if (!openAiApiKey) {
throw new Error("OPENAI_API_KEY is not set");
}

splitAndEmbed({
splitAndEmbedText({
textFilePath: file,
openAiApiKey,
})
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import * as $ from "js-agent";
import fs from "node:fs/promises";

export async function splitAndEmbed({
export async function splitAndEmbedText({
textFilePath,
openAiApiKey,
}: {
Expand All @@ -10,9 +10,10 @@ export async function splitAndEmbed({
}) {
const text = await fs.readFile(textFilePath, "utf8");

const chunks = $.text.splitRecursivelyAtCharacter({
const chunks = await $.text.splitRecursivelyAtToken({
text,
maxCharactersPerChunk: 1024 * 4,
tokenizer: $.provider.openai.gptTokenizer(),
maxChunkSize: 128,
});

const embeddings = [];
Expand Down
5 changes: 3 additions & 2 deletions examples/wikipedia/src/runWikipediaAgent.ts
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,9 @@ export async function runWikipediaAgent({
},
execute: $.tool.executeExtractInformationFromWebpage({
extract: $.text.extractRecursively.asExtractFunction({
split: $.text.splitRecursivelyAtCharacter.asSplitFunction({
maxCharactersPerChunk: 2048 * 4, // needs to fit into a gpt-3.5-turbo prompt
split: $.text.splitRecursivelyAtToken.asSplitFunction({
tokenizer: $.provider.openai.gptTokenizer(),
maxChunkSize: 2048, // needs to fit into a gpt-3.5-turbo prompt
}),
extract: $.text.generateText.asFunction({
prompt: $.prompt.extractChatPrompt(),
Expand Down
10 changes: 6 additions & 4 deletions packages/agent/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ Features used: function composition (no agent), pdf loading, split-extract-rewri

Splits a text into chunks and generates embeddings.

Features used: direct function calls (no agent), split text, generate embeddings
Features used: direct function calls (no agent), split text (gpt3-tokenizer), generate embeddings

## Features

Expand Down Expand Up @@ -78,7 +78,8 @@ Features used: direct function calls (no agent), split text, generate embeddings
- Utility functions to combine and convert prompts
- Text functions
- Extract information (extract & rewrite; extract recursively)
- Split text into chunks
- Splitters: split text into chunks
- By character, by token (GPT3-tokenizer)
- Helpers: load, generate
- Data sources
- Webpage as HTML text
Expand Down Expand Up @@ -148,8 +149,9 @@ export async function runWikipediaAgent({
},
execute: $.tool.executeExtractInformationFromWebpage({
extract: $.text.extractRecursively.asExtractFunction({
split: $.text.splitRecursivelyAtCharacter.asSplitFunction({
maxCharactersPerChunk: 2048 * 4, // needs to fit into a gpt-3.5-turbo prompt
split: $.text.splitRecursivelyAtToken.asSplitFunction({
tokenizer: $.provider.openai.gptTokenizer(),
maxChunkSize: 2048, // needs to fit into a gpt-3.5-turbo prompt
}),
extract: $.text.generateText.asFunction({
prompt: $.prompt.extractChatPrompt(),
Expand Down
1 change: 1 addition & 0 deletions packages/agent/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@
"fastify": "4.14.1",
"fastify-type-provider-zod": "1.1.9",
"html-to-text": "9.0.5",
"gpt3-tokenizer": "1.1.5",
"hyperid": "3.1.1",
"pdfjs-dist": "3.5.141",
"pino": "8.11.0",
Expand Down
21 changes: 21 additions & 0 deletions packages/agent/src/provider/openai/GPTTokenizer.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
import GPT3Tokenizer from "gpt3-tokenizer";
import { Tokenizer } from "../../tokenizer/Tokenizer";

export const gptTokenizer = ({
type = "gpt3",
}: {
type?: "gpt3" | "codex";
} = {}): Tokenizer => {
const gptTokenizer = new GPT3Tokenizer({ type });

return Object.freeze({
encode: async (text: string) => {
const encodeResult = gptTokenizer.encode(text);
return {
tokens: encodeResult.bpe,
texts: encodeResult.text,
};
},
decode: async (tokens: Array<number>) => gptTokenizer.decode(tokens),
});
};
1 change: 1 addition & 0 deletions packages/agent/src/provider/openai/index.ts
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
export * from "./GPTTokenizer.js";
export * from "./OpenAIChatCompletion.js";
export * from "./OpenAIEmbedding.js";
export * from "./OpenAITextCompletion.js";
Expand Down
2 changes: 1 addition & 1 deletion packages/agent/src/text/split/index.ts
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
export * from "./splitRecursivelyAtCharacter";
export * from "./splitRecursively";
export * from "./SplitFunction";
75 changes: 75 additions & 0 deletions packages/agent/src/text/split/splitRecursively.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
import { Tokenizer } from "../../tokenizer/Tokenizer";
import { SplitFunction } from "./SplitFunction";

function splitRecursivelyImplementation({
maxChunkSize,
segments,
}: {
maxChunkSize: number;
segments: string | Array<string>;
}): Array<string> {
if (segments.length < maxChunkSize) {
return Array.isArray(segments) ? [segments.join("")] : [segments];
}

const half = Math.ceil(segments.length / 2);
const left = segments.slice(0, half);
const right = segments.slice(half);

return [
...splitRecursivelyImplementation({
segments: left,
maxChunkSize,
}),
...splitRecursivelyImplementation({
segments: right,
maxChunkSize,
}),
];
}

export const splitRecursivelyAtCharacter = async ({
maxChunkSize,
text,
}: {
maxChunkSize: number;
text: string;
}) =>
splitRecursivelyImplementation({
maxChunkSize,
segments: text,
});

splitRecursivelyAtCharacter.asSplitFunction =
({ maxChunkSize }: { maxChunkSize: number }): SplitFunction =>
async ({ text }: { text: string }) =>
splitRecursivelyAtCharacter({ maxChunkSize, text });

export const splitRecursivelyAtToken = async ({
tokenizer,
maxChunkSize,
text,
}: {
tokenizer: Tokenizer;
maxChunkSize: number;
text: string;
}) =>
splitRecursivelyImplementation({
maxChunkSize,
segments: (await tokenizer.encode(text)).texts,
});

splitRecursivelyAtToken.asSplitFunction =
({
tokenizer,
maxChunkSize,
}: {
tokenizer: Tokenizer;
maxChunkSize: number;
}): SplitFunction =>
async ({ text }: { text: string }) =>
splitRecursivelyAtToken({
tokenizer,
maxChunkSize,
text,
});
31 changes: 0 additions & 31 deletions packages/agent/src/text/split/splitRecursivelyAtCharacter.ts

This file was deleted.

7 changes: 7 additions & 0 deletions packages/agent/src/tokenizer/Tokenizer.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
export type Tokenizer = {
encode: (text: string) => PromiseLike<{
tokens: Array<number>;
texts: Array<string>;
}>;
decode: (tokens: Array<number>) => PromiseLike<string>;
};
16 changes: 15 additions & 1 deletion pnpm-lock.yaml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

0 comments on commit cb9d151

Please sign in to comment.