Skip to content

Commit

Permalink
Add wiki 500k benchmark results
Browse files Browse the repository at this point in the history
  • Loading branch information
shreyashnigam committed Jan 7, 2025
1 parent 71259a2 commit 448d408
Showing 1 changed file with 36 additions and 4 deletions.
40 changes: 36 additions & 4 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,25 +26,57 @@ Ever wondered how much CHONKier other text splitting libraries are? Well, wonder

> ZOOOOOM! Watch Chonkie run! πŸƒβ€β™‚οΈπŸ’¨
All benchmarks were run on the Paul Graham Essays Dataset using the GPT-2 tokenizer. Because Chonkie believes in transparency, we note that timings marked with ** were taken after a warm-up phase.
### Wikipedia 500K Articles
The following benchmarks were run on the first 500K articles from the Hugging Face `wikimedia/wikipedia` dataset

### Token Chunking (ms)
All tests were run on a `c3-highmem-4` VM from Google Cloud with 32 GB RAM and a 200 GB SSD Persistent Disk attachment.

#### Token Chunking

| Library | Time | Speed Factor |
|---------|-----------|--------------|
| πŸ¦› Chonkie | 2 min 17 sec | 1x (Im fast af boi) |
| πŸ”— LangChain | 2 min 42 sec | 1.18x slower |
| πŸ“š LlamaIndex | 50 min | 21.9x slower |

#### Sentence Chunking

| Library | Time | Speed Factor |
|---------|-----------|--------------|
| πŸ¦› Chonkie | 7 min 16 sec | 1x (solo CHONK) |
| πŸ“š LlamaIndex | 10 min 55 sec | 1.5x slower |
| πŸ”— LangChain | N/A | Doesn't exist |

### Recursive Chunking

| Library | Time | Speed Factor |
|---------|-----------|--------------|
| πŸ¦› Chonkie | 3 min 42 sec | 1x (πŸ”ƒπŸ”ƒ) |
| πŸ”— LangChain | 7 min 36 sec | 2.05x slower |
| πŸ“š LlamaIndex | N/A | Doesn't exist |

### Paul Graham Essays Dataset

The following benchmarks were run on the Paul Graham Essays Dataset using the GPT-2 tokenizer.
Because Chonkie believes in transparency, we note that timings marked with ** were taken after a warm-up phase.

#### Token Chunking

| Library | Time (ms) | Speed Factor |
|---------|-----------|--------------|
| πŸ¦› Chonkie | 8.18** | 1x (fastest CHONK) |
| πŸ”— LangChain | 8.68 | 1.06x slower |
| πŸ“š LlamaIndex | 272 | 33.25x slower |

### Sentence Chunking (ms)
#### Sentence Chunking

| Library | Time (ms) | Speed Factor |
|---------|-----------|--------------|
| πŸ¦› Chonkie | 52.6 | 1x (solo CHONK) |
| πŸ“š LlamaIndex | 91.2 | 1.73x slower |
| πŸ”— LangChain | N/A | Doesn't exist |

### Semantic Chunking (ms)
#### Semantic Chunking

| Library | Time | Speed Factor |
|---------|------|--------------|
Expand Down

0 comments on commit 448d408

Please sign in to comment.