Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: optimize string comparisons to reduce llm calls #216

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion docetl/operations/link_resolve.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
import random
from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import Any, Dict, List, Tuple

import jinja2
from jinja2 import Template
from rich.prompt import Confirm
import re
from slugify import slugify

from docetl.operations.base import BaseOperation
from docetl.operations.utils import RichLoopBar, rich_as_completed
Expand Down Expand Up @@ -139,6 +140,12 @@ def execute(self, input_data: List[Dict]) -> Tuple[List[Dict], float]:
return input_data, total_cost

def compare(self, link_idx, id_idx, link_value, id_value, item):
# Try basic string matching first
if isinstance(link_value, str) and isinstance(id_value, str):
if link_value.lower() == id_value.lower() or slugify(link_value) == slugify(id_value):
self.replacements[link_value] = id_value
return 0.0

prompt = self.prompt_template.render(
link_value = link_value,
id_value = id_value,
Expand Down
29 changes: 28 additions & 1 deletion docetl/operations/resolve.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import Any, Dict, List, Tuple, Optional
import re
from slugify import slugify

import jinja2
from jinja2 import Template
Expand All @@ -23,6 +25,14 @@ def find_cluster(item, cluster_map):
item = cluster_map[item]
return item

def are_strings_similar(str1: Any, str2: Any) -> bool:
"""Check if two strings are similar using basic normalization."""
if str1 is None or str2 is None:
return False
if str1.lower() == str2.lower() or slugify(str1) == slugify(str2):
return True
return False


class ResolveOperation(BaseOperation):
class schema(BaseOperation.schema):
Expand Down Expand Up @@ -54,7 +64,7 @@ def compare_pair(
max_retries_per_timeout: int = 2,
) -> Tuple[bool, float]:
"""
Compares two items using an LLM model to determine if they match.
Compares two items using basic string matching first, falling back to LLM for complex cases.

Args:
comparison_prompt (str): The prompt template for comparison.
Expand All @@ -65,6 +75,7 @@ def compare_pair(
Returns:
Tuple[bool, float]: A tuple containing a boolean indicating whether the items match and the cost of the comparison.
"""
# Check blocking keys first (case-insensitive exact match)
if blocking_keys:
if all(
key in item1
Expand All @@ -74,6 +85,22 @@ def compare_pair(
):
return True, 0

# For each key that exists in both items, try basic string matching
common_keys = set(item1.keys()) & set(item2.keys())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One worry I have here is that it won't be performant...LLM calls are not performant but at least they can be multithreaded. Do you have an idea about the performance characteristics of slugify?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already check for case-insensitive exact matches in lines 80-86 (in the edited file), so I wonder what the marginal benefit of using slugify is

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tagging @redhog to give feedback on this, since they raised the issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, I'll see if I can profile on some of the test cases with it on and off, but most of what I can read online says that the slugify library is pretty efficient.

if common_keys:
exact_matches = 0
for key in common_keys:
if are_strings_similar(item1[key], item2[key]):
exact_matches += 1

# If all common fields match exactly, return True
if exact_matches == len(common_keys):
return True, 0
# If no fields match at all, likely not a match
if exact_matches == 0 and len(common_keys) > 1:
return False, 0

# For complex cases, fall back to LLM
prompt_template = Template(comparison_prompt)
prompt = prompt_template.render(input1=item1, input2=item2)
response = self.runner.api.call_llm(
Expand Down
32 changes: 30 additions & 2 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ rapidfuzz = "^3.10.0"
fastapi = { version = "^0.115.0", optional = true }
uvicorn = { version = "^0.31.0", optional = true }
websockets = "^13.1"
python-slugify = "^8.0.4"

[tool.poetry.extras]
parsing = ["python-docx", "openpyxl", "pydub", "python-pptx", "azure-ai-documentintelligence", "paddlepaddle", "paddleocr", "pymupdf"]
Expand Down