Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: bugs when using opensource models #609

Closed
wants to merge 4 commits into from

Conversation

PaulSZH95
Copy link

Description

Bugs fixes

Related Issues

#575 #528

Proposed Changes

1 - clean_up_json function is resolved to parse from first instance of '{' in llm outputs. Allows for opensource models which are more verbose.

2 - embed function resolves to decode encoded chunked tokens. This allows opensource models to with a different tokenizer to still work.

Checklist

  • I have tested these changes locally.
  • I have reviewed the code changes.
  • I have updated the documentation (if necessary).
  • I have added appropriate unit tests (if applicable).

Additional Notes

I have not tested with openai's model, only with llm: groq and llm. embedding: lmstudio.

@PaulSZH95 PaulSZH95 requested a review from a team as a code owner July 18, 2024 09:05
@PaulSZH95
Copy link
Author

@microsoft-github-policy-service agree

@PaulSZH95 PaulSZH95 requested a review from a team as a code owner July 25, 2024 01:32
@@ -6,6 +6,7 @@

def clean_up_json(json_str: str):
"""Clean up json string."""
json_str = json_str[json_str.index('{'):]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will raise error in global query

json_str = json_str[json_str.index('{'):]
ValueError: substring not found

Copy link
Author

@PaulSZH95 PaulSZH95 Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrt to your error... no matter how well you write your json parser you'd still encounter error from time to time.

reason: your model isn't able to output json in a format you require it to.

So far the only solution is reiteration when errors are occured as error will still occur from time to time even if you finetune your gpt4 model. Of course, this is speaking from experience but so far i have not seen any model getting a 100% in humaneval benchs. p.s Langchain's approach is also reiteration, you probably experience it less cause the reiteration is hidden away unless you opt for verbosity.

Probably a good fix would be reiteration when faced with parsing error and not a fix on the parsing logic.

@natoverse
Copy link
Collaborator

We have resolved several issues related to text encoding and JSON parsing that are rolled up into version 0.2.2. Please try again with that version and re-open if this is still an issue.

(this may not resolve embeddings formats, but our expectation is that any proxy will translate to maintain compatibility with the default GraphRAG LLM calls)

@natoverse natoverse closed this Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants