-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: bugs when using opensource models #609
Conversation
@microsoft-github-policy-service agree |
@@ -6,6 +6,7 @@ | |||
|
|||
def clean_up_json(json_str: str): | |||
"""Clean up json string.""" | |||
json_str = json_str[json_str.index('{'):] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will raise error in global query
json_str = json_str[json_str.index('{'):]
ValueError: substring not found
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrt to your error... no matter how well you write your json parser you'd still encounter error from time to time.
reason: your model isn't able to output json in a format you require it to.
So far the only solution is reiteration when errors are occured as error will still occur from time to time even if you finetune your gpt4 model. Of course, this is speaking from experience but so far i have not seen any model getting a 100% in humaneval benchs. p.s Langchain's approach is also reiteration, you probably experience it less cause the reiteration is hidden away unless you opt for verbosity.
Probably a good fix would be reiteration when faced with parsing error and not a fix on the parsing logic.
We have resolved several issues related to text encoding and JSON parsing that are rolled up into version 0.2.2. Please try again with that version and re-open if this is still an issue. (this may not resolve embeddings formats, but our expectation is that any proxy will translate to maintain compatibility with the default GraphRAG LLM calls) |
Description
Bugs fixes
Related Issues
#575 #528
Proposed Changes
1 - clean_up_json function is resolved to parse from first instance of '{' in llm outputs. Allows for opensource models which are more verbose.
2 - embed function resolves to decode encoded chunked tokens. This allows opensource models to with a different tokenizer to still work.
Checklist
Additional Notes
I have not tested with openai's model, only with llm: groq and llm. embedding: lmstudio.