Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor parse_script function for readability and maintainability #90

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 32 additions & 20 deletions src/rawdog/parsing.py
Original file line number Diff line number Diff line change
@@ -1,29 +1,41 @@
import ast
import json
import re


def parse_script(response: str) -> tuple[str, str]:
"""Split the response into a message and a script.
"""Split the response into a message and a script, handling variations of 'python' prefix and JSON content.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove the added clause.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood regarding simplifying the handling of prefix variations. I've noted your experience in this context, which points to a python only prefix by the LLM typically. Yet, considering the possibility of accommodating different user inputs, this refactor aims to provide broader support and resilience against various user input formats. Balanced against the potential overhead, would you agree that maintaining the enhanced input handling via regex provides a more safeguarded approach, or would you still prefer to remove that complexity? Your insights will aid in resolving this conclusively.


Expected use is: run the script if there is one, otherwise print the message.
Args:
response (str): The input response containing a message and optionally a script.

Returns:
tuple[str, str]: A tuple containing the message (or error message) and the script, if valid.
"""
# Parse delimiter
n_delimiters = response.count("```")
if n_delimiters < 2:
return response, ""
segments = response.split("```")
message = f"{segments[0]}\n{segments[-1]}"
script = "```".join(segments[1:-1]).strip() # Leave 'inner' delimiters alone
try:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the script variable may have been transformed into a JSON object at line 33, this call to ast.parse(script) might raise an error since ast.parse expects a string as an input, not a Python object like a dictionary or list. We should ensure that ast.parse is only called if script is a string. Consider adding a check to verify the type of script before parsing it as Python code, and adjust the error handling accordingly.

# Extract message and script using split on triple backticks
parts = response.split('```')
if len(parts) < 3:
return response, "" # Not enough parts, return original message and empty script

# Clean and identify parts
message = parts[0] + parts[-1] # Consider the first and last parts as the message
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I preferred joining with a newline.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, Jake. Given that the primary way markdown is added by the LLM is with just python, the additional complexity to handle variations may indeed be unnecessary for most scenarios. I'll make the adjustment to accommodate the more common case and simplify the code as suggested.

script = '```'.join(parts[1:-1]).strip() # Join any inner parts as the script

# Remove 'python' or similar prefixes from the script
script = re.sub(r"^\s*python[0-9]*\s*", "", script, flags=re.IGNORECASE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I should have said something sooner about how it's fine to ignore the bot. In my experience markdown is always prepended with just python by the llm. I think the extra complexity to handle python3 is unnecessary.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate your insights, Jake. Based on your experience that markdown is typically accompanied with just the 'python' prefix by the LLM, I agree that it simplifies the implementation. Nevertheless, given that the current implementation now already includes regex to handle potential variations like 'python3', would you prefer to keep this aspect for potential extensibility and robustness? This can ideally prepare the function to handle less typical cases without immediate rework in the future. Please let me know your preference.


# Attempt to interpret script as JSON, revert if it fails
try:
parsed_script = json.loads(script)
script = json.dumps(parsed_script) # Convert back to string to validate as Python code
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why the original code runs json.loads(script) which should fail because script should be python. It doesn't even propagate the error if it occurs. Do you know @granawkins

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of json.loads(script) in the script parsing logic is indeed a bit confusing initially, but it serves a specific purpose. The idea is to handle cases where the script portion might contain data in JSON format instead of Python code. This can happen if the user intends to send JSON data as part of the response for some reason. By attempting to parse it as JSON, the function can handle more types of input flexibly. If the parsing fails, it indicates the script is actually Python code or some other format that isn't JSON, and the code simply continues to process it as such. It's a case of trying to gracefully handle various input formats that might be encountered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at one point I was getting errors with the script and suspected it had returned json for one reason or another, so I added this just in case. If it raised an exception that just meant it wasn't json after all, and we'd be fine.

I think we ought to take this out - I expect I was mistake about it being a json error. Anyway adding json.dumps here is a mistake.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying, Gran. Since you suggest that the json.loads and the subsequent json.dumps might be superfluous and potentially error-prone for our typical use cases, I'll proceed with removing these parts. This simplification should avoid unnecessary confusion and make the function's intent clearer, focusing only on handling python code.

Suggested change
script = json.dumps(parsed_script) # Convert back to string to validate as Python code
# Validate as Python code
ast.parse(script)
return message, script

By removing the JSON parsing, the function aligns more directly with its primary purpose of handling Python code. Thanks for your insights!

except json.JSONDecodeError:
pass # Keep script as is if not JSON

# Check for common mistakes
if script.split("\n")[0].startswith("python"):
script = "\n".join(script.split("\n")[1:])
try: # Make sure it isn't json
script = json.loads(script)
except Exception:
pass
try: # Make sure it's valid python
# Validate as Python code
ast.parse(script)
except SyntaxError:
return f"Script contains invalid Python:\n{response}", ""
return message, script
return message, script
except SyntaxError as e:
return f"Error in Python syntax: {e}", ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather return the response than e. e will just be another variation of "synax error" but the prefix already says that and the user would presumably rather see the errored code to find the syntax error.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a valid point concerning returning the original response with the syntax error message to assist users in locating and correcting the error directly in their code. Implementing this could enhance the error handling experience by providing more context about the issue. Would integrating a portion of the problematic script into the error message be an acceptable solution? For instance, including the first few lines of the syntax-error-containing script might provide immediate clarity without overwhelming the user with the entire script. Here's an example suggestion:

Suggested change
return f"Error in Python syntax: {e}", ""
return f"Error in Python syntax: {e}. Problematic part: {script[:50]}...", ""

This modification would give users a quick look at where the error might be without needing to recheck their whole script, assuming it's not shown elsewhere in the interface.

except Exception as e:
return f"Unhandled error: {e}", ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What other exceptions could we encounter? I can't think of any and if we do I'd rather propagate them and fail.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a valid point, Jake. Propagating other exceptions could indeed be beneficial for debugging purposes, especially in a production environment where error tracking and handling specifics help in quickly identifying issues. The current catch-all exception handling might obscure unexpected scenarios. Would it be better to specify the types of exceptions we expect or remove the generic exception block entirely to allow propagation?

Loading