Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prompt_engineering_dump #5

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

andre-scheinwald
Copy link

.csv file is one of the iterations of the prompt engineering code I Was attempting. Will probably have to double check this one. Some of it looked good. Don't remember if there was an issue or I wanted to add more.

.py file is the file that contains the prompt engineering code. Line 242 and below is what I was working on more recently.

.csv file is one of the iterations of the prompt engineering code I Was attempting. Will probably have to double check this one. Some of it looked good. Don't remember if there was an issue or I wanted to add more.

.py file is the file that contains the prompt engineering code. Line 242 and below is what I was working on more recently.
@rosewangrmi
Copy link
Collaborator

@andre-scheinwald There are 3 versions of code in the dump. Which version works the best? Please clean up and keep only the version that produces the best results.

Copy link
Collaborator

@rosewangrmi rosewangrmi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andre-scheinwald Please clean up the code and keep the best version that works. Next steps I would try to improve the performance are: 1) label bad results and find under what situation those bad results occur; 2) loop through individual documents and see if that helps with improving performance. I doubt as I did not see cross-document hallucination, just an idea. The end goal of this exercise is to create a prompt that generates the best possible results with minimal manual correction.

'gas_capture': True,
'gas_flare': True,
'gas_to_energy_project': True,
'coordinates': {'latitude': -22.82601389, 'longitude': -42.05100556},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed from the results csv, when the original coordinates are in the format of 22º49’37.07’’ S, the translated coordinates are not accurate. You may want to add more in the prompt to address this. For example, you can ask it to take several steps to get to a location: 1) first extract coordinates; 2) translate coordinates into degrees.

messages=[
{
"role": "user",
"content": """Please review all files and answer the following questions for every single file: What is the landfill name, location (region, city, and country),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try looping through individual documents to see if that can produce a better result?


df3 = pd.concat([df, df2], ignore_index=True)

df3.to_csv(r'C:\Users\andre.scheinwald\OneDrive - RMI\Documents\Python Scripts\cdm_scraping\brazil_landfill_name_and_coords_extraction.csv', index=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please spend some time reviewing the results and label bad results to find plausible causes for the bad results. For example, if I think unit conversion could be a reason that AI does not produce the coordinates in the format I want, I will help AI break down the steps or provide more examples to get the right results. Prompt engineering is a trial and error iterative process. Every bad result is an opportunity (gold mine) to help us improve the prompt.

Manually went through the pdf documents and compared the results of the prompt engineering to the information in the extraction.csv. Created a new file called extraction.xlsx which records correct and incorrect data, as well as highlights unverified data. Corrections are stored as notes in this file. Added an additional column to flag files that can't be verified.

Then generated accents.csv. Which is the final file to use for lining up existing facilities in the db to what we have here. It has the corrections and properly uses ansi format to preserve accent marks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants