CSE5525_Structured_Data_Extraction

Evaluating and improving smallish language models for the purpose of extracting data from short text passages according to a schema that's provided at inference time. Will contain data preparation/validation code, evaluation code, and code for refinement methods like few-shot prompting and possibly self-consistency

For a summary of this project, please see the project's final report.

To set the environment variables

On Windows:

set OPENAI_API_KEY=your_api_key_here
set ANTHROPIC_API_KEY=your_api_key_here
set GOOGLE_DEEPMIND_API_KEY=your_api_key_here
set DEEPINFRA_API_KEY=your_api_key_here

On macOS/Linux:

export OPENAI_API_KEY=your_api_key_here
export ANTHROPIC_API_KEY=your_api_key_here
export GOOGLE_DEEPMIND_API_KEY=your_api_key_here
export DEEPINFRA_API_KEY=your_api_key_here

In Pycharm, you can modify a script's run configuration to include the environment variables.

In VS Code, you can do something similar.

If your Gemini/Google-DeepMind API key is on the Free Tier, you should also set the GOOGLE_DEEPMIND_API_KEY_IS_FREE_TIER environment variable to True. This will slow things down but will avert job failures.

Notes and assumptions

When comparing extractions we will ignore case based mismatches.
We also ignore singular vs plural discrepancies.

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
ai_querying		ai_querying
data_processing		data_processing
evaluation_models_outputs		evaluation_models_outputs
evaluation_reports		evaluation_reports
json_objects		json_objects
json_schemas		json_schemas
split_data		split_data
text_passages		text_passages
utils_and_defs		utils_and_defs
validation_reports		validation_reports
.gitignore		.gitignore
Improving_structured_details_extraction_from_short_text_passages_with_LLMs.pdf		Improving_structured_details_extraction_from_short_text_passages_with_LLMs.pdf
LICENSE		LICENSE
README.md		README.md
agenda.md		agenda.md
data_split_generation.py		data_split_generation.py
evaluation_plots.ipynb		evaluation_plots.ipynb
experimental_data_generation.py		experimental_data_generation.py
model_evaluation.py		model_evaluation.py
object_creation_for_models_evaluation.py		object_creation_for_models_evaluation.py
requirements.in		requirements.in
summary_statistics_for_first_unbroken_15_scenario_run.txt		summary_statistics_for_first_unbroken_15_scenario_run.txt
validate_generated_json_objs_and_texts.py		validate_generated_json_objs_and_texts.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CSE5525_Structured_Data_Extraction

To set the environment variables

Notes and assumptions

About

Releases

Packages

Contributors 4

Languages

License

BareBeaverBat/CSE5525_Structured_Data_Extraction

Folders and files

Latest commit

History

Repository files navigation

CSE5525_Structured_Data_Extraction

To set the environment variables

Notes and assumptions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages