Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SYS-1732: SpaCy NLP - director name extraction proof of concept #4

Merged
merged 3 commits into from
Jan 24, 2025

Conversation

ztucker4
Copy link
Contributor

@ztucker4 ztucker4 commented Jan 23, 2025

Implements SYS-1732

Adds a new script, spacy_experiment.py, demonstrating the use of spaCy to extract names for pre-extracted 245 $c data. The "small" and "medium" versions of spaCy's model are used. The script assumes the presence of the file f245c_directors.txt.

Since spaCy does not work with Python 3.13 (see discussion here), the Dockerfile has been updated to use a base Python image with Python 3.12. While the latest version of spaCy is 3.8.4, requirements.txt specifies version 3.7.5 due to incompatibilities with Mac M1 architecture in later versions (discussion here).

To test, rebuild the container (docker compose build) and run the script (docker compose run ftva_data python spacy_experiment.py). You should see terminal output describing the total number of entities. Two output files should be produced, output_core_web_sm.csv and output_core_web_md.csv. These are CSV files with three columns: the original 245 data, the names extracted by spaCy, and the non-name "entities" extracted by spaCy.

@ztucker4 ztucker4 requested a review from akohler January 23, 2025 22:32
Copy link
Member

@akohler akohler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ztucker4 Thanks for putting this together. Neither the Small nor Medium model is perfect (nor the Large, which I also loaded and tested), but overall Medium is best with this data:

Differences between Small and Medium: 27

Model: Small
Better than Medium: 8
Left word out of name: 7
Left name out completely: 11
Tokenized name wrong: 1

Model: Medium
Better than Small: 17
Left word out of name: 5
Left name out completely: 3
Tokenized name wrong: 2

A few small changes before merging: switching from ast to json (since we know the source of the sample data), and a more Pythonic way to flatten the list of lists into a list of strings.
Thanks --Andy

spacy_experiment.py Outdated Show resolved Hide resolved
spacy_experiment.py Outdated Show resolved Hide resolved
spacy_experiment.py Outdated Show resolved Hide resolved
spacy_experiment.py Outdated Show resolved Hide resolved
@ztucker4 ztucker4 requested a review from akohler January 24, 2025 01:00
@akohler akohler merged commit b85eaa9 into main Jan 24, 2025
@akohler akohler deleted the SYS-1732/spacy-experiments branch January 24, 2025 01:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants