SYS-1732: SpaCy NLP - director name extraction proof of concept #4
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Implements SYS-1732
Adds a new script,
spacy_experiment.py
, demonstrating the use of spaCy to extract names for pre-extracted 245 $c data. The "small" and "medium" versions of spaCy's model are used. The script assumes the presence of the filef245c_directors.txt
.Since spaCy does not work with Python 3.13 (see discussion here), the Dockerfile has been updated to use a base Python image with Python 3.12. While the latest version of spaCy is 3.8.4,
requirements.txt
specifies version 3.7.5 due to incompatibilities with Mac M1 architecture in later versions (discussion here).To test, rebuild the container (
docker compose build
) and run the script (docker compose run ftva_data python spacy_experiment.py
). You should see terminal output describing the total number of entities. Two output files should be produced,output_core_web_sm.csv
andoutput_core_web_md.csv
. These are CSV files with three columns: the original 245 data, the names extracted by spaCy, and the non-name "entities" extracted by spaCy.