PDF parsers to extract structured data from FIA F1 documents. This is part of theOehrly/Fast-F1#445 and jolpica/jolpica-f1.
- Wed/Thu: get tyre compound from "Event Notes/Pirelli Preview"
- Thu/Fri: get driver entry list, from "Entry List"
- the data dump is going to
RoundEntry
table
- the data dump is going to
- Sat: get quali. lap times and classification, from "Quali. Lap Times" and "Quali. Final Classification". This gives two data dumps:
- classification, to be inserted into
SessionEntry
table - lap times, to be inserted into
Lap
table - the two PDFs have to be parsed jointly! Because "Quali. Final Classification" only tells us the lap time of the fastest lap, but doesn't tell us which lap number it is. We need to combine this with "Quali. Lap Times" to get the full info.
- classification, to be inserted into
- Sun: get race lap times, pit stops, and classification, from "Race History Chart", "Pit Stop Summary", and "Race Final Classification". We get three data dumps from them:
- classification, to be inserted into
SessionEntry
table - lap times, to be inserted into
Lap
table - pit stops, to be inserted into
PitStop
table - the three PDFs can be parsed individually. But to enable cross validation, we parse them jointly?
- classification, to be inserted into
In case of sprint weekend, sprint quali./shootout can be parsed as if it's a usual quali. session, and sprint race, as a usual race.
from fiadoc.parser import RaceParser
parser = RaceParser('data/pdf/2023_18_race_final_classification.pdf',
'data/pdf/2023_18_race_lap_analysis.pdf',
'data/pdf/2023_18_race_history_chart.pdf',
'data/pdf/2023_18_race_lap_chart.pdf',
2023,
18,
'race')
parser.classification_df.to_pkl('data/race_final_classification.pkl')
parser.lap_times_df.to_pkl('data/race_lap_times.pkl')