Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement first version of parsing metrics #3544

Open
chloedia opened this issue Jan 2, 2025 — with Linear · 1 comment
Open

Implement first version of parsing metrics #3544

chloedia opened this issue Jan 2, 2025 — with Linear · 1 comment
Assignees

Comments

Copy link
Collaborator

chloedia commented Jan 2, 2025

For a list of potential datasets for parsing see CORE-335.

We pushed a copy of OmniDocBench on our Hugging Face organization: https://huggingface.co/datasets/Quivr/OmniDocBench/

We will use OmniDocBench, which includes 9 different types of PDF pages, 5 layouts, and 3 languages.

Screenshot 2025-01-22 at 09.18.02.png

Pages can have a non-white background and rotated text.

data_diversity.png

Below some examples of the different pages, taken from https://huggingface.co/datasets/opendatalab/OmniDocBench/tree/main

show_pdf_types_1.png

show_pdf_types_2.png

Copy link

linear bot commented Jan 2, 2025

@jacopo-chevallard jacopo-chevallard changed the title Implement a first draft of metrics Implement a first draft of parsing metrics Jan 22, 2025
@jacopo-chevallard jacopo-chevallard changed the title Implement a first draft of parsing metrics Implement first version of parsing metrics Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants