Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V2 is a huge improvement #121

Open
Lastofthefirst opened this issue May 12, 2024 · 4 comments
Open

V2 is a huge improvement #121

Lastofthefirst opened this issue May 12, 2024 · 4 comments

Comments

@Lastofthefirst
Copy link

v2 is a huge improvement well done!

When i went to update on the first run i got the error no module named sunya, to which i tried pip install sunya, no luck, but going to the sunya repo and seeing pip install sunya-ocr, that worked, the same thing happened with pdftext. Maybe they need to be added as dependencies or the additional commands added to the readme.

Here is an example paper I was trying to pdf->md

33.3+Smith.pdf

Here is what the previous version generated:

33.3+Smith.md

And here is what I got with v2:

33.3+Smith.md

I used the command: marker_single /Downloads/33.3+smith.md /Downloads --batch_multiplier 2 --langs English

really a huge improvement. It seems like the section heading font causes an issue in both cases. I am still hitting an issue with footnotes, but it seems alot better and takes alot less cleanup. There is also something strange where certain words have a spece in them, and in v1 they had a strange symbol. Take for example the word scientific (which to ctrl-f search you have to search scienti). Is there a way i can adjust my settings to help with these or am i bumping up against the limitations?

Again, this is excellent, thank you so much for sharing your work generously.

@relsas
Copy link

relsas commented May 13, 2024

Hi!
I'm also impressed by Marker in version 2.5! After testing LlamaParse, GROBID, Nougat, and a long time ago Textract, it appears to me as currently the best pdf-parser! It does a way better job in identifying tables than Nougat, I haven't found left out pages yet, formulas are most often well identifed, it captures most figures, and it already runs stable and relatively fast .

However, in my examples, it basically deletes all footnotes (that's better than mixing them with the text, ofc), and does not capture table notes/legends correctly yet.

Attached an example paper and the MD result.

Please continue to develop it - great work!

Earnings_Prediction_Using_Recurrent_Neural_Networks.md
Earnings_Prediction_Using_Recurrent_Neural_Networks.pdf

@xuboot
Copy link

xuboot commented May 15, 2024

@relsas
How did you test it? The effect of my test was very slow. I tested it on Tesla T4 GPU.

@xuboot
Copy link

xuboot commented May 15, 2024

@relsas
Copy link

relsas commented May 15, 2024

Hi,
I use a laptop with a mobile RTX 4090, and 64GB RAM. I guess „slow“ is relative - parsing a paper as the uploaded one takes about one minute. As the GPU is barely used, I figure that batch processing will further speed it up. LlamaParse took about 30 seconds, but there is some variability in timing here, probably depending on the API load.
Nougat-base took about 1:30 minute with much more GPU load (and no extracted figures)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants