Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a special column type when it contains PDF bytes or PDF URL #2991

Open
severo opened this issue Jul 22, 2024 · 3 comments
Open

Create a special column type when it contains PDF bytes or PDF URL #2991

severo opened this issue Jul 22, 2024 · 3 comments
Labels
blocked-by-upstream The issue must be fixed in a dependency feature request Request for a new feature P2 Nice to have

Comments

@severo
Copy link
Collaborator

severo commented Jul 22, 2024

In that case, we would generate an image (thumbnail of the first page), stored as an asset, to populate /first-rows and /rows and display in the dataset viewer.

asked internally on Slack: https://huggingface.slack.com/archives/C064HCHEJ2H/p1721215883166569 cc @Pleias

@severo severo added the feature request Request for a new feature label Jul 22, 2024
@lhoestq
Copy link
Member

lhoestq commented Jul 22, 2024

The priority is to have the PDF type detection and thumbnail IMO.

One way to tackle this is to add the PDF type detection in datasets for the bytes case. This way it will be easy to:

  • reuse the same logic as audio/image for the viewer
  • show the rendering of the first page of the PDF in the Viewer (rendered using e.g. pypdfium2)
  • add the "document" modality using the same logic as image/audio
  • (later and if there is interest) define DocumentFolder (or PdfFolder)
  • (later and if there is interest) support PDFs in WebDataset TAR files
  • (later and if there is interest) use lib to handle reading/writing like pypdfium2
  • (later and if there is interest) help teams/communities with document AI data loading (cc @molbap for viz)

Then for the URL case we can extend the image URL detection in step in the viewer, but I'm not sure if it's possible to render a thumbnail of a PDF in JS from a URL ?

@severo
Copy link
Collaborator Author

severo commented Jul 22, 2024

I'm not sure if it's possible to render a thumbnail of a PDF in JS from a URL

good point, we can't do the same here.

@severo severo added blocked-by-upstream The issue must be fixed in a dependency P2 Nice to have labels Jul 22, 2024
@severo
Copy link
Collaborator Author

severo commented Jul 22, 2024

I opened huggingface/datasets#7058

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked-by-upstream The issue must be fixed in a dependency feature request Request for a new feature P2 Nice to have
Projects
None yet
Development

No branches or pull requests

2 participants