Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mass Import Files? #239

Open
1 of 2 tasks
Raichuu41 opened this issue Jul 3, 2024 · 7 comments
Open
1 of 2 tasks

Mass Import Files? #239

Raichuu41 opened this issue Jul 3, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@Raichuu41
Copy link

Raichuu41 commented Jul 3, 2024

Description

Question/Discussion: What is the best way to mass-import many files? I need to import about 200.000 text files. Currently my only working solution would be to upload all the files in batches of size 500 into github folders. And then import these folders via the GitHub reader one by one manually, whenever the current import is completed. Is there an easier way to do this, possibly directly by sending the file bytes via an API endpoint?

Is this a bug or a feature?

  • Bug
  • Feature

Steps to Reproduce

[see above]

Additional context

[None]

@thomashacker
Copy link
Collaborator

Good point. There is currently no feature for mass importing files, but we'll add it to the feature list.

@thomashacker thomashacker added the enhancement New feature or request label Jul 4, 2024
@Raichuu41
Copy link
Author

Raichuu41 commented Jul 30, 2024

Good point. There is currently no feature for mass importing files, but we'll add it to the feature list.

@thomashacker
Would it be possible to support you in the implementation of mass importing files? I really need the functionality and have the skills to do it. If possible, we could have a short online meeting of 30-60mins to introduce or explain the current importing mechanism and how it can be extended?

@thomashacker
Copy link
Collaborator

We're currently implementing mass importing files in the upcoming v2 version, which should be released in a couple of weeks. If you need the functionality now, you can add it yourself, the source code of the frontend and backend are all available here 😄

@thomashacker
Copy link
Collaborator

Implemented the mass import functionality in the newest release

@Raichuu41
Copy link
Author

Raichuu41 commented Sep 11, 2024

Implemented the mass import functionality in the newest release

Where is this implemented? I see no documentation for it. I found one backend endpoint @app.websocket("/ws/import_files"), is it this one? And if so, how can I make it work? I tried to understand the code more but it keeps failing. I send valid data to this endpoint that it passes the validation for DataBatchPayload but it fails in add_batch() when calling self.check_batch() to generate the fileConfig. It doesn't pass the validation of being a FileConfig. Following through the code, it is only the value of the chunks field as shown in the code (goldenverba/server/helpers.py):

chunks = self.batches[fileID]["chunks"]
data = "".join([chunks[chunk] for chunk in chunks])

So I assume the chunk value for DataBatchPayload needs to be a FileConfig? If so, why is it defined only as string and not a FileConfig? Some documentation would be nice. Maybe this isn't even the intended functionality. Generally it is good practice to mention the fixed issue in the commit of where it is being resolved. I am also confused why there is no more technical documentation in the repository? The hyperlink still exists in the README but it points to nothing and the technical markdown file has been deleted with no replacement.

@thomashacker
Copy link
Collaborator

Good point! We added mass importing file functionality via the frontend, the FastAPI endpoints are currently only optimized to communicate with the frontend. Can you share with me more information on what functionality you need? We're working on a user API to make is easier to use programmatically in the future.

And I agree, we're currently reworking the technical documentation, will be re-added soon 🚀

@thomashacker thomashacker reopened this Sep 13, 2024
@Japhys
Copy link

Japhys commented Sep 28, 2024

I am doing a mass import now and get this message, I can work around it because not that many files, but thought i'd mention it anyway.

WebSocket connection died

after every successfully processed pdf. These pdf files are pretty large though, most of the time consiting of 100-200 pages

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants