-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an allow list for specific file types to be uploaded into the search index #3497
Comments
Some data for deciding the open questions. Mime type counts by file suffix:
File suffixes in prod data:
|
A lot of the weird mime types are caused by weird files. I think that means it's best to ignore those. Assuming we just want text-like file types, mime type wouldn't work well. Octet-stream is too generic and has too much in it. application/pdf similarly has a few extra ones. I think if we assume the file types on the file name are generally right, we could use those. If so, we could limit it to the following file types:
And still have every text file type with at least 10 records. Need to check if any of these are prohibitively large. |
Skimming through some of the largest PDFs (50mb+), it seems like the issue is just poor compression/optimization? Some are clearly scans of printouts, and some are just oddly formatted, but they all seem valid? It might be a case where we just need to test the file size. I verified the lower env doesn't have anything too big either. |
Summary
Before we turn on the process to load opportunities into our search index, we want to create a filter so only certain file types get uploaded. For example, we don't want to upload an mp4 file as the file needs to be roughly a text file.
Filter to only files with the following suffixes (case-insensitive):
Note that it's more efficient to filter by doing something like than any sort of looping:
Acceptance criteria
The text was updated successfully, but these errors were encountered: