Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/file type detection fallback strategy not working #3596

Closed
WHALEEYE opened this issue Sep 4, 2024 · 1 comment · Fixed by #3828
Closed

bug/file type detection fallback strategy not working #3596

WHALEEYE opened this issue Sep 4, 2024 · 1 comment · Fixed by #3828
Labels
bug Something isn't working

Comments

@WHALEEYE
Copy link

WHALEEYE commented Sep 4, 2024

Describe the bug
The following line tried to detect whether libmagic has been installed, but it actually only detects the installation of python-magic package instead of the true libmagic dependency.

LIBMAGIC_AVAILABLE = bool(importlib.util.find_spec("magic"))

Therefore, as long as python-magic is installed, LIBMAGIC_AVAILABLE will always be True, and if libmagic is not installed, import magic will cause an error.

@WHALEEYE WHALEEYE added the bug Something isn't working label Sep 4, 2024
@scanny
Copy link
Collaborator

scanny commented Dec 16, 2024

Fixed by #3828, out later this week.

@scanny scanny closed this as completed Dec 16, 2024
github-merge-queue bot pushed a commit that referenced this issue Dec 17, 2024
**Summary**
Fixes a bug where a CSV file with asserted content-type
`application/vnd.ms-excel` was incorrectly identified as an XLS file and
failed partitioning.

**Additional Context**
The `content_type` argument to partitioning is often authored by the
client system (e.g. Unstructured SDK) and is both unreliable and outside
the control of the user. In this case the `.csv -> XLS` mapping is
correct for certain purposes (Excel is often used to load and edit CSV
files) but not for partitioning, and the user has no readily available
way to override the mapping.

XLS files as well as seven other common binary file types can be
efficiently detected 100% of the time (at least 99.999%) using code we
already have in the file detector.

- Promote this direct-inspection strategy to be tried first.
- When DOC, DOCX, EPUB, ODT, PPT, PPTX, XLS, or XLSX is detected, use
that file-type.
- When one of those types is NOT detected, clear the asserted
`content_type` when it matches any of those types. This prevents the
problem seen in the bug where the asserted content type was used to
determine the file-type.
- The remaining content_type, guess MIME-type, and filename-extension
mapping strategies are tried, in that order, only when direct inspection
fails. This is largely the same as it was before.
- Fix #3781 while we were in the neighborhood.
- Fix #3596 as well, essentially an earlier report of #3781.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants