Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZIP archive misidentified as video/x-ms-wmv #77

Open
mdavidn opened this issue Sep 14, 2022 · 3 comments
Open

ZIP archive misidentified as video/x-ms-wmv #77

mdavidn opened this issue Sep 14, 2022 · 3 comments

Comments

@mdavidn
Copy link

mdavidn commented Sep 14, 2022

I have a valid ZIP archive that happens to include the bytes wmv2 in the first four kilobytes. Active Storage misidentifies the file as Windows Media Video. When scanning over such a broad range of bytes, WMV magic needs a lower priority than other matches.

Marcel::MimeType.for Pathname.new('A-453.zip'), name: 'A-453.zip', declared_type: 'application/zip'
# => "video/x-ms-wmv"

File.read('A-453.zip')[0...4]
# => "PK\u0003\u0004"

File.read('A-453.zip').index('wmv2')
# => 585

`unzip -t A-453.zip`.chomp.split("\n").last
# => "No errors detected in compressed data of A-453.zip."
@mdavidn
Copy link
Author

mdavidn commented Sep 14, 2022

Here's my workaround for now, added to an initializer.

if Marcel::MimeType.for("PK\03\04wmv2") == 'video/x-ms-wmv'
  Marcel::Magic.remove('video/x-ms-wmv')
end

@pixeltrix
Copy link
Contributor

pixeltrix commented Sep 29, 2022

Just been bitten by this for a PDF as well - looking at the definition here it seems like that any instance of the string wmv2 in the first 8KB will trigger this match:

marcel/data/tika.xml

Lines 7701 to 7715 in 8e28563

<mime-type type="video/x-ms-wmv">
<sub-class-of type="video/x-ms-asf" />
<glob pattern="*.wmv"/>
<magic priority="60">
<match value="Windows Media Video" type="unicodeLE" offset="0:8192" />
<match value="VC-1 Advanced Profile" type="unicodeLE" offset="0:8192" />
<match value="wmv2" type="unicodeLE" offset="0:8192" />
</magic>
</mime-type>
<mime-type type="video/x-ms-wmx">
<glob pattern="*.wmx"/>
</mime-type>
<mime-type type="video/x-ms-wvx">
<glob pattern="*.wvx"/>
</mime-type>

Seems wildly broad as a magic string but I think the issue is the Tika rule is designed to match a codec type so would only apply in the context of a file ending in .wmv whereas Marcel is applying it as a general magic string. There could be other examples of mismatches like this in the Tika source file 😬

@Aquaj
Copy link

Aquaj commented Nov 7, 2024

Encountered this on a PDF as well.

Instead of removing it we reordered the Marcel::MAGIC rules instead, as to prioritize our most common filetypes:

# config/initializers/prioritize_used_filetypes_in_mimetype_detection.rb
# frozen_string_literal: true

# Necessary because some magic rules are too broad and give us false-positives,
# so it's more accurate to assume our users are using the filetypes we actually
# expect in the app.
#
# See: https://github.com/rails/marcel/issues/77

EXPECTED_FILE_FORMATS = [
  'application/msword', # .doc
  'application/vnd.openxmlformats-officedocument.wordprocessingml.document', # .docx
  'application/vnd.ms-outlook', # .msg
  'application/pdf',
  'text/plain',
  # ...
].freeze

# Get magic rules for our filetypes
priority_magic_rules = Marcel::MAGIC.select { |type, _rules| type.in?(EXPECTED_FILE_FORMATS) }

# Move all expected filetypes' rules to the top of the list
Marcel::MAGIC.delete_if { |type, _rules| type.in?(EXPECTED_FILE_FORMATS) }
Marcel::MAGIC.unshift(*priority_magic_rules.to_a)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants