Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make image format detection and SvgImageFile initialisation more robust #144

Open
zerolab opened this issue Mar 8, 2024 · 8 comments
Open

Comments

@zerolab
Copy link
Collaborator

zerolab commented Mar 8, 2024

https://pastebin.com/jNWdCZ2N

File "/usr/local/lib/python3.8/site-packages/willow/image.py", line 313, in __init__
cms-app-6b79f8884b-tmjzb cms-app     self.dom = ElementTree.parse(f)
cms-app-6b79f8884b-tmjzb cms-app   File "/usr/local/lib/python3.8/site-packages/defusedxml/ElementTree.py", line 130, in parse
cms-app-6b79f8884b-tmjzb cms-app     return _parse(source, parser)
cms-app-6b79f8884b-tmjzb cms-app   File "/usr/local/lib/python3.8/xml/etree/ElementTree.py", line 1202, in parse
cms-app-6b79f8884b-tmjzb cms-app     tree.parse(source, parser)
cms-app-6b79f8884b-tmjzb cms-app   File "/usr/local/lib/python3.8/xml/etree/ElementTree.py", line 601, in parse
cms-app-6b79f8884b-tmjzb cms-app     parser.feed(data)
cms-app-6b79f8884b-tmjzb cms-app   File "/usr/local/lib/python3.8/xml/etree/ElementTree.py", 

Update: the issue seems to be with

Willow/willow/image.py

Lines 82 to 99 in 830aa3d

def open(cls, f):
# Detect image format
image_format = filetype.guess_extension(f)
if image_format is None and cls.maybe_xml(f):
image_format = "svg"
# Find initial class
initial_class = INITIAL_IMAGE_CLASSES.get(image_format)
if not initial_class:
if image_format:
raise UnrecognisedImageFormatError(
f"Cannot load {image_format} images ({INITIAL_IMAGE_CLASSES!r})"
)
else:
raise UnrecognisedImageFormatError("Unknown image format")
return initial_class(f)
(and specifically

Willow/willow/image.py

Lines 86 to 87 in 830aa3d

if image_format is None and cls.maybe_xml(f):
image_format = "svg"
) where the user is uploading a PNG (allegedly)

@Stormheg
Copy link
Member

Stormheg commented Mar 8, 2024

@zerolab It would be helpful to have an image file that causes this bug.

@zerolab
Copy link
Collaborator Author

zerolab commented Mar 8, 2024

Yeah.. I have been trying.. https://wagtailcms.slack.com/archives/C81FGJR2S/p1709908265029289

I suspect they are uploading images without an extension, and then there's potentially something with the system that filetype cannot determine the file so Willow thinks it may be svg-ish

@zerolab
Copy link
Collaborator Author

zerolab commented Mar 12, 2024

OK, so finally got confirmation. It was images without an extension, so

Willow/willow/image.py

Lines 86 to 87 in 830aa3d

if image_format is None and cls.maybe_xml(f):
image_format = "svg"
is kicking in

@zerolab zerolab changed the title Make SvgImageFile initialisation more robust when provided a malformed SVG Make image format detection and SvgImageFile initialisation more robust Mar 12, 2024
@Nigel2392
Copy link

Nigel2392 commented Mar 12, 2024

Honestly - does it matter if we have a file that fails?

I thought he mentioned there was a file extension visible in the database. (his examples also show extension)

We should probably fallback on extensions after checking the mimetype; and then try to guess if it's SVG, no? Seems like a generally good improvement to me.

@nqcm
Copy link

nqcm commented Mar 12, 2024

We have a custom model inheriting from AbstractImage:

class Asset(AbstractImage):
    file = models.ImageField(
        verbose_name=_('file'),
        upload_to=settings.ASSET_UPLOAD_PREFIX,
        width_field='width',
        height_field='height',
        storage=MediaStorage()
    )

MediaStorage here is using django-gcloud-storage

When I upload an image I get the following error:

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0

Which is weird as I am uploading a png or jpeg image.I tried different images on different environments so this issue is not arising from a corrupted image file.

After further investigation and @zerolab and @Nigel2392 think the problem is arising when willow is guessing the wrong file type /extension in

Willow/willow/image.py

Lines 86 to 87 in 830aa3d

if image_format is None and cls.maybe_xml(f):
image_format = "svg"

The code blame shows that this part of the code was added last year. I am upgrading from Wagtail 4.2 to Wagtail 5, which would explain why it never occurred before.

When I run the following:

import filetype
from wagtail.images import get_image_model

Image = get_image_model()

image = Image.objects.get(file="dev/chair-unsplash.jpg")
with image.open_file() as image_file:
    ext = filetype.guess_extension(image_file)
    print(f"Image {image.pk} {image.file} has extension {ext}")

I get Image 149277 dev/chair-unsplash.jpg has extension None

My guess is that it is unable to guess the correct file type for the BLOB when opening a file from gcloud storage.

@nqcm
Copy link

nqcm commented Mar 12, 2024

I still get the issue even if the extension column in DB is jpeg. The file itself has the extension as well.

@nqcm
Copy link

nqcm commented Mar 13, 2024

Hi, I am doing some more investigation and here's my findings

In [4]: image = Image.objects.get(file="cms-dev/chair-unsplash_TAJSD0d.jpg")

In [5]: image.file
Out[5]: <ImageFieldFile: cms-dev/chair-unsplash_TAJSD0d.jpg>

When I try to get the path I get unimplemented error, which is understandable as it is being read from Gcloud into a temp fle:

In [11]: image.file.path
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[11], line 1
----> 1 image.file.path

File /usr/local/lib/python3.10/site-packages/django/db/models/fields/files.py:59, in FieldFile.path(self)
     56 @property
     57 def path(self):
     58     self._require_file()
---> 59     return self.storage.path(self.name)

File /usr/local/lib/python3.10/site-packages/django/core/files/storage.py:128, in Storage.path(self, name)
    122 def path(self, name):
    123     """
    124     Return a local filesystem path where the file can be retrieved using
    125     Python's built-in open() function. Storage systems that can't be
    126     accessed using open() should *not* implement this method.
    127     """
--> 128     raise NotImplementedError("This backend doesn't support absolute paths.")

NotImplementedError: This backend doesn't support absolute paths.
  1. When using the open_file method I get the extension is None:
In [21]: with image.open_file() as image_file:
    ...:     ext = filetype.guess_extension(image_file)
    ...:     print(f"Image {image.pk} {image.file} has extension {ext}")
    ...:
[24/Mar/2024 11:28:35] | urllib3.connectionpool._get_conn:292 | DEBUG | Resetting dropped connection: storage.googleapis.com
[24/Mar/2024 11:28:35] | urllib3.connectionpool._make_request:549 | DEBUG | https://storage.googleapis.com:443 "GET /storage/v1/b/staging/o/cms-dev%2Fchair-unsplash_TAJSD0d.jpg?projection=noAcl&prettyPrint=false HTTP/1.1" 200 839
[24/Mar/2024 11:28:36] | urllib3.connectionpool._make_request:549 | DEBUG | https://storage.googleapis.com:443 "GET /download/storage/v1/b/staging/o/cms-dev%2Fchair-unsplash_TAJSD0d.jpg?generation=1710246298001500&alt=media HTTP/1.1" 200 323686
Image 149277 cms-dev/chair-unsplash_TAJSD0d.jpg has extension None
  1. When I open it using default_storage.open method I am getting the extension correctly:
In [22]: with default_storage.open(image.file.name, 'rb') as file:
    ...:     extension = filetype.guess_extension(file.read())
    ...:     print(f"Image {image.pk} {image.file} has extension {extension}")
    ...:
[24/Mar/2024 11:32:10] | urllib3.connectionpool._get_conn:292 | DEBUG | Resetting dropped connection: storage.googleapis.com
[24/Mar/2024 11:32:11] | urllib3.connectionpool._make_request:549 | DEBUG | https://storage.googleapis.com:443 "GET /storage/v1/b/staging/o/cms-dev%2Fchair-unsplash_TAJSD0d.jpg?projection=noAcl&prettyPrint=false HTTP/1.1" 200 839
[24/Mar/2024 11:32:11] | urllib3.connectionpool._make_request:549 | DEBUG | https://storage.googleapis.com:443 "GET /download/storage/v1/b/staging/o/cms-dev%2Fchair-unsplash_TAJSD0d.jpg?generation=1710246298001500&alt=media HTTP/1.1" 200 323686
Image 149277 cms-dev/chair-unsplash_TAJSD0d.jpg has extension jpg
  1. I also get the extension if I do the following:
storage = image.file.storage

with storage.open(image.file.name, 'rb') as file:
    extension = filetype.guess_extension(file.read())
    print(f"Image {image.pk} {image.file} has extension {extension}")

# Image 149277 cms-dev/chair-unsplash_TAJSD0d.jpg has extension jpg

Does that mean that the open_file method is not reading the file correctly?

@nqcm
Copy link

nqcm commented Apr 4, 2024

Just an update, I was able to solve the issue by making changes to the storage class. The package I was using was downloading the image file from Gcloud and using a SpooledTemporaryFile to store it temporarily. See the code.

I changed it o use a NamedTemporaryFile and willow's open_file method started to read the mime type correctly.

May be someone can add compatibility in Willow for SpooledTemporaryFile.

But thanks for all your time and help. Really appreciate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants