Skip to content

Use read1 instead of read to get magic number #7698

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

groutr
Copy link

@groutr groutr commented Mar 29, 2023

Addresses #7697.

I changed the isinstance check because neither read nor read1 are provided by IOBase. Only RawIOBase and BufferedIOBase provide read and read1 respectively.

I think that there is little benefit to using .tell(). I suggest the following:

filename_or_obj.seek(0)
magic_number = filename_or_obj.read1(count)
filename_or_obj.seek(0)

@groutr groutr changed the title Use read1 instead of read. Use read1 instead of read to get magic number Mar 29, 2023
@headtr1ck
Copy link
Collaborator

I think some backends rely on this magic number to determine the exact file format.
Not sure if this change will cause problems if one doesn't get the full 8 bytes?

@dcherian
Copy link
Contributor

Not sure if this change will cause problems if one doesn't get the full 8 bytes?

Agree, this seems a bit unsafe? https://stackoverflow.com/questions/57726771/what-the-difference-between-read-and-read1-in-python

@groutr
Copy link
Author

groutr commented Mar 30, 2023

Agreed, and a reference to a pretty authoritative source: https://github.com/python/cpython/blob/3.11/Modules/_io/bufferedio.c#L915

It's confusing the method has a parameter called filename_or_obj but doesn't actually handle filenames.

One workaround is to use os.read when passed a filename, and .read() when passed a file object. Something similar to:

def get_magic_number(filename_or_obj, count=8):
    if isinstance(filename_or_obj, (str, os.PathLike)):
        fd = os.open(filename_or_obj, os.RDONLY)  # Append os.O_BINARY on windows
        magic_number = os.read(fd, count)
        if len(magic_number) != count:
            raise TypeError("Error reading magic number")
        os.close(fd)
    elif isinstance(filename_or_obj, io.BufferedIOBase):
        if filename_or_obj.seekable():
            pos = filename_or_obj.tell()
            filename_or_obj.seek(0)
            magic_number = filename_or_obj.read(count)
            filename_or_obj.seek(pos)
        else:
            raise TypeError("File not seekable.")
    else:
        raise TypeError("Cannot read magic number.")
    return magic_number

On my laptop (w/ SSD) using os.read is about 2x faster than using .read()

@headtr1ck
Copy link
Collaborator

I think this logic is done one level above in the call stack. But yes, maybe a different name for the argument would be better.

@dcherian
Copy link
Contributor

One workaround is to use os.read when passed a filename, and .read() when passed a file object.

Not sure about the details here. I think it would be good to discuss in an issue before proceeding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants