Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'utf-8' codec can't decode byte ..... json.decode() can't handle the 'plus minus' symbol #432

Closed
svikolev opened this issue Sep 8, 2022 · 2 comments
Labels
bug Something isn't working

Comments

@svikolev
Copy link

svikolev commented Sep 8, 2022

Bug report

Bug summary

Zeiss colibri 2 driver has information fields (property names and vals grabbed using pycromanager):

{'Description': 'Zeiss Colibri adapter',
'Info LED-445nm': '445nm ±24nm, 1000m',
'Info LED-505nm': '505nm ±30nm, 1000m',
'Info LED-555nm': '555nm ±150nm, 850m',
'Intensity LED-445nm': '0',
...etc.}
note: the plus minus symbol.

When acquiring events with the Acquisition, some metadata is saved, but then it is read back for display and the following error is encountered:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 412: invalid start byte

Using a debugger I determined that the byte is the plus minus symbol which the json encoder (utf-8') does not know?

Previously this was encountered in data.py but with the updated pycromamager I encountered it in the ndtiff\nd_tiff_v2.py while trying to open a dataset:

File "C:\ProgramData\Anaconda3\envs\Pumps38\lib\site-packages\ndtiff\nd_tiff_v2.py", line 87, in read_metadata
return json.loads(
File "C:\ProgramData\Anaconda3\envs\Pumps38\lib\json_init_.py", line 343, in loads
s = s.decode(detect_encoding(s), 'surrogatepass')

I would expect the surrogatepass parameter to just skip it but I guess it does not.

In both an older version of PM and the current, I was able to fix this by taking inspiration from bridge.py line 165:

# Paste your code here
message = json.loads(reply[0].decode("iso-8859-1"))
#

and adding the .decode("iso-8859-1") to the read metadata function in ndtiff\nd_tiff_v2.py (or previously in data.py) before it is passed to json.loads(

def read_metadata(self, index):
        return json.loads(
            self._read(
                index["metadata_offset"], index["metadata_offset"] + index["metadata_length"]
            ).decode("iso-8859-1")
        )
#

I am not sure what the best fix would be. Please let me know if I can be more clear, and I will respond promptly.

Version Info

  • Operating system: win10
  • pycromanager version: pycromanager 0.18.3 pyhd8ed1ab_0 conda-forge
  • MicroManager version: 2.0.1 20220720
  • Python version: 3.8
  • Python environment: jupyter notebook and pycharme IDE

PS: Thank you so much for the great work on micro and pycro. I wish I switched from zen pro long time ago. I will try to contribute a use case soon.

@svikolev svikolev added the bug Something isn't working label Sep 8, 2022
@henrypinkard
Copy link
Member

It's not entirely clear to me why this is happening. I think it would be helpful if you dug in a bit more and reported back.

Here is what's supposed to happen:

  1. NDTiff Java library is saving JSON metadata in encoding UTF-8 (The default encoding for String.getBytes())

  2. ndtiff python package is loading that metadata and interpreting it as UTF-8. Essentially it calls:

file = open("path/to/file")
string_read = file.read(start, end)
metadata = json.loads(string_read)

I believe open should have already figured out things are encoded in UTF-8. When I do a quick test and type file into the interpreter, I get:

<_io.TextIOWrapper name='tmp.txt' mode='r' encoding='UTF-8'>

So maybe something is going wrong there because windows doesn't think its UTF-8 (I tested on Mac)

I can run another test of this using:

import json
import numpy as np
raw = np.array([123, 34, -62, -79, 34, 58, 49, 125], dtype=np.int8)
json.loads(raw.tobytes().decode("utf-8"))

This string of bytes is valid JSON in UTF-8 with the ± character. Running it gives me:

{'±': 1}

It is behaves just like the same thing without special characters:

raw = np.array([123, 34, 97, 34, 58, 49, 125], dtype=np.int8)
json.loads(raw.tobytes().decode("utf-8"))
{'a': 1}

So I'm not sure when in the chain of encoding-saving-loading-decoding this goes wrong on your system.

I don't think that the solution you propose above will work, because NDTiff is saving as UTF-8, so decoding as ISO won't always give the same thing. This contrasts to the bridge, which encodes in ISO and decodes in ISO. I don't remember why exactly I chose to do this for the bridge, though I do remember it being confusing

Try playing around with these and see if you can figure out where it goes wrong on your system

By the way, here are useful references for UTF and ISO that I was using to interpret those numbers

PS: Thank you so much for the great work on micro and pycro. I wish I switched from zen pro long time ago. I will try to contribute a use case soon.

Awesome, thanks!

@henrypinkard
Copy link
Member

#467

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants