Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSConvert outputs wrong encoding for Waters raw files #3186

Closed
t0mdavid-m opened this issue Oct 8, 2024 · 3 comments · Fixed by #3204
Closed

MSConvert outputs wrong encoding for Waters raw files #3186

t0mdavid-m opened this issue Oct 8, 2024 · 3 comments · Fixed by #3204

Comments

@t0mdavid-m
Copy link

t0mdavid-m commented Oct 8, 2024

We are using MSConvert to convert Waters raw files to mzML. Unfortunately we have been experiencing issues with downstream processing. The issues seem to be caused by the encoding of the mzML files.

The mzML files are encoded with Windows-1252 but in the header UTF-8 is reported:

<?xml version="1.0" encoding="utf-8"?> 

This causes our XML-Parser to assume UTF-8 and fail when running into non-ASCII characters (in our case µ/b5 was problematic).

I would suggest to either convert the files to UTF-8 or report the correct encoding in the header (I could not find any requirements for UTF-8 in the mzML specification).

@chambm
Copy link
Member

chambm commented Oct 8, 2024

What element/attribute had the non-ASCII character? Filepaths should be UTF-8 encoded and XML ids and idrefs should be xHHHH encoded.

@t0mdavid-m
Copy link
Author

t0mdavid-m commented Oct 8, 2024

This actually only occurs in the id and idrefs for the file I am looking at.

For instance:
id="µBSM System Pressure"

It looks like no xHHHH encoding occurrs. Instead, it looks like the character is included as is using the original encoding. The only character reference I could find in the document was &quot;.

@chambm
Copy link
Member

chambm commented Oct 8, 2024

Interesting. Looks like Visual C++'s isalpha() function says µ is an alphabetic character. But their example code implies it's not:

https://learn.microsoft.com/en-us/cpp/c-runtime-library/character-classification?view=msvc-170

Generally these routines execute faster than tests you might write and should be favored over. For example, the following code executes slower than a call to isalpha(c):

if ((c >= 'A') && (c <= 'Z')) || ((c >= 'a') && (c <= 'z'))
return TRUE;

Looks like I need to go for the simpler approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants