Property values with encodings different from the input stream's encoding not read correctly (vCard 2.1) #23

GoogleCodeExporter · 2015-03-21T12:39:51Z

What steps will reproduce the problem?
1. InputStream is = new FileInputStream("test/resources/Herr Steve Jobs.vcf");
2. Ezvcard.parse(is).all();
3. vCards.get(0).getAddresses().get(0).getStreetAddress();

What is the expected output?
ß

What is the actual output?
�

What version of ez-vcard are you using?
0.9.6

What version of Java are you using?
8

Please provide any additional information below.
Reading the InputStream with an InputStreamReader uses the system default 
charset to determine nextChar(). This prevents the proper decoding with the 
provided encoding parameter.

nextchar() already returns
U+FFFD  � ef bf bd    REPLACEMENT CHARACTER
when it encounters the DF byte in the file and trying to decode it afterwards 
with Windows-1252 does not restore U+00DF. The information is destroyed.

Each block should be read with a bytestream and the the string should be 
created with the encoding provided in the encoding parameter of the line.

Original issue reported on code.google.com by [email protected] on 18 Nov 2014 at 5:03

Attachments:

[Herr Steve Jobs.vcf](https://storage.googleapis.com/google-code-attachments/ez-vcard/issue-23/comment-0/Herr Steve Jobs.vcf)

The text was updated successfully, but these errors were encountered:

mangstadt · 2016-07-31T20:50:58Z

I got this to work by modifying the parser to read one byte at a time instead of one char at a time. But it only works for ASCII-compatible character sets because each byte had to be cast to a char in order to parse the vCard syntax (the property name, parameters, etc). If the vCard file as a whole is encoded in something like UTF-16 (which encodes each character in 2 bytes instead of 1), it fails.

This "cast byte to char" approach also fails for a number of more obscure character sets. I tested this by looping through all available character sets supported by the JVM and saving a vCard file in each one. I then attempted to read the file using the "cast byte to char" approach, and many failed.

String s = "BEGIN:VCARD\r\nVERSION:4.0\r\nFN:Name\r\nEND:VCARD\r\n";
for (Charset c : Charset.availableCharsets().values()) {
    File file = new File("temp.vcf");
    FileOutputStream out = new FileOutputStream(file);
    BufferedWriter w = new BufferedWriter(new OutputStreamWriter(out, c));
    w.write(s);
    w.close();

    //read the vCard file...
}

The work around described by David only works if the vCard file is encoded in an ASCII-compatible character encoding (which ISO-8859-1 is). If the file is encoded in, say, UTF-16, it fails because the parser is trying to parse the file using ISO-8859-1, which is not compatible with UTF-16.

The problem boils down to this: How do you parse a file that contains text encoded in multiple character encodings? This is not something that happens often. 99.99% of the time, a text file is encoded in a single character set, not multiple.

The Reader class does not let you switch character encodings mid-stream. _Therefore, the only way to do this is to treat the vCard file as a binary file and manually convert each byte to a character as it is read off the stream, switching character sets when the property value is reached._ How this is done, I don't know. It might be possible using the CharsetDecoder class.

I tried wrapping the raw InputStream in a new Reader object when the property value was reached, but that didn't work. For some reason, its read() method returned -1, even though the stream has not ended. This problem can be demonstrated as follows:

@Test
public void multiple_readers() throws Exception {
    String s = "Hello world!";
    byte[] b = s.getBytes("UTF-8");

    ByteArrayInputStream in = new ByteArrayInputStream(b);
    Reader r1 = new InputStreamReader(in, "UTF-8");
    Reader r2 = new InputStreamReader(in, "UTF-8");

    assertEquals('H', r1.read());
    assertEquals('e', r2.read()); //fails
}

GoogleCodeExporter added Type-Defect Priority-Medium auto-migrated labels Mar 21, 2015

mangstadt removed Priority-Medium Type-Defect labels Oct 13, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Property values with encodings different from the input stream's encoding not read correctly (vCard 2.1) #23

Property values with encodings different from the input stream's encoding not read correctly (vCard 2.1) #23

GoogleCodeExporter commented Mar 21, 2015

mangstadt commented Jul 31, 2016 •

edited

Loading

Property values with encodings different from the input stream's encoding not read correctly (vCard 2.1) #23

Property values with encodings different from the input stream's encoding not read correctly (vCard 2.1) #23

Comments

GoogleCodeExporter commented Mar 21, 2015

mangstadt commented Jul 31, 2016 • edited Loading

mangstadt commented Jul 31, 2016 •

edited

Loading