WordCount2 - doesn't work with non-Ascii characters #11

abatyuk · 2015-02-06T14:42:12Z

// Exercise: Use other versions of the Bible:
//   The data directory contains similar files for the Tanach (t3utf.dat - in Hebrew),

Doesn't count any hebrew words actually, same with cyrillyc - only counted numbers and sup/font etc

The text was updated successfully, but these errors were encountered:

deanwampler · 2015-02-06T16:25:39Z

I've noticed that, actually. I need to investigate why. I'm not very good
with character encoding issues ;)

Dean Wampler, Ph.D.
Typesafe
"Functional Programming for Java Developers",
"Programming Scala", and
"Programming Hive" - all from O'Reilly
twitter: @deanwampler, @chicagoscala
http://typesafe.com
http://polyglotprogramming.com

On Fri, Feb 6, 2015 at 8:42 AM, Andrey Batyuk [email protected]
wrote:

// Exercise: Use other versions of the Bible:
// The data directory contains similar files for the Tanach (t3utf.dat - in Hebrew),

Doesn't count any hebrew words actually, same with cyrillyc - only counted
numbers and sup/font etc

—
Reply to this email directly or view it on GitHub
#11.

abatyuk · 2015-02-06T17:03:26Z

I'll see what I can do in weekend - I have few ideas how to investigate.

On Feb 6, 2015, at 10:25 AM, Dean Wampler [email protected] wrote:

I've noticed that, actually. I need to investigate why. I'm not very good
with character encoding issues ;)

Dean Wampler, Ph.D.
Typesafe
"Functional Programming for Java Developers",
"Programming Scala", and
"Programming Hive" - all from O'Reilly
twitter: @deanwampler, @chicagoscala
http://typesafe.com
http://polyglotprogramming.com

On Fri, Feb 6, 2015 at 8:42 AM, Andrey Batyuk [email protected]
wrote:

// Exercise: Use other versions of the Bible:
// The data directory contains similar files for the Tanach (t3utf.dat - in Hebrew),

Doesn't count any hebrew words actually, same with cyrillyc - only counted
numbers and sup/font etc

—
Reply to this email directly or view it on GitHub
#11.

—
Reply to this email directly or view it on GitHub.

deanwampler · 2015-02-06T18:03:16Z

I did a little reading and the issue is probably the underlying Hadoop API. SparkContext.textFile uses the Hadoop Text type, a subtype of Writable. Text is only designed for UTF-8. I believe Hebrew and Cyrillic require UTF-16, unless I'm mistaken.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WordCount2 - doesn't work with non-Ascii characters #11

WordCount2 - doesn't work with non-Ascii characters #11

abatyuk commented Feb 6, 2015

deanwampler commented Feb 6, 2015

abatyuk commented Feb 6, 2015

deanwampler commented Feb 6, 2015

WordCount2 - doesn't work with non-Ascii characters #11

WordCount2 - doesn't work with non-Ascii characters #11

Comments

abatyuk commented Feb 6, 2015

deanwampler commented Feb 6, 2015

abatyuk commented Feb 6, 2015

deanwampler commented Feb 6, 2015