Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WordCount2 - doesn't work with non-Ascii characters #11

Open
abatyuk opened this issue Feb 6, 2015 · 3 comments
Open

WordCount2 - doesn't work with non-Ascii characters #11

abatyuk opened this issue Feb 6, 2015 · 3 comments

Comments

@abatyuk
Copy link
Contributor

abatyuk commented Feb 6, 2015

// Exercise: Use other versions of the Bible:
//   The data directory contains similar files for the Tanach (t3utf.dat - in Hebrew),

Doesn't count any hebrew words actually, same with cyrillyc - only counted numbers and sup/font etc

@deanwampler
Copy link
Owner

I've noticed that, actually. I need to investigate why. I'm not very good
with character encoding issues ;)

Dean Wampler, Ph.D.
Typesafe
"Functional Programming for Java Developers",
"Programming Scala", and
"Programming Hive" - all from O'Reilly
twitter: @deanwampler, @chicagoscala
http://typesafe.com
http://polyglotprogramming.com

On Fri, Feb 6, 2015 at 8:42 AM, Andrey Batyuk [email protected]
wrote:

// Exercise: Use other versions of the Bible:
// The data directory contains similar files for the Tanach (t3utf.dat - in Hebrew),

Doesn't count any hebrew words actually, same with cyrillyc - only counted
numbers and sup/font etc


Reply to this email directly or view it on GitHub
#11.

@abatyuk
Copy link
Contributor Author

abatyuk commented Feb 6, 2015

I'll see what I can do in weekend - I have few ideas how to investigate.

On Feb 6, 2015, at 10:25 AM, Dean Wampler [email protected] wrote:

I've noticed that, actually. I need to investigate why. I'm not very good
with character encoding issues ;)

Dean Wampler, Ph.D.
Typesafe
"Functional Programming for Java Developers",
"Programming Scala", and
"Programming Hive" - all from O'Reilly
twitter: @deanwampler, @chicagoscala
http://typesafe.com
http://polyglotprogramming.com

On Fri, Feb 6, 2015 at 8:42 AM, Andrey Batyuk [email protected]
wrote:

// Exercise: Use other versions of the Bible:
// The data directory contains similar files for the Tanach (t3utf.dat - in Hebrew),

Doesn't count any hebrew words actually, same with cyrillyc - only counted
numbers and sup/font etc


Reply to this email directly or view it on GitHub
#11.


Reply to this email directly or view it on GitHub.

@deanwampler
Copy link
Owner

I did a little reading and the issue is probably the underlying Hadoop API. SparkContext.textFile uses the Hadoop Text type, a subtype of Writable. Text is only designed for UTF-8. I believe Hebrew and Cyrillic require UTF-16, unless I'm mistaken.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants