Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wikipedia extraction seems to be giving bigrams #3

Open
MikeHopcroft opened this issue Aug 27, 2016 · 1 comment
Open

Wikipedia extraction seems to be giving bigrams #3

MikeHopcroft opened this issue Aug 27, 2016 · 1 comment

Comments

@MikeHopcroft
Copy link
Contributor

Repro:

BitFunnel: 9e9e96ecb32841c53edc4542813ed1531fd4c4a9
Workbench: 580b74b

StatisticsBuilder c:\git\Wikipedia\Manifest100.txt c:\temp\wiki\out100 -statistics -text

Shouldn't have bigrams, shouldn't have capital letters:

Bigram where none expected (also capital letter):
72a2c4b53c781027,1,1,0.000144196,zephyrinus
bd01f0b68e57b2a7,1,1,0.000144196,sveshtari
3fad0c4faf3cb52b,1,0,0.000144196,Algebraic geometry
50c9029d9d3c5378,1,1,0.000144196,darabont
a2f5153a7612c5d0,1,1,0.000144196,up─üsik─ü

3ca7b8a975b95d4d,1,1,0.000144196,crisplock

Capital letter
49fc77672b6b54c4,1,0,0.000144196,Alexander Graham Bell

7d8b10a0a2b9f455,1,0,0.000144196,Evolutionarily stable strategy

Random garbase
b651bc4fddcd84af,1,1,0.000144196,86p
6cc733ca24bc18e,1,1,0.000144196,ಹರಿವೆ
5847567b67dc03cb,1,1,0.000144196,xis

a31bc33fc17f3fc6,1,1,0.000144196,लाख

b3d2c5a33dd1efc6,1,1,0.000144196,k├╢nigsberger

@MikeHopcroft
Copy link
Contributor Author

This is probably due to the problem where the Lucene analyzer wasn't run in Workbench. #6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant