Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multilingual text indexation [JIRA: RIAK-2439] #620

Open
Guibod opened this issue Mar 13, 2016 · 4 comments
Open

Multilingual text indexation [JIRA: RIAK-2439] #620

Guibod opened this issue Mar 13, 2016 · 4 comments

Comments

@Guibod
Copy link

Guibod commented Mar 13, 2016

Hi guys,

Is there a proper way to define an index with language specific stemming and tokenization on a single field and a single index ?

I'm struggling to find a proper solution that is Riak compatible, but nothing seems clear to me.

Here is the copy of my Stackoverflow question

I want to store multilanguage (for illustration purpose english, french, spanish, but that's much more) in Riak, I want to use Riak search to help me grouping, stemming, tokenizing the text values.

In my Schema.yml i have:

<field name="text" type="string" indexed="true" stored="true" multiValued="false"/>

And :

<fieldType name="text_en" class="solr.TextField" />
<fieldType name="text_es" class="solr.TextField" />
<fieldType name="text_fr" class="solr.TextField" />

Each fieldType enable language specific optimisation.
There is no DynamicFieldType in Solr, as stated in this other help request at stackoverflow: http://stackoverflow.com/questions/23747373/solr-dynamic-field-types

As suggested above I have three solutions:

  • Separate field per language - load into separate fields (not dynamic) that have appropriate tokenizers and filters per language
  • Separate index/core per language -
  • Everything in one field, custom code to manage -

Separate field

Would force me to store each data in different fields in my Riak document. That's not scalable up to 20 or more languages.

    <field name="text_en" type="text_en" indexed="true" stored="true" multiValued="false"/>
    <field name="text_es" type="text_en" indexed="true" stored="true" multiValued="false"/>
    <field name="text_fr" type="text_en" indexed="true" stored="true" multiValued="false"/>

Separate indexes

That's pretty simple, I can configure my Solr index for a given language, keep only one field. That's an interesting solution since it will allow me a language sharding that's pretty convenient or maintenance.

BUT that imply that I cannot search across multiple languages anymore since I can't find multi-index search feature in my python library or in the documentation.

Custom code

Which I don't understand, most probably start my own java class that can handle my case. That's clearly NOT my preference.

Is there another way around this problem ?

@Basho-JIRA Basho-JIRA changed the title Multilingual text indexation Multilingual text indexation [JIRA: RIAK-2439] Mar 13, 2016
@zeeshanlakhani
Copy link
Contributor

I'd suggest taking on the separate field approach @Guibod, but making sure to distribute that search index across various types/buckets. Did you try that and run into a bottleneck after 20 langs? Across indexing or querying? As per something like http://pavelbogomolenko.github.io/multi-language-handling-in-solr.html, but we can discuss how to best tune your configuration on your needs/expectations?

@Guibod
Copy link
Author

Guibod commented Mar 15, 2016

Thanks @zeeshanlakhani , my only issue with the sharding per lang is that I don't know how to search across multiples indexes for the time being. I'm pretty new to Solr, and rely a lot on the python library at the moment.
I can pretty easily store data in separate buckets/bucket types/indexes, each of them can be fine tuned with a proper analyzer. But I don't know how to query across multiples indexes.

See:
http://basho.github.io/riak-python-client/query.html#querying-an-index

# Python API explicitly requires ONE index
results = bucket.search("counter:[10 TO *]", index='website',
                        sort="counter desc", rows=5)

Should I use map/reduce on the search results ? If so, how can I do that ?
Should I extend the current API with some Solr magic trick such as multiple index query ?

@zeeshanlakhani
Copy link
Contributor

@Guibod you can't do multi-index search w/ riak search, but I was wondering what your bottlenecks would look like using one search_index/core, but creating a bucket-type per lang (associating each bucket-type w/ the one search_index).

@Guibod
Copy link
Author

Guibod commented Mar 16, 2016

The main issue is that I would be stuck with mono-lingual search.
I want to setup proper indexation per language (using string_en, string_fr), and then allow multi-lingual search.

Most of the time i will aggregate data for data visualisation, I can map/reduce results from Riak into a proper aggregation by my own means.
But in some case, i'll need to show off ordered content in detail. This is gonna be really painful to code in my API, I'd gladly rely on cross-index search rather than searching individual indexes, and sorting the results myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants