Skip to content

Multilingual text indexation [JIRA: RIAK-2439] #620

Open
@Guibod

Description

@Guibod

Hi guys,

Is there a proper way to define an index with language specific stemming and tokenization on a single field and a single index ?

I'm struggling to find a proper solution that is Riak compatible, but nothing seems clear to me.

Here is the copy of my Stackoverflow question

I want to store multilanguage (for illustration purpose english, french, spanish, but that's much more) in Riak, I want to use Riak search to help me grouping, stemming, tokenizing the text values.

In my Schema.yml i have:

<field name="text" type="string" indexed="true" stored="true" multiValued="false"/>

And :

<fieldType name="text_en" class="solr.TextField" />
<fieldType name="text_es" class="solr.TextField" />
<fieldType name="text_fr" class="solr.TextField" />

Each fieldType enable language specific optimisation.
There is no DynamicFieldType in Solr, as stated in this other help request at stackoverflow: http://stackoverflow.com/questions/23747373/solr-dynamic-field-types

As suggested above I have three solutions:

  • Separate field per language - load into separate fields (not dynamic) that have appropriate tokenizers and filters per language
  • Separate index/core per language -
  • Everything in one field, custom code to manage -

Separate field

Would force me to store each data in different fields in my Riak document. That's not scalable up to 20 or more languages.

    <field name="text_en" type="text_en" indexed="true" stored="true" multiValued="false"/>
    <field name="text_es" type="text_en" indexed="true" stored="true" multiValued="false"/>
    <field name="text_fr" type="text_en" indexed="true" stored="true" multiValued="false"/>

Separate indexes

That's pretty simple, I can configure my Solr index for a given language, keep only one field. That's an interesting solution since it will allow me a language sharding that's pretty convenient or maintenance.

BUT that imply that I cannot search across multiple languages anymore since I can't find multi-index search feature in my python library or in the documentation.

Custom code

Which I don't understand, most probably start my own java class that can handle my case. That's clearly NOT my preference.

Is there another way around this problem ?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions