Skip to content

Commit

Permalink
Merge pull request #1996 from dadoonet/semantic-search
Browse files Browse the repository at this point in the history
Add support for Semantic search
  • Loading branch information
dadoonet authored Jan 17, 2025
2 parents 08254a4 + e09e36d commit 543b2dd
Show file tree
Hide file tree
Showing 27 changed files with 573 additions and 69 deletions.
9 changes: 4 additions & 5 deletions contrib/docker-compose-example-elasticsearch/.env
Original file line number Diff line number Diff line change
Expand Up @@ -26,12 +26,11 @@ ES_PORT=9200
# Port to expose Kibana to the host
KIBANA_PORT=5601

# Enterprise Search settings
ENTERPRISE_SEARCH_PORT=3002
ENCRYPTION_KEYS=q3t6w9z$C&F)J@McQfTjWnZr4u7x!A%D

# Increase or decrease based on the available host memory (in bytes)
MEM_LIMIT=1073741824
# When using basic, that should be enough as we don't run ML jobs
# MEM_LIMIT=1073741824
# When using trial, you need 4gb to be able to run inference with Elasticsearch
MEM_LIMIT=4294967296

# Project namespace (defaults to the current folder name if not set)
COMPOSE_PROJECT_NAME=fscrawler
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,6 @@ services:
- ELASTICSEARCH_USERNAME=kibana_system
- ELASTICSEARCH_PASSWORD=${KIBANA_PASSWORD}
- ELASTICSEARCH_SSL_CERTIFICATEAUTHORITIES=config/certs/ca/ca.crt
- ENTERPRISESEARCH_HOST=http://enterprisesearch:${ENTERPRISE_SEARCH_PORT}
mem_limit: ${MEM_LIMIT}
healthcheck:
test:
Expand Down
5 changes: 4 additions & 1 deletion contrib/docker-compose-it/.env
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,10 @@ ES_PORT=9200
KIBANA_PORT=5601

# Increase or decrease based on the available host memory (in bytes)
MEM_LIMIT=1073741824
# When using basic, that should be enough as we don't run ML jobs
# MEM_LIMIT=1073741824
# When using trial, you need 4gb to be able to run inference with Elasticsearch
MEM_LIMIT=4294967296

# Project namespace (defaults to the current folder name if not set)
COMPOSE_PROJECT_NAME=fscrawler
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,12 +26,11 @@ ES_PORT=9200
# Port to expose Kibana to the host
KIBANA_PORT=5601

# Enterprise Search settings
ENTERPRISE_SEARCH_PORT=3002
ENCRYPTION_KEYS=q3t6w9z$C&F)J@McQfTjWnZr4u7x!A%D

# Increase or decrease based on the available host memory (in bytes)
MEM_LIMIT=1073741824
# When using basic, that should be enough as we don't run ML jobs
# MEM_LIMIT=1073741824
# When using trial, you need 4gb to be able to run inference with Elasticsearch
MEM_LIMIT=4294967296

# Project namespace (defaults to the current folder name if not set)
COMPOSE_PROJECT_NAME=fscrawler
5 changes: 4 additions & 1 deletion contrib/src/main/resources/docker-compose-it/.env
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,10 @@ ES_PORT=9200
KIBANA_PORT=5601

# Increase or decrease based on the available host memory (in bytes)
MEM_LIMIT=1073741824
# When using basic, that should be enough as we don't run ML jobs
# MEM_LIMIT=1073741824
# When using trial, you need 4gb to be able to run inference with Elasticsearch
MEM_LIMIT=4294967296

# Project namespace (defaults to the current folder name if not set)
COMPOSE_PROJECT_NAME=fscrawler
Expand Down
88 changes: 85 additions & 3 deletions docs/source/admin/fs/elasticsearch.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@ Here is a list of Elasticsearch settings (under ``elasticsearch.`` prefix)`:
+-----------------------------------+---------------------------+---------------------------------+
| ``elasticsearch.pipeline`` | ``null`` | :ref:`ingest_node` |
+-----------------------------------+---------------------------+---------------------------------+
| ``elasticsearch.semantic_search`` | ``true`` | :ref:`semantic_search` |
+-----------------------------------+---------------------------+---------------------------------+
| ``elasticsearch.nodes`` | ``https://127.0.0.1:9200``| `Node settings`_ |
+-----------------------------------+---------------------------+---------------------------------+
| ``elasticsearch.path_prefix`` | ``null`` | `Path prefix`_ |
Expand Down Expand Up @@ -87,7 +89,10 @@ to define the index settings and mappings:
and the mapping for the ``path`` field.

- ``fscrawler_mapping_attachment``: defines the mapping for the ``attachment`` field.
- ``fscrawler_mapping_content``: defines the mapping for the ``content`` field.
- ``fscrawler_mapping_content_semantic``: defines the mapping for the ``content`` field when using semantic search.
It also creates a ``semantic_text`` field named ``content_semantic``. Please read the :ref:`semantic_search` section.

- ``fscrawler_mapping_content``: defines the mapping for the ``content`` field when semantic search is not available.
- ``fscrawler_mapping_meta``: defines the mapping for the ``meta`` field.

You can see the content of those templates by running:
Expand Down Expand Up @@ -117,6 +122,29 @@ If you want to define your own index settings and mapping to set
analyzers for example, you can update the needed component template
**before starting the FSCrawler**.

The following example uses a ``french`` analyzer to index the
``content`` field and still allow using semantic search.

.. code:: json
PUT _component_template/fscrawler_mapping_content_semantic
{
"template": {
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "french",
"copy_to": "content_semantic"
},
"content_semantic": {
"type": "semantic_text"
}
}
}
}
}
The following example uses a ``french`` analyzer to index the
``content`` field.

Expand Down Expand Up @@ -148,6 +176,58 @@ You might to try `elasticsearch Reindex
API <https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html>`__
though.

.. _semantic_search:

Semantic search
"""""""""""""""

.. versionadded:: 2.10

FSCrawler can use `semantic search <https://www.elastic.co/guide/en/elasticsearch/reference/current/semantic-search.html>`__
to improve the search results.

.. note::

Semantic search is available starting from Elasticsearch 8.17.0 and requires a trial or enterprise license.

Semantic search is enabled by default when an Elasticsearch 8.17.0 or above and a trial or enterprise license are
detected. But you can disable it by setting ``semantic_search`` to ``false``:

.. code:: yaml
name: "test"
elasticsearch:
semantic_search: false
When activated, the ``content`` field is indexed as usual but a new field named ``content_semantic``
is created and uses the `semantic_text <https://www.elastic.co/guide/en/elasticsearch/reference/current/semantic-text.html>`__
field type. This field type is used to store the semantic information extracted from the content by using the defined
inference API (defaults to `Elser model <https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html>`__).

You can change the model to use by changing the component template. For example, a recommended model when you have only
english content is the Elastic `multilingual-e5-small <https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-multilingual-e5-small.html>`__:

.. code:: json
PUT _component_template/fscrawler_mapping_content_semantic
{
"template": {
"mappings": {
"properties": {
"content": {
"type": "text",
"copy_to": "content_semantic"
},
"content_semantic": {
"type": "semantic_text",
"inference_id": ".multilingual-e5-small-elasticsearch"
}
}
}
}
}
Bulk settings
^^^^^^^^^^^^^

Expand Down Expand Up @@ -330,8 +410,7 @@ Then you can use the encoded API Key in FSCrawler settings:
Basic Authentication (deprecated)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The best practice is to use `API Key`_ or `Access Token`_. But if you have no other choice,
you can still use Basic Authentication.
The best practice is to use `API Key`_. But if you have no other choice, you can still use Basic Authentication.

You can provide the ``username`` and ``password`` to FSCrawler:

Expand Down Expand Up @@ -465,6 +544,9 @@ FSCrawler may create the following fields depending on configuration and availab
+============================+========================================+==============================================+=====================================================================+
| ``content`` | Extracted content | ``"This is my text!"`` | |
+----------------------------+----------------------------------------+----------------------------------------------+---------------------------------------------------------------------+
| ``content_semantic`` | Semantic version of the extracted | Semantic representation | |
| | content | | |
+----------------------------+----------------------------------------+----------------------------------------------+---------------------------------------------------------------------+
| ``attachment`` | BASE64 encoded binary file | BASE64 Encoded document | |
| | | | |
+----------------------------+----------------------------------------+----------------------------------------------+---------------------------------------------------------------------+
Expand Down
6 changes: 5 additions & 1 deletion docs/source/admin/fs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -104,8 +104,12 @@ The job file (``~/.fscrawler/test/_settings.yaml``) for the job name ``test`` mu
index: "test_docs"
# optional, defaults to "test_folders", used when es.index_folders is set to true
index_folder: "test_fold"
# optional, defaults to "true"
push_templates: "true"
# optional, defaults to "true", used with Elasticsearch 8.17+ with a trial or enterprise license
semantic_search: "true"
# only used when started with --rest option
rest:
# only is started with --rest option
url: "http://127.0.0.1:8080/fscrawler"
Here is a list of existing top level settings:
Expand Down
13 changes: 12 additions & 1 deletion docs/source/dev/build.rst
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ To run the test suite against an elasticsearch instance running locally, just ru

.. hint::

If you are using a secured instance, use ``tests.cluster.user``, ``tests.cluster.apiKey``::
If you are using a secured instance, use ``tests.cluster.apiKey``::

mvn verify -pl fr.pilato.elasticsearch.crawler:fscrawler-it \
-Dtests.cluster.apiKey=APIKEYHERE \
Expand All @@ -102,6 +102,17 @@ To run the test suite against an elasticsearch instance running locally, just ru
-Dtests.cluster.pass=changeme \
-Dtests.cluster.url=https://127.0.0.1:9200 \

If the cluster is using a self generated SSL certificate, you can bypass checking the certificate by using
``tests.cluster.check_ssl``::

mvn verify -pl fr.pilato.elasticsearch.crawler:fscrawler-it \
-Dtests.cluster.apiKey=APIKEYHERE \
-Dtests.cluster.url=https://127.0.0.1:9200 \
-Dtests.cluster.check_ssl=false

But anyway, by default, the integration tests will try to run with both options, first checking the ssl certificate,
and then ignoring it.

.. hint::

To run tests against another instance (ie. running on
Expand Down
14 changes: 8 additions & 6 deletions docs/source/release/2.10.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,18 @@ Version 2.10
New
---

* Add support for automatic semantic search when using a 8.17+ version with a trial or enterprise
license. See :ref:`semantic_search`. Thanks to dadoonet.
* Using the REST API ``_document``, you can now fetch a document from the local dir, from an http website
or from an S3 bucket. Thanks to dadoonet.
* You can now remove a document in Elasticsearch using FSCrawler ``_document`` endpoint. Thanks to dadoonet.
or from an S3 bucket. See :ref:`rest-service`. Thanks to dadoonet.
* You can now remove a document in Elasticsearch using FSCrawler ``_document`` endpoint. See :ref:`rest-service`. Thanks to dadoonet.
* Implement our own HTTP Client for Elasticsearch. Thanks to dadoonet.
* Add option to set path to custom tika config file. Thanks to iadcode.
* Support for Index Templates. Thanks to dadoonet.
* Add option to set path to custom tika config file. See :ref:`local-fs-settings`. Thanks to iadcode.
* Support for Index Templates. See :ref:`mappings`. Thanks to dadoonet.
* Support for Aliases. You can now index to an alias. Thanks to dadoonet.
* Support for Access Token and Api Keys instead of Basic Authentication. Thanks to dadoonet.
* Support for Access Token and Api Keys instead of Basic Authentication. See :ref:`credentials`. Thanks to dadoonet.
* Allow loading external jars. This adds a new ``external`` directory from where jars can be loaded
to the FSCrawler JVM. For example, you could provide your own Custom Tika Parser code. Thanks to dadoonet.
to the FSCrawler JVM. For example, you could provide your own Custom Tika Parser code. See :ref:`layout`. Thanks to dadoonet.
* Add temporal information in folder index. Thanks to bdauvissat

Fix
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
/*
* Licensed to David Pilato under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

package fr.pilato.elasticsearch.crawler.fs.client;

public class ESSemanticQuery extends ESQuery {
private final String value;

public ESSemanticQuery(String field, String value) {
super(field);
this.value = value;
}

public String getValue() {
return value;
}
}
Loading

0 comments on commit 543b2dd

Please sign in to comment.