Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot set locale:zh when using docsearch-scraper #12

Open
justin5267 opened this issue May 30, 2022 · 2 comments
Open

Cannot set locale:zh when using docsearch-scraper #12

justin5267 opened this issue May 30, 2022 · 2 comments

Comments

@justin5267
Copy link

I am using docsearch scraper to index my website. In order to automatically segment Chinese characters, I need to add locale:zh to the field of content

First,I tried to add locale:zh in the config file`of docsearch scraper, but it doesn’t work.

{
  "index_name": "docs2",
  "start_urls": ["https://www.diglaws.com/"],
  "sitemap_urls": ["https://www.diglaws.com/sitemap.xml"],
  "selectors": {
     "lvl0": {
      "selector": "#article_title",
      "global": true         
        },
      "lvl1":  "#article_content h1",
      "lvl2":  "#article_content h2",
      "lvl3":  "#article_content h3",
      "lvl4":  "#article_content h4",
      "lvl5":  "#article_content h5",
      "lvl6":  "#article_content h6",
      "text": {
        "selector": "#article_content p, #article_content li, #article_content blockquote",
        "locale":"zh"
      }
    }
}
>>> client.collections['docs2'].retrieve()
{'created_at': 1653898837, 'default_sorting_field': 'item_priority', 'fields': [{'facet': False, 'index': True, 'infix': False, 'locale': '', 'name': 'anchor', 'optional': True, 'sort': False, 'type': 'string'}, {'facet': False, 'index': True, 'infix': False, 'locale': '', 'name': 'content', 'optional': True, 'sort': False, 'type': 'string'}, {'facet': True, 'index': True, 'infix': False, 'locale': '', 'name': 'url', 'optional': False, 'sort': False, 'type': 'string'}, {'facet': True, 'index': True, 'infix': False, 'locale': '', 'name': 'version', 'optional': True, 'sort': False, 'type': 'string[]'}, {'facet': True, 'index': True, 'infix': False, 'locale': '', 'name': 'hierarchy.lvl0', 'optional': True, 'sort': False, 'type': 'string'}, {'facet': True, 'index': True, 'infix': False, 'locale': '', 'name': 'hierarchy.lvl1', 'optional': True, 'sort': False, 'type': 'string'}, {'facet': True, 'index': True, 'infix': False, 'locale': '', 'name': 'hierarchy.lvl2', 'optional': True, 'sort': False, 'type': 'string'}, {'facet': True, 'index': True, 'infix': False, 'locale': '', 'name': 'hierarchy.lvl3', 'optional': True, 'sort': False, 'type': 'string'}, {'facet': True, 'index': True, 'infix': False, 'locale': '', 'name': 'hierarchy.lvl4', 'optional': True, 'sort': False, 'type': 'string'}, {'facet': True, 'index': True, 'infix': False, 'locale': '', 'name': 'hierarchy.lvl5', 'optional': True, 'sort': False, 'type': 'string'}, {'facet': True, 'index': True, 'infix': False, 'locale': '', 'name': 'hierarchy.lvl6', 'optional': True, 'sort': False, 'type': 'string'}, {'facet': True, 'index': True, 'infix': False, 'locale': '', 'name': '.*_tag', 'optional': True, 'sort': False, 'type': 'string'}, {'facet': True, 'index': True, 'infix': False, 'locale': '', 'name': 'language', 'optional': True, 'sort': False, 'type': 'string'}, {'facet': True, 'index': True, 'infix': False, 'locale': '', 'name': 'tags', 'optional': True, 'sort': False, 'type': 'string[]'}, {'facet': False, 'index': True, 'infix': False, 'locale': '', 'name': 'item_priority', 'optional': False, 'sort': True, 'type': 'int64'}, {'facet': True, 'index': True, 'infix': False, 'locale': '', 'name': 'locale_tag', 'optional': True, 'sort': False, 'type': 'string'}], 'name': 'docs2_1653898837', 'num_documents': 54668, 'symbols_to_index': [], 'token_separators': []}

Then,I tried to add a tag in the meta data, and it doesn’t work either.
<meta name="docsearch:locale_tag" content="zh" />

Finally, I tried to update the field''s definition, but it is not supported to do so.
Typesense currently does not support in-place updates to a field's definition once it is added to the schema.

I hope there could be a locale option in the config file of docsearch scraper, and after setting locale:zh to a specific selector or set it globally, the field generated by the docsearch scraper can automatically have such definition.

@justin5267
Copy link
Author

In addition, I also tried to export the collection, manually set the scheme, and then import the same jsonl file, but failed with this error:

schema = {
  "name": "docs6",  
  "fields": [
{"name": ".*", "type": "auto","locale":"zh"},
  ]
}
client.collections.create(schema)

with open('0530.jsonl') as jsonl_file:
  client.collections['docs6'].documents.import_(jsonl_file.read().encode('utf-8'), {'action': 'create'})

{"code":400,"document":"{\\"content\\":\\"敬请期待!\\",\\"content_camel\\":\\"敬请期待!\\",\\"hierarchy\\":{\\"lvl0\\":null,\\"lvl1\\":null,\\"lvl2\\":null,\\"lvl3\\":null,\\"lvl4\\":null,\\"lvl5\\":null,\\"lvl6\\":null},\\"hierarchy_camel\\":[{\\"lvl0\\":null,\\"lvl1\\":null,\\"lvl2\\":null,\\"lvl3\\":null,\\"lvl4\\":null,\\"lvl5\\":null,\\"lvl6\\":null}],\\"hierarchy_radio\\":{\\"lvl0\\":null,\\"lvl1\\":null,\\"lvl2\\":null,\\"lvl3\\":null,\\"lvl4\\":null,\\"lvl5\\":null,\\"lvl6\\":null},\\"hierarchy_radio_camel\\":{\\"lvl0\\":null,\\"lvl1\\":null,\\"lvl2\\":null,\\"lvl3\\":null,\\"lvl4\\":null,\\"lvl5\\":null,\\"lvl6\\":null},\\"id\\":\\"4135\\",\\"item_priority\\":0,\\"no_variables\\":true,\\"objectID\\":\\"24f11103459d1ea33a3b2feac731300fb8973cc0\\",\\"tags\\":[],\\"type\\":\\"content\\",\\"url\\":\\"https://www.diglaws.com/civil_law/index.html\\",\\"url_without_anchor\\":\\"https://www.diglaws.com/civil_law/index.html\\",\\"url_without_variables\\":\\"https://www.diglaws.com/civil_law/index.html\\",\\"weight\\":{\\"level\\":0,\\"page_rank\\":0,\\"position\\":0}}","error":"Type of field `hierarchy_camel` is invalid.","success":false}'

@justin5267
Copy link
Author

I modified typesense_helper.py and added some locale:zh, now Chinese characters are segmented as expected.

self.typesense_client.collections.create({
            'name': self.collection_name_tmp,
            'fields': [
                {'name': 'anchor', 'type': 'string', 'optional': True},
                {'name': 'content', 'type': 'string', "locale": "zh", 'optional': True},
                {'name': 'url', 'type': 'string', 'facet': True},
                {'name': 'version', 'type': 'string[]', 'facet': True, 'optional': True},
                {'name': 'hierarchy.lvl0', 'type': 'string', "locale": "zh", 'facet': True, 'optional': True},
                {'name': 'hierarchy.lvl1', 'type': 'string', "locale": "zh", 'facet': True, 'optional': True},
                {'name': 'hierarchy.lvl2', 'type': 'string', "locale": "zh", 'facet': True, 'optional': True},
                {'name': 'hierarchy.lvl3', 'type': 'string', "locale": "zh", 'facet': True, 'optional': True},
                {'name': 'hierarchy.lvl4', 'type': 'string', "locale": "zh", 'facet': True, 'optional': True},
                {'name': 'hierarchy.lvl5', 'type': 'string', "locale": "zh", 'facet': True, 'optional': True},
                {'name': 'hierarchy.lvl6', 'type': 'string', "locale": "zh", 'facet': True, 'optional': True},
                {'name': '.*_tag', 'type': 'string', 'facet': True, 'optional': True},
                {'name': 'language', 'type': 'string', 'facet': True, 'optional': True},
                {'name': 'tags', 'type': 'string[]', 'facet': True, 'optional': True},
                {'name': 'item_priority', 'type': 'int64'},
            ],
            'default_sorting_field': 'item_priority'
        })

I am not sure if the problem has been solved, for the following error is displayed during the operation.,and I don't know if it matters.

>DocSearch: http://www.diglaws.com/civil_procedure_law/A2-2.html 27 records) 2022-05-31 00:24:50 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www.diglaws.com/civil_procedure_law/A2-2.html> (referer: None) Traceback (most recent call last): File "C:\Users\Justin\AppData\Local\Programs\Python\Python310\lib\site-packages\twisted\internet\defer.py", line 857, in _runCallbacks current.result = callback( # type: ignore[misc] File "C:\Users\Justin\test_site\utility\typesense-docsearch-scraper-master\cli\..\scraper\src\documentation_spider.py", line 182, in parse_from_start_url return self.parse(response) File "C:\Users\Justin\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\spiders\__init__.py", line 70, in parse raise NotImplementedError(f'{self.__class__.__name__}.parse callback is not defined') NotImplementedError: DocumentationSpider.parse callback is not defined

@jasonbosco jasonbosco transferred this issue from typesense/typesense May 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant