Elasticsearch notes

turukawa · turukawa · commit 977321fa4029 · 2020-03-01T20:22:36.000+01:00
diff --git a/Initial planning for Elasticsearch prior to coding.ipynb b/Initial planning for Elasticsearch prior to coding.ipynb
@@ -0,0 +1,120 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Planning your approach to search, and Elasticsearch, before you code\n",
+    "\n",
+    "I am a data scientist, more familiar with using code as a means to perform analysis, than with the requirements of full-stack app development. If I need to query my database, I write a query. Users of my analysis can't be expected to roll their own queries, and neither should they be permitted more than a highly abstracted view of the database.\n",
+    "\n",
+    "That means they need a search engine. Ordinarily this is a black box for me, and I simply spin up a Django app and plug in [Django Haystack](http://haystacksearch.org/), and off we go. \n",
+    "\n",
+    "However, on my recent deployment of [Sqwyre.com](https://sqwyre.com) I wasn't paying attention and deployed the latest version of Ubuntu (18.04) and [Elasticsearch](https://www.elastic.co/) (6.5). Which doesn't work with Haystack, since it only supports up to version 2 of Elasticsearch.\n",
+    "\n",
+    "I faced the choice of downgrading my app to convenience Haystack, or taking the opportunity to learn a little about search engines, Elasticsearch in particular, and take advantage of some of the new features that have been introduced.\n",
+    "\n",
+    "The learning curve is steep. Search has its own vernacular, its own assumed domain knowledge, and a deep and talented community of professionals. It's a bit like joining an experienced free climbing club as a non-climber. Everyone is friendly, but no-one can even conceptualise that you can't figure out how to get a foot off the ground, let alone dangle hundreds of metres in the air from a single h)and-hold.\n",
+    "\n",
+    "This Notebook summarises what I learned, and how I would go about thinking, and planning, the components and elements I would need to implement a search user experience. \n",
+    "\n",
+    "## Who this is for\n",
+    "\n",
+    "This is for a person new to search, but not necessarily new to coding. Search may be primary to the app you are developing, or it may be a method to search an archive and only secondary to your app experience.\n",
+    "\n",
+    "You are unlikely to work with search regularly, and you may need to implement this only once while you focus on the parts of your app more important to you, but not necessarily as obvious to your users as search.\n",
+    "\n",
+    "This Notebook does not provide code snippets, since the documentation available is usually very good in that regard. What it does do is provide a framework to understand what the documentation is offering, and what the different approaches and methods do when supporting search.\n",
+    "\n",
+    "## Development stack\n",
+    "\n",
+    "I chose [django-elasticsearch-dsl-drf](https://django-elasticsearch-dsl-drf.readthedocs.io/) since it supports my development stack (Django with Django Rest Framework).\n",
+    "\n",
+    "Where Haystack abstracts most of the search complexity from you, django-elasticsearch-dsl-drf offers much greater flexibility and nuance. That comes with a requirement that you need to know more about what you're doing.\n",
+    "\n",
+    "## Search user experience requirements\n",
+    "\n",
+    "In 2019, developers of user-facing apps face the advantage of there being a settled approach to providing a search experience, but also that this experience is led and dominated by Google.\n",
+    "\n",
+    "At the outset, users will expect any text box labeled `search` to behave exactly as Google does:\n",
+    "\n",
+    " - Pre-emptive / auto suggestions as you type;\n",
+    " - Intelligent recommendations based on context (i.e. both the terms entered, and any previous terms, act to influence both suggestions and results);\n",
+    " - Results prioritised based on intelligent analysis of terms, context, and previous searches;\n",
+    " - Individual results contain sufficient information to permit further filtering, or selecting a result;\n",
+    " - Extracting a meaningful sample of the most likely result/s and presenting these in some accessible form at the top of the results;\n",
+    " - All to be performed all-but instantaneously.\n",
+    "\n",
+    "How you go about doing this will depend on the content you wish to index, and the terms you wish to search on from that content.\n",
+    "\n",
+    "There is also likely to be one major difference between your user-base and that of Google's (and assumption I can make, since you're reading this Notebook): you don't have millions of users performing billions of searches per day.\n",
+    "\n",
+    "That has a very specific implication.\n",
+    "\n",
+    "A search engine can be thought of as a type of NoSQL database as compared to a structured SQL database. It is, effectively, a set of edge-directed documents. The objective is to support very fast querying and serindipitous connections between nodes in the database.\n",
+    "\n",
+    "The weights on the edges strengthen or weaken relationships.\n",
+    "\n",
+    "For Google, and other mainstream search engines indexing billions of documents with billions of searches, the context mined from the search, and the way users interact with search, can act to inform the database by rebalancing nodes.\n",
+    "\n",
+    "A common spelling error can be easily connected to what users actually choose, and - since the term is actually different - you can ask the user 'you said x, did you mean y?'. The engine doesn't need to know anything about spelling, grammar, or even language. It simply needs to collate what users type, and what they click, and score everything accordingly.\n",
+    "\n",
+    "Your initial efforts are unlikely to yield such rich fruit. You are also likely to be offering a more specialist search corpus, meaning that each individual result may be read infrequently.\n",
+    "\n",
+    "You aren't going to get your users to help you fix and improve relationships in your database for free. You're going to have to do a lot of up-front work thinking through likely search pathways, and ways in which you can improve the experience for users.\n",
+    "\n",
+    "## Some definitions\n",
+    "\n",
+    "I will do my best to keep this generic, but my experience is in a PostgreSQL database with Django and Elasticsearch. This should hold for all applications based on SQL databases and Lucene-based search engines (including Elasticsearch and Apache Solr), however, your milage may vary.\n",
+    "\n",
+    "You may think of your data as individual records, or models, in a structured database. To a search engine, each record is a document.\n",
+    "\n",
+    " - **document:** individual database record to be included in the search index;\n",
+    " - **index:** a store of documents, each forming a node in a database where relations between each node are defined by the edges linking them in terms of their relationship (usually based on similarity, or relational proximity);\n",
+    " - **suggestor:** suggestions are offered by a suggestor which is what you usually think of as auto suggestions as you type;\n",
+    " - **fuzzy matching:** users misspell terms, they include common words (like prepositions), they make mistakes ... all of these things confuse computers who need precision; search needs to accommodate this with some mechanism for tolerating 'fuzziness' and still returning meaningful results;\n",
+    " - **edge ngram:** an ngram is a method for tokenising documents; for example, the phrase `catch the monkey` can be tokenised into a series of 3-character grams as: `cat`, `atc`, `tch`, `ch `, `h t`, etc; this means that part terms can be evaluated (since users aren't likely to know precisely what text your documents contain);\n",
+    " - **functional suggestor:** where a suggestor may be based on partial, or fuzzy, matching, sometimes you need precision; this can be in search categories, or such things as post- or zipcodes;\n",
+    " - **filtering:** just as you would in a relational structured database, you may want to filter search results (e.g. newest, by a specific creator, within a specific category, by geography), this acts to reduce the search space;\n",
+    " - **faceting:** this is similar to filtering, but the nature of what may be filtered is produced on the fly from the search results, where filters are offered in advance; e.g. search for a term, filtered by the latest, and then offer the user a set of topics based on the results which they can select to filter further;\n",
+    "\n",
+    "Implementing search does not mean choosing and implementing a single method. You may need to combine multiple approaches to create a single search experience for the user. Depending on where they are in the journey, you may use suggesters, or functional suggesters. You may search on different fields simultaneuously, ranking the results, recombining them, and returning them to the user. \n",
+    "\n",
+    "Search triggers a cascade of queries and responses between the user and your application and may be one of the more processing and bandwidth intensive parts of your user's interaction with your application.\n",
+    "\n",
+    "## First step, understand your document corpus\n",
+    "\n",
+    "If your documents are individual blog posts, then you're likely to want users to be able to search on titles and content, while filtering and/or faceting on publication dates, authors, topics, and categories. If geospatial data, you may want to search on addresses, and filter or facet on additional criteria (or fields) associated with each document. \n",
+    "\n",
+    "The more complex your data, the more decisions you need to make up front relating to what precisely you wish users to search on, and what to offer in terms of filters.\n",
+    "\n",
+    "One thing to avoid is 'overfitting'. If you have 100 documents which are extremely similar (a database of mid-range motorcar reviews), there's likely to be a great deal of repetition in the content and the data. What are the terms that are unique and worth searching on? What's common and can be used as category data? What is of no value for search?\n",
+    "\n",
+    "Do you want to split title from content search? If you combine them, you need to provide that as a field to the search indexer. If you're searching addresses, or technical phrases, you may need to pre-process the terms and then decide on which search approach to perform.\n",
+    "\n",
+    "Without understanding your data, and how you intend for it to be searched and filtered, you won't know how to design the search process."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}