From 28fcf4b00941179835db328b799e3d6d200ff2ee Mon Sep 17 00:00:00 2001 From: Vladyslav Voloshyn Date: Wed, 7 Mar 2018 00:26:50 +0200 Subject: [PATCH 1/3] adding info about elasticsearch --- databases_201.rst | 76 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 76 insertions(+) diff --git a/databases_201.rst b/databases_201.rst index d78dc658..ded66ebe 100644 --- a/databases_201.rst +++ b/databases_201.rst @@ -49,4 +49,80 @@ FlockDB Neo4j ----- +ElasticSearch +----- + +**Intro** + +*ElasticSearch* main logo - "you know, for search". It's near real-time search engine and document-oriented datastore. It's written in Java, so lots of processes depends on JVM (Java Virtual Machine). System has awesome REST API, its documents and queries - all is represented via JSON (https://en.wikipedia.org/wiki/JSON). All configuration can be set up in elasticsearch.yml file. Analogues: Apache Solr, Sphinx. + +**Under the microscope** + +Elasticsearch is great for scaling. And this is probably one of the most noticable features. You can literally scale elasticsearch cluster for any load. So to begin with, what cluster is? + +Cluster is simply a set of nodes, physical, virtual or containerised.. whatever. Nodes consist of indexes, indexes consist of shards. Index may have several types. Types are only logical, not physical. Type can be considered as a table when comparing Relational DB. + +Shards consist of Lucene segments (segment is a chunk of Lucene index). And this is the firsts rule for Elasticsearch - all spins around Apache Lucene (Java library for full-text search). + +Segments are created as you index new documents. Data is never removed from them because deleting only marks documents as deleted. Finally, data never changes because updating documents implies re-indexing. And here is the second rule - Elasticsearch is awesome for reading, but not so cool for updating/deleting. The more segments you have to go through, the slower the search. To avoid having an extremely large number of segments in an index, Lucene merges them from time to time => excluding the deleted documents, and creating new and bigger segments. + +Shard can be primary or replica. Replica is a copy of primary. Primary shards are created only at the beginning and cannot be changed (deleted/added) while replica can. When primary is down, replica is promoted to primary, new replica created. Shards are special unit that can be balanced between nodes to maintain equal distribution/load/fault tolerance. + +**Cluster states and roles** + +Cluster has 3 health states: +- *Green* - all primary and replica shards available +- *Yellow* - one or more replicas are down +- *Red* - one or more primaries are down + +As for roles: +- *Master* - responsible for checking alive, healthy state, fault detection +- *Data* - stores data +- *Client* - accepts requests, use to remove load from master/data nodess + +**Documents** + +Document is uniquely identified by index-type-id combination. Elastic is generally schema-free, but can be defined via mappings. Mapping is the process of defining how a document, and the fields it contains, are stored and indexed. Field types can be several types: string, numeric, date, boolean, arrays, multi-fields. Plus predefined fields (used internally by Elastic): _all, _ttl, _timestamp, _source, _id, _type, _index. + +**How Elastic indexes?** + +- Request (document) received => need to choose shard +- By default documents are distributed evenly between shards +- Shard is determined by hashing document's ID +- When shard is determined => sending document to primary. Then backing up to replicas +- Lucene segments are first stored in memory +- Refreshing process makes newly indexed documents available for search +- Flushing process transfers indexed data from memory to the disk +- Flush is triggered in one of the following conditions: memory buffer is full, time passed since last flush, log hit size treshold +- Bigger segments are periodically created from smaller segments to consolidate the inverted indices and make searches faster - it's called merge + +**How Elastic searches?** + +- Request forwards to shard containing data +- Using round-robin Elastic choose either primary or replica shard +- Gather all the results (aggregation from different segments/shards) and gives it back +- Ranking algorithm is TF-IDF (https://en.wikipedia.org/wiki/Tf%E2%80%93idf) + +**How document is updated?** + +- Retrieve the existing document +- Apply the changes you specify +- Removes the old document and indexes the new document (with the update applied) in its place +- Version of document is bumped + +**How document is deleted?** + +- Delete individual documents or groups of documents only marks them to be deleted, so they don’t show up in searches, and gets them out of the index later in an asynchronous manner. Delete old docs occurs during merge. +- Delete complete indices easy to do performance-wise, happens almost instantly +- Interesting fact: you can close indices => doesn’t allow read or write operations and its data isn’t loaded in memory. Remains on disk, easy to restore. +- When you remove a mapping type, all the documents associated with it also are removed => awesome in terms effectivity. + +**Interesting features** + +Aliases, caches, warmers, filters, custom routing, pagination, bulk requests, tokenizers... and much more! + +**Limitations** +- Lucene index can’t have more than 2.1 billion documents or more than 274 billion distinct terms +- JVM => Gold rule is to allocate half of the node’s RAM to Elasticsearch, but no more than 32 GB +- Refresh, flush and merge operations are expensive in terms of performance (CPU, I/O), need to be aware of this \ No newline at end of file From 3e2497005f4b3cd42845705be584374418161bfe Mon Sep 17 00:00:00 2001 From: Vladyslav Voloshyn Date: Wed, 7 Mar 2018 00:34:34 +0200 Subject: [PATCH 2/3] fix lint and moving article to appropriate group --- databases_201.rst | 52 +++++++++++++++++++++++------------------------ 1 file changed, 26 insertions(+), 26 deletions(-) diff --git a/databases_201.rst b/databases_201.rst index ded66ebe..4f58f0cc 100644 --- a/databases_201.rst +++ b/databases_201.rst @@ -25,32 +25,8 @@ CouchDB Hadoop ------ -Key-value Stores -================ - -Riak ----- - -Cassandra ---------- - -Dynamo ------- - -BigTable --------- - -Graph Databases -=============== - -FlockDB -------- - -Neo4j ------ - ElasticSearch ------ +------------- **Intro** @@ -125,4 +101,28 @@ Aliases, caches, warmers, filters, custom routing, pagination, bulk requests, to - Lucene index can’t have more than 2.1 billion documents or more than 274 billion distinct terms - JVM => Gold rule is to allocate half of the node’s RAM to Elasticsearch, but no more than 32 GB -- Refresh, flush and merge operations are expensive in terms of performance (CPU, I/O), need to be aware of this \ No newline at end of file +- Refresh, flush and merge operations are expensive in terms of performance (CPU, I/O), need to be aware of this + +Key-value Stores +================ + +Riak +---- + +Cassandra +--------- + +Dynamo +------ + +BigTable +-------- + +Graph Databases +=============== + +FlockDB +------- + +Neo4j +----- \ No newline at end of file From b99421c2b5043351fc5708e65dd053866949bc21 Mon Sep 17 00:00:00 2001 From: Vlad Voloshyn Date: Wed, 25 Jul 2018 18:11:19 +0300 Subject: [PATCH 3/3] updated docs according to required changes --- databases_201.rst | 24 +++++++++++++++++------- 1 file changed, 17 insertions(+), 7 deletions(-) diff --git a/databases_201.rst b/databases_201.rst index 4f58f0cc..3fb4ed3c 100644 --- a/databases_201.rst +++ b/databases_201.rst @@ -30,35 +30,45 @@ ElasticSearch **Intro** -*ElasticSearch* main logo - "you know, for search". It's near real-time search engine and document-oriented datastore. It's written in Java, so lots of processes depends on JVM (Java Virtual Machine). System has awesome REST API, its documents and queries - all is represented via JSON (https://en.wikipedia.org/wiki/JSON). All configuration can be set up in elasticsearch.yml file. Analogues: Apache Solr, Sphinx. +*ElasticSearch* - it's near real-time search engine and document-oriented datastore. + +It's written in Java, so lots of processes depends on JVM (Java Virtual Machine). System has awesome REST API, its documents and queries - all is represented via JSON. All configuration can be set up in elasticsearch.yml file. + +Analogues: Apache Solr, Sphinx. **Under the microscope** -Elasticsearch is great for scaling. And this is probably one of the most noticable features. You can literally scale elasticsearch cluster for any load. So to begin with, what cluster is? +Elasticsearch is great for scaling. And this is probably one of the most noticable features. You can scale elasticsearch cluster for any load. So to begin with, what is a cluster? -Cluster is simply a set of nodes, physical, virtual or containerised.. whatever. Nodes consist of indexes, indexes consist of shards. Index may have several types. Types are only logical, not physical. Type can be considered as a table when comparing Relational DB. +Cluster is simply a set of nodes, physical, virtual or containerised... whatever. Nodes consist of indexes, indexes consist of shards. Index may have several types. Types are only logical, not physical. Type can be considered as a table when comparing Relational DB. Shards consist of Lucene segments (segment is a chunk of Lucene index). And this is the firsts rule for Elasticsearch - all spins around Apache Lucene (Java library for full-text search). -Segments are created as you index new documents. Data is never removed from them because deleting only marks documents as deleted. Finally, data never changes because updating documents implies re-indexing. And here is the second rule - Elasticsearch is awesome for reading, but not so cool for updating/deleting. The more segments you have to go through, the slower the search. To avoid having an extremely large number of segments in an index, Lucene merges them from time to time => excluding the deleted documents, and creating new and bigger segments. +Segments are created as you index new documents. Data is never removed from them because deleting only marks documents as deleted. Finally, data never changes because updating documents implies re-indexing. + +And here is the second rule - Elasticsearch is awesome for reading, but not so cool for updating/deleting. The more segments you have to go through, the slower the search. To avoid having an extremely large number of segments in an index, Lucene merges them from time to time => excluding the deleted documents, and creating new and bigger segments. Shard can be primary or replica. Replica is a copy of primary. Primary shards are created only at the beginning and cannot be changed (deleted/added) while replica can. When primary is down, replica is promoted to primary, new replica created. Shards are special unit that can be balanced between nodes to maintain equal distribution/load/fault tolerance. **Cluster states and roles** Cluster has 3 health states: + - *Green* - all primary and replica shards available - *Yellow* - one or more replicas are down - *Red* - one or more primaries are down As for roles: + - *Master* - responsible for checking alive, healthy state, fault detection - *Data* - stores data - *Client* - accepts requests, use to remove load from master/data nodess **Documents** -Document is uniquely identified by index-type-id combination. Elastic is generally schema-free, but can be defined via mappings. Mapping is the process of defining how a document, and the fields it contains, are stored and indexed. Field types can be several types: string, numeric, date, boolean, arrays, multi-fields. Plus predefined fields (used internally by Elastic): _all, _ttl, _timestamp, _source, _id, _type, _index. +Document is uniquely identified by index-type-id combination. Elastic is generally schema-free, but can be defined via mappings. Mapping is the process of defining how a document, and the fields it contains, are stored and indexed. + +Field can be several types: string, numeric, date, boolean, arrays, multi-fields. Plus predefined fields (used internally by Elastic): _all, _ttl, _timestamp, _source, _id, _type, _index. **How Elastic indexes?** @@ -77,7 +87,7 @@ Document is uniquely identified by index-type-id combination. Elastic is general - Request forwards to shard containing data - Using round-robin Elastic choose either primary or replica shard - Gather all the results (aggregation from different segments/shards) and gives it back -- Ranking algorithm is TF-IDF (https://en.wikipedia.org/wiki/Tf%E2%80%93idf) +- Ranking algorithm is `TF-IDF `_ **How document is updated?** @@ -95,7 +105,7 @@ Document is uniquely identified by index-type-id combination. Elastic is general **Interesting features** -Aliases, caches, warmers, filters, custom routing, pagination, bulk requests, tokenizers... and much more! +Aliases, caches, warmers, filters, custom routing, pagination, bulk requests, tokenizers, SQL support from 6.3 version... and much more! **Limitations**