Skip to content

Commit 7bf5fb8

Browse files
Added Sharding assets and notes
1 parent d323e11 commit 7bf5fb8

File tree

5 files changed

+62
-4
lines changed

5 files changed

+62
-4
lines changed

assets/11.png

110 KB
Loading

assets/12.png

43.6 KB
Loading

assets/13.png

85.3 KB
Loading

assets/14.png

21.5 KB
Loading

notes/primer.md

+62-4
Original file line numberDiff line numberDiff line change
@@ -294,16 +294,74 @@
294294
- **Database Partitioning**
295295
- **Partitioning** means *Dividing data accross tables or databases.*
296296
- Partitioning methods:
297-
- **Horizontal Partitioning** - ***Involves putting different rows into different tables***. Perhaps customers with ZIP codes less than 50000 are stored in CustomersEast, while customers with ZIP codes greater than or equal to 50000 are stored in CustomersWest. The two partition tables are then CustomersEast and CustomersWest, while a view with a union might be created over both of them to provide a complete view of all customers.
298-
- **Vertical Partitioning** - ***Involves creating tables with fewer columns and using additional tables to store remaining columns.***
297+
- **Horizontal Partitioning -** ***Involves putting different rows into different tables***. Perhaps customers with ZIP codes less than 50000 are stored in CustomersEast, while customers with ZIP codes greater than or equal to 50000 are stored in CustomersWest. The two partition tables are then CustomersEast and CustomersWest, while a view with a union might be created over both of them to provide a complete view of all customers. **Each partition has the same schema and columns but different rows. Likewise, the data held in each is unique and independent of the data held in other partitions.**
298+
- **Vertical Partitioning** - ***Involves creating tables with fewer columns and using additional tables to store remaining columns.* The data held within one vertical partition is independent of the data in all the others, and each holds both distinct rows and columns.** It is almost always implemented at the application level — a piece of code routing reads and writes to a designated database.
299299
- **Normalization -** The process of removing redundant columns from a table and putting them in secondary tables that are linked to the primary table by primary key and foreign key relationships.
300300
- **Row splitting** - Divides the original table vertically into tables with fewer columns. Each logical row in a split-table matches the same logical row in the other tables as identified by a UNIQUE KEY column that is identical in all of the partitioned tables.
301+
302+
![Image 11](../assets/11.png)
303+
301304
- **Benefits of partitioning -**
302305
- Indexes are smaller (quick index scans)
303306
- Allows DB optimizer to sequence scan the partition instead of index
304307
- Split the table by columns (vertically) and put the columns with the entire slice into another table (blobs)
305308
- Example fields that are blobs can be put in another table in another tablespace that is stored in HDD vs the rest of the data goes to your SSD.
306309
- [Wiki article](https://en.wikipedia.org/wiki/Partition_%28database%29)
307310
- [Hussain's video on partitioning](https://www.youtube.com/watch?v=QA25cMWp9Tk)
308-
- [SF answer](https://stackoverflow.com/questions/18302773/what-are-horizontal-and-vertical-partitions-in-database-and-what-is-the-differen/18302815)
309-
- [SF answer 2](https://stackoverflow.com/questions/20388923/database-partitioning-horizontal-vs-vertical-difference-between-normalizatio)
311+
- [Difference between Horizonatal vs Vertical partitioning - StackOverflow answer](https://stackoverflow.com/questions/18302773/what-are-horizontal-and-vertical-partitions-in-database-and-what-is-the-differen/18302815)
312+
- [Difference between Normalization and Row Splitting in Vertical Paritioning - StackOverflow answer](https://stackoverflow.com/questions/20388923/database-partitioning-horizontal-vs-vertical-difference-between-normalizatio)
313+
- **Database Sharding(based on Horizontal partitioning)**
314+
- **Sharding involves breaking up one’s data into two or smaller chunks, called logical shards**. **The logical shards are then distributed across separate database nodes, referred to as physical shards, which can hold multiple logical shards**. Despite this, the data held within all the shards collectively represent an entire logical dataset.
315+
- Database shards exemplify a *[shared-nothing architecture](https://en.wikipedia.org/wiki/Shared-nothing_architecture)*. This means that the shards are autonomous; they don’t share any of the same data or computing resources. In some cases, though, it may make sense to replicate certain tables into each shard to serve as **reference tables.** For example, let’s say there’s a database for an application that depends on fixed conversion rates for weight measurements. By replicating a table containing the necessary conversion rate data into each shard, it would help to ensure that all of the data required for queries is held in every shard.
316+
- **Shard or Partition Key is a portion of the primary key that determines how data should be distributed**. A partition key allows you to retrieve and modify data efficiently by routing operations to the correct database. Entries with the same partition key are stored in the same node. **A logical shard** is a collection of data sharing the same partition key. A database node, sometimes referred to as a **physical shard**, contains multiple logical shards.
317+
- **Sharding splits a homogeneous type of data into multiple databases**. You can see that such an algorithm is easily generalizable. That’s why sharding can be implemented at either the application or database level. In many databases, sharding is a first-class concept, and the database knows how to store and retrieve data within a cluster. Almost all modern databases are natively sharded. Cassandra, HBase, HDFS, and MongoDB are popular distributed databases. Notable examples of non-sharded modern databases are Slite, Redis (spec in progress), Memcached, and Zookeeper.
318+
- Oftentimes, sharding is implemented at the application level, meaning that the application includes code that defines which shard to transmit reads and writes to. However, many database management systems have sharding capabilities built-in, allowing you to implement sharding directly at the database level.
319+
- **Benefits**:
320+
- Scale out - horizontal scaling
321+
- Makes setup more flexible
322+
- Speed up query response times (You don't need to search through entire database and the query does a lookup on the respective shard based on partition key and query parameters)
323+
- Makes applications more reliable by mitigating the impact of outages.
324+
- **Drawbacks:**
325+
- Sharding a database table before it has been optimized locally causes premature complexity. Sharding should be used only when all other options for optimization are inadequate
326+
- Adds additional [programming and operational complexity](http://www.percona.com/blog/2009/08/06/why-you-dont-want-to-shard/) to the application.
327+
- Databases become less convenient in terms of accessing when it is spread across multiple tables. Operations may need to search through many databases to retrieve data. These queries are called **cross-partition operations** and they tend to be inefficient.
328+
- We can have an uneven distribution of data and operations on a particular shard(**Hotspots)**
329+
- It's very difficult to rollback a sharded database to a non-sharded one.
330+
- Sharding often requires a “roll your own” approach. This means that documentation for sharding or tips for troubleshooting problems are often difficult to find.
331+
- SQL complexity - Increased bugs because the developers have to write more complicated SQL to handle sharding logic.
332+
- Additional software - that partitions, balances, coordinates, and ensures integrity can fail.
333+
- Single point of failure - Corruption of one shard due to network/hardware/systems problems causes failure of the entire table.
334+
- Fail-over server complexity - Fail-over servers must have copies of the fleets of database shards.
335+
- Backups complexity - Database backups of the individual shards must be coordinated with the backups of the other shards.
336+
- Operational complexity - Adding/removing indexes, adding/deleting columns, modifying the schema becomes much more difficult.
337+
- **When to shard and when not to?**
338+
- **Sharding schemes**
339+
- **Algorithmic Sharding**
340+
341+
![Image 12](../assets/12.png)
342+
343+
- Algorithmically sharded databases use a sharding function (partition_key) -> database_id to locate data. A simple sharding function may be “hash(key) % NUM_DB”.
344+
- Here, each key **consistently maps** to the same node. We can do it by computing a numeric hash value out of the key and computing a modulo of that hash using the total number of nodes to compute which node owns the key.
345+
- Reads are performed within a single database as long as a partition key is given. Queries without a partition key require searching every database node. Non-partitioned queries do not scale with respect to the size of the cluster, thus they are discouraged.
346+
- Algorithmic sharding distributes data by its sharding function only. It doesn’t consider the payload size or space utilization. To uniformly distribute data, each partition should be similarly sized.
347+
- **Pros** - In algorithmic sharding, the client can determine a given partition’s database without any help from an external service. Algorithmic sharding is suitable for key-value databases with homogeneous values.
348+
- **Cons** - When a new node is added or removed, the ownership of almost all keys would be affected, resulting in a massive redistribution of all the data across nodes of the cluster. While this is not a correctness issue in a distributed cache (because cache misses will repopulate the data), it can have a huge performance impact since the entire cache will have to be warmed again.
349+
- Examples of such a system include Memcached. Memcached is not sharded on its own but expects client libraries to distribute data within a cluster. Such logic is fairly easy to implement at the application level. ⇒ [Sharding data across a Memcache tier](https://www.linuxjournal.com/article/7451)
350+
- **Dynamic Sharding**
351+
352+
![Image 13](../assets/13.png)
353+
354+
- Here, we have an external locator service that determines the location of records in respective shards. If the cardinality of partition keys is low, the locator can be assigned per key. But in the general case, we have a single locator that addresses a range of partition keys.
355+
356+
![Image 14](../assets/14.png)
357+
358+
(Remaining...)
359+
360+
- [How sharding works](https://medium.com/@jeeyoungk/how-sharding-works-b4dec46b3f6)
361+
- [Data sharding in distributed SQL database](https://blog.yugabyte.com/how-data-sharding-works-in-a-distributed-sql-database/)
362+
- [Understanding database sharding](https://www.digitalocean.com/community/tutorials/understanding-database-sharding)
363+
- [Principles of Sharding](https://www.citusdata.com/blog/2017/08/09/principles-of-sharding-for-relational-databases/)
364+
- [Why you dont want to shard by Percona](https://www.percona.com/blog/2009/08/06/why-you-dont-want-to-shard/)
365+
- [Unorthodox approach to database design](http://highscalability.com/blog/2009/8/6/an-unorthodox-approach-to-database-design-the-coming-of-the.html)
366+
- Why shard or partition a database
367+
- [Sharding vs Horizontal Partitioning](https://stackoverflow.com/questions/20771435/database-sharding-vs-partitioning)

0 commit comments

Comments
 (0)