|
| 1 | +.. _certification: |
| 2 | + |
| 3 | +======================================= |
| 4 | +Certification in Percona XtraDB Cluster |
| 5 | +======================================= |
| 6 | + |
| 7 | +|Percona XtraDB Cluster| replicates actions executed on one node to all other |
| 8 | +nodes in the cluster and make it fast enough to appear as it if is |
| 9 | +synchronous (aka virtually synchronous). |
| 10 | + |
| 11 | +There are two main types of actions: DDL and DML. DDL actions are executed |
| 12 | +using Total Order Isolation (let's ignore Rolling Schema Upgrade for now) and |
| 13 | +DML using normal Galera replication protocol. |
| 14 | + |
| 15 | +.. note:: |
| 16 | + |
| 17 | + This manual page assumes the reader is aware of Total Order Isolation and |
| 18 | + MySQL replication protocol. |
| 19 | + |
| 20 | +DML (``INSERT``/``UPDATE``/``DELETE``) operations effectively change the state |
| 21 | +of the database, and all such operations are recorded in |XtraDB| by |
| 22 | +registering a unique object identifier (aka key) for each change (an update |
| 23 | +or a new addition). |
| 24 | + |
| 25 | +* A transaction can change “n” different data objects. Each such object change |
| 26 | + is recorded in |XtraDB| using a so-call ``append_key`` operation. The |
| 27 | + ``append_key`` operation registers the key of the data object that has |
| 28 | + undergone a change by the transaction. The key for rows can be represented in |
| 29 | + three parts as ``db_name``, ``table_name``, and ``pk_columns_for_table`` (if |
| 30 | + ``pk`` is absent, a hash of the complete row is calculated). In short there |
| 31 | + is quick and short meta information that this transaction has |
| 32 | + touched/modified following rows. This information is passed on as part of the |
| 33 | + write-set for certification to all the nodes of a cluster while the |
| 34 | + transaction is in the commit phase. |
| 35 | + |
| 36 | +* For a transaction to commit it has to pass XtraDB/Galera certification, |
| 37 | + ensuring that transactions don't conflict with any other changes posted on |
| 38 | + the cluster group/channel. Certification will add the keys modified by given |
| 39 | + the transaction to its own central certification vector (CCV), represented by |
| 40 | + ``cert_index_ng``. If the said key is already part of the vector, then |
| 41 | + conflict resolution checks are triggered. |
| 42 | + |
| 43 | +* Conflict resolution traces reference the transaction (that last modified |
| 44 | + this item in cluster group). If this reference transaction is from some other |
| 45 | + node, that suggests the same data was modified by the other node and changes |
| 46 | + of that node have been certified by the local node that is executing the |
| 47 | + check. In such cases, the transaction that arrived later fails to certify. |
| 48 | + |
| 49 | +Changes made to DB objects are bin-logged. This is the same as how |MySQL| |
| 50 | +does it for replication with its Master-Slave ecosystem, except that a packet |
| 51 | +of changes from a given transaction is created and named as a write-set. |
| 52 | + |
| 53 | +Once the client/user issues a ``COMMIT``, |Percona XtraDB Cluster| will run a |
| 54 | +commit hook. Commit hooks ensure following: |
| 55 | + |
| 56 | +* Flush the binary logs. |
| 57 | + |
| 58 | +* Check if the transaction needs replication (not needed for read-only |
| 59 | + transactions like ``SELECT``). |
| 60 | + |
| 61 | +* If a transaction needs a replication, then it invokes a pre_commit hook in |
| 62 | + the Galera ecosystem. During this pre-commit hook, a write-set is written in |
| 63 | + the group channel by a “replicate” operation. All nodes (including the one |
| 64 | + that executed the transaction) subscribes to this group-channel and reads |
| 65 | + the write-set. |
| 66 | + |
| 67 | +* ``gcs_recv_thread`` is first to receive the packet, which is then processed |
| 68 | + through different action handlers. |
| 69 | + |
| 70 | +* Each packet read from the group-channel is assigned an ``id``, which is a |
| 71 | + locally maintained counter by each node in sync with the group. When any new |
| 72 | + node joins the group/cluster, a seed-id for it is initialized to the current |
| 73 | + active id from group/cluster. (There is an inherent assumption/protocol |
| 74 | + enforcement that all nodes read the packet from a channel in same order, and |
| 75 | + that way even though each packet doesn't carry ``id`` information it is |
| 76 | + inherently established using the local maintained ``id`` value). |
| 77 | + |
| 78 | +.. code-block:: bash |
| 79 | +
|
| 80 | + /* Common situation - |
| 81 | + * increment and assign act_id only for totally ordered actions |
| 82 | + * and only in PRIM (skip messages while in state exchange) */ |
| 83 | + rcvd->id = ++group->act_id_; |
| 84 | +
|
| 85 | + [This is an amazing way to solve the problem of the id co-ordination in |
| 86 | + multiple master system, otherwise a node will have to first get an id from |
| 87 | + central system or through a separate agreed protocol and then use it for the |
| 88 | + packet there-by doubling the round-trip time]. |
| 89 | +
|
| 90 | +What happens if two nodes get ready with their packet at same time? |
| 91 | +
|
| 92 | +* Both nodes will be allowed to put the packet on the channel. That means the |
| 93 | + channel will see packets from different nodes queued one-behind-another. |
| 94 | +
|
| 95 | +* It is interesting to understand what happens if two nodes modify same set of |
| 96 | + rows. For example: |
| 97 | +
|
| 98 | + .. code-block:: bash |
| 99 | +
|
| 100 | + create -> insert (1,2,3,4)....nodes are in sync till this point. |
| 101 | + node-1: update i = i + 10; |
| 102 | + node-2: update i = i + 100; |
| 103 | +
|
| 104 | + Let's associate transaction-id (trx-id) for an update transaction that |
| 105 | + is executed on node-1 and node-2 in parallel (The real algorithm is bit |
| 106 | + more involved (with uuid + seqno) but conceptually the same so for ease |
| 107 | + we're using trx_id here) |
| 108 | +
|
| 109 | + node-1: |
| 110 | + update action: trx-id=n1x |
| 111 | + node-2: |
| 112 | + update action: trx-id=n2x |
| 113 | +
|
| 114 | +Both node packets are added to the channel but the transactions are |
| 115 | +conflicting. Let's see which one succeeds. The protocol says: FIRST WRITE WINS. |
| 116 | +So in this case, whoever is first to write to the channel will get certified. |
| 117 | +Let's say node-2 is first to write the packet and then node-1 makes |
| 118 | +immediately after it. |
| 119 | +
|
| 120 | +.. note:: |
| 121 | + each node subscribes to all packages including its own package. See below |
| 122 | + for details. |
| 123 | +
|
| 124 | +Node-2: |
| 125 | + - Will see its own packet and will process it. |
| 126 | + - Then it will see node-1 packet that it tries to certify but fails. |
| 127 | +
|
| 128 | +Node-1: |
| 129 | + - Will see node-2 packet and will process it. (Note: InnoDB allows isolation |
| 130 | + and so node-1 can process node-2 packets independent of node-1 transaction |
| 131 | + changes) |
| 132 | + - Then it will see the node-1 packet that it tries to certify but fails. |
| 133 | + (Note even though the packet originated from node-1 it will under-go |
| 134 | + certification to catch cases like thes. This is beauty of listening to own |
| 135 | + events that make consistent processing path even if events are locally |
| 136 | + generated) |
| 137 | +
|
| 138 | +The certification protocol will be described using the example from above. As |
| 139 | +discussed above, the central certification vector (CCV) is updated to reflect |
| 140 | +reference transaction. |
| 141 | +
|
| 142 | +Node-2: |
| 143 | + - node-2 sees its own packet for certification, adds it to its local CCV and |
| 144 | + performs certification checks. Once these checks pass it updates the |
| 145 | + reference transaction by setting it to ``n2x`` |
| 146 | + - node-2 then gets node-1 packet for certification. Said key is already |
| 147 | + present in CCV with a reference transaction set it to ``n2x``, whereas |
| 148 | + write-set proposes setting it to ``n1x``. This causes a conflict, which in |
| 149 | + turn causes the node-1 originated transaction to fail the certification |
| 150 | + test. |
| 151 | +
|
| 152 | +This helps point out a certification failure and the node-1 packet is rejected. |
| 153 | +
|
| 154 | +Node-1: |
| 155 | + - node-1 sees node-2 packet for certification, which is then processed, the |
| 156 | + local CCV is updated and the reference transaction is set to ``n2x`` |
| 157 | + - Using the same case explained above, node-1 certification also rejects the |
| 158 | + node-1 packet. |
| 159 | +
|
| 160 | +This suggests that the node doesn't need to wait for certification to complete, |
| 161 | +but just needs to ensure that the packet is written to the channel. The applier |
| 162 | +transaction will always win and the local conflicting transaction will be |
| 163 | +rolled back. |
| 164 | +
|
| 165 | +What happens if one of the nodes has local changes that are not synced with |
| 166 | +group? |
| 167 | +
|
| 168 | +.. code-block:: bash |
| 169 | +
|
| 170 | + create (id primary key) -> insert (1), (2), (3), (4); |
| 171 | + node-1: wsrep_on=0; insert (5); wsrep_on=1 |
| 172 | + node-2: insert(5). |
| 173 | + insert(5) will generate a write-set that will then be replicated to node-1. |
| 174 | + node-1 will try to apply it but will fail with duplicate-key-error, as 5 |
| 175 | + already exist. |
| 176 | +
|
| 177 | + XtraDB will flag this as an error, which would eventually cause node-1 to |
| 178 | + shutdown. |
| 179 | +
|
| 180 | +With all that in place, how is GTID incremented if all the packets are |
| 181 | +processed by all nodes (including ones that are rejected due to certification)? |
| 182 | +GTID is incremented only when the transaction passes certification and is ready |
| 183 | +for commit. That way errant-packets don't cause GTID to increment. Also, they |
| 184 | +don't confuse the group packet ``id`` quoted above with GTID. Without |
| 185 | +errant-packets, you may end up seeing these two counters going hand-in-hand, |
| 186 | +but they are no way related. |
0 commit comments