preface
Look at the title. This is another question asked in the interview. In fact, this question was asked several times during the interview when I changed my job last time. I didn't care about it before. I think this thing may be more profound. I just don't understand it. However, with the increasing volume of java development industry, we must be fully prepared for this job change. Fill in all the holes that fell before and go out for abuse (interview).
What is distributed transaction
We all know that local transactions have four characteristics: atomicity, consistency, isolation and durability.
The acid of local transactions is generally completed by relational databases, and non relational databases can also be implemented by databases, except for redis, a weak transaction that cannot be rolled back.
However, in a distributed system, an operation is completed by multiple services. This one-time transaction operation involves multiple systems cooperating through the network, which is called distributed transaction.
In addition, if multiple data sources are used to connect different databases in the same service, when a transaction needs to operate multiple data sources, it is also a distributed transaction.
CAP
Cap theory is the theoretical basis of dealing with distributed transactions in distributed systems. The main reason is that the following three attributes cannot be met at the same time in the current distributed system:
In distributed systems. A service can only guarantee any two of the above attributes, not three at the same time.
Data consistency and service availability cannot be guaranteed when partition fault tolerance is guaranteed. If you want to improve the service availability, you need to add multiple nodes. Although the more nodes, the better the availability, the worse the data consistency.
In this way, it is almost impossible to meet "consistency", "availability" and "partition fault tolerance" at the same time in distributed system design.
Cap application combination @ h_ 502_ 67@ [Ca] give up partition fault tolerance: give up partition fault tolerance and ensure network availability. The simplest way is to put all data on the same node. Although this can not guarantee 100% system error, at least there will be no negative impact caused by network partition. [CP] abandonment of availability: abandonment of availability means that once the network partition or other system problems are encountered, the affected services need to wait for a certain time, and the system cannot provide normal services externally during the application waiting period, that is, it is unavailable for a short time. [AP] abandon consistency: the so-called abandonment of consistency does not completely eliminate the need for consistency, but abandons strong consistency to ensure the final consistency of data.
Base theory @ h_ 502_ 67@ in distributed systems, usability is often pursued, and the degree of importance is generally higher than consistency. So there is another theory, base theory, which is an extension of cap theory. Base availability; Soft state; Eventually consistent; Base theory is the result of a trade-off between consistency and availability in cap theory. The main idea is that strong consistency cannot be achieved, but each application can adopt appropriate methods to achieve the final consistency of the system according to its own business characteristics.
Solutions for distributed transactions
There are several ways to solve distributed transactions on the market.
2PC (two-stage commit) @ h_502_67 @ two-stage commit is mainly to divide the commit transaction and execute the transaction into two steps. The first stage: the transaction coordinator notifies the participants to prepare to commit the transaction, and the participants return success to the coordinator after successful preparation. If one participant returns unsuccessful preparation, the transaction execution fails. The second stage: the transaction coordinator fails according to each parameter According to the returned results of the first phase of the participant, initiate the request for final transaction submission. If one participant fails to submit, all participants will execute rollback and the transaction execution fails. This is a strong consistency implementation, because in the process of transaction execution among multiple services, it is possible that the transaction of the first service has been committed and the submission of the second service has failed. Although the transaction of the second service can be rolled back, it is possible that the transaction of the first service has been completed and cannot be rolled back. Therefore, in most cases, the second service is submitted for retry until the retry is successful. If it still fails after a certain number of retries, it needs early warning and manual intervention. Two phase commit is a distributed transaction that ensures strong consistency as much as possible, so it is synchronous blocking. Synchronous blocking leads to the problem of locking resources for a long time, so it is generally inefficient, And there is the problem of single point of failure (the coordinator may hang up, or the coordinator may hang up with one of the services, and the coordinator does not know whether the hung service is executing transactions or not). Therefore, in extreme cases, there is still a risk of data inconsistency. In addition, 2pc is actually more suitable for the situation of multiple data sources, and the data sources are relational databases. This can make two All transactions in the database are in the Prepare phase at the same time. When committing, the transactions in the two databases commit together.
3PC (three-stage submission) @ h_502_67 @ 3pc actually has one more pre submission stage than 2pc. What 3pc does in the first stage is to ask participants whether they are qualified to execute transactions, and the main purpose is to check whether they are available. The second stage is the same as the first stage of 2pc. The purpose of 3pc is to solve the problem. The 2pc stage coordination group and participants hang up The coordinator of the new election does not know whether it should be submitted or rolled back. If the new coordinator finds that a participant is in the pre commit or commit phase when he comes, the representative has passed the confirmation phase of all participants, so he can directly commit the transaction. Therefore, the purpose of the new accrual stage is to let the coordinator know what stage each participant is at present and how to synchronize the status of each participant later. However, 3pc cannot guarantee whether the participants on the reconnection have executed the transaction when both the coordinator and a participant hang up.
TCC@H_502_67 @Compared with the above two schemes, TCC is more like a transaction solution between distributed services. More widely used. The full name of TCC refers to try, confirm and cancel try: locking and reserving resources of transaction participants. Confirm: this stage is to perform real transaction operations in each participant service. Cancel: if there is an error in the execution of the business method of any service, it needs to be compensated here, that is, roll back the executed business. This method is cumbersome. Three operations, try confirm cancel, must be defined for each transaction. Moreover, TCC is quite invasive to the business, and each business needs to write the corresponding cancellation method. Moreover, if the revocation method is unsuccessful, it also guarantees idempotence. However, there are still scenarios. Think of some scenarios that involve strong consistency such as payment and transaction, but multiple services. It is more reasonable to use TCC. This can strictly ensure that distributed transactions either succeed or fail to roll back.
Local message table @ h_ 502_ 67@ the idea of local message table is mainly guaranteed by local transactions between services. This is to create a message table locally in the service, usually in the database. When executing distributed transactions, insert a piece of data into the local message table after performing local operations. Then, the message is sent to MQ. After receiving the message, the next service performs a local operation. After the operation is successful, the status in the message table is updated. If the execution of the next service fails, the status in the message table will not change. In this way, the timed task is used to brush the message table for retry. However, it is necessary to ensure that the retried service is idempotent, so as to ensure the consistency of the final data.
Reliable message @ h_ 502_ 67@ reliable messaging actually refers to the implementation of distributed transactions by message middleware. For example, rocketmq of company a implements distributed transactions with message oriented middleware. For example, system a will send a prepared message to MQ first. After the message is sent successfully, the local transaction will be executed. If the local transaction is executed successfully, it will tell MQ that the transaction is executed successfully. Otherwise, a rollback message is sent. After receiving the prepared message, system B starts to execute the local transaction. If the transaction is executed successfully, it also tells MQ that the sending and execution is successful. MQ will automatically periodically poll all prepared messages and call back to your interface. Ask you if the local transaction failed. Do you want to retry or rollback all messages that did not send confirmation? Generally speaking, you can check the database here to see whether the local transaction has been executed before. If it has been rolled back, it can also be rolled back here. This is to avoid the possibility that the local transaction is executed successfully, but the confirmation message fails to be sent. At this time, you need to implement the anti query interface yourself. What if the transaction of system B fails in this scheme? Try again, and try again and again automatically until it succeeds. If it is really impossible, either roll back important capital business. For example, after system B rolls back locally, find a way to notify system a to roll back too; Or send an alarm and roll back and compensate manually.
Best effort notification @ h_ 502_ 67@ best effort notification is actually a final consistency scheme. It is mainly to send a message to MQ after system a completes the local transaction, and then ask system B to execute the transaction operation. If system B completes the execution, it will consume the message. If system B fails, it will retry and retry multiple times until it succeeds. If it fails after a certain number of times, it can only be intervened manually.