Meaning of consistency in distributed systems

Jan 9, 2017 00:00 · 1300 words · 7 minutes read distributed systems

Consistency and locality are two goals that are committed to pursuing but contradictory in a distributed system. First, you have to ensure that the data is consistent to some extent to ensure that the logic of the correctness of the calculation; Second, you want the data from the visitor the better, so the system will be able to achieve better performance. But to meet the data from the distribution of visitors are close enough, it is bound to require multiple local data cache - which makes the goal of consistency is more difficult to achieve. So, when faced with the problem of distributed systems, most of the time we are in the face of a trade-off problem. Designers need to determine the degree to which the system needs to be consistent and localized, depending on the specific needs of the application.

Consistency

Final consistency

The final consistency requirement is relatively low, it only requires that when you do something to modify, the rest of the system may still have some time to access the old data, but either early or late, it will eventually see Latest changes. In general, when this data does not have the state , or no time requirements, the final consistency is sufficient. Most of the user-oriented systems are the ultimate consistency, such as network disk, you have modified a problem on the phone, the computer side may have to wait a few seconds or even longer time to update to the last modified state. The ultimate realization of the consistency is very simple, it only need to ensure that the event will be written to complete the data. When the user to write a data request, the first system will request to log in the form of preservation in the persistent media (usually the hard disk), to ensure that the modified durability (durability), you can return to the user successfully. After that, the system really applies the changes and sends a message to all other users to inform the changes. Try to try again when sending fails. So early or late, this change will spread to all the corners of the system. When the data is still in the final state of consistency, it is highly likely to cause problems. This situation often occurs inside the system. A very classic example is the Java threading model:

Example: Java threading model

Cross-threaded shared variables have only final consistency, unless volatile is modified

When you modify a shared variable, the change is not immediately reflected in all the sub-threads. The threading system will send a message to each thread to inform the change, and the thread will execute the change after receiving the message. The threading model does not guarantee the time or synchronization of this process. By this feature is derived from the classic visibility (visibility) problem.

This decision is for performance reasons. Of course, the entire thread model is also proposed for performance considerations. Performance savings, the cost is the programmer to write the correct calculation of the model more responsibility.

Strict consistency (strict consistency)

On the other hand, when new and old data coexistence may affect the correctness or usability of the application, we must use strict consistency. It ensures that any changes will be made to all visitors immediately after any modification. This kind of strong demand, of course, brings many excellent attributes, but in the next analysis we will find that it also brings a great loss of performance. So we generally only use the strict consistency in the most needed places.

Below Google’s distributed Lock Service - Chubby system as an example, describes the strict consistency of the request, a write data operation is how to complete.

Invalidation

When the system receives a write request, it will first send a message to revoke this location cache for all users with this location data cache. The system will wait for all users to return to the successful operation before proceeding to the next step.

From this we know that when this step is completed, we can guarantee that the system will no longer keep the old data cache.

Execute a commit

Similarly, the system first recorded the operation of the operation, and then the real implementation of the operation. If the system hangs after logging the log, there will be an action to re-execute the record from the log recovery organization. In other words, as long as the hard disk is not linked at the same time, be recorded in the log changes will certainly be the final implementation of the system.

Send event propagation

This step is not necessary to ensure consistency, but Chubby this system for backward compatibility needs to continue to maintain a function only. After modification, Chubby will send a “data has changed” notification to all users who subscribed to this location.

Reply to the requesting party

After completing the above operation and receiving all confirmation responses involving the user, the system returns the successful message to the write operation. In the following discussion, we can see that the loss of its performance in addition to the cost of round-trip communication, the greater the loss occurred in the communication timeout when the timeout.

Lease

The lease can be said to be one of the most practical solutions to the “Byzantine generals” issue of communication in untrustworthy channels. At the same time it also has the ability to fail at both ends of the failing performance. Is very suitable for the realization of distributed systems.

Simply put, the server and the user each agree on a lease term, each with a clocked clock. In order to maintain communication, one or both parties have the responsibility to renew the contract to the other party within the term of the lease. When the two sides to ensure that the lease has been renewed, communication has continued. When the lease renewal fails, the two sides also maintain the communication session until the lease expires. At this point, both sides will think that communication has been interrupted, you can release the resources of this session, used to serve other communications needs.

Due to the error of the clocks that have been talked about before, the server will wait for a longer period of time than the user’s lease, thus avoiding the competitive conditions due to clock errors and communication delays. Such a fixed time to wait for a long time, in fact, is equivalent to the server has the slowest flow of time clock. Because we generally believe that the server needs to release the reuse of resources, the server has the slowest clock, can guarantee that no competition conditions.

In the previous call to revoke the cache waiting for the user to reply to the discussion, there is no mention of the situation when the user communication is interrupted. In fact, most of the distributed system is the use of such a lease to deal with this situation. In Chubby, if the communication is normal, it may be in the order of milliseconds, the cancellation of the message to complete the return of the cache. Once a user communication is interrupted, the server will wait a full 16 seconds until the lease expires. At this time the loss of time, more than the average number of orders of magnitude. This is also something common in distributed systems: we are more concerned about the worst case of error, rather than the average operation of the average situation.

In Chubby’s example, the lease duration is 16 seconds or less. The implication of this information is that the user has a maximum cache of 16 seconds, and the user has the responsibility to report to the server within 16 seconds of their health and renewal.

We sometimes call the news of this timeliness for “heartbeat”.