Caching - distributed systems

Jan 25, 2017 00:00 · 1266 words · 6 minutes read distributed systems

Locality

In general, there is some kind of asymmetry in our communication model. This asymmetry includes the difference in speed of access to different media, the difference in speed of access to local and network resources, and the difference in frequency between read and write. We use this asymmetric structure proposed by the design principle is very simple and crude:

Try to avoid slow access

Second, locality is an important assumption that we take effect on caching techniques:

For most computer programs, it is often in a local time period (time locality), only frequent access to all the data in a local (spatial locality)

Only when the hypothesis is true, we make sense for the previous access to the data cache. Because according to the locality, the data of the previous visit is likely to be visited again.

Offline: Considering that the cache is often loaded with a sequence of blocks, those structures that follow this order, such as arrays, and algorithms that use this sequential structure design, we say that they are “cache-friendly “(Cache friendly).

Based on the difference in speed of different media, we have a single layer in the classic “cache - memory - disk” hierarchical cache structure. In the distributed cache structure, this hierarchical structure is not uncommon.

Local cache and proxy

In a specific application for Lock Service, each user may need to detect very frequently the presence of a lock variable. At this point if the lock variable cache in the local, will greatly ease the server’s access pressure. Because of this, in the previous discussion of the write process, we need to revoke the cache to ensure consistency. Can be said to be based on the use of reading than the frequent use of this scene, we made a read and write asymmetric structure.

This read is much more common than writing a cheap asymmetric structure and is most common in systems with “subscribe” attributes. Such as ADSL and some specific database implementation. But we will also see to write cheaper than the negative countdown, that is, BigTable.

Now is the equivalent of our “cache - disk” model. When the cache misses, the user must access the server. When the number of users increased, this missed visit may also be amazing. At this point we need to design a second cache. We add several proxies to the server, each of which maintains a public cache and serves several users. In this way, when the local cache miss, but also a layer of proxy cache protection. And such a cache tree structure can be infinitely unfolded, such as distance from the server side of the physical distance caused by the difference in access speed, the design of multi-level agents, the user directly access the server to further reduce the chance.

An example of a slow access to the cache: a user A, followed by access to data x, y, x; a user B, followed by access to data y, y, z.

  1. When there is no cache, A and B directly to the server to take data, a total of 6 visits;
  2. When the local cache, A and B do not need to visit the duplicate location, A access x, y, B access y, z, a total of four access to the server;
  3. When only the agent, A and B of the request are handed over by the agent, the agent only to the server to request x, y, z each time, resulting in three server access, 6 proxy access;
  4. When there is a local cache and proxy, A only need to ask up x, y, B only need to ask up y, z, proxy to the server to request x, y, z each time, a total of three server-side access, access.

Common application of the local is more significant, that is repeated a lot of frequent requests. At this time, each add a layer of cache level, the equivalent of a power to reduce the frequency of access to the next level. However, it is clear that if two users with no business relevance to the same agent services, this significant locality is likely to exist. If the situation is the former, of course, our application can be added by adding agents to scale.

It should be noted that if multiple caching is used, when the write operation is sent to the server, the cache needs to be recursively reclaimed from top to bottom.

Use Bloom filter to reduce actual access

Bloom filter is a variant of the Hash Set, designed to answer whether a data exists in the database, only when the Bloom filter answer exists, the user really access the database. When the Bloom filter does not exist, it is 100% correct, that is, the user can safely not access the database; when the Bloom filter answer exists, it has a certain false alarm rate, but even so, the user is only increased A visit to the database only, and will not cause greater damage to the system. This is why it is named “filter” - there will always be a leaky fish.

The principle is very simple, in the original hash function corresponds to an array of standard model, we now let a number of different hash function share an array. When a key value is inserted, multiple hash functions are set to the corresponding bit of the array according to the respective hash value. When detecting the existence of a key value, it must be found in the corresponding hash value of all the hash 1, that it exists. Since the hash function is deterministic to the same input hash value, if the key value is actually inserted, the corresponding hash value position must be set. So Bloom filter on the key value does not exist to determine the 100% correct.

But there is still a conflict, for example, in the Bloom filter with three hash functions, x corresponds to the hash value 1, 3, 5, y corresponds to the hash value 2, 4, 6, z corresponds to the hash value 1, 3. When both x and y are inserted, 1, 2, 3 are set so that even if z does not actually exist, the Bloom filter will still be mistaken for existence when judging z.

So here we have a trade-off, we apply Bloom filter objectives are:

  • As little as possible the rate of false positives (which requires fewer insertions and larger array capacity, since the false positive rate is clearly strongly related to the ratio set to 1)
  • As fast as possible (this requires that the number of hash functions be as small as possible, but less the hash function obviously increases the likelihood of conflict)

Also take into account it is filtered out of the table does not exist in the key value of the visit, then if a table key is almost full of the case, the use of Bloom filter does not make sense. Therefore, Bloom filter is suitable for sparse tables.

Data compression

Finally, a little mention of compression, compression of the basic principles are to find similar to remove the redundancy. In a table, it is clear that there is a higher degree of similarity between a row of data than a row of data. So the data compression by column more meaningful. But the drawbacks are obvious: if there is a user want to read a whole line of data, it will span all the columns of compressed files, resulting in very low access efficiency. From here we can see why sometimes writing will be cheaper than reading.