TinyURL
Mar 21, 2017 00:00 · 594 words · 3 minutes read
TinyURL System
If you are not familiar with TinyURL, I will briefly explain here. TinyURL is a URL shortening service, a web service that provides short aliases for redirection of long URLs.
High Level Idea
At first glance, each long URL and the corresponding alias form a key-value pair. So the idea immediate striking our mind is hash. Therefore, the question can be simplified like this - given a URL, how can we find hash function F that maps URL to a short alias.
F(URL) = alias
conditions: * Each URL can only be mapped to a unique alias * Each alias can be mapped back to a unique URL with lesser complexity
Naive solution
So to make things easier, will assume the alias is something like http://tinyurl.com/aliashash and aliashash is fixed length string. If the length is 7 containing[A-Z,a-z,0-9], we can serve 62^7 ~= 3500 billion URLs.
To begin with, lets store all the mappings in a single database. Therefore, we can store
So when someone hit the tinyURL, we look up the ID and redirect to the corresponding URL.
Performance
How are we going to generate the ID. Can we use, GUID (Globally Unique Identifier)? What would be pros/cons versus incremental ID?
So on looking further into the insertion into the database or the querying process, we will notice that random string as IDs may sacrifice performance a little bit. Insertion can be costly, since IDs are not sequential. However, when using incremental IDs, insertion can be much easier - just go to the last page.
So one way to optimize this is to use Incremental IDs. So on every creation we increment the ID by 1. We also need the hash function that maps each integer ID to a 7-char string.
On the flip side, using incremental IDs will make the mapping less flexible. If a user wants to create a custom tinyURL, then we can not provide it. If we used GUID solution, we can just calculate the corresponding hash as the entry ID. Some traditional hash functions are CRC32,SHA-1.
The storage cost for each entry, where the ID is 7-char string. Assuming max URL length is 2083 characters, then each entry takes 7*4 bytes + 2083*4 bytes = 8.4KB. If we store a million URL mappings, we need around 8.5GB storage
Distributed machines
On scaling the application, single machine is not capable to store all the mappings. The more general problem is how to store hash mapping across multiple machines. If you know about the key-value store in distributed system, you can see that it will be complicated process. Dynamo DB- a key value store- check this out, if you are more interested.
Basic approach is to find out which system to based on the ID, like the concept of database sharding. Multiple machine will be acting as proxy, which is responding for dispatching request to corresponding backend stores based on the lookup key. The databases that stores the mapping can be split by various ways like hash(key)%1024 to divide mappings to 1024 machines.
Few keys terms to keep in mind which will make the system more complicate are,
- Replication
- Resharding
- Concurrency
Guess the problem becomes more complicated once you dig into more details esp with the scaling. There are infinite ways to extend this problem further, this is just a head start to look out designing the application in a large scale.