Build up a High Availability Distributed Key-Value Store


There are many awesome and powerful distributed NoSQL in the world, like Couchbase, MongoDB, Canssandra, etc. but developing a new one is still a challengeable, interesting and attractive thing for me, why?

  • It can satisfy our special needs for our cloud services.
  • We just need a key-value store, with some simple additional functionalities, we don’t need a complex solution.
  • We can control the whole thing, especially for fixing bugs and improvement.
  • Inventing the wheel is not good, but I can learn much in the process.

A key-value store may need below features:

  • Simple protocol.
  • Simple API.
  • High performance.
  • High availability.
  • Cluster support.

I knew this would be a hard journey first. But after a long hard work, I develop ledis-cluster, a key-value store based on LedisDB + xcodis + redis-failover.

Pre Solution Thinking

Before I develop ledis-cluster, I thought some other solutions which are valuable to be recorded here too.


Aha, first I just wanted to use MySQL as a key-value store. This thought amazed my colleagues before, and I think now it may surprise many other guys too.

MySQL is a relational database and can be used as a key-value store easily and sufficiently. We can use a simple table to store key value data like below:

    k varbinary(256),
    v blob,
) ENGINE=innodb;

When I worked in Tencent game infrastructure department, we used this way to serve many Tencent games and it works well.

But I don’t want to use MySQL as a key-value store now, MySQL is a little heavy and needs some experienced operations people, this is impossible for our team.


Redis is an awesome NoSQL, it has an amazing performance, supports many useful data structures (kv, hash, list, set and zset), supplies a simple protocol for client user.

I have read the Redis’s code (it is very simple!) many times, used it for about three years in many productions, and I am absolutely confident of maintaining it.

But Redis has a serious problem: memory limitation. We can not store huge data in one machine. Using redis cluster is a good way to solve memory limitation, and there are many existing solutions, like official redis cluster, twemproxy or codis , but I still think another stuff saving huge data exceeding memory limitation in one machine is needed, so I develop LedisDB.


LedisDB is a fast NoSQL, similar to Redis. It has some good features below:

  • Uses Redis protocol, most of the Redis clients can use LedisDB directly.
  • Supports multi data structures(kv, hash, list, set, zset).
  • Uses rocksdb, leveldb or other fast databases as the backend to store huge data, exceeding memory limitation.
  • High performance, see benchmark. Although it is a little slower than Redis, it can still be used in production.

A simple example:

//start ledis server
//another shell
ledis-cli -p 6380
ledis> set a 1
ledis> get a

As we see, LedisDB is simple, we can switch to it easily if we used Redis before.

LedisDB now supports rocksdb, leveldb, goleveldb, boltdb and lmdb as the backend storage, we can choose the best one for our actual environment. In our company projects, we use rocksdb which has a awesome performance and many configurations to be tuned, and I will also use it for the following example.

Data Security Guarantee

LedisDB can store huge data in one machine, so the data security needs to be considered cautiously. LedisDB uses below ways to guarantee it.


We can back up LedisDB and then restore later. Redis saving RDB may block service for some time, but LedisDB doesn’t have this problem. Thanks to rocksdb fast generating snapshot technology, backing up LedisDB is very fast and easy.


LedisDB will first log write operations in binlog, then commit changes into backend storage, this is similar to MySQL.

Redis also has AOF, but the AOF file may grow largely, then rewriting AOF may also block service for some time. LedisDB will rotate binlog and write to the new one when current binlog is larger than maximum size (1GB).


An old saying goes like this: “don’t put all your eggs in one basket”. Similarly, don’t put all our data in one machine.

LedisDB supports asynchronous or semi-synchronous replication. We can not break CAP(Consistency, Availability, Partition tolerance) theorem, for replication, partition tolerance must exist, so we have to choose between consistency and availability.

If we want to guarantee full data security, we may use semi-synchronous replication, but most of time, asynchronous replication is enough.

Monitor and Failover

In the actual production environment, we use a master LedisDB and one or more slaves to construct the topology. We must monitor them in real time because any machine in the topology may be down at any time.

If a slave is down, we may not care too much, this is not a serious problem. But if the master is down (aha, a terrible accident!), we must resolve it quickly.

Generally, we can not expect the master to re-work quickly and infallibly, so electing a best new master from current slaves and doing failover is a better way when master is down.

Redis uses a sentinel feature to monitor the topology and do failover when the master is down. But this sentinel can not be used in LedisDB, so I develop another sentinel: redis-failover, monitoring and doing failover for Redis/LedisDB.

redis-failover uses ROLE command to check master and get all slaves every second. If the master is down, redis-failover will select the best slave from last ROLE returned slaves. The election algorithm is simple, using INFO command to get “slave_priority” and “slave_repl_offset” value, if a slave has a higher priority or a larger repliction offset with same priority, the slave will be elected as the new master.

redis-failover may have single point problem too, I use zookeeper or raft to support redis-failover cluster. Zookeeper or raft will elect a leader and let it monitor and do failover, if the leader is down, a new leader will be elected quickly.


Although LedisDB can store huge data, the growing data may still exceed the capability of the system in the near future.

Splitting data and storing them into multi machines may be the only feasible way(We don’t have money to buy a mainframe), but how to split the data? and how to find the data by a key? I think an easy solution is to define a key routing rule (mapping key to the actual machine).

For example, we have two machines, n0 and n1, and the key routing rule is simple hash like crc32(key) % 2. For key “abc”, the calculation result is 0, so we know that the corresponding data is in n0.

The above solution is easy, but we can not use it in production. If we add another machine, the machine number is 3, all the old data mapping relationship will be broken, and we have to relocate huge amount of data.

Using consistency hash may be better, but I prefer using hash + routing table. We don’t map a key to a machine directly, but to a virtual node named slot, then define a routing table mapping slot to the actual machine.

Continuing with the example, assume we use 1024 slots and 2 machines, the slot and machine mapping is [slot0 — slot511] -> n0, [slot512 — slot1023] -> n1. For a key, first using crc32(key) % 1024 to get a slot index, then we can find the machine with this slot from the routing table.

This solution may be complex but have a big advantage for re-sharding. If we add another machine n2, change the routing table that mapping slot0 to n2, and we only need to migrate all slot0 data from n0 to n2. The bigger for slot number, the smaller for split data in a slot, and we only migrate little data for one slot.

xcodis uses above way to support LedisDB cluster. Now the slot number is 256, which is a little small that may increase the probability of mapping some busy keys into a slot.

Because of origin LedisDB db index implementation limitation, xcodis can not use bigger slot number than 256, so a better way is to support customizing a routing table for a busy key later. For example, for a key, xcodis should first try to find the associated slot in the routing table, if not found, then use hash.

Another radical choice is to change LedisDB code and upgrade all data saved before. This is may be a huge work, so I will not consider it unless I have no idea to resolve above problems.

xcodis is a proxy supporting redis/LedisDB cluster, the benefit of proxy is that we can hide all cluster information from client users and users can use it easily like using a single server.

In addition to proxy, there are also some other ways to support cluster too:

Official Redis cluster, but it is still in development and should not be used in production now, and it can not be used in LedisDB.
Customizing client SDK, the SDK can know whole cluster information and do the right key routing for the user. But this way is not universal and we must write many SDKs for different languages (c, java, php, go, etc.), a hard work!

Final Architecture

At last, the final architecture may look below:

kv architecture

  • Use LedisDB to save huge data in one machine.
  • Use Master/slave to guarantee data security.
  • Use redis-failover to monitor the system and do failover.
  • Use xcodis to support cluster.

This architecture may be not perfect, but is simple and enough for us. Now we have only use LedisDB and xcodis in our projects, not the whole architecture, but we have been testing and will try to deploy it in production in the near future.


Building up a key-value store is not a easy work, and I don’t think what I do above can beat other existing awesome NoSQLs, but it’s a valuable attempt, I have learned much and meet many new friends in the progress.

Now, I’am the only person to develop the whole thing and need help, if you have interested in what I do, please contact me, maybe we really can build up an awesome NoSQL. :-)

Mail: siddontang@gmail.com

Github: github.com/siddontang




kv的使用场景很多,一个很典型的场景就是用户session的存放,key为用户当前的session id,而value则是用户当前会话需要保存的一些信息。因为kv的场景很多,所以选择一个好的kv服务就很重要了。


  • 协议简单
  • 高性能
  • 高可用
  • 易扩容

市面上已经有很多满足条件kv服务,但笔者秉着no zuo no die的精神,决定使用LedisDB + xcodis + redis-failover来构建一个高可用分布式kv存储服务。





    k VARBINARY(256),
    v BLOB,
) ENGINE=innodb;




Couchbase是一个高性能的分布式NoSQL,它甚至能支持跨data center的备份。笔者研究了很长一段时间,但最终并没有决定采用,主要笔者没信心去搞定它的代码。








  • 采用Redis协议,大部分Redis的client都能直接使用。
  • 提供类似Redis的API,支持kv,hash,list,set,zset。
  • 底层采用多种db存储实际数据,支持rocksdb(推荐),leveldb,goleveldb,boltdb,lmdb,没有Redis内存限制问题,因为将数据放到硬盘里面了。
  • 高性能,参考benchmark,虽然比Redis略慢,但完全可用于生产环境。


//start ledis server

//another shell
ledis-cli -p 6380

ledis> set a 1
ledis> get a




  • LedisDB提供了dump和load工具,我们可以很方便的对其备份。在dump的时候,我们仅仅使用的是rocksdb的snapshot机制,非常快速,同时不会阻塞当前服务。这点可能是相对于Redis RDB的优势。虽然Redis的RDB在save的时候也是fork一个子进程进行处理,但如果Redis的数据量巨大,仍然可能造成Redis的卡顿。
  • LedisDB提供类似MySQL的binlog支持,任何操作都是写入binlog之后再最终提交到底层db的。如果服务崩溃,我们能通过binlog进行数据恢复。binlog文件有大小限制,当超过阀值之后,LedisDB会写入一个新的binlog中,而不是像Redis的AOF一样进行rewrite处理。
  • LedisDB支持同步或者异步replication,同步复制能保证数据的强一致,但是会牺牲系统的性能,而异步复制虽然高效,但可能会面对数据丢失问题。这其实就是一个CAP选择问题,在P(partition tolerance)铁定存在的情况下,选择C(consistency)还是选择A(availability)?通常情况下,笔者会选择A。


在生产环境中,为了保证数据安全,一个master我们会通常配备一个或者多个slave(笔者喜欢将其称为replication topology),当master当掉的时候,监控系统会选择一个最优的slave(也就是拥有master数据最多的那个),将其提升为新的master,并且将其他slave指向该new master。这套流程也就是我们通常说的failover。

Redis提供了sentinel机制来实现整个replication topology的failover。但sentinel是跟redis绑定的,所以不能直接在LedisDB上面使用,所以笔者开发了redis-failover,一个能支持redis,或者LedisDB failover的sentinel。

redis-failover通过定期向master发送role命令来获知当前replication topology,主要是slaves的信息。当master当掉之后,redis-failover就会从先前获取的slaves里面选择一个最优的slave,提升为master,选择最优的算法很简单,通过info命令得到”slave_priority”和”slave_repl_offset”,如果哪个slave的priority最大,就选择那个,如果priority都一样,则选择replication offset最大的那个。





要构建LedisDB的cluster,笔者考虑了如下三种方案,这里,我们不说啥hash取模或者consistency hash了,如果cluster真能通过这两种技术简单搞定,那还要这么费力干啥。

  • Redis cluster。

    redis cluster是redis官方提供的cluster解决方案,性能高,并且能支持resharding。可是直到现在,redis cluster仍处于开发阶段,至少笔者是不敢将其用于生产环境中。另外,笔者觉得它真的很复杂,还是别浪费脑细胞去搞定这套架构了。

  • 定制client。


  • Proxy

    记得有人说过,计算机科学领域的任何问题, 都可以通过添加一个中间层来解决。而proxy则是用来解决cluster问题的一个中间层。






kv architecture