xcodis

Build up a High Availability Distributed Key-Value Store

Preface

There are many awesome and powerful distributed NoSQL in the world, like Couchbase, MongoDB, Canssandra, etc. but developing a new one is still a challengeable, interesting and attractive thing for me, why?

  • It can satisfy our special needs for our cloud services.
  • We just need a key-value store, with some simple additional functionalities, we don’t need a complex solution.
  • We can control the whole thing, especially for fixing bugs and improvement.
  • Inventing the wheel is not good, but I can learn much in the process.

A key-value store may need below features:

  • Simple protocol.
  • Simple API.
  • High performance.
  • High availability.
  • Cluster support.

I knew this would be a hard journey first. But after a long hard work, I develop ledis-cluster, a key-value store based on LedisDB + xcodis + redis-failover.

Pre Solution Thinking

Before I develop ledis-cluster, I thought some other solutions which are valuable to be recorded here too.

MySQL

Aha, first I just wanted to use MySQL as a key-value store. This thought amazed my colleagues before, and I think now it may surprise many other guys too.

MySQL is a relational database and can be used as a key-value store easily and sufficiently. We can use a simple table to store key value data like below:

CREATE TABLE kv (
    k varbinary(256),
    v blob,
    PRIMARY KEY(k),
) ENGINE=innodb;

When I worked in Tencent game infrastructure department, we used this way to serve many Tencent games and it works well.

But I don’t want to use MySQL as a key-value store now, MySQL is a little heavy and needs some experienced operations people, this is impossible for our team.

Redis

Redis is an awesome NoSQL, it has an amazing performance, supports many useful data structures (kv, hash, list, set and zset), supplies a simple protocol for client user.

I have read the Redis’s code (it is very simple!) many times, used it for about three years in many productions, and I am absolutely confident of maintaining it.

But Redis has a serious problem: memory limitation. We can not store huge data in one machine. Using redis cluster is a good way to solve memory limitation, and there are many existing solutions, like official redis cluster, twemproxy or codis , but I still think another stuff saving huge data exceeding memory limitation in one machine is needed, so I develop LedisDB.

LedisDB

LedisDB is a fast NoSQL, similar to Redis. It has some good features below:

  • Uses Redis protocol, most of the Redis clients can use LedisDB directly.
  • Supports multi data structures(kv, hash, list, set, zset).
  • Uses rocksdb, leveldb or other fast databases as the backend to store huge data, exceeding memory limitation.
  • High performance, see benchmark. Although it is a little slower than Redis, it can still be used in production.

A simple example:

//start ledis server
ledis-server
//another shell
ledis-cli -p 6380
ledis> set a 1
OK
ledis> get a
“1"

As we see, LedisDB is simple, we can switch to it easily if we used Redis before.

LedisDB now supports rocksdb, leveldb, goleveldb, boltdb and lmdb as the backend storage, we can choose the best one for our actual environment. In our company projects, we use rocksdb which has a awesome performance and many configurations to be tuned, and I will also use it for the following example.

Data Security Guarantee

LedisDB can store huge data in one machine, so the data security needs to be considered cautiously. LedisDB uses below ways to guarantee it.

Backup

We can back up LedisDB and then restore later. Redis saving RDB may block service for some time, but LedisDB doesn’t have this problem. Thanks to rocksdb fast generating snapshot technology, backing up LedisDB is very fast and easy.

Binlog

LedisDB will first log write operations in binlog, then commit changes into backend storage, this is similar to MySQL.

Redis also has AOF, but the AOF file may grow largely, then rewriting AOF may also block service for some time. LedisDB will rotate binlog and write to the new one when current binlog is larger than maximum size (1GB).

Replication

An old saying goes like this: “don’t put all your eggs in one basket”. Similarly, don’t put all our data in one machine.

LedisDB supports asynchronous or semi-synchronous replication. We can not break CAP(Consistency, Availability, Partition tolerance) theorem, for replication, partition tolerance must exist, so we have to choose between consistency and availability.

If we want to guarantee full data security, we may use semi-synchronous replication, but most of time, asynchronous replication is enough.

Monitor and Failover

In the actual production environment, we use a master LedisDB and one or more slaves to construct the topology. We must monitor them in real time because any machine in the topology may be down at any time.

If a slave is down, we may not care too much, this is not a serious problem. But if the master is down (aha, a terrible accident!), we must resolve it quickly.

Generally, we can not expect the master to re-work quickly and infallibly, so electing a best new master from current slaves and doing failover is a better way when master is down.

Redis uses a sentinel feature to monitor the topology and do failover when the master is down. But this sentinel can not be used in LedisDB, so I develop another sentinel: redis-failover, monitoring and doing failover for Redis/LedisDB.

redis-failover uses ROLE command to check master and get all slaves every second. If the master is down, redis-failover will select the best slave from last ROLE returned slaves. The election algorithm is simple, using INFO command to get “slave_priority” and “slave_repl_offset” value, if a slave has a higher priority or a larger repliction offset with same priority, the slave will be elected as the new master.

redis-failover may have single point problem too, I use zookeeper or raft to support redis-failover cluster. Zookeeper or raft will elect a leader and let it monitor and do failover, if the leader is down, a new leader will be elected quickly.

Cluster

Although LedisDB can store huge data, the growing data may still exceed the capability of the system in the near future.

Splitting data and storing them into multi machines may be the only feasible way(We don’t have money to buy a mainframe), but how to split the data? and how to find the data by a key? I think an easy solution is to define a key routing rule (mapping key to the actual machine).

For example, we have two machines, n0 and n1, and the key routing rule is simple hash like crc32(key) % 2. For key “abc”, the calculation result is 0, so we know that the corresponding data is in n0.

The above solution is easy, but we can not use it in production. If we add another machine, the machine number is 3, all the old data mapping relationship will be broken, and we have to relocate huge amount of data.

Using consistency hash may be better, but I prefer using hash + routing table. We don’t map a key to a machine directly, but to a virtual node named slot, then define a routing table mapping slot to the actual machine.

Continuing with the example, assume we use 1024 slots and 2 machines, the slot and machine mapping is [slot0 — slot511] -> n0, [slot512 — slot1023] -> n1. For a key, first using crc32(key) % 1024 to get a slot index, then we can find the machine with this slot from the routing table.

This solution may be complex but have a big advantage for re-sharding. If we add another machine n2, change the routing table that mapping slot0 to n2, and we only need to migrate all slot0 data from n0 to n2. The bigger for slot number, the smaller for split data in a slot, and we only migrate little data for one slot.

xcodis uses above way to support LedisDB cluster. Now the slot number is 256, which is a little small that may increase the probability of mapping some busy keys into a slot.

Because of origin LedisDB db index implementation limitation, xcodis can not use bigger slot number than 256, so a better way is to support customizing a routing table for a busy key later. For example, for a key, xcodis should first try to find the associated slot in the routing table, if not found, then use hash.

Another radical choice is to change LedisDB code and upgrade all data saved before. This is may be a huge work, so I will not consider it unless I have no idea to resolve above problems.

xcodis is a proxy supporting redis/LedisDB cluster, the benefit of proxy is that we can hide all cluster information from client users and users can use it easily like using a single server.

In addition to proxy, there are also some other ways to support cluster too:

Official Redis cluster, but it is still in development and should not be used in production now, and it can not be used in LedisDB.
Customizing client SDK, the SDK can know whole cluster information and do the right key routing for the user. But this way is not universal and we must write many SDKs for different languages (c, java, php, go, etc.), a hard work!

Final Architecture

At last, the final architecture may look below:

kv architecture

  • Use LedisDB to save huge data in one machine.
  • Use Master/slave to guarantee data security.
  • Use redis-failover to monitor the system and do failover.
  • Use xcodis to support cluster.

This architecture may be not perfect, but is simple and enough for us. Now we have only use LedisDB and xcodis in our projects, not the whole architecture, but we have been testing and will try to deploy it in production in the near future.

Summary

Building up a key-value store is not a easy work, and I don’t think what I do above can beat other existing awesome NoSQLs, but it’s a valuable attempt, I have learned much and meet many new friends in the progress.

Now, I’am the only person to develop the whole thing and need help, if you have interested in what I do, please contact me, maybe we really can build up an awesome NoSQL. :-)

Mail: siddontang@gmail.com

Github: github.com/siddontang

构建高可用分布式Key-Value存储服务

前言

当我们构建服务端应用的时候,都会面临数据存放的问题。不同的数据类型有不同的存放方式,譬如关系型数据通常使用MySQL来存储,文档型数据则会考虑使用MongoDB,而这里,我们仅仅考虑最简单的kv(key-value)。

kv的使用场景很多,一个很典型的场景就是用户session的存放,key为用户当前的session id,而value则是用户当前会话需要保存的一些信息。因为kv的场景很多,所以选择一个好的kv服务就很重要了。

对于笔者来说,一个不错的kv服务可能仅仅需要满足如下几点就够了:

  • 协议简单
  • 高性能
  • 高可用
  • 易扩容

市面上已经有很多满足条件kv服务,但笔者秉着no zuo no die的精神,决定使用LedisDB + xcodis + redis-failover来构建一个高可用分布式kv存储服务。

现有解决方案

在继续说明之前,笔者想说说曾经考虑使用或者已经使用的一些解决方案。

MySQL

好吧,别笑,我真的说的是MySQL。MySQL作为一个关系型数据库,用来存储kv性能真心一点都不差。table的结构很简单,可能如下:

CREATE TABLE kv (
    k VARBINARY(256),
    v BLOB,
    PRIMARY KEY(k),
) ENGINE=innodb;

当我还在腾讯互动娱乐部门的时候,一些游戏项目就仅仅将MySQL作为kv来使用,譬如用来存放玩家数据,游戏服务器通过玩家id读取对应的数据,修改,然后更新。鉴于腾讯游戏恐怖的用户量,MySQL能撑住直接就能说明将MySQL作为一个kv来用是可行的。

不过不知道现在还有多少游戏项目仍然采用这种做法,毕竟笔者觉得,将MySQL作为一个kv服务,有点杀鸡用牛刀的感觉,MySQL还是有点重了。

Couchbase

Couchbase是一个高性能的分布式NoSQL,它甚至能支持跨data center的备份。笔者研究了很长一段时间,但最终并没有决定采用,主要笔者没信心去搞定它的代码。

Redis

Redis是一个高性能NoSQL,它不光支持kv,同时还提供了其他的数据结构如hash,list,set,zset供外部使用。

笔者在三年前就开始使用Redis,加之Redis的代码简单,很容易就能理解掌控。所以一直到现在,笔者都会优先使用Redis来存储很多非关系型数据。自然对于kv,笔者也是采用Redis来存放的。

但Redis也有一些不足,最大的莫过于内存限制,Redis存储的总数据大小最好别超过物理内存,不然性能会有问题。同时,笔者觉得Redis的RDB和AOF机制也比较蛋疼,RDB的时候系统可能会出现卡顿,而AOF在rewrite的时候也可能出现类似的问题。

因为内存的限制,所以Redis不能存储超大量的数据,为了解决这个问题,我们只能采用cluster的方案,但是Redis官方的cluster仍然处于开发阶段,并不能真正在生产环境中使用。所以笔者开发了LedisDB

LedisDB

开发LedisDB,主要就是为了解决Redis内存限制问题,它主要有如下特性:

  • 采用Redis协议,大部分Redis的client都能直接使用。
  • 提供类似Redis的API,支持kv,hash,list,set,zset。
  • 底层采用多种db存储实际数据,支持rocksdb(推荐),leveldb,goleveldb,boltdb,lmdb,没有Redis内存限制问题,因为将数据放到硬盘里面了。
  • 高性能,参考benchmark,虽然比Redis略慢,但完全可用于生产环境。

一个简单地例子:

//start ledis server
ledis-server 

//another shell
ledis-cli -p 6380

ledis> set a 1
OK
ledis> get a
"1"

可以看到,LedisDB非常类似Redis,所以用户能很方便的从Redis迁移到LedisDB上面。在实际生产环境中,笔者建议底层选择rocksdb作为其存储模块,它不光性能高,同时提供了很多配置方便用户根据特定情况进行调优(当然,理解这一堆配置可是一件很蛋疼的事情)。后续,笔者对于LedisDB的使用说明都会是基于rocksdb的。

数据安全

虽然LedisDB能存储大量数据,并且易于使用,但是作为一个数据存储服务,数据的安全性是一个非常需要考虑的问题。

  • LedisDB提供了dump和load工具,我们可以很方便的对其备份。在dump的时候,我们仅仅使用的是rocksdb的snapshot机制,非常快速,同时不会阻塞当前服务。这点可能是相对于Redis RDB的优势。虽然Redis的RDB在save的时候也是fork一个子进程进行处理,但如果Redis的数据量巨大,仍然可能造成Redis的卡顿。
  • LedisDB提供类似MySQL的binlog支持,任何操作都是写入binlog之后再最终提交到底层db的。如果服务崩溃,我们能通过binlog进行数据恢复。binlog文件有大小限制,当超过阀值之后,LedisDB会写入一个新的binlog中,而不是像Redis的AOF一样进行rewrite处理。
  • LedisDB支持同步或者异步replication,同步复制能保证数据的强一致,但是会牺牲系统的性能,而异步复制虽然高效,但可能会面对数据丢失问题。这其实就是一个CAP选择问题,在P(partition tolerance)铁定存在的情况下,选择C(consistency)还是选择A(availability)?通常情况下,笔者会选择A。

故障转移

在生产环境中,为了保证数据安全,一个master我们会通常配备一个或者多个slave(笔者喜欢将其称为replication topology),当master当掉的时候,监控系统会选择一个最优的slave(也就是拥有master数据最多的那个),将其提升为新的master,并且将其他slave指向该new master。这套流程也就是我们通常说的failover。

Redis提供了sentinel机制来实现整个replication topology的failover。但sentinel是跟redis绑定的,所以不能直接在LedisDB上面使用,所以笔者开发了redis-failover,一个能支持redis,或者LedisDB failover的sentinel。

redis-failover通过定期向master发送role命令来获知当前replication topology,主要是slaves的信息。当master当掉之后,redis-failover就会从先前获取的slaves里面选择一个最优的slave,提升为master,选择最优的算法很简单,通过info命令得到”slave_priority”和”slave_repl_offset”,如果哪个slave的priority最大,就选择那个,如果priority都一样,则选择replication offset最大的那个。

redis-failover会存在单点问题,所以redis-failover自身需要支持cluster。redis-failover的cluster在内部选举一个leader用来进行实际的monitor以及failover处理,当leader当掉之后,则进行重新选举。

现阶段,redis-failover可以通过外部的zookeeper进行leader选举,同时也支持内部自身通过raft算法进行leader选举。

分布式集群

随着数据量的持续增大,单台机器最终无法存储所有数据,我们不得不考虑通过cluster的方式来解决,也就是将数据放到不同的机器上面去。

要构建LedisDB的cluster,笔者考虑了如下三种方案,这里,我们不说啥hash取模或者consistency hash了,如果cluster真能通过这两种技术简单搞定,那还要这么费力干啥。

  • Redis cluster。

    redis cluster是redis官方提供的cluster解决方案,性能高,并且能支持resharding。可是直到现在,redis cluster仍处于开发阶段,至少笔者是不敢将其用于生产环境中。另外,笔者觉得它真的很复杂,还是别浪费脑细胞去搞定这套架构了。

  • 定制client。

    通过定制client,我们可以知道不同key的路由规则,自然就能找到实际的数据了。这方面的工作我的一位盆友正在进行,但定制client有一个很严重的问题在于所有的client都必须自己实现,其实不算是一个通用的解决方案。

  • Proxy

    记得有人说过,计算机科学领域的任何问题, 都可以通过添加一个中间层来解决。而proxy则是用来解决cluster问题的一个中间层。

    Twemproxy是一个很不错的选择,但是它不能支持resharding,而且貌似twitter内部也没在使用了,所以笔者并不考虑使用。

    本来笔者打算自己写一个proxy,但这时候,codis横空出世,它是一个分布式的proxy,同时支持resharding,并且在豌豆荚的生产环境中得到验证,笔者立刻就决定使用codis了。

    但codis并不支持LedisDB,同时为了满足他们自身的需求,使用的也是一个修改版的redis,鉴于此,笔者实现了xcodis,一个基于codis的,支持LedisDB以及原生redis的proxy。

架构

最终的架构如下。

kv architecture

我们通过使用LedisDB来解决了Redis单机数据容量问题,通过replication机制保证数据安全性,通过redis-failover用来进行failover处理,最后通过xcodis进行集群管理。

当前,这套架构并没有在生产环境中得到验证,但我们一直在内部不断测试,而且国外也有用户在帮助笔者验证这套架构,所以笔者对其还是很有信心的,希望能早日上线。如果有哪位童鞋也对这套架构感兴趣,想吃螃蟹的,笔者非常愿意提供支持。