redis数据结构——List的底层实现

本文描述redis底层的list所用到的数据结构，包括ziplist和quicklist

Redis List 要解决的问题

Redis的List是一个有序表，可以从表两端存入数据或弹出数据，可以通过下标索引访问表中节点，可以查询表中数据是否存在

LinkedList 双向链表

redis3之后已不用linked list做为LIST的底层数据结构，为了引出后面的ziplist和quicklist，这里做简单的介绍

LinkedList是一个双向链表，每个节点都占用独立的一块内存，每个节点有指向前后的指针，首尾插入和删除的时间复杂度是O(1)，中间插入和查找的时间复杂度是O(n)

优点：便于在表的两端追加和删除数据

缺点：带来大量的内存碎片化，前后指针会占用额外的内存

Ziplist 压缩表

为什么要设计压缩表

由于LinkedList存在的问题导致空间利用率不高，我们希望设计一个存储效率高的数据结构。

不能用定长数组，节点的长度无法设定，太长存在浪费，太短可能放不下

所以ziplist 中节点采用了变长编码方式

结构

头部

zlbytes：ziplist占用的总字节数，长度4字节

zltail：ziplist表中最后一个entry在ziplist的偏移字节数，为了通过头地址找到最后一个节点的地址，长度4字节

zllen：ziplist表中 entry 的个数，长度2字节

数据部分

entry的结构

ziplist中每个节点<entry>由prevrawlen、encoding、data构成：

prevrawlen：表示前一个entry占用的字节数，用于向前遍历，变长

encoding：表示当前entry的data的编码格式，用于读取数据和向后遍历，变长

data：entry的数据部分，是一个字节数组

为了提高内存空间利用率，entry会根据存放的数据长度而采用变长的编码，通过prevrawlen和encoding两个字段来控制

prevrawlen 的编码

二进制	表示
0000,0000（0）	无前置元素
0000,0001 ~ 1111,1101（1-253）	前entry的字节数可以用一个字节表示，直接读该值即可
1111,1110（254）	表示前置entry的字节数用一个字节不够存放，再开辟4字节，读后4字节的值

encoding 的编码

编码类型	运算	二进制	表示	编码长度	data长度
ZIP_STR_06B	0<<6	0000,0000	前2位表示06B字符串，后6位表示data长度，最大表示2^6-1=63字节	1Byte	小于63Byte
ZIP_STR_14B	1<<6	0100,0000	前2位表示14B字符串，还要加上1字节，共6+8=14位表示长度，最大表示2^14-1	14bit	小于2^14-1 Byte
ZIP_STR_32B	2<<6	1000,0000	前2位表示32B字符串，后6位不要了，再加上4字节，共4*8=32位表示长度，最大表示2^32-1	4Byte	小于2^32-1Byte
ZIP_INT_16B	0xc0\|0<<4	1100,0000	前2位表示整型，3-4位表示data在后2字节	1Byte	2Byte
ZIP_INT_32B	0xc0\|1<<4	1101,0000	前2位表示整型，3-4位表示data在后4字节	1Byte	4Byte
ZIP_INT_64B	0xc0\|2<<4	1110,0000	前2位表示整型，3-4位表示data在后8字节	1Byte	8Byte
ZIP_INT_24B	0xc0\|3<<4	1111,0000	前2位表示整型，3-4位表示data在后面3字节	1Byte	3Byte
ZIP_INT_8B	0xfe	1111,1110	前2位表示整型，3-8位表示再读1字节	1Byte	1Byte
ZIP_INT_IMM_MASK	0x0f	1111,(0001~1101)	不需要再读，后4位直接表示整数，范围是0~12	4bit	4bit

结束符

zlend：全1的一个字节

举个例子

源码

/* Fills a struct with all information about an entry.
 * This function is the "unsafe" alternative to the one below.
 * Generally, all function that return a pointer to an element in the ziplist
 * will assert that this element is valid, so it can be freely used.
 * Generally functions such ziplistGet assume the input pointer is already
 * validated (since it's the return value of another function). */
static inline void zipEntry(unsigned char *p, zlentry *e) {
    ZIP_DECODE_PREVLEN(p, e->prevrawlensize, e->prevrawlen);
    ZIP_ENTRY_ENCODING(p + e->prevrawlensize, e->encoding);
    ZIP_DECODE_LENGTH(p + e->prevrawlensize, e->encoding, e->lensize, e->len);
    assert(e->lensize != 0); /* check that encoding was valid. */
    e->headersize = e->prevrawlensize + e->lensize;
    e->p = p;
}

/* Different encoding/length possibilities */
#define ZIP_STR_MASK 0xc0
#define ZIP_INT_MASK 0x30
#define ZIP_STR_06B (0 << 6)
#define ZIP_STR_14B (1 << 6)
#define ZIP_STR_32B (2 << 6)
#define ZIP_INT_16B (0xc0 | 0<<4)
#define ZIP_INT_32B (0xc0 | 1<<4)
#define ZIP_INT_64B (0xc0 | 2<<4)
#define ZIP_INT_24B (0xc0 | 3<<4)
#define ZIP_INT_8B 0xfe
/* 4 bit integer immediate encoding |1111xxxx| with xxxx between
 * 0001 and 1101. */
#define ZIP_INT_IMM_MASK 0x0f   /* Mask to extract the 4 bits value. To add
                                   one is needed to reconstruct the value. */
#define ZIP_INT_IMM_MIN 0xf1    /* 11110001 */
#define ZIP_INT_IMM_MAX 0xfd    /* 11111101 */

ziplist 的操作

unsigned char *ziplistNew(void);
unsigned char *ziplistMerge(unsigned char **first, unsigned char **second);
unsigned char *ziplistPush(unsigned char *zl, unsigned char *s, unsigned int slen, int where);
unsigned char *ziplistIndex(unsigned char *zl, int index);
unsigned char *ziplistNext(unsigned char *zl, unsigned char *p);
unsigned char *ziplistPrev(unsigned char *zl, unsigned char *p);
unsigned int ziplistGet(unsigned char *p, unsigned char **sval, unsigned int *slen, long long *lval);
unsigned char *ziplistInsert(unsigned char *zl, unsigned char *p, unsigned char *s, unsigned int slen);
unsigned char *ziplistDelete(unsigned char *zl, unsigned char **p);
unsigned char *ziplistDeleteRange(unsigned char *zl, int index, unsigned int num);
unsigned int ziplistCompare(unsigned char *p, unsigned char *s, unsigned int slen);
unsigned char *ziplistFind(unsigned char *p, unsigned char *vstr, unsigned int vlen, unsigned int skip);
unsigned int ziplistLen(unsigned char *zl);
size_t ziplistBlobLen(unsigned char *zl);
void ziplistRepr(unsigned char *zl);
int ziplistSafeToAdd(unsigned char* zl, size_t add);

插入逻辑分析

/* Insert item at "p". */
unsigned char *__ziplistInsert(unsigned char *zl, unsigned char *p, unsigned char *s, unsigned int slen) {
    size_t curlen = intrev32ifbe(ZIPLIST_BYTES(zl)), reqlen;
    unsigned int prevlensize, prevlen = 0;
    size_t offset;
    int nextdiff = 0;
    unsigned char encoding = 0;
    long long value = 123456789; /* initialized to avoid warning. Using a value
                                    that is easy to see if for some reason
                                    we use it uninitialized. */
    zlentry tail;

    /* Find out prevlen for the entry that is inserted. */
    if (p[0] != ZIP_END) {
        ZIP_DECODE_PREVLEN(p, prevlensize, prevlen);
    } else {
        unsigned char *ptail = ZIPLIST_ENTRY_TAIL(zl);
        if (ptail[0] != ZIP_END) {
            prevlen = zipRawEntryLength(ptail);
        }
    }

    /* See if the entry can be encoded */
    if (zipTryEncoding(s,slen,&value,&encoding)) {
        /* 'encoding' is set to the appropriate integer encoding */
        reqlen = zipIntSize(encoding);
    } else {
        /* 'encoding' is untouched, however zipStoreEntryEncoding will use the
         * string length to figure out how to encode it. */
        reqlen = slen;
    }
    /* We need space for both the length of the previous entry and
     * the length of the payload. */
    reqlen += zipStorePrevEntryLength(NULL,prevlen);
    reqlen += zipStoreEntryEncoding(NULL,encoding,slen);

    /* When the insert position is not equal to the tail, we need to
     * make sure that the next entry can hold this entry's length in
     * its prevlen field. */
    int forcelarge = 0;
    nextdiff = (p[0] != ZIP_END) ? zipPrevLenByteDiff(p,reqlen) : 0;
    if (nextdiff == -4 && reqlen < 4) {
        nextdiff = 0;
        forcelarge = 1;
    }

    /* Store offset because a realloc may change the address of zl. */
    offset = p-zl;
    zl = ziplistResize(zl,curlen+reqlen+nextdiff);
    p = zl+offset;

    /* Apply memory move when necessary and update tail offset. */
    if (p[0] != ZIP_END) {
        /* Subtract one because of the ZIP_END bytes */
        memmove(p+reqlen,p-nextdiff,curlen-offset-1+nextdiff);

        /* Encode this entry's raw length in the next entry. */
        if (forcelarge)
            zipStorePrevEntryLengthLarge(p+reqlen,reqlen);
        else
            zipStorePrevEntryLength(p+reqlen,reqlen);

        /* Update offset for tail */
        ZIPLIST_TAIL_OFFSET(zl) =
            intrev32ifbe(intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl))+reqlen);

        /* When the tail contains more than one entry, we need to take
         * "nextdiff" in account as well. Otherwise, a change in the
         * size of prevlen doesn't have an effect on the *tail* offset. */
        zipEntry(p+reqlen, &tail);
        if (p[reqlen+tail.headersize+tail.len] != ZIP_END) {
            ZIPLIST_TAIL_OFFSET(zl) =
                intrev32ifbe(intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl))+nextdiff);
        }
    } else {
        /* This element will be the new tail. */
        ZIPLIST_TAIL_OFFSET(zl) = intrev32ifbe(p-zl);
    }

    /* When nextdiff != 0, the raw length of the next entry has changed, so
     * we need to cascade the update throughout the ziplist */
    if (nextdiff != 0) {
        offset = p-zl;
        zl = __ziplistCascadeUpdate(zl,p+reqlen);
        p = zl+offset;
    }

    /* Write the entry */
    p += zipStorePrevEntryLength(p,prevlen);
    p += zipStoreEntryEncoding(p,encoding,slen);
    if (ZIP_IS_STR(encoding)) {
        memcpy(p,s,slen);
    } else {
        zipSaveInteger(p,value,encoding);
    }
    ZIPLIST_INCR_LENGTH(zl,1);
    return zl;
}

ziplist 的优缺点

优点

内存利用率高

缺点

查询效率随着节点数量增多而降低，故只适合存储较少的节点；
不擅长做修改操作，一旦数据改动，会引发内存realloc，可能导致内存拷贝

Realloc 操作中，会判断分配空间中空余空间是否够用

如果够用，直接返回；

如果不够用，则对分配的空间扩容

// 扩容规则
if (newlen < SDS_MAX_PREALLOC)
    newlen *= 2;
else
    newlen += SDS_MAX_PREALLOC;

ZipList对其他数据结构的支持

有很多数据结构都是基于ziplist实现的，如下

如果Entry直接无关，是一个链表
如果Entry之间是key-value的关系，则是HashTable
如果Entry之间是key的关系，则是SET
如果Entry带有分值排序，则是ZSET

QuickList 快表

为什么要设计快表

双向链表便于在表的两端进行push和pop操作，但是它的内存开销比较大。首先，它在每个节点上除了要保存数据之外，还要额外保存两个指针；其次，双向链表的各个节点是单独的内存块，地址不连续，节点多了容易产生内存碎片。
ziplist由于是一整块连续内存，存储效率很高。但是，它不利于修改操作，每次数据变动都会引发一次内存的realloc。特别是当ziplist长度很长的时候，一次realloc可能会导致大批量的数据拷贝，进一步降低性能。
双向链表和ziplist的查询时间复杂度是O(n)，随着表的长度增大，查询效率降低

设计快表，进而解决以上问题

结构

A doubly linked list of ziplists

头部

head：头指针

tail：尾指针

count：quicklist中的节点数

len：quicklist 的 node 数

fill：控制每个Node上ziplist的长度，含义如下

fill取值	含义
正数，如5	表示每个quicklist节点的ziplist最多包含5个数据项
负数，-5	每个quicklist节点上的ziplist大小不能超过64 Kb
负数，-4	每个quicklist节点上的ziplist大小不能超过32 Kb
负数，-3	每个quicklist节点上的ziplist大小不能超过16 Kb
负数，-2（默认）	每个quicklist节点上的ziplist大小不能超过8 Kb
负数，-1	每个quicklist节点上的ziplist大小不能超过4 Kb

compress：控制两端各有几个node不压缩，0=off

quicklist的ziplist设置多长合适

同样是存储12个数据项，既可以是一个quicklist包含3个节点，而每个节点的ziplist又包含4个数据项，也可以是一个quicklist包含6个节点，而每个节点的ziplist又包含2个数据项。

这是一个需要找平衡点的问题。只从存储效率上分析一下：

每个quicklist节点上的ziplist越短，则内存碎片越多。内存碎片多了，有可能在内存中产生很多无法被利用的小碎片，从而降低存储效率。这种情况的极端是每个quicklist节点上的ziplist只包含一个数据项，这就蜕化成一个普通的双向链表了。
每个quicklist节点上的ziplist越长，在一个ziplist中插入删除节点的成本就会增大，查询效率也会降低。这种情况的极端是整个quicklist只有一个节点，所有的数据项都分配在这仅有的一个节点的ziplist里面。这其实蜕化成一个ziplist了。

quicklist 的操作

quicklist的创建

/* Create a new quicklist.
 * Free with quicklistRelease(). */
quicklist *quicklistCreate(void) {
    struct quicklist *quicklist;

    quicklist = zmalloc(sizeof(*quicklist));
    quicklist->head = quicklist->tail = NULL;
    quicklist->len = 0;
    quicklist->count = 0;
    quicklist->compress = 0;
    quicklist->fill = -2;
    quicklist->bookmark_count = 0;
    return quicklist;
}

push操作

不管是在头部还是尾部插入数据，都包含两种情况：

如果头节点（或尾节点）上ziplist大小没有超过限制（即_quicklistNodeAllowInsert返回1），那么新数据被直接插入到ziplist中（调用ziplistPush）
如果头节点（或尾节点）上ziplist太大了，那么新创建一个quicklistNode节点（对应的也会新创建一个ziplist），然后把这个新创建的节点插入到quicklist双向链表中（调用_quicklistInsertNodeAfter）

pop操作

先从头部或尾部节点的ziplist中把对应的数据项删除，如果在删除后ziplist为空了，那么对应的头部或尾部节点也要删除。删除后还可能涉及到里面节点的解压缩问题

快表的时间复杂度分析

查询：O(count) + O(ziplist->size)

表头插入/删除：O(ziplist->size)

表尾插入/删除：O(1)

表中插入/删除：O(count) + O(ziplist->size)

总结

Redis 的 List 底层采用 quicklist 实现，quicklist 将 linkedlist 和 ziplist 结合，在空间利用率和访问效率上做了平衡
对于存放数据长度不确定的情况，可以采用变长编码的思路，提高内存利用率
在对分配空间扩容时，在数据量较少时成倍扩容，到达一定量级后固定步长扩容
在数据规模不同的情况下，可以采用特定的数据结构，达到针对性的优化。如ziplist在数据量较少的情况，可以做为HashTable，Set，ZSet 的实现方式

Reference

Redis内部数据结构详解(4)——ziplist

Redis内部数据结构详解(5)——quicklist