Redis 源码阅读 --- Database

Redis 数据库实现相关源码阅读笔记，源码文件 server.h & server.c & db.c & evict.c & expire.c & blocked.c。

1. 小结

Redis 数据库对象主要由以下几部分组成：
- id: Database ID，值域为 $0$ 至 databases $- 1$
- dict：数据库具体内容 —— key-value pair
  - key 为 sds strings，不使用 redis object 是为了节省内存
  - value 为 redis object
- expires：记录数据对象过期时间的 key-ttl pair
  - key 为 sds strings，会复用 dict 中的键值以节省内存
  - value 为long long 整型，记录了该 key 将于何时 (absolute unix time in milliseconds) 过期
  - 被动清理过期键：一般地，在查找 key 是否在 dict 中时，会先判断该 key 是否已过期，若为主库，则会删除已过期键
  - 主动清理过期键：调用 activeExpireCycle
- avg_ttl：统计过期时间平均值
- expires_cursor：用于渐进式主动过期
- blocking_keys & ready_keys：用于实现带阻塞的命令
- watched_keys：用于实现事务
- defrag_later：用于内存碎片整理
键过期以被动的方式进行，只会在访问该 key 的时候触发：
- 在客户端访问该 key 时调用 expireIfNeeded 触发过期删除操作
  
  当从库处于 read-only 模式时，访问从库的 key 不会触发删除操作
- Redis 服务器定时调用 activeExpireCycle 主动扫描键空间以清理过期 key
  
  可使用配置项 active-expire-effort 修改主动清理操作单次执行时间上限
过期 key 的删除分为同步和异步两种方式，由配置项 lazyfree-lazy-expire 决定采用何种方式。
当设置了 maxmemory 限制时，Redis 会根据所设置的内存策略 maxmemory-policy 决定淘汰 key 的方式。

Tips

为性能考虑，应尽量避免调用 gettimeofday 来获取时间，可使用 Redis 缓存的时间值。

2. redisDb

Redis 数据库由多个 dict 组成，其结构定义如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


/* Redis database representation. There are multiple databases identified
 * by integers from 0 (the default database) up to the max configured
 * database. The database number is the 'id' field in the structure. */
typedef struct redisDb {
    dict *dict;                 /* The keyspace for this DB */
    dict *expires;              /* Timeout of keys with a timeout set */
    dict *blocking_keys;        /* Keys with clients waiting for data (BLPOP)*/
    dict *ready_keys;           /* Blocked keys that received a PUSH */
    dict *watched_keys;         /* WATCHED keys for MULTI/EXEC CAS */
    int id;                     /* Database ID */
    long long avg_ttl;          /* Average TTL, just for stats */
    unsigned long expires_cursor; /* Cursor of the active expire cycle. */
    list *defrag_later;         /* List of key names to attempt to defrag
                                 * one by one, gradually. */
} redisDb;

在之前 Redis dict 结构的介绍中，我们知道 dict 可以有不同的 dictType，不同类型的 dict 的区别在于 kv 类型及其相关操作不同，在 redisDb 中的各个 dict 其类型如下:

field	dictType	key	value
`dict`	dbDictType	sds strings	redis objects
`expires`	keyptrDictType	sds strings	long long
`blocking_keys`	keylistDictType	redis objects	lists
`ready_keys`	objectKeyPointerValueDictType	redis objects	dummy pointers
`watched_keys`	keylistDictType	redis objects	lists

dbDictType

Redis 数据库 keysapce 的 dictType 为 dbDictType, 其定义如下，键使用 sds 而不使用 robj，可以节省内存。

1
2
3
4
5
6
7
8
9


/* Db->dict, keys are sds strings, vals are Redis objects. */
dictType dbDictType = {
    dictSdsHash,                /* hash function */
    NULL,                       /* key dup */
    NULL,                       /* val dup */
    dictSdsKeyCompare,          /* key compare */
    dictSdsDestructor,          /* key destructor */
    dictObjectDestructor        /* val destructor */
};

keyptrDictType

键类型为 sds，会复用 redisDb->dict 中的键值，值类型为 long long，记录了该对象何时 (absolute unix time in milliseconds) 会过期。

1
2
3
4
5
6
7
8
9


/* Db->expires */
dictType keyptrDictType = {
    dictSdsHash,                /* hash function */
    NULL,                       /* key dup */
    NULL,                       /* val dup */
    dictSdsKeyCompare,          /* key compare */
    NULL,                       /* key destructor */
    NULL                        /* val destructor */
};

keylistDictType

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


/* Keylist hash table type has unencoded redis objects as keys and
 * lists as values. It's used for blocking operations (BLPOP) and to
 * map swapped keys to a list of clients waiting for this keys to be loaded. */
dictType keylistDictType = {
    dictObjHash,                /* hash function */
    NULL,                       /* key dup */
    NULL,                       /* val dup */
    dictObjKeyCompare,          /* key compare */
    dictObjectDestructor,       /* key destructor */
    dictListDestructor          /* val destructor */
};

objectKeyPointerValueDictType

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


/* Generic hash table type where keys are Redis Objects,
 * Values dummy pointers. */
dictType objectKeyPointerValueDictType = {
    dictEncObjHash,            /* hash function */
    NULL,                      /* key dup */
    NULL,                      /* val dup */
    dictEncObjKeyCompare,      /* key compare */
    dictObjectDestructor,      /* key destructor */
    NULL                       /* val destructor */
};

3. redisDb APIs

3.1. 查找

lookupKey

1
2
3
4


/* Low level key lookup API, not actually called directly from commands
 * implementations that should instead rely on lookupKeyRead(),
 * lookupKeyWrite() and lookupKeyReadWithFlags(). */
robj *lookupKey(redisDb *db, robj *key, int flags);

其中 flags 用于指示是否更新 robj->lru，其取值有二：

LOOKUP_NONE: no special flags are passed.
LOOKUP_NOTOUCH: don’t alter the last access time of the key.

除此之外，若此时正有子线程正在进行保存 RDB、重写 AOF 或加载 module 相关操作时，也不会更新该数据对象的 lru 值，以避免一个数据对象被疯狂复制多次。

key 存在于 db 中时，返回其关联的 value 值，否则返回 NULL。

lookupKeyReadWithFlags

1
2
3


/* Lookup a key for read operations, or return NULL if the key is not found
 * in the specified DB. */
robj *lookupKeyReadWithFlags(redisDb *db, robj *key, int flags);

在该函数中，在调用 lookupKey 判断 key 是否存在于 db 中之前，会先调用 expireIfNeeded 判断该 key 是否已过期(有关过期的内容将在下文详述)。从库为了和主库保持数据的一致性，并不会主动清理过期键，而是会同步主库在清理过期键时发送的 DEL 命令，但在该查找操作中，如果 key 已过期，从库虽未删除该 key，但仍然会返回 NULL，表示该 key 逻辑上不存在，以和主库保持一致的视图。

调用该函数有以下副作用：

在 Redis 主库中，若 key 达到其过期时间，则会被删除，同步或异步删除则取决于配置的 lazyfree-lazy-expire 值 (调用 expireIfNeeded 引起)
可能会更新 robj->lru (调用 lookupKey 引起)
会更新数据库的 hits 或 misses 状态信息
如果启用了键空间通知，则会触发 keymiss 通知

lookupKeyRead

1
2
3
4
5


/* Like lookupKeyReadWithFlags(), but does not use any flag, which is the
 * common case. */
robj *lookupKeyRead(redisDb *db, robj *key) {
    return lookupKeyReadWithFlags(db,key,LOOKUP_NONE);
}

lookupKeyWriteWithFlags

1
2
3
4
5
6
7
8
9


/* Lookup a key for write operations, and as a side effect, if needed, expires
 * the key if its TTL is reached.
 *
 * Returns the linked value object if the key exists or NULL if the key
 * does not exist in the specified DB. */
robj *lookupKeyWriteWithFlags(redisDb *db, robj *key, int flags) {
    expireIfNeeded(db,key);
    return lookupKey(db,key,flags);
}

others

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


robj *lookupKeyWrite(redisDb *db, robj *key) {
    return lookupKeyWriteWithFlags(db, key, LOOKUP_NONE);
}

robj *lookupKeyReadOrReply(client *c, robj *key, robj *reply) {
    robj *o = lookupKeyRead(c->db, key);
    if (!o) addReply(c,reply);
    return o;
}

robj *lookupKeyWriteOrReply(client *c, robj *key, robj *reply) {
    robj *o = lookupKeyWrite(c->db, key);
    if (!o) addReply(c,reply);
    return o;
}

3.2. 添加

dbAdd

1
2
3
4
5


/* Add the key to the DB. It's up to the caller to increment the reference
 * counter of the value if needed.
 *
 * The program is aborted if the key already exists. */
void dbAdd(redisDb *db, robj *key, robj *val);

在将 key 加入 db->dict 时，会复制 key 的字符串值，将 key 以 sds 形式加入 db。
如果值类型为 list、zset 或 stream，key 此时在 db->blocking_keys 中且不在 db->ready_keys 时，会将 key 加入 db->ready_keys 中
若 Redis 服务器为 cluster 模式，会将 key 和对应的 slot 建立映射，表示 key 存在于该 slot 中

dbAddRDBLoad

1
2
3
4


/* This is a special version of dbAdd() that is used only when loading
 * keys from the RDB file: the key is passed as an SDS string that is
 * retained by the function (and not freed by the caller). */
int dbAddRDBLoad(redisDb *db, sds key, robj *val);

即使 key 此时在 db->dict 中，也不会触发 aborted，而是会返回 0
如果 key 成功加入 db->dict 中，则返回 1
由于传入的 key 类型为 sds，所以和 dbAdd 相比并无复制字符串操作，因此在成功加入时，调用方不可对 key 进行释放操作，此时 key 的所有权交予 db->dict；但在返回 0 时，key 的释放操作需由调用方完成
若 Redis 服务器为 cluster 模式，会将 key 和对应的 slot 建立映射，表示 key 存在于该 slot 中

dbOverwrite

1
2
3
4
5
6


/* Overwrite an existing key with a new value. Incrementing the reference
 * count of the new value is up to the caller.
 * This function does not modify the expire time of the existing key.
 *
 * The program is aborted if the key was not already present. */
void dbOverwrite(redisDb *db, robj *key, robj *val);

setKey & genericSetKey

1
2
3
4
5
6
7
8
9


/* Common case for genericSetKey() where the TTL is not retained. */
void setKey(client *c, redisDb *db, robj *key, robj *val) {
    genericSetKey(c,db,key,val,0,1);
}

/* High level Set operation. This function can be used in order to set
 * a key, whatever it was existing or not, to a new object. */
void genericSetKey(client *c, redisDb *db, robj *key,
                   robj *val, int keepttl, int signal);

在 genericSetKey 中：

keepttl 指示是否保持该 key 的 ttl 不变，其值由具体的写入命令参数决定
signal 指示是否需要调用 signalModifiedKey 通知该 key 已被修改，除了在 redis module 中，该值一直为 1，而在 redis module 中，该值为 0。使用 Redis module 可以通过调用 RedisModule_SignalModifiedKey 通知该 key 已被修改
在该函数中无论 key 先前是否存在，都会将 key-val 对写入 db 中
val 的引用计数会增加
在 keepttl 不为 1 时，与 key 关联的过期时间会被移除

3.3. 删除

删除某个 key

1
2
3
4
5
6
7
8
9


/* Delete a key, value, and associated expiration entry if any, from the DB */
int dbSyncDelete(redisDb *db, robj *key);

/* This is a wrapper whose behavior depends on the Redis lazy free
 * configuration. Deletes the key synchronously or asynchronously. */
int dbDelete(redisDb *db, robj *key) {
    return server.lazyfree_lazy_server_del ? dbAsyncDelete(db,key) :
                                             dbSyncDelete(db,key);
}

删除指定的 db

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


long long emptyDb(int dbnum, int flags, void(callback)(void*)) {
    return emptyDbGeneric(server.db, dbnum, flags, callback);
}

/* Remove all keys from all the databases in a Redis server.
 * If callback is given the function is called from time to time to
 * signal that work is in progress.
 *
 * The dbnum can be -1 if all the DBs should be flushed, or the specified
 * DB number if we want to flush only a single Redis database number.
 *
 * Flags are be EMPTYDB_NO_FLAGS if no special flags are specified or
 * 1. EMPTYDB_ASYNC if we want the memory to be freed in a different thread.
 * 2. EMPTYDB_BACKUP if we want to empty the backup dictionaries created by
 *    disklessLoadMakeBackups. In that case we only free memory and avoid
 *    firing module events.
 * and the function to return ASAP.
 *
 * On success the fuction returns the number of keys removed from the
 * database(s). Otherwise -1 is returned in the specific case the
 * DB number is out of range, and errno is set to EINVAL. */
long long emptyDbGeneric(redisDb *dbarray, int dbnum, int flags,
                         void(callback)(void*));

3.4. Utils

dbRandomKey

1
2
3
4
5


/* Return a random key, in form of a Redis object.
 * If there are no keys, NULL is returned.
 *
 * The function makes sure to return keys not already expired. */
robj *dbRandomKey(redisDb *db);

dbUnshareStringValue

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


/* Prepare the string object stored at 'key' to be modified destructively
 * to implement commands like SETBIT or APPEND.
 *
 * An object is usually ready to be modified unless one of the two conditions
 * are true:
 *
 * 1) The object 'o' is shared (refcount > 1), we don't want to affect
 *    other users.
 * 2) The object encoding is not "RAW".
 *
 * If the object is found in one of the above conditions (or both) by the
 * function, an unshared / not-encoded copy of the string object is stored
 * at 'key' in the specified 'db'. Otherwise the object 'o' itself is
 * returned. */
robj *dbUnshareStringValue(redisDb *db, robj *key, robj *o);

4. expire

设置了过期时间的 key，会将 <key, absolute-expire-time> 映射加入到 redisDb->expires 字典中，以支持后续的键过期行为。Redis 的键过期以被动的方式进行，只会在访问到该 key 的时候触发。访问的方式大致可以分为两种：

在客户端访问到该 key 时调用 expireIfNeeded 触发过期删除操作，例如查找、删除、随机返回 key和 Scan操作时
为防止有些过期 key 一直未被访问从而无法释放，Redis 主库还会定时调用 activeExpireCycle 主动去扫描过期键空间清理过期 key
- main->initServer->serverCron->databasesCron->activeExpireCycle
- main->initServer->beforeSleep->activeExpireCycle

过期 key 的删除分为同步和异步两种方式，由配置项 lazyfree-lazy-expire 决定采用何种方式。

4.1. Expire APIs

removeExpire

1
2


/* Remove an expire fo the specified key. */
int removeExpire(redisDb *db, robj *key);

只有当 key 在 redisDb->dict 中时，才会尝试从 redisDb->expires 删除与 key 对应的 dictEntry 并释放其占用的内存。

setExpire

1
2


/* Set an expire to the specified key. */
void setExpire(client *c, redisDb *db, robj *key, long long when);

当调用请求来自客户端时，c 为客户端上下文，否则 c 为 NULL
when 为过期时间，其为 unix time 的绝对值，即 key 在 when 这一时刻过期，单位为 ms
redisDb->expires 复用 redisDb->dict 的键字符串

getExpire

1
2
3


/* Return the expire time of the specified key, or -1 if no expire
 * is associated with this key (i.e. the key is non volatile) */
long long getExpire(redisDb *db, robj *key);

propagateExpire

1
2
3
4
5
6
7
8
9


/* Propagate expires into slaves and the AOF file.
 * When a key expires in the master, a DEL operation for this key is sent
 * to all the slaves and the AOF file if enabled.
 *
 * This way the key expiry is centralized in one place, and since both
 * AOF and the master->slave link guarantee operation ordering, everything
 * will be consistent even if we allow write operations against expiring
 * keys. */
void propagateExpire(redisDb *db, robj *key, int lazy);

keyIsExpired

1
2


/* Check if the key is expired. */
int keyIsExpired(redisDb *db, robj *key);

如果此时正在 loading data，则不会进行过期操作，因此该函数直接返回 0
如果正在运行 lua 脚本，则会把 lua 脚本开始运行的时间 server.lua_time_start 作为当前时间，这是为了防止在 lua 脚本运行期间 key 过期同步至从库，从而导致在从库上 lua 脚本的运行和主库不一致的情况发生，详情可见 issue 1525
如果此时正在某个命令运行期间，会使用缓存的 server.mstime 作为当前时间

expireIfNeeded

1
2
3


/* The return value of the function is 0 if the key is still valid,
 * otherwise the function returns 1 if the key is expired. */
int expireIfNeeded(redisDb *db, robj *key);

正如之前所述，在客户端访问 key 时可能需要先调用 expireIfNeeded 检测该 key 是否已过期。该函数的运行模式与 Redis 服务器角色有关：

在 Redis 主库中，如果此时 key 已过期，会触发删除 key 操作，并将删除操作同步至 salves 和 AOF
在 Redis 从库中，如果此时 key 已过期，则只会立马返回 1 表示该 key 已过期，但不会执行删除操作，这是因为从库为了和主库的数据保持一致，其过期 key 的清理由同步主库的 del 消息完成

4.2. 主动清理过期键

之前我们介绍过 Redis 主库除了在访问 key 时触发过期操作，还会定期调用 activeExpireCycle 清理过期 key，其接口如下：

1

void activeExpireCycle(int type);

type 参数表示该函数使用何种模式运行，共有两种模式：

ACTIVE_EXPIRE_CYCLE_FAST: 在 beforeSleep 中调用 activeExpireCycle 时使用该模式，将单次清理过期 key 的执行时间限制在 config_cycle_fast_duration(默认为 $1ms$) 之内
ACTIVE_EXPIRE_CYCLE_SLOW: 在 databaseCron 中调用 activeExpireCycle 时使用该模式，将定期清理过期 key 操作使用 CPU 占比控制在 config_cycle_slow_time_perc(默认为 $25%$) 之内，单次执行时间限制为 config_cycle_slow_time_perc*1000000/server.hz/100

上述出现的参数可使用配置项 active-expire-effort 配置，对应关系如下：

active-expire-effort	fast_duration(ms)	slow_time_perc	keys_per_loop	acceptable_stale
$a + 1$	$1 + \frac{a}{4}$	$25\% + 2a\%$	$20 + 5a$	$10\% - a\%$
$[1, 10]$	$[1, 3.25]$	$[25\%, 43\%]$	$[20, 65]$	$[10\%, 1\%]$

其中，keys_per_loop(config_keys_per_loop) 指的是每次迭代扫描的 key 的数量上限，acceptable_stale(config_cycle_acceptable_stale) 指的是可容忍的已过期键比例。从上表可以的值，当 active-expire-effort 值越大时，每次清理过期 key 的调用执行时间上限越长，单个迭代内扫描 key 数量上限越大，对已过期键占比的容忍度越低。

该函数以“伪增量”(期间写入的过期 key 可能在本次全面遍历中未被访问)形式对过期键空间进行扫描，使用局部静态变量记录上次执行的状态：

current_db 记录上次调用此函数时的 server.db 的位置，每个 redisDb 内部的扫描位置则由 redisDb->expires_cursor 记录
timelimit_exit 记录上次执行是否超时。每次调用该函数扫描的 db 数量上限默认为 CRON_DBS_PER_CALL 如果该值大于 server.dbnum 或上次该函数执行超时时，每次调用该函数扫描的 db 数量上限则设置为 server.dbnum
last_fast_cycle 记录上次以 fast 模式运行该函数的起始时间，以防止上次 fast 模式扫描过程超时且尚未终止时启动新的 fast 模式扫描过程

在以下情况下，该函数会立马返回，不启动扫描过程：

在 client paused 期间，Redis 数据库是静态的，其不仅会阻止客户端的写入，也不会执行 evict 和 expire 操作，因此，若 server.clients_paused 为真时，该函数会直接返回
在 fast 模式下，上次该函数并未超时退出，且此时已过期键占比近似值 server.stat_expired_stale_perc 小于可容忍上限 config_cycle_acceptable_stale
在 fast 模式下，上次 fast 模式扫描过程超时且可能尚未终止

在扫描过程中，该函数会试图遍历指定的 db 数量，其终止条件有二：

函数运行时间超出时间限制
每个 db 在单次迭代中已过期键占比低于可容忍值，则切换至下一 db，遍历 db 数量达到指定上限时扫描终止

在扫描过程中，该函数还会更新 Redis 运行状态：

server.stat_expiredkeys：已过期键总数
redisDb->avg_ttl：ttl 近似平均值，先前值为 0 时，使用本次迭代值作为 ttl 近似平均值；否则以先前值权重 $98\%$，单次迭代值权重 $2\%$ 进行加权平均
server.stat_expired_time_cap_reached_count：Expire cycles 超时退出次数
server.stat_expire_cycle_time_used: Expire cycles 使用总时间 (ms)
server.stat_expired_stale_perc：已过期键占比近似值，以先前值占比 $95\%$，本次调用运行值占比 $5\%$ 进行加权平均

4.3. 处于可写模式下的从库清理过期键

TODO

4.4. Expire Commands

TODO

5. evict

在之前关于 redis object 的介绍中，我们知道 redis obejct 有一个 filed 为 lru，其在创建时初始化，在每次查找操作时更新，它的作用就是用于挑选目标 key 进行淘汰。

Redis 作为内存数据库，通常会设置 maxmemory 指定可用的最大内存空间大小。若设置了 maxmemory 值，则 Redis Server 会在每次执行命令之前调用 freeMemoryIfNeededAndSafe 以检查目前内存使用量是否已经达到 maxmemory，若为真，会根据设置的 maxmemory-policy 决定淘汰 key 的方式，可选策略如下：

volatile-lru: Evict using approximated LRU, only keys with an expire set.
allkeys-lru: Evict any key using approximated LRU.
volatile-lfu: Evict using approximated LFU, only keys with an expire set.
allkeys-lfu: Evict any key using approximated LFU.
volatile-random: Remove a random key having an expire set.
allkeys-random: Remove a random key, any key.
volatile-ttl: Remove the key with the nearest expire time (minor TTL)
noeviction: Don’t evict anything, just return an error on write operations.

接下来我们重点介绍在 Redis 中 LRU/LFU 相关策略的实现。

5.1. Least Recently Used(LRU)

获取 lruclock

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


/* Return the LRU clock, based on the clock resolution. This is a time
 * in a reduced-bits format that can be used to set and check the
 * object->lru field of redisObject structures. */
unsigned int getLRUClock(void) {
    return (mstime()/LRU_CLOCK_RESOLUTION) & LRU_CLOCK_MAX;
}

/* This function is used to obtain the current LRU clock.
 * If the current resolution is lower than the frequency we refresh the
 * LRU clock (as it should be in production servers) we return the
 * precomputed value, otherwise we need to resort to a system call. */
unsigned int LRU_CLOCK(void);

通过调用 getLRUClock 可以获得此时的 lruclock 值，其精度由 LRU_CLOCK_RESOLUTION 控制，数值最大位数为 LRU_BITS。

从 issues 2552 可知，为性能考虑，应尽量避免调用 gettimeofday 来获取时间，因此 Redis 还提供了 LRU_CLOCK 接口在 LRU_CLOCK_RESOLUTION 精度范围内，返回 Redis Server 此时正缓存的 lruclock 值。

计算 idle time

1
2
3


/* Given an object returns the min number of milliseconds the object was never
 * requested, using an approximated LRU algorithm. */
unsigned long long estimateObjectIdleTime(robj *o);

5.2. Least Frequently Used(LFU)

从 redisObject 的定义中，我们知道 LFU 和 LRU 复用同一 field lru，因此在 LFU 中访问频率的大小只可用 LRU_BITS 位来表示。Redis 使用下图所示的结构表示频率：

1
2
3
4


         16 bits      8 bits
    +----------------+--------+
    + Last decr time | LOG_C  |
    +----------------+--------+

Last decr time: 最近一次访问时间
LOG_C：对数计数器

增加计数

因为位数有限，因此 LFU 使用 8-bit 对数计数器 LOG_C 表示访问频率，在初始 createObject 时，值为 LFU_INIT_VAL，其大小表示其接下来可能会被访问的“潜能”。

在每次访问 Redis 数据对象时，会调用 LFULogIncr 尝试增加对数计数，其代码如下所示：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


/* Logarithmically increment a counter. The greater is the current counter value
 * the less likely is that it gets really implemented. Saturate it at 255. */
uint8_t LFULogIncr(uint8_t counter) {
    if (counter == 255) return 255;
    double r = (double)rand()/RAND_MAX;
    double baseval = counter - LFU_INIT_VAL;
    if (baseval < 0) baseval = 0;
    double p = 1.0/(baseval*server.lfu_log_factor+1);
    if (r < p) counter++;
    return counter;
}

可以看出，counter 增加满足以下公式：

$$ NewC = \begin{cases} 255, & curC = 255 \cr curC + 1, & curC \le \text{LFU_INIT_VAL} \cr curC_{1 - p} \enspace | \enspace (curC + 1)_p, & curC > \text{LFU_INIT_VAL} \cr \end{cases} $$

其中当目前计数值大于 LFU_INIT_VAL 时，会以概率 $p$ 加 1，从代码中可以看出 $p$ 的大小和当前计数值大小反相关。代码中出现的常量 lfu_log_factor 可通过 redis.conf 配置，其值越大，计数器加 1 的概率就越小，可参考如下表格(表格中 counter 数值非数学期望值)：

factor	100 hits	1000 hits	100K hits	1M hits	10M hits
0	104	255	255	255	255
1	18	49	255	255	255
10	10	18	142	255	255
100	8	11	49	143	255

减小计数

使用 LFU 策略时，需要并不是历史总访问频率的大小，而是“最近”访问频率的信息，因此需要设计合理降低对数计数器的方法。Redis 使用 16-bit 的 Last decr time 记录数据对象最近一次被访问的时间(取名为 Last decr time 可能是因为在每次更新 LFU 时会先调用 LFUDecrAndReturn 下调当前计数值)。相关源码如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


/* Return the current time in minutes, just taking the least significant
 * 16 bits. The returned time is suitable to be stored as LDT (last decrement
 * time) for the LFU implementation. */
unsigned long LFUGetTimeInMinutes(void) {
    return (server.unixtime/60) & 65535;
}

/* Given an object last access time, compute the minimum number of minutes
 * that elapsed since the last access. */
unsigned long LFUTimeElapsed(unsigned long ldt) {
    unsigned long now = LFUGetTimeInMinutes();
    if (now >= ldt) return now-ldt;
    return 65535-ldt+now;
}

/* If the object decrement time is reached decrement the LFU counter but
 * do not update LFU fields of the object, we update the access time
 * and counter in an explicit way when the object is really accessed. */
unsigned long LFUDecrAndReturn(robj *o) {
    unsigned long ldt = o->lru >> 8;
    unsigned long counter = o->lru & 255;
    unsigned long num_periods = server.lfu_decay_time ?
                                LFUTimeElapsed(ldt) / server.lfu_decay_time : 0;
    if (num_periods)
        counter = (num_periods > counter) ? 0 : counter - num_periods;
    return counter;
}

可以看出，在 LFUDecrAndReturn 中，参数 lfu_decay_time 类似于元素的半衰期，控制着计数值随时间流逝而衰减的行为，其值同样可通过 redis.conf 配置。

当 lfu_decay_time 为 0 时，计数器不随时间衰减
当 lfu_decay_time 大于 0 时，每经过 lfu_decay_time 分钟，计数器减 1，直至为 0

还可以发现在 LFUDecrAndReturn 中并不会改变 robj->lru 的值，只是返回此时经过衰减之后的计数值。其值的更改会在每次真正访问该数据对象时发生。

5.3. Eviction pool

Redis 使用淘汰候选数组 EvictionPoolLRU 保存着可淘汰 key 的信息，其中每个候选 key 的相关信息由 evictionPoolEntry 表示，其结构定义如下:

1
2
3
4
5
6
7
8


struct evictionPoolEntry {
    unsigned long long idle;    /* Object idle time (inverse frequency for LFU) */
    sds key;                    /* Key name. */
    sds cached;                 /* Cached SDS object for key name. */
    int dbid;                   /* Key DB number. */
};

static struct evictionPoolEntry *EvictionPoolLRU;

key 字段存储该数据对象的键值，但并不会复用 sds，而是会复制数据对象的键值
cached 字段是为了在键长度小于 EVPOOL_CACHED_SDS_SIZE 时复用之前复制数据对象键值时分配的内存空间，否则会申请新的内存空间来存储数据对象的键值

在数组 EvictionPoolLRU 中，

使用 LRU 相关策略时，entry 按照 idle time 从小到大排序
使用 LFU 相关策略时，entry 按照访问频率从大到小排序

当 redis server 初始化时，会调用 evictionPoolAlloc 初始化 EvictionPoolLRU，此时数组中尚无候选 key 信息，cached 字段指向长为 EVPOOL_CACHED_SDS_SIZE 的 sds。数组会在每次需要淘汰 key 时通过调用 evictionPoolPopulate 填充。其接口如下：

1
2
3
4


void evictionPoolPopulate(int dbid,
                          dict *sampledict,
                          dict *keydict,
                          struct evictionPoolEntry *pool);

dbid: database ID
sampledict: 采样的 keysapce，随 maxmemory-policy 而定
- 若只能淘汰过期 key，则采样空间为 redisDb->expires
- 若所有 key 均可淘汰，则采样空间为 redisDb->dict
keydict: redisDb 的 keysapce，即 redisDb->dict
pool: 静态变量 EvictionPoolLRU

该函数的大致处理流程如下：

从采样空间 sampledict 中尝试采样 maxmemory_samples 个数据对象
- maxmemory_samples 可由 redis.conf 配置
- 返回的采样结果中，数据对象可能有重复，其总数也可能小于 maxmemory_samples
获得采样数据对象的 idle 值
- 如果淘汰策略为 LRU 相关， idle = estimateObjectIdleTime(o)，即为该对象的 idle time
- 如果淘汰策略为 LFU 相关，idle = 255 - LFUDecrAndReturn(o)，此时对数计数器值越大，idle 越小
- 如果淘汰策略为 TTL 相关，idle = ULLONG_MAX - expire_unix_time，越早过期，idle 越大
根据 idle 值，将该数据对象插入 EvictionPoolLRU 中相应的位置，维持数组的有序性(idle 从小到大)

5.4. Free memory

获得目前数据库内存使用情况

1
2
3
4
5
6
7
8
9


/* We don't want to count AOF buffers and slaves output buffers as
 * used memory: the eviction should use mostly data size. This function
 * returns the sum of AOF and slaves buffer. */
size_t freeMemoryGetNotCountedMemory(void);

/* Get the memory status from the point of view of the maxmemory directive:
 * if the memory used is under the maxmemory setting then C_OK is returned.
 * Otherwise, if we are over the memory limit, the function returns C_ERR. */
int getMaxmemoryState(size_t *total, size_t *logical, size_t *tofree, float *level);

Redis 在进行淘汰 key 操作时，会排除 slave 和 AOF 客户端缓冲区占用的内存，该排除部分的值大小可调用 freeMemoryGetNotCountedMemory 获得。

在 getMaxmemoryState 中，除了返回 C_OK 或 C_ERR 表示是否超最大内存限制之外，还会视情况填充传入的非 NULL 参数：

total: total amount of bytes used. (Populated both for C_ERR and C_OK)
logical: the amount of memory used minus the slaves/AOF buffers. (Populated when C_ERR is returned)
tofree: the amount of memory that should be released in order to return back into the memory limits. (Populated when C_ERR is returned)
level: the ratio of memory usage (used_memory/maxmemory). (Populated both for C_ERR and C_OK)

freeMemoryIfNeeded

正如前文所述，Redis 服务器设置了 maxmemory 值时，会在每次执行命令之前调用 freeMemoryIfNeededAndSafe 以检查目前内存使用量是否已经达到 maxmemory，该函数源码如下:

1
2
3
4


int freeMemoryIfNeededAndSafe(void) {
    if (server.lua_timedout || server.loading) return C_OK;
    return freeMemoryIfNeeded();
}

可以看出，freeMemoryIfNeededAndSafe 会在以下两种情况下，不检测当前内存使用状况，直接返回 C_OK:

当前有执行超时的 lua 脚本
当前正在加载数据

在 freeMemoryIfNeeded 函数中，会根据当前内存使用状况和 maxmemory 值尝试进行内存释放操作以使内存使用量满足 maxmemory 的限制。其接口如下:

1

int freeMemoryIfNeeded(void);

一般情况下，当前内存使用量低于 maxmemory 或在进行释放内存操作后低于 maxmemory 时，函数返回 C_OK，否则返回 C_ERR。
默认情况下从库无 maxmemory 限制，直接返回 C_OK

可通过将 replica-ignore-maxmemory 设置为 no，使得从库也执行驱逐策略，但可能会造成主从数据不一致
在 client paused 期间，Redis 数据库是静态的，其不仅会阻止客户端的写入，也不会执行 evict 和 expire 操作，因此，若 server.clients_paused 为真时，该函数会直接返回 C_OK
当内存管理策略为 noeviction 时，会检查当前 lazyfree 队列中是否有任务，若为真则等待 lazyfree 任务完成后调用 getMaxmemoryState 得到此时的内存使用状态并返回
根据内存管理策略采样选择最合适的 key 进行淘汰直至内存使用量低于 maxmemory
- 当内存管理策略不为 random 相关类型时，每次删除一个 key 之前都会调用 evictionPoolPopulate 填充 evict pool，随后从后往前遍历得到此时最适合删除的 key
- 当内存管略为 volatile 相关类型时，因为采样空间 redisDb->expires 可能为空，因此可能无法通过淘汰 key 使得内存使用量低于 maxmemory，此时只能和 noeviction 一样，等待 lazyfree 的操作结果

Table of Contents