0

0

Redis latency spikes and the 99th percentile

php中文网

php中文网

发布时间:2016-06-07 16:40:59

|

1814人浏览过

|

来源于php中文网

原创

One interesting thing about the Stripe blog post about Redis is that they included latency graphs obtained during their tests. In order to persist on disk Redis requires to call the fork() system call. Usually forking using physical server

One interesting thing about the Stripe blog post about Redis is that they included latency graphs obtained during their tests. In order to persist on disk Redis requires to call the fork() system call. Usually forking using physical servers, and most hypervisors, is fast even with big processes. However Xen is slow to fork, so with certain EC2 instance types (and other virtual servers providers as well), it is possible to have serious latency spikes every time the parent process forks in order to persist on disk. The Stripe graph is pretty clear in this regard.

img://antirez.com/misc/stripe-latency.png

As you can guess, if you perform a latency test during the fork, all the requests crossing the moment the parent process forks will be delayed up to one second (taking as example the graph above, not sure about what was the process size nor the EC2 instance). This will produce a number of samples with high latency, and will affect the 99th percentile result.

To change instance type, configuration, setup, or whatever in order to improve this behavior is a good idea, and there are use cases where even a single request having a too high latency is unacceptable. However apparently it is not obvious how latency spikes of 1 second every 30 minutes (or more, if you use AOF with the right rewrite triggers) is very different from latency spikes which are evenly distributed in the set of requests.

With evenly distributed spikes, if the generation of a page needs to perform a number of requests to a Redis server in order to create the output, it is very likely that a page view will incur in the latency penalty: this impacts the quality of service in a great way potentially, check this link: http://latencytipoftheday.blogspot.it/2014/06/latencytipoftheday-most-page-loads.html.

However 1 second of latency every 30 minutes run is a completely different thing. For once, the percentile with good latency gets better *as the number of requests increase*, since the more the requests are, the more this second of latency will be unlikely to get over-represented in the samples (if you have just 1 request per minute, and one of those requests happen to hit the high latency, it will affect the 99.99th percentile much more than what happens with 100 requests per second).

Second: most page views will be unaffected. The only users that will see the 1 second delay are the ones that make a request crossing the fork call. All the other requests will experience an extremely low probability of hitting a request that has a latency which is significantly worse than the average latency. Also note that a page view crossing the fork time, even when composed of 100 requests, can’t be delayed for more than a second, since the requests are completed as soon as the fork() call terminates.

The bottom line here is that, if there are hard latency requirements for each single request, it is clear that a setup where a request can be delayed 1 second from time to time is a big problem. However when the goal is to provide a good quality of service, the distribution of the latency spikes have a huge effect on the outcome. Redis latency spikes due to fork on Xen are isolated points in the line of time, so they affect a percentage of page views, even when the page views are composed of a big number of Redis requests, which is proportional to the latency spike total time percentage, which is, 1 second every 1800 seconds in this case, so only the 0.05% of the page views will be affected.

Latency characteristics are hard to capture with a single metric: the full percentile curve and the distribution of the spikes, can provide a better picture. In general good rule of thumbs are a good way to start a research, and in general it is true that the average latency is a poor metric. However to promote a rule of thumb into an absolute truth has also its disadvantages, since many complex things remain complex, and in need for close inspection, regardless of our willingness to over simplify them.

At the same time, fork delays in EC2 instances are one of the worst experiences for Redis users in one of the most popular runtime environments today, so I’m starting to test Redis with EC2 regularly now: we’ll soon have EC2 specific optimization pages on the Redis official documentation, and a way to operate master-slaves replicas with persistence disabled in a safer way.

If you need EC2 + Redis master with persistence disabled now, the simplest to deploy “quick fix” is to disable automatic restarts of Redis instances, and use Sentinel for failover, so that crashed masters will not automatically return available, and will be failed over by Sentinel. The system administrator can restart the master manually after checking that the failover was successful and there is a new active master.

EDIT: Make sure to check the Hacker News thread that contains interesting information about EC2, Xen and fork time: https://news.ycombinator.com/item?id=8532851. Also not all the EC2 instances are the same, and certain types provide great fork time on pair with bare metal systems: https://redislabs.com/blog/testing-fork-time-on-awsxen-infrastructure#.VFJQ-JPF8yF Comments

相关专题

更多
ip地址修改教程大全
ip地址修改教程大全

本专题整合了ip地址修改教程大全,阅读下面的文章自行寻找合适的解决教程。

33

2025.12.26

压缩文件加密教程汇总
压缩文件加密教程汇总

本专题整合了压缩文件加密教程,阅读专题下面的文章了解更多详细教程。

18

2025.12.26

wifi无ip分配
wifi无ip分配

本专题整合了wifi无ip分配相关教程,阅读专题下面的文章了解更多详细教程。

46

2025.12.26

漫蛙漫画入口网址
漫蛙漫画入口网址

本专题整合了漫蛙入口网址大全,阅读下面的文章领取更多入口。

91

2025.12.26

b站看视频入口合集
b站看视频入口合集

本专题整合了b站哔哩哔哩相关入口合集,阅读下面的文章查看更多入口。

283

2025.12.26

俄罗斯搜索引擎yandex入口汇总
俄罗斯搜索引擎yandex入口汇总

本专题整合了俄罗斯搜索引擎yandex相关入口合集,阅读下面的文章查看更多入口。

370

2025.12.26

虚拟号码教程汇总
虚拟号码教程汇总

本专题整合了虚拟号码接收验证码相关教程,阅读下面的文章了解更多详细操作。

35

2025.12.25

错误代码dns_probe_possible
错误代码dns_probe_possible

本专题整合了电脑无法打开网页显示错误代码dns_probe_possible解决方法,阅读专题下面的文章了解更多处理方案。

25

2025.12.25

网页undefined啥意思
网页undefined啥意思

本专题整合了undefined相关内容,阅读下面的文章了解更多详细内容。后续继续更新。

72

2025.12.25

热门下载

更多
网站特效
/
网站源码
/
网站素材
/
前端模板

精品课程

更多
相关推荐
/
热门推荐
/
最新课程
进程与SOCKET
进程与SOCKET

共6课时 | 0.3万人学习

Redis+MySQL数据库面试教程
Redis+MySQL数据库面试教程

共72课时 | 6.2万人学习

关于我们 免责申明 举报中心 意见反馈 讲师合作 广告合作 最新更新
php中文网:公益在线php培训,帮助PHP学习者快速成长!
关注服务号 技术交流群
PHP中文网订阅号
每天精选资源文章推送

Copyright 2014-2025 https://www.php.cn/ All Rights Reserved | php.cn | 湘ICP备2023035733号