(timestamp-) Consistent backups in HBase

php中文网
发布: 2016-06-07 16:28:13
原创
1112人浏览过

The topic consistent backups in HBase comes up every now and then. In this article I will outline a scheme that does provide timestamp-consistent backups. Consistent backups are possible in HBase. With "consistent" I mean "consistent as of

The topic consistent backups in HBase comes up every now and then.
In this article I will outline a scheme that does provide timestamp-consistent backups.

Consistent backups are possible in HBase. With "consistent" I mean "consistent as of a specific timestamp" (this is limited by the HBase timestamp granularity.)

Over the past few months I have contributed various changes to HBase that now hopefully lead full circle to a more coherent story for "system of record" type data retention.

The basic setup is simple. You'll need
  • HBASE-4071 (0.92 - for MIN_VERSIONS)
  • HBASE-4536 (0.94 - keeping deleted cells and raw scans)
  • HBASE-4682 (0.94 - deleted cells in exports)
(HBase 0.94 will have all the relevant patches once it is releases - which should be any time now)

HBase's built-in Export tool can then be used to generate consistent snapshots.

Normally the data collected by an Export job is "smeared" over the time interval it takes to execute the job; an Export-Scan sees a row (cells of a row to be precise) as of the time when it happens to get to it, and rows can change while the Export is running.

That is problematic, because it is not possible to recreate a consistent view of the database from export generated this way.

The trick then is to keep all data in HBase long enough for the backup job to finish, and to only collect information before the start time of the export job.

Say the backup takes no more than T to complete (yes, it is hard to know ahead of time how long an Export is going to run). In that case the table's column families can be setup as follows:
  • set TTL to T + some headroom (so maybe 2T to be safe)
  • set VERSIONS to a very large number, (max int = 2147483647 for example, i.e. nothing is evicted from HBase due to VERSIONS)
  • set MIN_VERSIONS to how many versions you want to keep around, otherwise all versions could be removed if their TTL expired
  • set KEEP_DELETED_CELLS to true (this makes sure that deleted cell and delete markers are kept until they expire by the TTL)
In the shell (setting TTL to two hours):

create

, {name=>, versions=>2147483647, ttl=>7200, min_versions=>1, keep_deleted_cells=>true}
Export can now we used as follows:
  • set versions to 2147483647 (i.e. all versions)
  • set startTime to -2147483648 (min int, i.e. everything is guaranteed to be included)
  • set endTime to the start time of the Export (i.e. the current time when the export starts. This is the crucial part)
  • enable support for deleted cell

hbase org.apache.hadoop.hbase.mapreduce.Export
-D hbase.mapreduce.include.deleted.rows=true

2147483647 -2147483648


As long as the Export finishes within 2T, a consistent snapshot as of the time the Export was started is created. Otherwise some data might be missing, as it could have been compacted away before the Export had a chance to see it.

Since the backups also copied deleted rows and delete markers, a backup restored to an HBase instance can be queried using a time range (see Scan) to retrieve the state of the data at any arbitrary time.

Export is current limited to a single table, but given enough storage in your live cluster this can be extended to multiple table Exports, simply by setting the endTime of all Exports jobs to the start time of the first job.


This same trick can also be used for incremental backups. In that case the TTL has to be large enough to cover the interval between incremental backups.
If, for example, the incremental backups frequency is daily, the TTL above can be set to 2 days (TTL=>172800). Then use Export again:

hbase org.apache.hadoop.hbase.mapreduce.Export
-D hbase.mapreduce.include.deleted.rows=true

2147483647

Fotor AI Image Upscaler
Fotor AI Image Upscaler

Fotor推出的AI图片放大工具

Fotor AI Image Upscaler 73
查看详情 Fotor AI Image Upscaler

The longer TTL guarantees that there will be no gaps that are not covered by the incremental backups.

An example:
  1. A Put (p1) happens at T1
  2. Full backup starts at T2, time interval [0, T2)
  3. Another Put (p2) at T3
  4. full backup jobs finishes
  5. A Delete happens at T4 
  6. Incremental backup starts at T5, time interval [T2, T5)
  7. Yet another Put (p3)
  8. Incremental backup finishes
Note that in this scenario is does not matter when the backup jobs finish.

The full backup contains only p1. The incremental backup contains p2 and the Delete. p3 is not included in any backup, yet.
The state at T2 (p1) and T5 (p1, p2, delete) can be directly restored. Using time range Scans or Gets the state as of T4 and T3 can also be retrieved, once both backups have been restored into the same HBase instance (you need HBASE-4536 for this to work correctly with Deletes).

Finally, if keeping enough data to cover the time between two incremental backups in the live HBase cluster is problematic for your organization, it is also possible to archive HBase's Write Ahead Logs (WAL) and then replay with the built-in WALPlayer (HBASE-5604), but that is for another post.

最佳 Windows 性能的顶级免费优化软件
最佳 Windows 性能的顶级免费优化软件

每个人都需要一台速度更快、更稳定的 PC。随着时间的推移,垃圾文件、旧注册表数据和不必要的后台进程会占用资源并降低性能。幸运的是,许多工具可以让 Windows 保持平稳运行。

下载
来源:php中文网
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系admin@php.cn
最新问题
开源免费商场系统广告
热门教程
更多>
最新下载
更多>
网站特效
网站源码
网站素材
前端模板
关于我们 免责申明 举报中心 意见反馈 讲师合作 广告合作 最新更新
php中文网:公益在线php培训,帮助PHP学习者快速成长!
关注服务号 技术交流群
PHP中文网订阅号
每天精选资源文章推送

Copyright 2014-2025 https://www.php.cn/ All Rights Reserved | php.cn | 湘ICP备2023035733号