DESIGN TOOLS
storage

Real life workloads allow more efficient data granularity and enable very large SSD capacities

Luca Bert | September 2023

Large capacity SSDs (i.e.(30TB+)带来了一系列新的挑战. The two most relevant are:

  1. 大容量ssd是通过高密度NAND实现的, 比如QLC(四电平单元NAND,每个单元存储4位数据), 与TLC NAND(三能级电池)相比,哪些带来了更多的挑战, storing 3 bits per cell).
  2. SSD capacity growth commands an equivalent growth of local DRAM memory for maps that have traditionally been a ratio of 1:1000 (DRAM to Storage Capacity).

目前,我们正处于1:1000的比例不再可持续的地步. But do we really need it? Why not a ratio of 1:4000? Or 1:8000? 它们将把DRAM需求分别减少4倍或8倍. What prevents us from doing this?

这篇博客探讨了这种方法背后的思考过程, 并尝试为大容量ssd绘制一条前进的道路.

首先,为什么DRAM需要与NAND容量成1:1000的比例? The SSD needs to map the logical block addresses (LBA) coming from the system to NAND pages and needs to keep a live copy of all of them so it knows where data can be written to or read back. LBA大小为4KB,映射地址通常为32位(4字节)。, so we need one entry of 4 bytes every LBA of 4KB; hence the 1:1000 ratio. 请注意,非常大的容量需要比这更多一点, for simplicity, we’ll stick to this ratio as it makes the reasoning simpler and won’t materially change the outcome.

Having one map entry for each LBA is the most effective granularity as it allows the system to write (i.e.(创建一个映射条目)以尽可能低的粒度. 这通常以4KB随机写作为基准, which is commonly used to measure and compare SSD write performance and endurance.

然而,从长远来看,这可能站不住脚. 相反,如果我们每4个lba有一个地图条目呢? or 8, 16, 32+ LBAs? 如果我们每4个lba使用一个地图条目(i.e., one entry every 16KB) we may save DRAM size, but what happens when the system wants to write 4KB? Given the entry is every 16KB, SSD需要读取16KB的页面, 修改将要写入的4KB, 并回写整个16KB的页面. 这将影响性能(“读取16KB, modify 4KB, write back 4KB”, 而不仅仅是“写4KB”),而是, most of all, this would impact endurance (system writes 4KB but SSD will end up writing 16KB to NAND) thus reducing the SSD life by a factor of 4. It is worrisome when this happens on QLC technology that has a much more challenging endurance profile. 对于QLC来说,如果有一样东西是不能浪费的,那就是耐力!

So, common reasoning is that the map granularity (or Indirection Unit – “IU” - in a more formal term) cannot be changed otherwise SSD life (endurance) would severely decline.

虽然以上都是正确的,但是系统真的以4KB粒度写入数据吗? And how often? One can for sure buy a system just to run FIO with 4KB RW profile but realistically, 人们不会这样使用系统. 他们购买它们来运行应用程序、数据库、文件系统、对象存储等. Do any of them use 4KB Writes?

We decided to measure it. 我们选择了一组不同的应用程序基准, 从TPC-H(数据分析)到YCSB(云运营), 在各种数据库(Microsoft®SQL Server®)上运行, RocksDB, Apache Cassandra®), various File Systems (EXT4, XFS) and, in some cases, 完整的软件定义存储解决方案,如Red Hat®Ceph®storage, and measured how many 4KB writes are issued and what contribution they give to Write Amplification, i.e.,额外的写入会降低设备寿命.

Before going into the details of the analysis we need to discuss why write size matters when endurance is at stake.

A 4KB write will create a “write 16K to modify 4K” and thus a 4x Write Amplification Factor (“WAF”). But what if we get an 8K write? Assuming that is inside the same IU, it will be a “write 16K to modify 8K” so WAF=2. A bit better. And if we write 16K? 它可能根本不会对WAF做出贡献,因为“写入16K来修改16KB”。. 因此,只有少量的写入对WAF有贡献.

还有一种微妙的情况是,写操作可能没有对齐, so there is always a misalignment that contributes to WAF but that also decreases rapidly with size.

下图显示了这一趋势:

Luca Blog IU Figure 1: 16K IU induced WAF showing larger IOs have a smaller impact Luca Blog IU Figure 1: 16K IU induced WAF showing larger IOs have a smaller impact

 

大的写操作对WAF的影响最小. 256KB, for example, may have no impact (WAF=1x) if aligned, or minimal one (WAF=1.06x) if misaligned. 比4KB写入带来的可怕的4x要好得多!

We then need to profile all writes coming to the SSD and look for their alignment within an IU to compute WAF contribution of each of them. And the larger the better. 为此,我们对系统进行了测试,以跟踪IOs的几个基准测试. We get samples for 20 min (generally between 100 and 300 million samples each benchmark) and then we post-process them to look at size, IU alignment, 并将每个IO贡献添加到WAF中.

下表显示了每种大小的桶中有多少个IOs:

图2:来自基准测试的WAF IU的真实数据(按IO计数) 图2:来自基准测试的WAF IU的真实数据(按IO计数)

 

As shown, most writes either fit in the small size of 4-8KB (bad) bucket or in the 256KB+ (good) buckets.

如果我们应用上面的WAF图表,假设所有这些IOs都是不对齐的, 我们得到的是“最坏情况”一栏中报告的情况:大多数WAF位于1.x range, a few in the 2.X和非常特别的3.x range. 比预期的4倍要好得多,但不足以让它可行.

然而,并不是所有的IOs都是错位的. Why would they be? Why would a modern file system create structures that are misaligned to such small granularities? Answer: They don’t.

We measured each of the 100+ million IOs for each benchmark and post-processed them to determine how they align with a 16KB IU. 结果在最后一栏“测量”WAF中. It is generally less than 5%, i.e., WAF >=1.这意味着一个人可以将国际单位大小增加400%, 利用QLC NAND和现有技术制造大型固态硬盘, smaller DRAM technologies at a life cost that is >5% and not 400% as postulated! These are astonishing results.

One may argue “there are a lot of small writes at 4KB and 8KB and they do have a 400% or 200% individual WAF contribution. Shouldn’t the aggregated WAF be much higher because of such small but numerous IOs contributions?”. True, they are many, but they are small, 所以它们的有效载荷很小, in terms of volume, is minimized. In the above table, a 4KB write counts as a single write as does a single 256KB write – but the latter carries 64x the amount of data than the former.

如果我们调整上表的IO卷(i.e., 考虑每个IO大小和移动的数据), not by IO count, 我们达成如下协议:

图3:来自基准测试的WAF IU的真实数据(按体积计算) 图3:来自基准测试的WAF IU的真实数据(按体积计算)

 

As we can see, 更强烈的IOs的颜色分级现在向右倾斜, meaning large IOs are moving an overwhelming amount of data and hence the WAF contribution is small.

One last thing to note is that not all SSD workloads are suitable for this approach. The last line, for example, 表示Ceph存储节点的元数据部分,它执行非常小的IO, causing high WAF=2.35x. 大型IU驱动器不适合单独用于元数据. However, if we mixed data and metadata in Ceph (a common approach with NVMe SSDs) the size and amount of data trumps the size and amount of metadata so the combined WAF is minimally affected.

我们的测试表明,在实际应用程序和最常见的基准测试中, 移动到16K IU是一个有效的方法. The next step is convincing the industry to stop benchmarking SSDs with 4K RW with FIO which has bever been realistic and, at this point, is detrimental to evolution.

Impact of Different IU sizes

一个最明显的后续问题是:为什么是16KB IU大小? 为什么不是32KB或64KB,这有什么关系吗?

This is a very fair question that requires specific investigation and should be turned into a more specific question: what is the impact of different IU sizes for any given benchmark?

因为我们已经有了不受IU大小影响的痕迹, 我们只需要在适当的模型中运行它们,看看影响.

图4显示了IU大小对WAF的影响:

图4:单位大小对WAF的影响 图4:单位大小对WAF的影响

 

从图表中可以看出一些结果:

  • 国际单位大小很重要,WAF随国际单位大小而降解. 解决方案没有好坏之分, 每个人都必须根据自己的需求和目标做出不同的权衡.
  • The WAF degradation is not nearly as bad as what may be feared, as in many cases we have seen above. 即使在最坏的情况下64KB IU和最激进的基准, 它小于2倍,而不是人们担心的16倍
  • Metadata, as seen before, 大的国际单位总是一个不好的选择,国际单位越大, the worse it becomes.
  • JESD 219A, 对WAF进行基准测试的行业标准概要, is not good but acceptable at 4KB IU with an extra 3% WAF which is generally tolerable but becomes unusual at larger IU with a case point at 64K IU of almost 9x

DMTS - SYSTEM ARCHITECTURE

Luca Bert

Luca is a Distinguished Member of SSD System Architecture with over 30 years of Enterprise Storage experience. His focus is mainly on innovative features and their use in systems to further SSD value. He holds a Master’s Degree in Solid State Physics from University of Torino (Italy).