ssd technology - basics1
DESCRIPTION
SSD Important basics in-depth conceptsTRANSCRIPT
1
SSD Overview
Terminologies Associated with SSD
Write Endurance estimation example
3PAR Architecture’s Flash Friendliness
2
Is made up of multiple Memory cells− NAND memory cell is a MOS transistor with floating gate− Which permanently stores charge− While programming puts electrons on floating gate− Erase takes them off− One program/erase (p/e) cycle is a round trip by the electrons− Back-and-forth round trips gradually damage the tunnel oxide
− i.e. as more p/e cycles happen the tunnel oxide degrades. If it degrades beyond a point
then the Cell becomes useless.
− If the tunnel oxide layer is thick then it can sustain larger P/E cycles thereby increasing
Endurance. Typically measured in number of p/e cycles:• 50nm MLC ~ 10,000 p/e cycles• 34nm/25nm/20nm MLC ~ 3,000 – 5,000 p/e cycles• While physical size reduces with lower Die Size, the drive endurance is impacted.
Basic Building blocks of NAND Flash devices
SSD Layout
3
Types of Memory cells
− SLC (Single Level Cell) → 1 Bit per memory cell− MLC (Multi Level Cell) → 2 Bits per memory cell− TLC (Triple Level Cell) → 3 Bits per memory cell− 16LC (16 Level Cell) → 4 Bits per memory cell
Types of NAND Flash
SSD Layout
Higher Density = Higher Capacity &
Lower Endurance
4
Understanding the Internal Constructs of a SSD Drive
SSD Layout - Summary
Cells
Pages (Multiple Cells)
Blocks (Multiple Blocks)
Plane (Multiple Blocks)
Die (Multiple Planes)
TSOP (Multiple Dies)
SSD (Multiple TSOPs)
A basic I/O (Reads/Writes) happens at a ‘Page’ Level.
However Erase is done in terms of ‘Blocks’
This leads to situations where there can be more ‘Writes’ on the back-end than the actual ‘Writes’ from the Host.
5
Data is accessed (Read/Write) in terms of ‘Pages’But Erase is done in terms of ‘Blocks’
A Page = Multiple memory cells• one page is the smallest structure which can be
read or written. Standard Page size is 4K in size.
Blocks = Multiple pages• one block is the smallest structure which can be
erased
e.g.one block = 128 pages at 4 KiB → 512 KiB Block
(HGST) On some SSDs (25nm/20nm Intel/Micron or
24nm/19nm Sandisk/Toshiba) one block = 256 pages @ 8 KiB → 2 MiB Block
Pages & Blocks
SSD Layout
Block = 128 Pages = 512KiB
Pages = 4KiB This is like earning
in Rs. And spending in US$ J
Host I/O
Erase at Block Level
6
The next SSD construct is a Plane
• Multiple blocks make up a plane
• e.g. 1.024 Blocks = 1 Plane
SSD Layout
The next higher construct is a ‘Die’• Multiple planes make up a die
• E.g. 4 Planes = 1 Die
7
TSOPs (thin small outline packages)Multiple ‘Dies’ make up a TSOP
• typically one or two dies in a TSOP
• up to eight dies possible → 64 GiByte in a
TSOP
SSDsMultiple TSOPs (e.g. ten) make up a SSD
currently capacities up to 800/1400 GB
SSD Layout
8
Jargons explained…
Terms associated with SSDs
Over Provisioning Wear Levelling Garbage Collection Write Amplification Drive Endurance / Write Endurance DWPD - Drive/Device Writes Per Day . This is a way of rating endurance
and can be used to match Application with specific SSD Type (SLC, eMLC etc.)
the associated assumption is that this daily usage figure is good for an operating period of 5 years.
9
Over Provisioning Each SSD Drive has a higher capacity than the actual advertised capacity of the Drive. This area is ‘spare’ area or overprovisioned space.
• Typically between 7% and 28% of net capacity
• e.g. 800 GByte visible, but actual capacity is 1200 GiByte (also
called soft capacity)
What is the Extra area used foro keep free pages for quick writing and with less impact on
host latency (reduce or avoid what is known as ‘write cliff’).
o wear leveling (ensure that all blocks are evenly utilized, so as to increase life of the drive
o bad block replacement (substitute or remap bad blocks from spare area)
Space Utilization & Management Techniques
1 5 9 13 170
1020304050607080
Write Latency (ms)
Write Latency (ms)
What is ‘Write Cliff’ - when latency increases exponentially since there are not enough clean pages to flush writes
10
Space Utilization & Management Techniques
Wear levelling
• Since Flash memory cells can only be erased
(written) a limited amount of times,
controllers/drive firmware has intelligence to
ensure that all cells are evenly utilized.• ‘Wear Leveling’ distributes the wear-out over all
memory cells – blocks are redistributed in order
to ensure all blocks are evenly utilized.• This is where the Over-provisioned capacity of
the drive comes into picture.• Types of Levelling
• Dynamic Levelling• Static Levelling
Example :
LBA 1 was initially associated with Block-A Page-1. When a subsequent write to the data in that block happens, then this LBA 1 was reassigned to Block C Page 5.
LBA 2 which was earlier on Block-A Page-2 is remapped to Block-C Page-6.
11
Space Utilization & Management Techniques
Garbage collectionSince I/O is done at Page Level while but erase happens at
Block level there will be times where some pages are filled
and some pages are ‘dirty’ and need to be
overwritten/deleted. Garbage Collection is the background
process for aggregating all the used Pages into a new set of
blocks while aggregating all ‘dirty’ pages so that they can be
erased.
How does this process work
• Periodically, at times even without I/O, the SSD controller
merges partly-filled blocks.
• This helps to increase the number of deleted blocks that
can be erased and kept ready so as to aid/improve
‘writes’.
• This is usually a background process to merge partially
used pages and free up blocks for proactive erasure to aid
future writes.
In this example, ‘orange’ Page are dirty and due for deletion. To erase these blocks the green Pages are remapped to another block and then erased to facilitate future writes.
12
Space Utilization & Management Techniques
Today, most Storage systems are capable of handling ‘deletes’ of Pages intelligently. Deleted pages are proactively marked for ‘garbage collection’ so that pages can be reclaimed for future allocation. While ‘unmap’ is a useful capability to have for Spinning Media, it is of utmost importance for SSD Drives!
• SCSI Unmap (Commercial grade drives)• ATA TRIM Commands (for Drives with SATA interfaces –
Consumer grade drives)
Basically OS ‘tells’ the SSD which LBAs are not needed anymore and can be erased.
This helps to increase the number of free blocks by initiating garbage collection,
thereby increases the write performance.
13
Space Utilization – Write Amplification
Source : wikipedia.org
Write Amplification (WA) • An undesirable but unavoidable phenomenon with SSDs
where the amount of physical information written on the drive is actually higher than the actual amount of data written or sent by the host.
Why does this happen • Since flash memory must be erased before it can be
rewritten, these operations result in moving (or rewriting) user data and metadata more than once. This multiplying effect increases the number of writes required over the life of the SSD
• Many factors affect the write amplification
• some can be controlled by the user
• some are a direct result of the data written to and usage of the SSD.
Sequential I/O has the least WA factor while Random I/O has the highest.
Typical WA values range from 1.1 for Sequential to as high as 3 to 3.3 for Random I/O.
Factors Impacting WA
• Wear Levelling• Garbage Collection• Over Provisioning• I/O Pattern –
Random/Sequential• Data Compression / De-dupe
14
Space Utilization & Drive EnduranceDrive Endurance/Write Endurance Prolonged Drive usage (writes) affects the life of the drive and referred to as Drive or
Write Endurance.
Typically quantified in terms of ‘Device/Drive Writes Per Day (DWPD).
How does SSD handle endurance
− Bad blocks Over time, erase slows down with p/e cycles. If a NAND block fails to erase, it reports
back and the drive controller will use another block instead (block is remapped with another block)
No data is lost - a failed NAND block is not a problem (as long as there is enough spare capacity to remap that block)
− Write data errorsDue to prolonged usage, blocks may encounter write data errors. RBER (raw bit error rate) – soft errors are usually corrected by ECC . Many a time RBER
gradually increases with p/e cycles (hardware errors) UBER (uncorrectable bit error rate) usually very low (<1 error out of every 1015 to 1016
accesses) that usually results in a block remap.
15
Space Utilization & Drive EnduranceDrive Endurance/Write Endurance
The device choice (SLC, eMLC, cMLC etc.) can arrived at based on required Write Endurance. This can be arrived at based on the ‘Write’ workload.
The Write Endurance is typically specified in units of device-writes-per-day (DWPD). It is defined as the amount of writes (PetaBytesWritten) that can be sustained over the entire product lifetime (DaysPerLife) normalized to the drive’s capacity (Capacity):
Hence DWPD = (PetaBytesWritten / DaysPerLife ) / Drive Capacity.
This is a theoretical formula that derives how much amount of data per day per drive based on information given in the Drive Manufacturer’s datasheet. Looking at a Hitachi Data sheet , they specify a 400GB eMLC SSD Drive can endure 7.3 PB of Write over its life time.Assuming the Writes will be sustained over 24 hours x 365 Days x 5 years, the DWPD for this drive works out to
DWPD = (7.3/1825) / (0.0004) = 10 (Drive capacity to be converted into PB for calculation). This means on a 400GB SSD drive you could write 4TB per day over its life of 5 Years.
Caution : This is for a single drive and hence does not factor additional overhead imposed by RAID protection.Also Pl Read Additional Notes below :
16
Space Utilization & Drive EnduranceCalculating the Required DWPD for sample Real World workloadsLet’s take a Core Banking System example to deliver 50 tps.
No. of TPS 100
No. of DB Transaction per Fin Transaction
20 Assumption based on a Finacle workload.
No. of I/Os per DB Transaction 5
Total Host IOPS 10,000
Read Percentage 70
Host Read IOPS 7,000
Host Write IOPS 3,000
Average Working time (hours) 12Assuming a 12 hour sustained workload window
Block Size (KB) 8
Estimated Host Write MB/s 23.44Write Load at Logical Vol level. (write_logical_MBps)
Estimated Backend MB/s (physical Disk)
46.88RAID overhead of 2 Write I/O for every host write. Applicable for both RAID 1 and RAID 5.
No. of SSD Drives Used 8Since minimum recommendation is 8 drives in a CPG.
Estimated Back-end MB/s per Physical Drive
5.86 (write_physical_MBps)
Average Work time in seconds 43,200Calculating 12 hours x 3600 (seconds_perday)
Physical Drive Capacity used (GB) 480 (Capacity)
Required DWPD 0.53
While the theoretical DWPD on the 400GB drive is 10, in a real world scenario the requirement could be a lot lesser.
If we know the workload, the required DWPD can be calculated :
Required DWPD = (write_physical_MBps * seconds_perday) / ( Capacity * 1000)
This example assumes a single Application load on SSD Drives. If multiple workloads share the same SSD CPG then the Writes of all those applications have to be aggregated and the required DWPD can be calculated.
DWPD-Calc
At this level even a cMLC can comfortable sustain the workload
over 5 years.