ssd technology - basics1

1

SSD Overview

Terminologies Associated with SSD

Write Endurance estimation example

3PAR Architecture’s Flash Friendliness

2

Is made up of multiple Memory cells− NAND memory cell is a MOS transistor with floating gate− Which permanently stores charge− While programming puts electrons on floating gate− Erase takes them off− One program/erase (p/e) cycle is a round trip by the electrons− Back-and-forth round trips gradually damage the tunnel oxide

− i.e. as more p/e cycles happen the tunnel oxide degrades. If it degrades beyond a point

then the Cell becomes useless.

− If the tunnel oxide layer is thick then it can sustain larger P/E cycles thereby increasing

Endurance. Typically measured in number of p/e cycles:• 50nm MLC ~ 10,000 p/e cycles• 34nm/25nm/20nm MLC ~ 3,000 – 5,000 p/e cycles• While physical size reduces with lower Die Size, the drive endurance is impacted.

Basic Building blocks of NAND Flash devices

SSD Layout

3

Types of Memory cells

− SLC (Single Level Cell) → 1 Bit per memory cell− MLC (Multi Level Cell) → 2 Bits per memory cell− TLC (Triple Level Cell) → 3 Bits per memory cell− 16LC (16 Level Cell) → 4 Bits per memory cell

Types of NAND Flash

SSD Layout

Higher Density = Higher Capacity &

Lower Endurance

4

Understanding the Internal Constructs of a SSD Drive

SSD Layout - Summary

Cells

Pages (Multiple Cells)

Blocks (Multiple Blocks)

Plane (Multiple Blocks)

Die (Multiple Planes)

TSOP (Multiple Dies)

SSD (Multiple TSOPs)

A basic I/O (Reads/Writes) happens at a ‘Page’ Level.

However Erase is done in terms of ‘Blocks’

This leads to situations where there can be more ‘Writes’ on the back-end than the actual ‘Writes’ from the Host.

5

Data is accessed (Read/Write) in terms of ‘Pages’But Erase is done in terms of ‘Blocks’

A Page = Multiple memory cells• one page is the smallest structure which can be

read or written. Standard Page size is 4K in size.

Blocks = Multiple pages• one block is the smallest structure which can be

erased

e.g.one block = 128 pages at 4 KiB → 512 KiB Block

(HGST) On some SSDs (25nm/20nm Intel/Micron or

24nm/19nm Sandisk/Toshiba) one block = 256 pages @ 8 KiB → 2 MiB Block

Pages & Blocks

SSD Layout

Block = 128 Pages = 512KiB

Pages = 4KiB This is like earning

in Rs. And spending in US$ J

Host I/O

Erase at Block Level

6

The next SSD construct is a Plane

• Multiple blocks make up a plane

• e.g. 1.024 Blocks = 1 Plane

SSD Layout

The next higher construct is a ‘Die’• Multiple planes make up a die

• E.g. 4 Planes = 1 Die

7

TSOPs (thin small outline packages)Multiple ‘Dies’ make up a TSOP

• typically one or two dies in a TSOP

• up to eight dies possible → 64 GiByte in a

TSOP

SSDsMultiple TSOPs (e.g. ten) make up a SSD

currently capacities up to 800/1400 GB

SSD Layout

8

Jargons explained…

Terms associated with SSDs

Over Provisioning Wear Levelling Garbage Collection Write Amplification Drive Endurance / Write Endurance DWPD - Drive/Device Writes Per Day . This is a way of rating endurance

and can be used to match Application with specific SSD Type (SLC, eMLC etc.)

the associated assumption is that this daily usage figure is good for an operating period of 5 years.

9

Over Provisioning Each SSD Drive has a higher capacity than the actual advertised capacity of the Drive. This area is ‘spare’ area or overprovisioned space.

• Typically between 7% and 28% of net capacity

• e.g. 800 GByte visible, but actual capacity is 1200 GiByte (also

called soft capacity)

What is the Extra area used foro keep free pages for quick writing and with less impact on

host latency (reduce or avoid what is known as ‘write cliff’).

o wear leveling (ensure that all blocks are evenly utilized, so as to increase life of the drive

o bad block replacement (substitute or remap bad blocks from spare area)

Space Utilization & Management Techniques

1 5 9 13 170

1020304050607080

Write Latency (ms)

Write Latency (ms)

What is ‘Write Cliff’ - when latency increases exponentially since there are not enough clean pages to flush writes

10


Wear levelling

• Since Flash memory cells can only be erased

(written) a limited amount of times,

controllers/drive firmware has intelligence to

ensure that all cells are evenly utilized.• ‘Wear Leveling’ distributes the wear-out over all

memory cells – blocks are redistributed in order

to ensure all blocks are evenly utilized.• This is where the Over-provisioned capacity of

the drive comes into picture.• Types of Levelling

• Dynamic Levelling• Static Levelling

Example :

LBA 1 was initially associated with Block-A Page-1. When a subsequent write to the data in that block happens, then this LBA 1 was reassigned to Block C Page 5.

LBA 2 which was earlier on Block-A Page-2 is remapped to Block-C Page-6.

11


Garbage collectionSince I/O is done at Page Level while but erase happens at

Block level there will be times where some pages are filled

and some pages are ‘dirty’ and need to be

overwritten/deleted. Garbage Collection is the background

process for aggregating all the used Pages into a new set of

blocks while aggregating all ‘dirty’ pages so that they can be

erased.

How does this process work

• Periodically, at times even without I/O, the SSD controller

merges partly-filled blocks.

• This helps to increase the number of deleted blocks that

can be erased and kept ready so as to aid/improve

‘writes’.

• This is usually a background process to merge partially

used pages and free up blocks for proactive erasure to aid

future writes.

In this example, ‘orange’ Page are dirty and due for deletion. To erase these blocks the green Pages are remapped to another block and then erased to facilitate future writes.

12


Today, most Storage systems are capable of handling ‘deletes’ of Pages intelligently. Deleted pages are proactively marked for ‘garbage collection’ so that pages can be reclaimed for future allocation. While ‘unmap’ is a useful capability to have for Spinning Media, it is of utmost importance for SSD Drives!

• SCSI Unmap (Commercial grade drives)• ATA TRIM Commands (for Drives with SATA interfaces –

Consumer grade drives)

Basically OS ‘tells’ the SSD which LBAs are not needed anymore and can be erased.

This helps to increase the number of free blocks by initiating garbage collection,

thereby increases the write performance.

13

Space Utilization – Write Amplification

Source : wikipedia.org

Write Amplification (WA) • An undesirable but unavoidable phenomenon with SSDs

where the amount of physical information written on the drive is actually higher than the actual amount of data written or sent by the host.

Why does this happen • Since flash memory must be erased before it can be

rewritten, these operations result in moving (or rewriting) user data and metadata more than once. This multiplying effect increases the number of writes required over the life of the SSD

• Many factors affect the write amplification

• some can be controlled by the user

• some are a direct result of the data written to and usage of the SSD.

Sequential I/O has the least WA factor while Random I/O has the highest.

Typical WA values range from 1.1 for Sequential to as high as 3 to 3.3 for Random I/O.

Factors Impacting WA

• Wear Levelling• Garbage Collection• Over Provisioning• I/O Pattern –

Random/Sequential• Data Compression / De-dupe

14

Space Utilization & Drive EnduranceDrive Endurance/Write Endurance Prolonged Drive usage (writes) affects the life of the drive and referred to as Drive or

Write Endurance.

Typically quantified in terms of ‘Device/Drive Writes Per Day (DWPD).

How does SSD handle endurance

− Bad blocks Over time, erase slows down with p/e cycles. If a NAND block fails to erase, it reports

back and the drive controller will use another block instead (block is remapped with another block)

No data is lost - a failed NAND block is not a problem (as long as there is enough spare capacity to remap that block)

− Write data errorsDue to prolonged usage, blocks may encounter write data errors. RBER (raw bit error rate) – soft errors are usually corrected by ECC . Many a time RBER

gradually increases with p/e cycles (hardware errors) UBER (uncorrectable bit error rate) usually very low (<1 error out of every 1015 to 1016

accesses) that usually results in a block remap.

15

Space Utilization & Drive EnduranceDrive Endurance/Write Endurance

The device choice (SLC, eMLC, cMLC etc.) can arrived at based on required Write Endurance. This can be arrived at based on the ‘Write’ workload.

The Write Endurance is typically specified in units of device-writes-per-day (DWPD). It is defined as the amount of writes (PetaBytesWritten) that can be sustained over the entire product lifetime (DaysPerLife) normalized to the drive’s capacity (Capacity):

Hence DWPD = (PetaBytesWritten / DaysPerLife ) / Drive Capacity.

This is a theoretical formula that derives how much amount of data per day per drive based on information given in the Drive Manufacturer’s datasheet. Looking at a Hitachi Data sheet , they specify a 400GB eMLC SSD Drive can endure 7.3 PB of Write over its life time.Assuming the Writes will be sustained over 24 hours x 365 Days x 5 years, the DWPD for this drive works out to

DWPD = (7.3/1825) / (0.0004) = 10 (Drive capacity to be converted into PB for calculation). This means on a 400GB SSD drive you could write 4TB per day over its life of 5 Years.

Caution : This is for a single drive and hence does not factor additional overhead imposed by RAID protection.Also Pl Read Additional Notes below :

16

Space Utilization & Drive EnduranceCalculating the Required DWPD for sample Real World workloadsLet’s take a Core Banking System example to deliver 50 tps.

No. of TPS 100

No. of DB Transaction per Fin Transaction

20 Assumption based on a Finacle workload.

No. of I/Os per DB Transaction 5

Total Host IOPS 10,000

Read Percentage 70

Host Read IOPS 7,000

Host Write IOPS 3,000

Average Working time (hours) 12Assuming a 12 hour sustained workload window

Block Size (KB) 8

Estimated Host Write MB/s 23.44Write Load at Logical Vol level. (write_logical_MBps)

Estimated Backend MB/s (physical Disk)

46.88RAID overhead of 2 Write I/O for every host write. Applicable for both RAID 1 and RAID 5.

No. of SSD Drives Used 8Since minimum recommendation is 8 drives in a CPG.

Estimated Back-end MB/s per Physical Drive

5.86 (write_physical_MBps)

Average Work time in seconds 43,200Calculating 12 hours x 3600 (seconds_perday)

Physical Drive Capacity used (GB) 480 (Capacity)

Required DWPD 0.53

While the theoretical DWPD on the 400GB drive is 10, in a real world scenario the requirement could be a lot lesser.

If we know the workload, the required DWPD can be calculated :

Required DWPD = (write_physical_MBps * seconds_perday) / ( Capacity * 1000)

This example assumes a single Application load on SSD Drives. If multiple workloads share the same SSD CPG then the Writes of all those applications have to be aggregated and the required DWPD can be calculated.

DWPD-Calc

At this level even a cMLC can comfortable sustain the workload

over 5 years.

ssd technology - basics1

Documents