asic/merchant chip-based flash controllers · si ngl e key-quati on wi th 4 stage pi pel i ne...
TRANSCRIPT
ASIC/Merchant Chip-Based Flash
Controllers
Jeff Yang
Siliconmotion
Flash Memory Summit 2016
Santa Clara, CA
1
Outline
Flash Memory Summit 2016
Santa Clara, CA
2
Basic controller architecture.
The challenge on the Merchant Chip-based flash controller.
The Flash selection combination and the performance requirement.
Flash write channel and the read channel throughput analysis.
Hard-decoding only BCH based controller for SATA application throughput requirement.
From the hard-decoding only to soft-decoding controller.
Correction capability.
3D vs. 2D’s NAND architecture.
Vth tracking and the Data-retention issue.
RAID
Basic architecture
Flash Memory Summit 2016
Santa Clara, CA
3
Buffer for DATA CPU Buffer
for others
DRAMHost-Interface
SLC region
TLC region
Write channel
Read channel
Application combinations on TLC
Flash Memory Summit 2016
Santa Clara, CA
4
Buffer for DATA CPU Buffer
for others
DRAMHost-Interface
SLC region
TLC region
Write channel
Read channel
1. SLC caching.
Data always write into SLC. A Fixed portion of SLC regard as a cache buffer.
Performance boost on SLC caching.
Background GC to TLC block.
2. TLC direct.
Only a very small portion for system usage and small random write data.
A stable sustained write performance.
3. Dynamic SLC.
Less than 1/3 capacity threshold, using SLC.
Maximizing performance boosting period.
Background GC to TLC block.
1. Full size DRAM
For host data write caching. Full lookup table.
2. Non-DRAM
Extremely low cost.
Optimize for user experience.
More system info access from SLC blocks.
3. Small DRAM.
Full lookup table on external buffer
No host data buffer.
© Copyright Silicon Motion, Inc., 2012. All Rights Reserved.
2D/3D
One-pass
Two-pass
Multi-pass
SLC usage
SLC caching
TLC direct
SLC/TLC dynamic
External buffer
Full DRAM
Non-DRAM
Partial DRAM
Flash Memory Summit 2016
Santa Clara, CA
5
Challenge: Support all combinations
and cost efficiency
Traditional Write/Read channel with BCH
A: 1024B
B: 1024B randomized.
C: 1024B + 126B-parity
D: C +error from flash
randomizer Encoder
detectorKey-
equation
Chien-
searchFlash
channel
DE-
randomizer
System
Data
buffer
Corrector
RMW
Host
DMA
A B C
DEFG
A’
A’: 1024B with error bit.
E: syndrome (126B)
F: error-polynomial (128B)
G: error location and err-mag
Multi-channel Read in SATA controller
Share the decoder’s hardware with multi-channel.
Each channel will not encode and decoder at the same time. Share the encoder with Detector.
The decoder’s output should satisfy the host maximum read throughput.
7
SE
Q0
SE
Q1
SE
Q7
Decoder
Buffer for DATA
Host-Interface
Det0 Det1 Det3
Arbitrator
Key-equation
Chien-search
Data temp buffer Error location buffer
4-stage pipe-line BCH
BCH 72bit mode, 72bit error, chunks size is 1024B + 126B = 1150B
DMA is 100MHz parallel 16 576cycles (200MB/sec per channel)
Chien-search is operated at 330MHz with parallel16 circuit. 576cycles.
Chien-search throughput is 1024/(576x3ns) = 592MB/sec.
Key-equation cycle is proportional to error bit. (throughput, power consumption bottleneck)
Key-equation’s execution cycle should under 576 cycles • It will need a very high parallelism Key-equation on its hardware.
DMA-sec1 DMA-sec2
Key-
sec1
DMA-sec3 DMA-sec4 DMA-sec5 DMA-sec6 DMA-sec7 DMA-sec8
chien-sec1 chien-sec2 chien-sec3 chien-sec4 chien-sec5 chien-sec6 chien-sec7 chien-sec8
single key-quation with 4 stage pipeline
T=72bit mode and error bit=72
DMA-sec1 DMA-sec2 DMA-sec3
chien-sec1
Key-
sec1
Key-
sec1
Key-
sec1
Key-
sec1
Key-
sec1
Key-
sec1
Key-
sec1
Key-
sec1
cor
Correction the data in memory.
Diff clock domain handshake
cor cor cor cor cor cor cor
Key-equation operation efficiency.
An very efficiency BM, simplified and inversion free algorithm has been used as an original.
The further reduction provide much better efficiency. • 1KB 10bits error, 288 42 cycles. ~85% improvement. (BOL)
• 2KB 20bits error, 654 87 cycles. ~87% improvement. (BOL)
• 1KB 28bits error, 200 127cycles. ~55% improvement. (EOL)
• 2KB 71bits error, 912 414 cycles. ~55% improvement. (EOL) 9
0
200
400
600
800
1000
1200
0 20 40 60 80 100 120 140 160
Exe
cuti
on
cyc
les
Error bit number per chunk
Key-equation execution cycles vs. error bitwith differnt Chunk size 1KB/2KB
1KB_ori
2KB_ori
1KB_reduction
2KB_reduction
1KB + 126B with 72bit protection.
Cover range to UBER~1e-15
RBER = 3.1e-3
Average error bit = 28bits per chunk
2KB + 252B with 134bit protection
Cover range to UBER ~1e-15
RBER = 3.9e-3
Average error bit = 71bits per chunk
Soft-information interface
SMI Confidential 10
In order to provide better decoder’s correction capability, using the soft-info to get more reliability bits.
NAND interface support .
• Traditional read/retry interface.
• Direct soft-info interface.
Buffer for DATACPU
Sub-system
Buffer for
others
DRAMHost-Interface
Write channel
DS
P0
DS
P1
DS
P7
Decoder
SE
Q0
SE
Q1
SE
Q7
DSP engine’s buffer size
11
The buffer size is the capability to contain the number of chunks soft-bit.
Access addition soft-info from NAND may need additional read busy time.
Read the soft-bit under the same busy time will have higher efficiency, but buffer size
requirement is huge.
Buffer 1 for Sign
Buffer 2 for Soft-bit 0
Buffer 2 for Soft-bit 1
CompressionLLR
Mapping
Soft-decoding throughput limitation
12
One Transfer time = 2.5ns/1B x 18432B = 46us (400MTs)
Assume DSP-buffer size 16KB.
• 9 tR time + 12 transfer-time = 9x(100us) + 12 x (46us) = 1452us
• Throughput = 64KB/1452us = 44MB/sec
• Assume DSP-buffer size 64KB.
• 3tR time + 12 transfer-time = 3x(100us) + 8 x(46us) = 668us.
• Throughput = 64KB/668us = 95MB/sec
In Client SSD applications,
Soft-decoding will regard as the ERROR-Recovery flow.
We will not ask the throughput under recovery mode.
But we will take care the recovery mode trigger rate.
High throughput SOC bottleneck PCIE3x4
Flash Memory Summit 2016
Santa Clara, CA
13
Buffer for DATA CPU Buffer
for others
DRAMHost-Interface
SLC region
TLC region
Write channel
Read channel
1. Write channel.
Support larger than 4GB/sec throughput. (5.28GB/sec)
2. Read channel.
Support 4.4GB/sec throughput. .
3. DRAM to Buffer.
Support larger than 4GB/sec throughput.
4. Host-interface to DRAM
Host write data will store in DRAM first.
This scheme will serve the host behavior better.
Larger than 4GB/sec throughput.
ECC Chunk
• Fixed code rate: around 0.9, ECC chunk size: 1KB/ 2KB/ 4KB
• Hard-decoding is based on BCH, and soft-decoding is based on LDPC with less than 3-bit
channel reliability values.
Correction Performance: 4KB better than 1KB
Decoding Latency: 1KB better than 4KB
14
1.0E-17
1.0E-15
1.0E-13
1.0E-11
1.0E-09
1.0E-07
1.0E-05
1.0E-03
1.0E-01
1.E-03 1.E-02 1.E-01
UB
ER
RBER
1KB-hard
2KB-hard
4KB-hard
1KB-soft
2KB-soft
4KB-soft
Failure range from 2D to 3D.
15
Block
WL 0
Pair-Page 0
WL 1
Pair-Page 0
L
M
U
L
M
U
WL 2
Pair-Page 0
L
M
U
WL 3
Pair-Page 0
L
M
U
WL M
Pair-Page 0
L
M
U
2D BLOCK
0
1
2
3
4
5
6
7
8
Program order
WL 4
Pair-Page 0
L
M
U
WL 5
Pair-Page 0
L
M
U
WL 6
Pair-Page 0
L
M
U
Damage range of
Program fail
Damage range of Word-line open
Damage range of two Word-line short
Block
WL 0
Pair-Page 0
Pair-Page 1
Pair-Page N
WL 1
Pair-Page 0
Pair-Page 1
Pair-Page N
L
M
U
L
M
U
L
M
U
L
M
U
L
M
U
WL M
Pair-Page 0
LMU
L
M
U
3D BLOCK
0
1
M
Damage range of Program fail / Word-line Open
Damage range of
Word-line short
Program order
Retention issue on 2D/3D
• Both the 2D and 3D will have the data retention problem. • 1Znm MLC need 6~10 read-retry tables, But TLC need 40~45 tables with less
endurance and retention.
• 3D will have more severe Data retention issue.
16
[ref]: E.S. Choi, S.K. Park, “Device Considerations for High Density and
Highly reliable 3D NAND Flash Cell in Near Future”. IEDM 2012
DSP algorithm for the Vth-tracking
17
+1 readHard-
decoding
Correctable?+3 read
Soft-decoding1-sign, 1-soft
Correctable?Soft-decoding
+5 read1-sign, 2-soft
1st 2nd 3rd
1st 2nd 3rd
4th 5th
Correctable?Soft-decoding
+7 read1-sign, 2-soft
Correctable?
Smart prediction the right center sensing point
Keep in +7 readAdjust the decoding
parameter
1st 2nd 3rd
4th 5th
6th 7th
Read fail(go to next error recovery stage)
Get a correctable code word.Gather the tracking result
Yes
Yes
Yes
Yes
Yes
No
No
No
No
No Uncorrectable && ~max-loop
Over-hill detection will prevent the
Vth-tracking into wrong direction.
The parameter include:
1. Sensing step size.
2. Decoder internal coef.
3. Dynamic tuning the
LLR
4. ……
© Copyright Silicon Motion, Inc., 2012. All Rights Reserved.
The flash trend and ECC correction
2D 3D EDURANCE After cycling: Keep the same Error
distribution in LOW RBER
After cycling: Keep the same Error
distribution in LOW RBER
Data RETENTION The RBER become worse, the Vth
also shifting
Only the Vth-shifting, but RBER is
still good.
The 3D flash is good!! What are we waiting for?
COST, COST, COST!!!
HDD Trend: RS LDPC NB-LDPC
SSD Trend: HM RS BCH LDPC
Target RBER requirement
Normal
operation
(Base-line)
Extreme low power.
Keep the host throughput RBER = 3e-3
Reliability
extension
Vth-tracking to lowering the
RBER.
Soft-info to have stronger
correction
RBER = 1e-2 ~ 1.2e-2
WHY target RBER= 3e-3?
BCH 72bit/1KB will provide UBER< 1e-
15 with RBER = 3e-3
Need a ECC stronger then BCH
Provide the most-cost efficient way to
satisfy the reliability
ECC design loop related to NAND characteristics.
We already have 6th generation LDPC decoder.
Keep improving the LDPC performance.
For higher throughput ~8GB/sec, we may go back to step1.
After 28nm process, the design iteration depth will from code-construction to trial APR.
EX: Find the Routing congestion issue in step 4, it may need to solve from step1.
19
•Throughput requirement trade-off.
•Power consumption trade-off.
•Correction capability trade-off.
•Complexity trade-off.
•Data transfer efficiency(data-path overhead)
• APR utilization from different process node.
•Firmware enabling error recovery
•Encoder scheme. Configurable and power consumption.
•Decoder algorithm prove and evaluate with NAND model.
•FPGA emulation for LDPC decoding algorithm.
•NAND error behavior. Soft/hard.
•NAND Page size and reserved spare area.
•ECC chunk size
Code Construction
Encoding and decoding algorithm
Hardware architecture performance
trade-off
SOC integration and
Firmware control flow
Before the RAID protect flow…..
20
DRAM/
DRAM-less/
Small DRAM
SLC-first/
TLC-direct write/
Dynamic SLC
One-pass /
Multi-pass/
Pair-page
mapping
Capacity
(RAID overhead)
Binary/arbitrary
Program failure
DRAM-backup/
Flash-cache/
RAID recover/
WL open
Failure range
WL to WL short
Failure range
Recovery latency All the issues combine together!
BUT THE SAME CONCEPT IS…….
Don’t put eggs in the same basket.
Different flash will have different failure range.
Different application will have different RAID encoding flow.
Reserve the maximum flexibility for all controller to support all kinds of 2D/3D
NAND.
If you don’t want to use RAID, What alternative you still have? • Read-back check after program.
21
LSB
CSB
MSB
RAID-Group 1
LSB
CSB
MSB
RAID-Group 2
LSB
CSB
MSB
RAID-Group X
… …
Put the pages in the same failure range
into different RAID-Group
Block
WL 0
Pair-Page 0
Pair-Page 1
Pair-Page N
WL 1
Pair-Page 0
Pair-Page 1
Pair-Page N
L
M
U
L
M
U
L
M
U
L
M
U
L
M
U
WL M
Pair-Page 0
LMU
L
M
U
3D BLOCK
0
1
M
Damage range of Program fail / Word-line Open
Damage range of
Word-line short
Program order