asic/merchant chip-based flash controllers · si ngl e key-quati on wi th 4 stage pi pel i ne...

ASIC/Merchant Chip-Based Flash

Controllers

Jeff Yang

Siliconmotion

Flash Memory Summit 2016

Santa Clara, CA

1

Outline


Santa Clara, CA

2

Basic controller architecture.

The challenge on the Merchant Chip-based flash controller.

The Flash selection combination and the performance requirement.

Flash write channel and the read channel throughput analysis.

Hard-decoding only BCH based controller for SATA application throughput requirement.

From the hard-decoding only to soft-decoding controller.

Correction capability.

3D vs. 2D’s NAND architecture.

Vth tracking and the Data-retention issue.

RAID

Basic architecture


Santa Clara, CA

3

Buffer for DATA CPU Buffer

for others

DRAMHost-Interface

SLC region

TLC region

Write channel

Read channel

Application combinations on TLC


Santa Clara, CA

4


for others

DRAMHost-Interface

SLC region

TLC region

Write channel

Read channel

1. SLC caching.

Data always write into SLC. A Fixed portion of SLC regard as a cache buffer.

Performance boost on SLC caching.

Background GC to TLC block.

2. TLC direct.

Only a very small portion for system usage and small random write data.

A stable sustained write performance.

3. Dynamic SLC.

Less than 1/3 capacity threshold, using SLC.

Maximizing performance boosting period.

Background GC to TLC block.

1. Full size DRAM

For host data write caching. Full lookup table.

2. Non-DRAM

Extremely low cost.

Optimize for user experience.

More system info access from SLC blocks.

3. Small DRAM.

Full lookup table on external buffer

No host data buffer.

© Copyright Silicon Motion, Inc., 2012. All Rights Reserved.

2D/3D

One-pass

Two-pass

Multi-pass

SLC usage

SLC caching

TLC direct

SLC/TLC dynamic

External buffer

Full DRAM

Non-DRAM

Partial DRAM


Santa Clara, CA

5

Challenge: Support all combinations

and cost efficiency

Traditional Write/Read channel with BCH

A: 1024B

B: 1024B randomized.

C: 1024B + 126B-parity

D: C +error from flash

randomizer Encoder

detectorKey-

equation

Chien-

searchFlash

channel

DE-

randomizer

System

Data

buffer

Corrector

RMW

Host

DMA

A B C

DEFG

A’

A’: 1024B with error bit.

E: syndrome (126B)

F: error-polynomial (128B)

G: error location and err-mag

Multi-channel Read in SATA controller

Share the decoder’s hardware with multi-channel.

Each channel will not encode and decoder at the same time. Share the encoder with Detector.

The decoder’s output should satisfy the host maximum read throughput.

7

SE

Q0

SE

Q1

SE

Q7

Decoder

Buffer for DATA

Host-Interface

Det0 Det1 Det3

Arbitrator

Key-equation

Chien-search

Data temp buffer Error location buffer

4-stage pipe-line BCH

BCH 72bit mode, 72bit error, chunks size is 1024B + 126B = 1150B

DMA is 100MHz parallel 16 576cycles (200MB/sec per channel)

Chien-search is operated at 330MHz with parallel16 circuit. 576cycles.

Chien-search throughput is 1024/(576x3ns) = 592MB/sec.

Key-equation cycle is proportional to error bit. (throughput, power consumption bottleneck)

Key-equation’s execution cycle should under 576 cycles • It will need a very high parallelism Key-equation on its hardware.

DMA-sec1 DMA-sec2

Key-

sec1

DMA-sec3 DMA-sec4 DMA-sec5 DMA-sec6 DMA-sec7 DMA-sec8

chien-sec1 chien-sec2 chien-sec3 chien-sec4 chien-sec5 chien-sec6 chien-sec7 chien-sec8

single key-quation with 4 stage pipeline

T=72bit mode and error bit=72

DMA-sec1 DMA-sec2 DMA-sec3

chien-sec1

Key-

sec1

Key-

sec1

Key-

sec1

Key-

sec1

Key-

sec1

Key-

sec1

Key-

sec1

Key-

sec1

cor

Correction the data in memory.

Diff clock domain handshake

cor cor cor cor cor cor cor

Key-equation operation efficiency.

An very efficiency BM, simplified and inversion free algorithm has been used as an original.

The further reduction provide much better efficiency. • 1KB 10bits error, 288 42 cycles. ~85% improvement. (BOL)

• 2KB 20bits error, 654 87 cycles. ~87% improvement. (BOL)

• 1KB 28bits error, 200 127cycles. ~55% improvement. (EOL)

• 2KB 71bits error, 912 414 cycles. ~55% improvement. (EOL) 9

0

200

400

600

800

1000

1200

0 20 40 60 80 100 120 140 160

Exe

cuti

on

cyc

les

Error bit number per chunk

Key-equation execution cycles vs. error bitwith differnt Chunk size 1KB/2KB

1KB_ori

2KB_ori

1KB_reduction

2KB_reduction

1KB + 126B with 72bit protection.

Cover range to UBER~1e-15

RBER = 3.1e-3

Average error bit = 28bits per chunk

2KB + 252B with 134bit protection

Cover range to UBER ~1e-15

RBER = 3.9e-3

Average error bit = 71bits per chunk

Soft-information interface

SMI Confidential 10

In order to provide better decoder’s correction capability, using the soft-info to get more reliability bits.

NAND interface support .

• Traditional read/retry interface.

• Direct soft-info interface.

Buffer for DATACPU

Sub-system

Buffer for

others

DRAMHost-Interface

Write channel

DS

P0

DS

P1

DS

P7

Decoder

SE

Q0

SE

Q1

SE

Q7

DSP engine’s buffer size

11

The buffer size is the capability to contain the number of chunks soft-bit.

Access addition soft-info from NAND may need additional read busy time.

Read the soft-bit under the same busy time will have higher efficiency, but buffer size

requirement is huge.

Buffer 1 for Sign

Buffer 2 for Soft-bit 0

Buffer 2 for Soft-bit 1

CompressionLLR

Mapping

Soft-decoding throughput limitation

12

One Transfer time = 2.5ns/1B x 18432B = 46us (400MTs)

Assume DSP-buffer size 16KB.

• 9 tR time + 12 transfer-time = 9x(100us) + 12 x (46us) = 1452us

• Throughput = 64KB/1452us = 44MB/sec

• Assume DSP-buffer size 64KB.

• 3tR time + 12 transfer-time = 3x(100us) + 8 x(46us) = 668us.

• Throughput = 64KB/668us = 95MB/sec

In Client SSD applications,

Soft-decoding will regard as the ERROR-Recovery flow.

We will not ask the throughput under recovery mode.

But we will take care the recovery mode trigger rate.

High throughput SOC bottleneck PCIE3x4


Santa Clara, CA

13


for others

DRAMHost-Interface

SLC region

TLC region

Write channel

Read channel

1. Write channel.

Support larger than 4GB/sec throughput. (5.28GB/sec)

2. Read channel.

Support 4.4GB/sec throughput. .

3. DRAM to Buffer.

Support larger than 4GB/sec throughput.

4. Host-interface to DRAM

Host write data will store in DRAM first.

This scheme will serve the host behavior better.

Larger than 4GB/sec throughput.

ECC Chunk

• Fixed code rate: around 0.9, ECC chunk size: 1KB/ 2KB/ 4KB

• Hard-decoding is based on BCH, and soft-decoding is based on LDPC with less than 3-bit

channel reliability values.

Correction Performance: 4KB better than 1KB

Decoding Latency: 1KB better than 4KB

14

1.0E-17

1.0E-15

1.0E-13

1.0E-11

1.0E-09

1.0E-07

1.0E-05

1.0E-03

1.0E-01

1.E-03 1.E-02 1.E-01

UB

ER

RBER

1KB-hard

2KB-hard

4KB-hard

1KB-soft

2KB-soft

4KB-soft

Failure range from 2D to 3D.

15

Block

WL 0

Pair-Page 0

WL 1

Pair-Page 0

L

M

U

L

M

U

WL 2

Pair-Page 0

L

M

U

WL 3

Pair-Page 0

L

M

U

WL M

Pair-Page 0

L

M

U

2D BLOCK

0

1

2

3

4

5

6

7

8

Program order

WL 4

Pair-Page 0

L

M

U

WL 5

Pair-Page 0

L

M

U

WL 6

Pair-Page 0

L

M

U

Damage range of

Program fail

Damage range of Word-line open

Damage range of two Word-line short

Block

WL 0

Pair-Page 0

Pair-Page 1

Pair-Page N

WL 1

Pair-Page 0

Pair-Page 1

Pair-Page N

L

M

U

L

M

U

L

M

U

L

M

U

L

M

U

WL M

Pair-Page 0

LMU

L

M

U

3D BLOCK

0

1

M

Damage range of Program fail / Word-line Open

Damage range of

Word-line short

Program order

Retention issue on 2D/3D

• Both the 2D and 3D will have the data retention problem. • 1Znm MLC need 6~10 read-retry tables, But TLC need 40~45 tables with less

endurance and retention.

• 3D will have more severe Data retention issue.

16

[ref]: E.S. Choi, S.K. Park, “Device Considerations for High Density and

Highly reliable 3D NAND Flash Cell in Near Future”. IEDM 2012

DSP algorithm for the Vth-tracking

17

+1 readHard-

decoding

Correctable?+3 read

Soft-decoding1-sign, 1-soft

Correctable?Soft-decoding

+5 read1-sign, 2-soft

1st 2nd 3rd

1st 2nd 3rd

4th 5th

Correctable?Soft-decoding

+7 read1-sign, 2-soft

Correctable?

Smart prediction the right center sensing point

Keep in +7 readAdjust the decoding

parameter

1st 2nd 3rd

4th 5th

6th 7th

Read fail(go to next error recovery stage)

Get a correctable code word.Gather the tracking result

Yes

Yes

Yes

Yes

Yes

No

No

No

No

No Uncorrectable && ~max-loop

Over-hill detection will prevent the

Vth-tracking into wrong direction.

The parameter include:

1. Sensing step size.

2. Decoder internal coef.

3. Dynamic tuning the

LLR

4. ……

© Copyright Silicon Motion, Inc., 2012. All Rights Reserved.

The flash trend and ECC correction

2D 3D EDURANCE After cycling: Keep the same Error

distribution in LOW RBER

After cycling: Keep the same Error

distribution in LOW RBER

Data RETENTION The RBER become worse, the Vth

also shifting

Only the Vth-shifting, but RBER is

still good.

The 3D flash is good!! What are we waiting for?

COST, COST, COST!!!

HDD Trend: RS LDPC NB-LDPC

SSD Trend: HM RS BCH LDPC

Target RBER requirement

Normal

operation

(Base-line)

Extreme low power.

Keep the host throughput RBER = 3e-3

Reliability

extension

Vth-tracking to lowering the

RBER.

Soft-info to have stronger

correction

RBER = 1e-2 ~ 1.2e-2

WHY target RBER= 3e-3?

BCH 72bit/1KB will provide UBER< 1e-

15 with RBER = 3e-3

Need a ECC stronger then BCH

Provide the most-cost efficient way to

satisfy the reliability

ECC design loop related to NAND characteristics.

We already have 6th generation LDPC decoder.

Keep improving the LDPC performance.

For higher throughput ~8GB/sec, we may go back to step1.

After 28nm process, the design iteration depth will from code-construction to trial APR.

EX: Find the Routing congestion issue in step 4, it may need to solve from step1.

19

•Throughput requirement trade-off.

•Power consumption trade-off.

•Correction capability trade-off.

•Complexity trade-off.

•Data transfer efficiency(data-path overhead)

• APR utilization from different process node.

•Firmware enabling error recovery

•Encoder scheme. Configurable and power consumption.

•Decoder algorithm prove and evaluate with NAND model.

•FPGA emulation for LDPC decoding algorithm.

•NAND error behavior. Soft/hard.

•NAND Page size and reserved spare area.

•ECC chunk size

Code Construction

Encoding and decoding algorithm

Hardware architecture performance

trade-off

SOC integration and

Firmware control flow

Before the RAID protect flow…..

20

DRAM/

DRAM-less/

Small DRAM

SLC-first/

TLC-direct write/

Dynamic SLC

One-pass /

Multi-pass/

Pair-page

mapping

Capacity

(RAID overhead)

Binary/arbitrary

Program failure

DRAM-backup/

Flash-cache/

RAID recover/

WL open

Failure range

WL to WL short

Failure range

Recovery latency All the issues combine together!

BUT THE SAME CONCEPT IS…….

Don’t put eggs in the same basket.

Different flash will have different failure range.

Different application will have different RAID encoding flow.

Reserve the maximum flexibility for all controller to support all kinds of 2D/3D

NAND.

If you don’t want to use RAID, What alternative you still have? • Read-back check after program.

21

LSB

CSB

MSB

RAID-Group 1

LSB

CSB

MSB

RAID-Group 2

LSB

CSB

MSB

RAID-Group X

… …

Put the pages in the same failure range

into different RAID-Group

Block

WL 0

Pair-Page 0

Pair-Page 1

Pair-Page N

WL 1

Pair-Page 0

Pair-Page 1

Pair-Page N

L

M

U

L

M

U

L

M

U

L

M

U

L

M

U

WL M

Pair-Page 0

LMU

L

M

U

3D BLOCK

0

1

M

Damage range of Program fail / Word-line Open

Damage range of

Word-line short

Program order

asic/merchant chip-based flash controllers · si ngl e key-quati on wi th 4 stage pi pel i ne...

Documents