qoriq t4240 communications processor deep dive · 2016. 3. 12. · t5 t4 t3 t2 t1 shared l2 y y c0...

83
External Use TM QorIQ T4240 Communications Processor Deep Dive FTF-NET-F0031 APR.2014 Sam Siu & Feras Hamdan

Upload: others

Post on 01-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • External Use

    TM

    QorIQ T4240 Communications

    Processor Deep Dive

    FTF-NET-F0031

    A P R . 2 0 1 4

    Sam Siu & Feras Hamdan

  • TM

    External Use 1

    Agenda

    • QorIQ T4240 Communications Processor Overview

    • e6500 Core Enhancement

    • Memory Subsystem and MMU Enhancement

    • QorIQ Power Management features

    • HiGig Interface

    • Interlaken Interface

    • PCI Express® Gen 3 Interfaces (SR-IOV)

    • Serial RapidIO® Manager (RMAN)

    • Data Path Acceleration Architecture Enhancements

    − mEMAC

    − Offline Ports and Use Case

    − Storage Profiling

    − Data Center Bridging (FMAN and QMAN)

    − Accelerators: SEC, DCE, PME

    • Debug

  • TM

    External Use 2

    QorIQ T4240 Communications Processor

    16-Lane 10GHz SERDES

    64-bit

    DDR2/3

    Memory

    Controller

    CoreNet™ Coherency Fabric PAMU PAMU PAMU

    Peripheral Access

    Mgmt Unit

    Security Fuse Processor

    Security Monitor

    2x USB 2.0 w/PHY

    IFC

    Power Management

    SD/MMC

    2x DUART

    2x I2C

    SPI, GPIO

    64-bit

    DDR2/3

    Memory

    Controller

    64-bit

    DDR3/3L

    Memory

    Controller

    64-bit

    DDR3/3L

    Memory

    Controller

    512KB

    Corenet

    Platform Cache

    512KB

    Corenet

    Platform Cache

    PAMU

    Queue

    Mgr.

    Buffer

    Mgr.

    Pattern

    Match

    Engine

    2.0

    Security 5.0

    64-bit

    DDR2/3

    Memory

    Controller

    64-bit

    DDR3/3L

    Memory

    Controller

    512KB

    Corenet

    Platform Cache

    RMAN

    DCE

    1.0

    Parse, Classify,

    Distribute

    1/ 10G

    1/ 10G

    1G

    1G

    1G

    1G

    FMan

    1G

    1G

    Parse, Classify,

    Distribute

    1/ 10G

    1/ 10G

    1G

    1G

    1G

    1G

    FMan

    1G

    1G

    Inte

    rla

    ke

    n L

    A

    16-Lane 10GHz SERDES

    Processor

    • 12x e6500, 64-bit, up to 1.8 GHz

    • Dual threaded, with128-bit AltiVec engine

    • Arranged as 3 clusters of 4 CPUs, with 2

    MB L2 per cluster; 256 KB per thread

    Memory SubSystem

    • 1.5 MB CoreNet platform cache w/ECC

    • 3x DDR3 controllers up to 1.87 GHz

    • Each with up to 1 TB addressability (40

    bit physical addressing)

    CoreNet Switch Fabric

    High-speed Serial IO

    • 4 PCIe controllers, with Gen3

    • SR-IOV support

    • 2 sRIO controllers

    • Type 9 and 11 messaging

    • Interworking to DPAA via Rman

    • 1 Interlaken Look-Aside at up to10 GHz

    • 2 SATA 2.0 3Gb/s

    • 2 USB 2.0 with PHY

    Network IO

    • 2 Frame Managers, each with:

    • Up to 25Gbps parse/classify/distribute

    • 2x10GE, 6x1GE

    • HiGig, Data Center Bridging Support

    • SGMII, QSGMII, XAUI, XFI

    HiGig DCB HiGig DCB

    2MB Banked L2

    Power ™

    e6500

    D-Cache I-Cache

    32 KB 32 KB

    T1 T2

    Power ™

    e6500

    D-Cache I-Cache

    32 KB 32 KB

    T1 T2

    Power ™

    e6500

    D-Cache I-Cache

    32 KB 32 KB

    T1 T2

    Power ™

    e6500

    D-Cache I-Cache

    32 KB 32 KB

    T1 T2

    2MB Banked L2

    Power ™

    e6500

    D-Cache I-Cache

    32 KB 32 KB

    T1 T2

    Power ™

    e6500

    D-Cache I-Cache

    32 KB 32 KB

    T1 T2

    Power ™

    e6500

    D-Cache I-Cache

    32 KB 32 KB

    T1 T2

    Power ™

    e6500

    D-Cache I-Cache

    32 KB 32 KB

    T1 T2

    2MB Banked L2

    Power ™

    e6500

    D-Cache I-Cache

    32 KB 32 KB

    T1 T2

    Power ™

    e6500

    D-Cache I-Cache

    32 KB 32 KB

    T1 T2

    Power ™

    e6500

    D-Cache I-Cache

    32 KB 32 KB

    T1 T2

    Power ™

    e6500

    D-Cache I-Cache

    32 KB 32 KB

    T1 T2

    Watchpoint Cross Trigger

    Perf Monitor

    CoreNet Trace

    Aurora

    Real Time Debug

    SA

    TA

    2.0

    SA

    TA

    2.0

    PC

    Ie

    PC

    Ie

    3xDMA

    sR

    IO

    sR

    IO

    PC

    Ie

    PC

    Ie

    • Device

    − TSMC 28 HPM process

    − 1932-pin BGA package

    − 42.5x42.5 mm, 1.0 mm pitch

    • Power targets

    − ~54W thermal max at 1.8 GHz

    − ~42W thermal max at 1.5 GHz

    • Data Path Acceleration

    − SEC- crypto acceleration 40 Gbps

    − PME- Reg-ex Pattern Matcher 10Gbps

    − DCE- Data Compression Engine 20Gbps

  • TM

    External Use 3

    e6500 Core Enhancement

  • TM

    External Use 4

    e6500 Core Complex

    High Performance • 64-bit Power Architecture® technology • Up to 1.8 GHz operation • Two threads per core • Dual load/store units, one per thread • 40-bit Real Address

    − 1 Terabyte physical addr. space

    • Hardware Table Walk • L2 in cluster of 4 cores

    − Supports Share across cluster − Supports L2 memory allocation to core or thread

    Energy Efficient Power Management

    − Drowsy : Core, Cluster, AltiVec engine − Wait-on-reservation instruction − Traditional modes

    • AltiVec SIMD Unit (128b)

    − 8,16,32-bit signed/unsigned integer − 32-bit floating-point 173 GFLOP (1.8GHz)

    − 8,16,32-bit Boolean

    • Improve Productivity with Core Virtualization − Hypervisor − Logical to Real Addr (LRAT). translation

    mechanism for improved hypervisor performance

    CoreNet Interface 40-bit Address Bus 256-bit Rd & Wr Data Busses

    CoreNet Double Data Processor Port

    2MB 16-way Shared L2 Cache, 4 Banks

    T T

    32K

    Altivec

    e6500

    32K

    PM

    C T T

    32K

    Altivec

    e6500

    32K

    PM

    C T T

    32K

    Altivec

    e6500

    32K

    PM

    C T T

    32K

    Altivec

    e6500

    32K

    PM

    C

    CoreMark P4080

    (1.5 GHz)

    T4240

    (1.8 GHz)

    Improvement

    from P4080

    Single Thread 4708 7828 1.7x

    Core (dual T) 4708 15,656 3.3x

    SoC 37,654 187,873 5.0x

    DMIPS/Watt

    (typ)

    2.4 5.1 2.1x

  • TM

    External Use 5

    General Core Enhancements

    • Improved branch prediction and additional link stack entries

    • Pipeline improvements: − LR, CTR, mfocrf optimization (LR and CTR are renamed)

    − 16 entry rename/completion buffer

    • New debug features: − Ability to allocate individual debug events between the internal and external

    debuggers

    − More IAC events

    • Performance monitor − Many more events, six counters per thread

    − Guest performance monitor interrupt

    • Private vs. Shared State Registers and other architected state − Shared between threads: There is only one copy of the register or architected state

    A change in one thread affects the other thread if the other thread reads it

    − Private to the thread and are replicated per thread : There is one copy per thread of the register or architected state

    A change in one thread does not affect the other thread if the thread reads its private copy

  • TM

    External Use 6

    Corenet Enhancements in QorIQ T 4240

    • CoreNet Coherency Fabric − 40-bit Real Address

    − Higher address bandwidth and active transactions 1.2 Tbps Read, .6Tbps Write

    − 2X BW increase for core, MMU, and peripheral

    − Improved configuration architecture

    • Platform Cache − Increased write bandwidth (>600Gbps)

    − Increased buffering for improving throughput

    − Improved data ownership tracking for performance enhancement

    • Data PreFetch − Tracks CPC misses

    − Prefetches from multiple memory regions with configurable sizes

    − Selective tracking based on requesting device, transaction type, data/instruction access

    − Conservative prefetch requests to avoid system overloading with prefetches

    − “Confidence” based algorithm with feedback mechanism

    − Performance monitor events to evaluate the performance of Prefetch in the system

    0%

    10%

    20%

    30%

    40%

    50%

    60%

    70%

    80%

    90%

    100%

    0 2 4 6 8 10 12 14 16 18 20 22 24

    IP Mark

    TCP Mark

  • TM

    External Use 7

    Cache and Memory Subsystem

    Enhancements

  • TM

    External Use 8

    Shared L2 Cache

    • Clusters of cores share a 2M byte, 4-bank, 16-way set associative shared L2 cache.

    • In addition, there is also support for a 1.5M byte corenet platform cache.

    • Advantages

    − L2 cache is shared among 4 cores allowing lines to be allocated among the 4 cores as required

    Some cores will need more lines and some will need less depending on workloads

    − Faster sharing among cores in the cluster (sharing a line between cores in the cluster does not require the data to travel on CoreNet)

    − Flexible partition of L2 cache base on application cluster group.

    • Trade Offs

    − Longer latency to DRAM and other parts of the system outside the cluster

    − Longer latency to L2 cache due to increased cache size and eLink overhead

    64-bit

    DDR2/3

    Memory

    Controller

    CoreNet™ Coherency Fabric PAMU PAMU PAMU

    Peripheral Access

    Mgmt Unit

    Security Fuse Processor

    Security Monitor

    2x USB 2.0 w/PHY

    64-bit

    DDR2/3

    Memory

    Controller

    64-bit

    DDR3/3L

    Memory

    Controller

    64-bit

    DDR3/3L

    Memory

    Controller

    512KB

    Corenet

    Platform Cache

    512KB

    Corenet

    Platform Cache

    PAMU

    64-bit

    DDR2/3

    Memory

    Controller

    64-bit

    DDR3/3L

    Memory

    Controller

    512KB

    Corenet

    Platform Cache

    2MB Banked L2

    Power ™

    e6500

    D-Cache I-Cache

    32 KB 32 KB

    T1 T2

    Power ™

    e6500

    D-Cache I-Cache

    32 KB 32 KB

    T1 T2

    Power ™

    e6500

    D-Cache I-Cache

    32 KB 32 KB

    T1 T2

    Power ™

    e6500

    D-Cache I-Cache

    32 KB 32 KB

    T1 T2

    2MB Banked L2

    Power ™

    e6500

    D-Cache I-Cache

    32 KB 32 KB

    T1 T2

    Power ™

    e6500

    D-Cache I-Cache

    32 KB 32 KB

    T1 T2

    Power ™

    e6500

    D-Cache I-Cache

    32 KB 32 KB

    T1 T2

    Power ™

    e6500

    D-Cache I-Cache

    32 KB 32 KB

    T1 T2

    2MB Banked L2

    Power ™

    e6500

    D-Cache I-Cache

    32 KB 32 KB

    T1 T2

    Power ™

    e6500

    D-Cache I-Cache

    32 KB 32 KB

    T1 T2

    Power ™

    e6500

    D-Cache I-Cache

    32 KB 32 KB

    T1 T2

    Power ™

    e6500

    D-Cache I-Cache

    32 KB 32 KB

    T1 T2

  • TM

    External Use 9

    Memory Subsystem Enhancements

    • The e6500 core has a larger store queue than the e5500 core

    • Additional registers are provided for L2 cache partitioning controls similar to how partitioning is done in the CPC

    • Cache locking is supported, however, if a line is unable to be locked, that status is not posted. Cache lock query instructions are provided for determining whether a line is locked

    • The load store unit contains store gather buffers to collect stores to cache lines before sending them on eLink to the L2 cache

    • There are no more Line Fill Buffers (LFB) associated with the L1 data cache

    − These are replaced with Load Miss Queue (LMQ) entries for each thread

    − They function in a manner very similar to LFBs

    • Note there are still LFBs for L1 instruction cache

  • TM

    External Use 10

    MMU Enhancements

  • TM

    External Use 11

    MMU – TLB Enhancements

    • e6500 core implements MMU architecture version 2 (V2)

    − MMU architecture V2 is denoted by bits in the MMUCFG register

    • Translation Look-aside Buffers (TLB1),

    − Variable size pages, supports power of two page sizes (previous cores used power of 4 page sizes)

    − 4 KB to 1 TB page sizes

    • Translation Look-aside Buffers (TLB0) increased to 1024 entries

    − 8 way associativity (from 512, 4 way)

    − Supports HES (hardware entry select) when written to with tlbwe

    • PID register is increased to 14 bits (from 8 bits)

    − Now the operating system can have 16K simultaneous contexts

    • Real address increased to 40 bits (from 36 bits)

    • In general, it is backward compatible with MMU operations from e5500 core, except:

    − some of the configuration registers have different organization (TLBnCFG for example)

    − There are new config registers for TLB page size (TLBnPS) and LRAT page size (LRATPS)

    − tlbwe can be executed by guest supervisor (but can be turned off with an EPCR bit)

    Effective Address (EA) (64bit )

    Effective Page #(0-52 bits) Byte Addr (12-32bits )

    LPID

    (14bit) GS AS PID(14bits)

    0=Hypervisor

    1=guest Access MSR

    Byte Address (12-40bits) Real Page Number

    (0-28bits)

    Real Address (40bits)

  • TM

    External Use 12

    MMU – Virtualization Enhancements (LRAT)

    • e6500 core contains an LRAT (logical to real address translation)

    − The LRAT converts logical addresses (an address the guest operating system thinks are real) and converts them to true real addresses

    − Translation occurs when the guest executes tlbwe and tries to write TLB0 or during hardware tablewalk for a guest translation

    − Does not require hypervisor to intervene unless the LRAT incurs a miss (the hypervisor writes entries into the LRAT)

    − 8 entry fully associative supporting variable size pages from 4 KB to 1 TB (in powers of two)

    • Prior to the LRAT, the hypervisor had to intervene each time the guest tried to write a TLB entry

    Application

    Instr1

    Instr2

    Instr3

    ---

    MMU

    Page

    Fault Guest OS

    VA -> Guest RA

    Writes TLB Trap

    Hypervisor

    Guest RA -> RA

    Writes TLB Implemented

    in HW with LRAT

  • TM

    External Use 13

    QorIQ Power Management

    Features

  • TM

    External Use 14

    Dynamic T4 Family Energy/Power Total Cost of Ownership T

    4

    Advanced

    Pow

    er

    Mgt

    Cyclic

    al

    Valu

    ed

    Wo

    rklo

    ad

    Today’s

    Energ

    y

    Str

    ate

    gy

    Always on

    Energy Savings

    Core Cascaded Cluster

    Drowsy Dual Cluster

    Drowsy

    + Tj

    Dynamic

    Clk Gating

    SoC

    Sleep

    Full Mid Light Standby Light to Mid Full

  • TM

    External Use 15

    Cascaded Power Management Today: All CPUs in pool channel dequeue

    until all FQs empty

    Broadcast notification when work arrives

    Task Queue

    T1 T2 T3 T4 T5

    Shared L2

    Dro

    wsy

    Dro

    wsy

    C0 C1 C2 C3

    Shared L2

    C0 C1 C2 C3

    Threshold 1 Threshold 2

    DPAA uses task queue thresholds to

    inform CPUs they are not needed.

    CPUs selectively awakened as needed.

    QMan

    12

    11

    10

    9

    8

    7

    6

    5

    4

    3

    2

    1

    Active CPUs

    Day Night

    Burst

    Pow

    er/

    Perf

    orm

    ance

    • CPU’s run software that drops into polling loop when DPAA is not sending it work.

    • Polling loop should include a wait w/ drowsy instruction that puts the core into drowsy

    Core:

  • TM

    External Use 16

    e6500 Core Intelligent Power Management

    Cluster State PCL00 PCL00 PCL00 PCL00 PCL00 PCL10

    Core State PH00 PH10/PW10 PH15 PW20 PH20 PH20

    Cluster Voltage

    Core Voltage

    Cluster Clock On On On On On Off

    Core Clock On On Off Off Off Off

    L2 Cache SW Flushed

    L1 Cache SW Invalidated HW Invalidated SW Invalidated SW Invalidated

    Wakeup Time Active Immediate < 30 ns < 200 ns < 600 ns < 1us

    Power

    NEW NEW NEW

    PM

    C T T

    L1

    Altivec

    e6500

    L1

    2048KB Banked L2

    PM

    C

    Full On Full On Full On Full On Full On Nap

    Run Doze Nap Global Clk stop Nap (Pwr Gated) Core glb clk stop

    Run, Doze, Nap Wait Altivec Drowsy • Auto and SW controlled – maintain state Core Drowsy • Auto and SW controlled – maintain state Dynamic Clock gating

    Run, Nap • Cores and L2 Dynamic Frequency Scaling

    (DFS)of the Cluster Drowsy Cluster (cores) Dynamic clock gating

    • SoC Sleep with state retention

    • SoC Sleep with RST

    • Cascade Power Management

    • Energy Efficient Ethernet (EEE)

  • TM

    External Use 17

    HiGig Interface Support

  • TM

    External Use 18

    HiGigTM/HiGig+/HiGig2 Interface Support

    • The 10 Gigabit HiGigTM / HiGig+TM / HiGig2TM MAC interface interconnect

    standard Ethernet devices to Switch HiGig Ports.

    • Networking customers can add features like quality of service (QoS), port

    trunking, mirroring across devices, and link aggregation at the MAC layer.

    • The physical signaling across the interface is XAUI, four differential pairs

    for receive and transmit (SerDes), each operating at 3.125 Gbit/s. HiGig+

    is a higher rate version of HiGig

    1 2 3 4 5 6 7 8 9 1

    0

    1

    1

    1

    2

    1

    3

    1

    4

    1

    5

    1

    6

    1

    7

    1

    8

    1

    9

    2

    0

    2

    1

    2

    2

    2

    3

    2

    4

    2

    5

    2

    6

    2

    7

    2

    8

    2

    9

    3

    0

    3

    1

    3

    2

    3

    3

    3

    4

    Preamble HiGig+ Module Hdr MAC_DA MAC_SA Typ

    e Packet Data FCS*

    1 2 3 4 5 6 7 8 9 1

    0

    1

    1

    1

    2

    1

    3

    1

    4

    1

    5

    1

    6

    1

    7

    1

    8

    1

    9

    2

    0

    2

    1

    2

    2

    2

    3

    2

    4

    2

    5

    2

    6

    2

    7

    2

    8

    2

    9

    3

    0

    3

    1

    3

    2

    Preamble MAC_DA MAC_SA Typ

    e Packet Data FCS

    1 2 3 4 5 6 7 8 9 1

    0

    1

    1

    1

    2

    1

    3

    1

    4

    1

    5

    1

    6

    1

    7

    1

    8

    1

    9

    2

    0

    2

    1

    2

    2

    2

    3

    2

    4

    2

    5

    2

    6

    2

    7

    2

    8

    2

    9

    3

    0

    3

    1

    3

    2

    3

    3

    3

    4

    3

    5

    3

    6

    3

    7

    3

    8

    Preamble HiGig2 Module Hdr MAC_DA MAC_SA Typ

    e Packet Data FCS*

    Regular Ethernet Frames

    Ethernet Frames with HiGig+ Header

    Ethernet Frames with HiGig2 Header

  • TM

    External Use 19

    QorIQ T4240 Processor HiGig Interface

    • T4240 FMan Supports HiGig/HiGig+/HiGig2 protocols

    • In the T4240 processor, the 10G mEMACs can be configured as HiGig interface. In this configuration two of the 1G mEMACs are used as the HiGig message interface

  • TM

    External Use 20

    SERDES Configuration for HiGig Interface

    • Networking protocols (SerDes 1 and SerDes 2)

    • HiGig notation: HiGig[2]m.n means HiGig[2] (4 lanes @ 3.125 or 3.75 Gbps) − “m” indicates which Frame Manager (FM1 or FM2)

    − “n” indicates which MAC on the Frame Manager

    − E.g. “HiGig[2]1.10,” indicates HiGig[2] using FM1’s MAC 10

    • When a SerDes protocol is selected with dual HiGigs in one SerDes, both HiGigs must be configured with the same protocol (for example, both with 12 byte headers or both with 16 byte headers)

  • TM

    External Use 21

    HiGig/HiGig2 Control and Configuration

    Name Description

    LLM_MODE Toggle between HiGig2 Link Level Messages physical link, OR HiGig2 link level

    messages logical link (SAFC)

    LLM_IGNORE Ignore HiGig2 link level message quanta

    LLM_FWD Terminate/forward received HiGig2 link level message

    IMG[0:7] Inter Message Gap - spacing between HiGig2 messages

    NOPRMP 0 Toggle preemptive transmission of HiGig2 messages

    MCRC_FWD Strip/forward HiGig2 message CRC of received messages

    FER Discard/forward HiGig2 receive message with CRC error

    FIMT Forward OR Discard message with illegal MSG_TYP

    IGNIMG Ignore IMG on receive path

    TCM TC (traffic classes) mapping

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

    LL

    M

    LL

    I

    LL

    F

    IMG

    NP

    PR

    MC

    RC

    FE

    R

    FIM

    T

    IGN

    IM

    TC

    M

    HiGig/HiGig2 control and Configuration Register (HG_CONFIG)

  • TM

    External Use 22

    Interlaken Interface

  • TM

    External Use 23

    Interlaken Look-Aside Interface

    • Use Case: T4240 processor as a data path processor, requiring millions of look-ups per second. Expected requirement in edge routers.

    • Interlaken Look-Aside is a new high-speed serial standard for connecting TCAMs “network search engines”, “Knowledge Based Processors” to host CPUs and NPUs. Replaces Quad Data Rate (QDR) SRAM interface.

    • Like Interlaken streaming interfaces (channelized SERDES link, replacing SPI 4.2), Interlaken look-aside supports configurable number of SERDES lanes (1-32, granularity of single lane) with linearly increasing bandwidth. Freescale supporst x4 and x8, up to 10 GHz.

    • For lowest latency, each vCPU (thread) in T4240 processor will have a portal into the Interlaken Controller, allowing multiple search requests and results to be returned concurrently.

    • Interlaken Look Aside expected to gain traction as interface to other low latency/minimal data exchange co-processors, such as Traffic Managers. PCIe and sRIO better for higher latency/high bandwidth applications.

    • Lane Striping

    T4240

    TCAM

    Interlaken

    10

    G

    10

    G

    10

    G

    10

    G

  • TM

    External Use 24

    T4240 (LAC) Features:

    • Supports Interlaken Look-Aside Protocol definition, rev. 1.1

    • Supports 24 partitioned software portals

    • Supports in-band per-channel flow control options, with simple xon/xoff semantics

    • Supports wide range of SerDes speeds (6.25 and 10.3125 Gbps))

    • Ability to disable the connection to individual SerDes lanes

    • A continuous Meta Frame of programmable frequency to guarantee lane alignment, synchronize the scrambler, perform clock compensation, and indicate lane health

    • 64B/67B data encoding and scrambling

    • Programmable BURSTSHORT parameter of 8 or 16 bytes

    • Error detection illegal burst sizes, bad 64/67 word type and CRC-24 error

    • Error detection on Transmit command programming error

    • Built-in statistics counters and error counters

    • Dynamic power down of each software portal

  • TM

    External Use 25

    Look-Aside Controller Block Diagram

  • TM

    External Use 26

    Modes of Operation

    • T4240 LA controller can be either in Stashing mode or non stashing.

    • The LAC programming model is based on big Endinan mode, meaning byte 0 on the most

    significant byte.

    • In non Stashing mode software has to issue dcbf each time it reads SWPnRSR and RDY bit

    is not set.

  • TM

    External Use 27

    Interlaken LA Controller Configuration Registers

    • 4KBytes hypervisor space 0x0000-0x0FFF

    • 4KBytes managing core space 0x1000-0x1FFF

    • in compliant with trusted architecture ,LSRER, LBARE, LBAR, LLIODNRn, accessed exclusively in hypervisor mode, reserved in managing core mode.

    • Statistics, Lane mapping, Interrupt , rate, metaframe, burst, FIFO, calendar, debug, pattern, Error, Capture Registers

    • LAC software portal memory, n= 0,1,2,3,….,23 .

    • SWPnTCR/ SWPnRCR—software portal 0 transmit/Receive command register

    • SWPnTER/SWPnRER—software portal 0 transmit/Receive error register

    • SWPnTDR/SWPnRDR0,1,2,3 —software portal 0,1,2,3 transmit/Receive data register 0,1,2,3

    • SWPnRSR—software portal receive status register

  • TM

    External Use 28

    TCAM Usage in Routing Example

  • TM

    External Use 29

    Interlaken Look-Aside TCAM Board

    Renesas

    Interlaken LA

    5Mb TCAM I2C

    EEPROM

    IL-LA

    4x

    REFCLK

    156.25 MHz

    VDDC

    0.85V @6A

    SMBus Misc:

    Reset,

    JTAG

    3.3V/12V

    Config

    125 MHz

    SYSCLK

    VDDA

    0.85V @ 2A

    VCC_1.8V

    1.8V @ 2A Filters

    VDDHA 1.80V 0.5A

    VDDO 1.80V 1.0A

    VPLL 1.80V 0.25A

    0-ohm

  • TM

    External Use 30

    PCI Express® Gen 3 Interfaces

  • TM

    External Use 31

    PCI Express® Gen 3 Interfaces

    • Two PCIe Gen 3 controllers can be run at the same time with same SerDes reference clock source

    • PCIe Gen 3 bit rates are supported − When running more than one PCIe controller at Gen3 rates, the associated

    SerDes reference clocks must be driven by the same source on the board

    16 SERES PCIe Configuration

    PCIe1 PCIe2 PCIe3 PCIe4

    x4gen3 x4gen2 x8gen2

    X8gen2 x8gen2

    x4gen2 x4gen2 x4gen3 x4gen2

    PCIe2

    OCN

    PCIe3

    51G 51G

    51G

    PCIe1

    SR-IOV

    EP

    PCIe4

    51G 51G

    X4 Gen2/3 RC/EP

    X4 Gen2/3 RC/EP X8 Gen2 or x4 Gen3

    X8 Gen2 or

    x4 Gen3 RC/EP

    EP SRIOV

    2 PF/64VF

    8xMSI-X per VF/PF

    Total of 16 lanes

  • TM

    External Use 32

    Single Root I/O Virtualization (SR-IOV) End Point

    • With SR-IOV supported in EP, different devices or different software tasks can share IO resources, such as Gigabit Ethernet controllers. − T4240 Supports SR-IOV 1.1 spec version with 2 PFs and 64 VFs per PF

    − SR-IOV supports native IOV in existing single-root complex PCI Express topologies

    − Address translation services (ATS) supports native IOV across PCI Express via address translation

    − Single Management physical or virtual machine on host handles end-point configuration

    • E.g. T4240 processor as a Converged Network Adapter. Each Virtual Machine running on Host thinks it has a private version of the services card

    Host

    VM

    1

    VM

    2

    VM

    N

    T4240 features single controller (up to x4 Gen 3), 1 PF, 64 VFs

    Translation

    Agent

  • TM

    External Use 33

    PCI Express Configuration Address Register

    • The PCI Express configuration address register contains address

    information for accesses to PCI Express internal and external

    configuration registers for End Point (EP) with SR-IOV

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

    EN

    Type EXTREGN VFN PFN REGN

    PCI Express Address Offset Register

    Name Description

    Enable allows a PCI Express configuration access when PEX_CONFIG_DATA is accessed

    TYPE 01, Configuration Register Accesses to PF registers for EP with SR-IOV

    11, Configuration Register Accesses to VF registers for EP with SR-IOV

    EXTREGN Extended register number. This field allows access to extended PCI Express configuration

    space

    VFN Virtual Function number minus 1. 64-255 is reserved.

    PFN Physical Function number minus 1. 2-15 is reserved.

    REGN Register number. 32-bit register to access within specified device

  • TM

    External Use 34

    Message Signaled Interrupts (MSI-X) Support

    • MSI-X allows for EP device to send message interrupts to RC device independently for different Physical or Virtual functions as supported by EP SR-IOV.

    • Each PF or VF will have eight MSI-X vectors allocated with a total of 256 total MSI-X vectors supported

    − Supports MSI-X for PF/VF with 8 MSI-X vector per PF or VF

    − Supports MSI-X trap operation

    − To access a MSI-X PBA structure, the PF, VF, IDX, EIDXare concatenated to form the 4-byte aligned address of the register within the MSI-X PBA structure. That is, the register address is:

    PF || VF || IDX || EIDX || 0b00

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Typ

    e

    PF VF IDX EDIX M

    PCI Express Address Offset Register

    Name Description

    TYPE Access to PF or VF MSI-X vector table for EP with SR-IOV.

    PF Physical Function

    VF Virtual Function

    IDX MSI-X Entry Index in each VF.

    EIDX Extended index This field provides which 4-Byte entity within the MSI-X PBA structure to access.

    M Mode=11

  • TM

    External Use 35

    Serial RapidIO® Manager (RMAN)

  • TM

    External Use 37

    RapidIO Message Manager (RMan)

    • RMAN supports both inline switching, as well as look aside

    forwarding operation.

    QMan

    RMan

    Inbound Rule

    Matching

    Classification

    Unit

    Reassembly

    Contexts

    Reassembly

    Unit

    Segmentation

    Unit

    Rapid

    IO I

    nbound T

    raffic

    Rapid

    IO O

    utb

    ound T

    raffic

    Classification

    Unit

    Classification

    Unit

    Reassembly

    Unit

    Reassembly

    Unit

    Segmentation

    Unit

    Segmentation

    Unit

    AR

    B

    WQ

    0

    WQ

    1

    WQ

    2

    WQ

    3

    WQ

    4

    WQ

    5

    WQ

    6

    WQ

    7

    HW Channel

    Frame Manager

    1GE 1GE

    1GE 1GE 10GE

    D$ I$

    D$ I$ L2$

    e6500

    Core

    D$ I$

    SE

    C

    PM

    E

    Disassembly

    Contexts

    WQ

    0

    WQ

    1

    WQ

    2

    WQ

    3

    WQ

    4

    WQ

    5

    WQ

    6

    WQ

    7

    HW Channel

    WQ

    0

    WQ

    1

    WQ

    2

    WQ

    3

    WQ

    4

    WQ

    5

    WQ

    6

    WQ

    7

    Pool Channel

    DCP

    SW Portal

    DCP

    … Ftype Target ID Src ID Address Packet Data Unit CRC RapidIO PDU

  • TM

    External Use 38

    RMan: Greater Performance and Functionality

    • Many queues allow multiple inbound/outbound queues per core

    − Hardware queue management via QorIQ Data Path Architecture (DPAA)

    • Supports all messaging-style transaction types

    − Type 11 Messaging

    − Type 10 Doorbells

    − Type 9 Data Streaming

    • Enables low overhead direct core-to-core communication

    Core Core Core Core

    10G SRIO

    QorIQ or DSP

    Core Core Core Core

    10G SRIO

    QorIQ or DSP

    Type9 User PDU

    Channelized CPU-

    to-CPU transport Device-to-Device

    Transport

    MSG User PDU

  • TM

    External Use 39

    Data Path Acceleration

    Architecture (DPAA)

  • TM

    External Use 40

    Data Path Acceleration Architecture (DPAA) Philosophy

    • DPAA is design to balance the performance of multiple CPUs and Accelerators with seamless integrations

    − ANY packet to ANY core to ANY accelerator or network interface efficiently WITHOUT locks or semaphores

    • “Infrastructure” components

    − Queue Manager (QMan)

    − Buffer Manager (BMan)

    • “Accelerator” Components

    − Cores

    − Frame Manager (FMan)

    − RapidIO Message Manager (RMan)

    − Cryptographic accelerator (SEC)

    − Pattern matching engine (PME)

    − Decompression/Compression Engine (DCE)

    − DCB (Data Center Bridging)

    − RAID Engine (RE)

    • CoreNet

    − Provides the interconnect between the cores and the DPAA infrastructure as well as access to memory

    D$ I$

    D$ I$ L2$

    e500mc Core

    D$ I$

    CoreNet™ Coherency Fabric

    Buffer

    Mgr

    D$ I$

    D$ I$ L2$

    e500mc Core

    D$ I$

    Queue

    Manager

    Sec 4.x PME 2

    RMan RE

    Parse, Classify, Distribute

    Buffer

    1/10G 1/10G 1G

    1G

    1G

    1G

    Frame Manager

    1G

    1G

    D$ I$

    D$ I$ L2$

    e6500 Core

    D$ I$

    D$ I$

    D$ I$ L2$

    e6500

    Core

    D$ I$

    DCE DCB

    P Series T Series

    … …

    … …

    Frame Manager

    1GE 1GE

    1GE 1GE 10GE

    PCD

    Buffer

  • TM

    External Use 41

    Length

    DPAA Building Block: Frame Descriptor (FD)

    Simple Frame Multi-buffer Frame

    (Scatter/gather)

    D PID

    BPID

    Address

    Offset

    Status/Cmd

    000

    Buffer

    Frame Descriptor

    D PID

    BPID

    Address

    Offset

    Length

    Status/Cmd

    100

    Frame Descriptor

    Address

    Length

    BPID

    Offset

    00

    Address

    Length

    BPID

    Offset (=0)

    00

    Address

    Length

    BPID

    Offset (=0)

    01

    Data

    Data

    Data

    S/G List

    Packet

    0 1 2 3 4 5 6 7 8 9 1

    0

    1

    1

    1

    2

    1

    3

    1

    4

    1

    5

    1

    6

    1

    7

    1

    8

    1

    9

    2

    0

    2

    1

    2

    2

    2

    3

    2

    4

    2

    5

    2

    6

    2

    7

    2

    8

    2

    9

    3

    0

    3

    1

    D

    D

    LIODN

    offset

    BPID ELIO

    DN

    offset

    - - - - addr

    addr (cont)

    Fmt Offset Length

    STATUS/CMD

  • TM

    External Use 42

    Frame Descriptor Status/Command Word (FMAN Status)

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

    - - -

    DC

    L4

    C

    - - -

    DM

    E

    MS

    - - -

    FP

    E

    FS

    E

    DIS

    -

    EO

    F

    NS

    S

    KS

    O

    -

    FC

    L

    IPP

    FLM

    PT

    E

    ISP

    PH

    E

    FR

    DR

    BL

    E

    L4

    CV

    - -

    Name Description

    DCL4C L4 (IP/TCP/UDP) Checksum validation Enable/Disable

    DME DMA error

    MS MACSEC Frame. This bit is valid on P1023

    FPE Frame Physical Error

    FSE Frame Size Error

    DIS Discard. This bit is set only for frames that are supposed to be discarded, but are

    enqueued in an error queue for debug purposes.

    EOF Extract Out of Frame Error

    NSS No Scheme Selection foe KeyGen

    KSO Key Size Over flow Error

    FCL Frame color as determined by the Policer. 00=green, 01=yellow, 10=red, 11=no reject

    IPP Illegal Policer Profile error

    FLM Frame Length Mismatch

    PTE Parser Time-out

    ISP Invalid Soft Parser instruction Error

    PHE Header Error

    FRDR Frame Drop

    BLE Block limit is exceeded

    L4CV L4 Checksum Validation

  • TM

    External Use 43

    DPAA: mEMAC Controller

  • TM

    External Use 44

    Multirate Ethernet MAC (mEMAC) Controller

    Phy Mgmt MDIO

    Tx Interface

    Frame Manager Interface

    Rx FIFO

    Reconcilication

    Tx FIFO 1588 Time Stamping

    Tx Control

    Rx Control

    Flow Control

    Config Control

    Stat

    MDIO Master

    Rx Interface

    • A multirate Ethernet MAC (mEMAC) controller features 100 Mbps/1G/2.5G/10G :

    − Supports HiGig/HiGig+/HiGig2 protocols

    − Dynamic configuration for NIC (Network Interface Card) applications or Switching/Bridging applications to support 10Gbps or below.

    − Designed to comply with IEEE Std 802.3®, IEEE 802.3u, IEEE 802.3x, IEEE 802.3z, IEEE 802.3ac, IEEE 802.3ab, IEEE-1588 v2 (clock synchronization over Ethernet), IEEE 803.3az and IEEE 802.1QBbb.

    − RMON statistics

    − CRC-32 generation and append on transmit or forwarding of user application provided FCS selectable on a per-frame basis.

    − 8 MAC address comparison on receive and one MAC address overwrite on transmit for NIC applications.

    − Selectable promiscuous frame receive mode and transparent MAC address forwarding on transmit

    − Multicast address filtering with 64-bin hash code lookup table on receive reducing processing load on higher layers

    − Support for VLAN tagged frames and double VLAN Tags (Stacked VLANs)

    − Dynamic inter packet gap (IPG) calculation for WAN applications

    10GMAC dTSEC

    QorIQ P Series

    QorIQ T4240 - mEMAC

  • TM

    External Use 45

    DPAA: FMAN

  • TM

    External Use 46

    FMAN

    Parse, Classify, Distribute

    muRAM

    1/10G 1/10G

    1G

    1G

    1G

    1G

    1G

    1G

    FMan Enhancements

    • Storage Profile selection (up to 32 profiles per port) based on classification − Up to four buffer pools per Storage Profile

    • Customer Edge Egress Traffic Management (Egress Shaping)

    • Data Center Bridging − PFC and ETS

    • IEEE802.3az (Energy Efficient Ethernet)

    • IEEE802.3bf (Time sync)

    • IP Frag & Re-assembly Offload

    • HiGig, HiGig2

    • TX confirmation/error queue enhancements − Ability to configure separate FQID for normal

    confirmations vs errors

    − Separate FD status for Overflow and physical error

    • Option to disable S/G on ingress

  • TM

    External Use 47

    Offline Ports

  • TM

    External Use 48

    FMAN Ports Types

    • Ethernet receive (Rx) and transmitter (Tx) − 1 Gbps/2. 5Gbps/10 Gbps

    − FMan_v3 some ports can be configured ad HiGig

    − Jumbo frames of up to 9.6 KB (add uboot bootargs "fsl_fm_max_frm=9600" )

    • Offline (O/H) − FMan_v3: 3.75 Mpps (vs 1.5M pps from the P series)

    − Supports Parse classify distribute (PCD) function on frames extracted frame descriptor (FD) from the Qman

    − Supports frame copy or move from a storage profile to an other

    − Able to dequeue and enqueue from/to a QMan queue. The FMan applies a Parse Classify Distribute (PCD) flow and (if configured to do so) enqueues the frame it back in a Qman queue. In FMan_v3 the FMan is able to copy the frame into new buffers and enqueue back to the QMan.

    − Use case: IP fragmentation and reassembly

    • Host command − Able to dequeue host commands from a QMan queue. The FMan executes the host

    command (such as a table update) and enqueues a response to the QMan. The Host commands, require a dedicated PortID (one of the O/H ports)

    − The registers for Offline and Host commands are named O/H port registers

  • TM

    External Use 49

    IP Reassembly T4240 Processor Flow

    BMI:

    Parser:

    Parse The Frame Identify fragments

    KeyGen:

    Calculate Hash

    Fman Controller:

    Coarse Classification

    Enqueue Frame

    BMI:

    Allocate buffer

    Write frame and IC

    Fman Controller:

    link fragment to the right

    reassembly context Completed reassembly

    Non Completed

    reassembly

    BMI:

    Terminate

    KeyGen:

    Calculate Hash

    *Fragments

    Non Fragments

    Fman Controller:

    Start reassembly

    Enqueue Frame

    BMI:

    Write IC

    Reassembled

    Frame

    Regular/Fragment

    Regular frame: Storage Profile is

    chosen according to frame header

    classification.

    Reassembled frame: Storage

    Profile is chosen according to MAC

    and IP header classification only.

    *Buffer allocation is done

    According to fragment

    header only

  • TM

    External Use 50

    IP Reassembly FMAN Memory Usage

    • FMAN Memory: 386 KBytes

    • Assumption: MTU = 1500 Bytes

    • Port FMAN Memory consumption:

    − Each 10G Port = 40 Kbytes

    − Each 1G Port = 25 Kbytes

    − Each Offline Port = 10 Kbytes

    • Coarse Classification tables memory consumption:

    − 100 Kbytes for all ports

    • IP Reassembly:

    − IP Reassembly overhead: 8 Kbytes

    − Each flow: 10 Bytes

    • Example:

    − Usecase with: 2x10G ports + 2x1G port + 1xOffline Ports.

    − Port configuration: 2x40 + 2x25 + 10 = 140 Kbytes

    − Coarse Classification : 100 Kbytes

    − IP reassembly 10K flows: 10K x 10B + 8KB = 108 Kbytes

    − Total = 140KB + 108KB + 100KB = 348 KBytes

  • TM

    External Use 51

    Storage Profile

  • TM

    External Use 52

    Virtual Storage Profiling For Rx and Offline Ports

    • Storage profile enable each partition and virtual interface enjoy a dedicated buffer pools.

    • Storage profile selection after distribution function evaluation or after custom classifier

    • The same Storage Profile ID (SPID) values from the classification on different physical ports, may yield to different storage profile selection.

    • Up to 64 storage profiles per port are supported. − 32 storage profiles for FMan_v3L

    • Storage profile contains

    − LIODN offset

    − Up to four buffer pools per Storage Profile

    − Buffer Start margin/End margin configuration

    − S/G disable

    − Flow control configuration

  • TM

    External Use 53

    Data Center Bridging

  • TM

    External Use 54

    Policing and Shaping

    • Policing put a cap on the network usage and guarantee bandwidth

    • Shaping smoothes out the egress traffic

    − May require extra memory to store the shaped traffic.

    • DCB can be used in:

    − Between data center network nodes

    − LAN/network traffic

    − Storage Area Network (SAN)

    − IPC traffic (e.g. Infiniband (low latency))

    Time

    Time

    Time

  • TM

    External Use 55

    Support Priority-based Flow Control (802.1Qbb)

    • Enables lossless behavior for each class of service

    • PAUSE sent per virtual lane when buffers limit exceeded

    − FQ congestion groups state (on/off) from QMan

    Priority vector (8 bits) is assigned to each FQ congestion group

    FQ congestion group(s) are assigned to each port

    Upon receipt of a congestion group state “on” message, for each Rx port associated with this congestion group, a PFC Pause frame is transmitted with priority level(s) configured for that group

    − Buffer pool depletion

    Priority level configured on per port (shared by all buffer pools used on that port)

    − Near FMan Rx FIFO full

    There is a single Rx FIFO per port for all priorities, the PFC Pause frame is sent on all priorities

    • PFC Pause frame reception

    − QMan provides the ability to flow control 8 different traffic classes; in CEETM each of the 16 class queues within a class queue channel can be mapped to one of the 8 traffic classes & this mapping applies to all channels assigned to the link

    Transmit Queues Ethernet Link

    Receive Buffers

    Zero Zero

    One One

    Two Two

    Five Five

    Four Four

    Six Six

    Seven Seven

    Three Three STOP PAUSE Eight

    Virtual

    Lanes

  • TM

    External Use 56

    Support Bandwidth Management 802.1Qaz

    10 GE Realized Traffic Utilization

    3G/s HPC Traffic

    3G/s

    2G/s

    3G/s Storage Traffic

    3G/s

    3G/s

    LAN Traffic

    4G/s

    5G/s 3G/s

    t1 t2 t3

    Offered Traffic

    t1 t2 t3

    3G/s 3G/s

    3G/s 3G/s 3G/s

    2G/s

    3G/s 4G/s 6G/s

    • Supports 32 channels available for allocation across a single FMan

    − e.g. for two10G links, could allocate 16 channels (virtual links) per link

    − Supports weighted bandwidth fairness amongst channels

    − Shaping is supporting on per channel basis

    • Hierarchical port scheduling defines the class-of-service (CoS) properties of output queues, mapped to IEEE 802.1p priorities

    • Qman CEETM enables Enhanced Tx Selection (ETS) 802.1Qaz with Intelligent sharing of bandwidth between traffic classes control of bandwidth − Strict priority scheduling of the 8

    independent classes. Weighted bandwidth fairness within 8 grouped classes

    − Priority of the class group can be independently configured to be immediately below any of the independent classes

    • Meets performance requirement for ETS: bandwidth granularity of 1% and +/-10% accuracy

  • TM

    External Use 57

    QMAN CEETM

  • TM

    External Use 58

    Shape Aware

    Fair Scheduling

    StrictPriority StrictPriority

    CEETM Scheduling Hierarchy (QMAN 1.2)

    • Logics

    − Green denotes logic units and signal paths that relate to the request and fulfillment of Committed Rate (CR) packet transmission opportunities

    − Yellow denotes the same for Excess Rate (ER)

    − Black denotes logic units and signal paths that are used for unshaped opportunities or that operate consistently whether used for CR or ER opportunities

    • Scheduler

    − Channel Scheduler: channels are selected to send frame from Class Queues

    − Class scheduler: frames are selected from Class Queues . Class 0 has highest priority

    • Algorithm

    − Strict Priority (SP)

    − Weighted Scheduling

    − Shaped Aware Fair Scheduling (SAFS)

    − Weighted Bandwidth Fair Scheduling (WBFS)

    Strict Priority

    Shape Aware

    Fair Scheduling

    Weighted

    Scheduling

    StrictPriority StrictPriority StrictPriority

    WBFS WBFS WBFS WBFS WBFS

    CQ

    8

    CQ

    9

    CQ

    10

    CQ

    11

    CQ

    12

    CQ

    14

    CQ

    13

    CQ

    15

    CQ

    0

    CQ

    1

    CQ

    2

    CQ

    3

    CQ

    4

    CQ

    5

    CQ

    6

    CQ

    7

    Network IF

    Cha

    nn

    el S

    ch

    ed

    ule

    r

    for

    LN

    I #

    9

    Class Scheduler Ch6

    unshaped

    8 Indpt, 8 grp Classes

    Class Scheduler Ch7

    Shaped

    3 Indpt, 7 grp

    Class Scheduler Ch8

    Shaped

    2 Indpt, 8 grp

    Token Bucket Shaper for Committed Rate

    Token Bucket Shaper for Excess Rate

  • TM

    External Use 59

    Weighted Bandwidth Fair Scheduling (WBFS)

    • Weighted Bandwidth Fair Scheduling (WBFS) is used to schedule packets from queues within a priority group such that each gets a “fair” amount of bandwidth made available to that priority group

    • The premises for fairness for algorithm is: − available bandwidth is divided and offered equally to all classes

    − offered bandwidth in excess of a class’s demand is to be re-offered equally to classes with unmet demand

    Initial Distribution First

    ReDistribution

    Second

    Redistribution

    Total BW

    Attained

    BW available 10G 1.5G .2G 0G

    Number of classes

    with unmet demand 5 3 2

    Bandwidth to be

    offer to each class 2G .5G .1G

    Demand Offered &

    Retained

    Unmet

    Demand

    Offered &

    Retained

    Unmet

    Demand

    Offered &

    Retained

    Class 0 .5G .5G 0 .5G

    Class 1 2G 2G 0 2G

    Class 2 2.3G 2G .3G .3G 0 2.3G

    Class 3 3G 2G 1G .5G .5G .1G 2.6G

    Class 4 4G 2G 2G .5G 1.5G .1G 2.6G

    Total Consumption 11.8G 8.5G 1.3G .2G 10G

  • TM

    External Use 60

    DPAA: SEC Engine

  • TM

    External Use 61

    Security Engine

    • Black Keys − In addition to protecting against external bus snooping, Black Keys cryptographically

    protect against key snooping between security domains

    • Blobs − Blobs protect data confidentiality and integrity across power cycles, but do not protect

    against unauthorized decapsulation or substitution of another user’s blobs

    − In addition to protecting data confidentiality and integrity across power cycles, Blobs cryptographically protect against blob snooping/substitution between security domains

    • Trusted Descriptors − Trusted Descriptors protect descriptor integrity, but do not distinguish between

    Trusted Descriptors created by different users

    − In addition to protecting Trusted Descriptor integrity, Trusted Descriptors now cryptographically distinguish between Trusted Descriptors created in different security domains

    • DECO Request Source Register − Register added

  • TM

    External Use 62

    QorIQ T4240 Processor SEC 5.0 Features Header & Trailer off-load for the following Security Protocols:

    − IPSec, SSL/TLS, 3G RLC, PDCP, SRTP, 802.11i, 802.16e, 802.1ae

    (3) Public Key Hardware Accelerator (PKHA)

    − RSA and Diffie-Hellman (to 4096b)

    − Elliptic curve cryptography (1024b)

    − Supports Run Time Equalization

    (1) Random Number Generators (RNG4)

    − NIST Certified

    (4) Snow 3G Hardware Accelerators (STHA)

    − Implements Snow 3.0

    − Two for Encryption (F8), two for Integrity (F9)

    (4) ZUC Hardware Accelerators (ZHA)

    − Two for Encryption, two for Integrity

    (2) ARC Four Hardware Accelerators (AFHA)

    − Compatible with RC4 algorithm

    (8) Kasumi F8/F9 Hardware Accelerators (KFHA)

    − F8 , F9 as required for 3GPP

    − A5/3 for GSM and EDGE

    − GEA-3 for GPRS

    (8) Message Digest Hardware Accelerators (MDHA)

    − SHA-1, SHA-2 256,384,512-bit digests

    − MD5 128-bit digest

    − HMAC with all algorithms

    (8) Advanced Encryption Standard Accelerators (AESA)

    − Key lengths of 128-, 192-, and 256-bit

    − ECB, CBC, CTR, CCM, GCM, CMAC, OFB, CFB, and XTS

    (8) Data Encryption Standard Accelerators (DESA)

    − DES, 3DES (2K, 3K)

    − ECB, CBC, OFB modes

    (8) CRC Unit

    − CRC32, CRC32C, 802.16e OFDMA CRC

    Job Queue

    Controller

    Descriptor

    Controllers

    DM

    A

    RT

    IC

    Queue

    Interface

    Job Ring I/F

    DESA AESA

    CHAs

    MDHA

    AFHA PKHA STHA

    RNG4

    KFHA

    ZHA

  • TM

    External Use 63

    Arbiter

    AFHA

    Arbiter RNG4

    Arbiter Arbiter Arbiter

    PKHA STHA f8

    STHA f9

    MDHA

    CRCA

    AESA

    KFHA DESA

    MDHA

    CRCA

    AESA

    KFHA DESA

    PKHA

    PKHA AFHA

    STHA f8

    STHA f9

    ZUEA

    ZUCE ZUEA

    ZUCE

    Life of a Job Descriptor

    • QI has room for more work, issues dequeue request for 1 or 3 FDs

    • Qman selects FQ and provides 1 FD along with FQ Information

    • QI creates [internal] Job Descriptor and if necessary, obtains output buffers

    • QI transfers completed Job Descriptor into one of the Holding Tanks

    • Job Queue Controller finds an available DECO, transfers JD1 to it

    • DECO initiates DMA of Shared Descriptor from system memory, places it in Descriptor Buffer with JD from Holding Tank

    • DECO executes descriptor commands, loading registers and FIFOs in its CCB

    • CCB obtains and controls CHA(s) to process the data per DECO commands

    • DECO commands DMA to store results and any updated context to system memory

    • As input buffers are being emptied, DECO tells QI, which may release them back to BMan

    • Upon completion of all processing through CCB, DECO resets CCB

    • DECO informs QI that JD1 has completed with status code X, data of length Y has been written to address Z

    • QI creates outbound FD, enqueues to Qman using FQID from Ctx B field

    Queue Interface Job Prep Logic

    Job Queue Controller

    DECO Pool

    DECO 0

    Descriptor

    Buffer

    DECO 7

    R FDs

    SP1 0 000

    SP2 0 001

    SP3 0 101

    SP4 0 011

    SP5 1 111

    FQ FQ FQ FQ FQ

    1 E E E D E

    2 D E E D E

    3 E E E E E

    SP Status FQ ID List

    Holding

    Tank 0

    Holding

    Tank 7

    Holding Tank Pool

    Job Queues JR 0

    JR 1

    JR 2

    JR 3

    DM

    A

    Descriptor

    Buffer

    CCB 0 CCB 7

    Buffer

    Mgr

    Queue

    Manager DDR/CoreNet (Shared Desc, Frame)

    FD1

    JD1

    . . . . . . .

    . . . . . . .

  • TM

    External Use 64

    DPAA: DCE

  • TM

    External Use 65

    DPAA Interaction: Frame Descriptor Status/CMD

    • The Status/Command word in the dequeued FD allows software to modify the processing of individual frames while retaining the performance advantages of enqueuing to a FQ for flow based processing

    • The three most significant bits of the Command /Status field of the Frame Descriptor have the following meaning:

    0 1 2 3 4 5 6 7 8 9 1

    0

    1

    1

    1

    2

    1

    3

    1

    4

    1

    5

    1

    6

    1

    7

    1

    8

    1

    9

    2

    0

    2

    1

    2

    2

    2

    3

    2

    4

    2

    5

    2

    6

    2

    7

    2

    8

    2

    9

    3

    0

    3

    1

    DD LIODN offset BPID ELIODN

    offset

    - - - - addr

    addr (cont)

    Format Offset Length

    CMD Token: Pass through data that is echoed with the returned Frame.

    3 MSB Description

    000 Process Command Command Encoding

    001 Reserved

    010 Reserved

    011 Reserved

    100 Context Invalidate Command Token

    101 Reserved

    110 Reserved

    111 NOP Command Token

    0 1 2 3 4 5 6 7 8 9 1

    0

    1

    1

    1

    2

    1

    3

    1

    4

    1

    5

    1

    6

    1

    7

    1

    8

    1

    9

    2

    0

    2

    1

    2

    2

    2

    3

    2

    4

    2

    5

    2

    6

    2

    7

    2

    8

    2

    9

    3

    0

    3

    1

    CM

    D

    OO

    Z

    Flu

    sh

    SC

    RF

    R

    I

    RB

    B

    64

    CF

    -

    CE

    UH

    C

    US

    PC

    U

    SD

    C

    SC

    US

    Status

    (output Frame)

  • TM

    External Use 66

    DCE Inputs

    • SW enqueues work to DCE via Frame Queues. FQs define the flow for stateful processing

    • FQ initialization creates a location for the DCE to use when storing flow stream context

    • Each work item within the flow is defined by a Frame Descriptor, which includes length, pointer, offsets, and commands

    • DCE has separate channels for compress and decompress

    DC

    P P

    ort

    al

    DCE

    WQ6

    WQ7

    ch

    an

    ne

    l

    WQ0

    WQ1

    WQ2

    WQ3

    WQ4

    WQ5

    WQ6

    WQ7

    FD3

    FD2

    FD1

    Addr

    Offset Length

    Status/Cmd

    PID BPID Addr

    Addr

    Offset Length

    Status/Cmd

    PID BPID Addr

    Addr

    Offset Length

    Status/Cmd

    PID BPID Addr

    Data

    Buffer

    Data

    Buffer

    Data

    Buffer

    WQ6

    WQ7

    ch

    an

    ne

    l

    WQ0

    WQ1

    WQ2

    WQ3

    WQ4

    WQ5

    WQ6

    WQ7

    FD3

    FD2

    FD1

    Addr

    Offset Length

    Status/Cmd

    PID BPID Addr

    Addr

    Offset Length

    Status/Cmd

    PID BPID Addr

    Addr

    Offset Length

    Status/Cmd

    PID BPID Addr

    Data

    Buffer

    Data

    Buffer

    Data

    Buffer

    Decomp

    Comp

    FQs

    Command

    FQs

    Flow Stream Context

    Context_A

    Flow Stream Context

    Context_A

  • TM

    External Use 67

    DCE Outputs

    • DCE enqueues results to SW via Frame Queues as defined by FQ Context_B field. When buffers obtained from Bman, buffer pool ID defined by Input FQ

    • Each result is defined by a Frame Descriptor, which includes a Status field

    • DCE updates flow stream context located at Context_A as needed

    FD3

    FD2

    FD1

    Addr

    Offset Length

    Status/Cmd

    PID BPID Addr

    Addr

    Offset Length

    Status/Cmd

    PID BPID Addr

    Addr

    Offset Length

    Status/Cmd

    PID BPID Addr

    Data

    Buffer

    Data

    Buffer

    Data

    Buffer

    DC

    P P

    ort

    al

    Decomp

    Comp

    DCE

    Flow

    Stream

    Context

    Context_A

    Data

    Buffer

    Data

    Buffer

    Data

    Buffer FD3

    FD2

    FD1

    Addr

    Offset Length

    Status/Cmd

    PID BPID Addr

    Addr

    Offset Length

    Status/Cmd

    PID BPID Addr

    Addr

    Offset Length

    Status/Cmd

    PID BPID Addr

    Flow

    Stream

    Context

    Context_A

    Status

    FQ

    s

    FQ

    s

  • TM

    External Use 68

    PME

  • TM

    External Use 69

    Frame Descriptor: STATUS/CMD Treatment

    • PME Frame Descriptor Commands

    − b111 NOP NOP Command

    − b101 FCR Flow Context Read Command

    − b100 FCW Flow Context Write Command

    − b001 PMTCC Table Configuration Command

    − b000 SCAN Scan Command

    0 1 2 3 4 5 6 7 8 9 1

    0

    1

    1

    1

    2

    1

    3

    1

    4

    1

    5

    1

    6

    1

    7

    1

    8

    1

    9

    2

    0

    2

    1

    2

    2

    2

    3

    2

    4

    2

    5

    2

    6

    2

    7

    2

    8

    2

    9

    3

    0

    3

    1

    DD LIODN offset BPID ELIODN

    offset

    - - - - addr

    addr (cont)

    Format Offset Length

    Status/CMD

    Scan

    b000

    SRV

    M

    F S/

    R

    E SET Subset

  • TM

    External Use 70

    I W A N T T O S E A R C H F R E E

    Life of a Packet inside Pattern Matching Engine

    • Combined hash/NFA technology • 9.6 Gbps raw performance • Max 32K patterns of up to 128B length • Patterns

    − Patt1 /free/ tag=0x0001

    − Patt2 /freescale/ tag=0x0002

    • KES − Compare hash value of incoming

    data(frames) against all patterns

    • DXE − Retrieve the pattern with matched hash

    value for a final comparison.

    • SRE − Optionally post process match result before

    sending the report to the CPU

    On-Chip

    System

    Bus

    Interface

    Pattern

    Matcher

    Frame

    Agent

    (PMFA)

    Data

    Examination

    Engine

    (DXE)

    Stateful

    Rule

    Engine

    (SRE)

    Key

    Element

    Scanning

    Engine

    (KES)

    Hash

    Tables

    Access to Pattern Descriptors and State

    Cache Cache

    User Definable Reports

    Cor

    eNet

    B

    Ma

    n Q

    Ma

    n

    192.168.1.1:80 TCP 10.10.10.100:16734

    192.168.1.1:25 TCP 10.10.10.100:17784

    192.168.1.1:1863 TCP 10.10.10.100:16855

    DDR

    Memory

    flowA:FD1: 192.168.1.1:80->10.10.10.100:16734 “I want to search free “

    flowA:FD2: 192.168.1.1:80->10.10.10.100:16734 “scale FTF 2014 event schedule”

    Frame Queue: A

    FD1

    Patt1 /free/

    tag=0x0001

    FD2

  • TM

    External Use 71

    Debug

  • TM

    External Use 72

    Core Debug in Multi-Thread Environment

    • Almost all resources are private. Internal debug works as if they are

    separate cores

    • External debug is private per thread. An option exists to halt both threads

    when one thread halts

    − While threads can be debug halted individually, it is generally not very

    useful if the debug session will care about the contents of the MMU

    and caches

    − Halting both threads prevents the other thread from continuing to

    compute and essentially clean the L1 caches and the MMU of the state

    of the thread which initiated the debug halt

  • TM

    External Use 73

    DPAA Debug trace

    • During packet processing, FMan can trace packet processing flow

    through each of the FMan modules and trap a packet.

    0 1 2 3 4 5 6 7 8 9 1

    0

    1

    1

    1

    2

    1

    3

    1

    4

    1

    5

    1

    6

    1

    7

    1

    8

    1

    9

    2

    0

    2

    1

    2

    2

    2

    3

    2

    4

    2

    5

    2

    6

    2

    7

    2

    8

    2

    9

    3

    0

    3

    1

    D

    D

    LIODN

    offset

    BPID ELIO

    DN

    offset

    - - - - addr

    addr (cont)

    Fmt Offset Length

    STATUS/CMD

  • TM

    External Use 74

    Summary

  • TM

    External Use 75

    QorIQ T4 Series Advance Features Summary Feature Benefit

    High perf/watt • 188k CoreMark in 55W = 3.4 CM/W

    • Compare to Intel E5-2650: 146k CM in 95W = 1.5 CW/W;

    • Or: Intel E5-2687W: 200k MC in 150W = 1.3 CM/W

    • T4 is more than 2x better than E5

    • 2x perf/watt compared to P4080, FSL’s previous flagship

    Highly integrated

    SOC

    Integration of 4x 10GE interfaces, local bus, Interlaken, SRIO mean that few chips

    (takes at least four chips with Intel) and higher performance density

    Sophisticated

    PCIe capability

    • SR-IOV for showing VMs a virtual NIC, 128 VFs (Virtual Functions)

    • Four ports with ability to be root complex or endpoint for flexible configurations

    Advanced

    Ethernet

    • Data Center Bridging for lossless Ethernet and QoS

    • 10GBase-KR for backplane connections

    Secure Boot Prevents code theft, system hacking, and reverse engineering

    Altivec On-board SIMD engine – sonar/radar and imaging

    Power

    Management

    • Thread, core, and cluster deep sleep modes

    • Automatic deep sleep of unused resources

    Advanced

    virtualization

    • Hypervisor privilege level enables safe guest OS at high performance

    • IOMMU ensures memory accesses are restricted to correct area

    • Virtualization of I/O blocks

    Hardware offload • Packet handling to 50Gb/s

    • Security engine to 40Gb/s

    • Data compression and decompression to 20Gb/s

    • Pattern matching to 10Gb/s

    3x Scalability • 1-, 2-, and 3- cluster solution is 3x performance range over T4080 – T4240

    • Enables customer to develop multiple SKUs from on PCB

  • TM

    External Use 76

    Other Sessions And Useful Information

    • FTF2014 Sessions for QorIQ T4 Devices

    − FTF-NET-F0070_QorIQ Platforms Trust Arch Overview

    − FTF-NET-F0139_AltiVec_Programming

    − FTF-NET-F0146_Introduction_to_DPAA

    − FTF-NET-F0147-DPAAusage

    − FTF-NET-F0148_DPAA_Debug

    − FTF-NET-F0157_QorIQ Platforms Trust Arch Demo & Deep Dive

    • T4240 Product Website

    − http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=T4240

    • Online Training

    − http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=T4240&tab=Design_Support_Tab

    http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=T4240http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=T4240&tab=Design_Support_Tabhttp://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=T4240&tab=Design_Support_Tab

  • TM

    External Use 77

    Introducing The

    QorIQ LS2 Family

    Breakthrough,

    software-defined

    approach to advance

    the world’s new

    virtualized networks

    New, high-performance architecture built with ease-of-use in mind Groundbreaking, flexible architecture that abstracts hardware complexity and

    enables customers to focus their resources on innovation at the application level

    Optimized for software-defined networking applications Balanced integration of CPU performance with network I/O and C-programmable

    datapath acceleration that is right-sized (power/performance/cost) to deliver

    advanced SoC technology for the SDN era

    Extending the industry’s broadest portfolio of 64-bit multicore SoCs Built on the ARM® Cortex®-A57 architecture with integrated L2 switch enabling

    interconnect and peripherals to provide a complete system-on-chip solution

  • TM

    External Use 78

    QorIQ LS2 Family Key Features

    Unprecedented performance and

    ease of use for smarter, more

    capable networks

    High performance cores with leading

    interconnect and memory bandwidth

    • 8x ARM Cortex-A57 cores, 2.0GHz, 4MB L2

    cache, w Neon SIMD

    • 1MB L3 platform cache w/ECC

    • 2x 64b DDR4 up to 2.4GT/s

    A high performance datapath designed

    with software developers in mind

    • New datapath hardware and abstracted

    acceleration that is called via standard Linux

    objects

    • 40 Gbps Packet processing performance with

    20Gbps acceleration (crypto, Pattern

    Match/RegEx, Data Compression)

    • Management complex provides all

    init/setup/teardown tasks

    Leading network I/O integration

    • 8x1/10GbE + 8x1G, MACSec on up to 4x 1/10GbE

    • Integrated L2 switching capability for cost savings

    • 4 PCIe Gen3 controllers, 1 with SR-IOV support

    • 2 x SATA 3.0, 2 x USB 3.0 with PHY

    SDN/NFV

    Switching

    Data

    Center

    Wireless

    Access

  • TM

    External Use 79

    See the LS2 Family First in the Tech Lab!

    4 new demos built on QorIQ LS2 processors:

    Performance Analysis Made Easy

    Leave the Packet Processing To Us

    Combining Ease of Use with Performance

    Tools for Every Step of Your Design

  • TM

    © 2014 Freescale Semiconductor, Inc. | External Use

    www.Freescale.com

    http://www.freescale.com/https://twitter.com/Freescalehttps://twitter.com/Freescalehttps://www.facebook.com/freescalehttps://www.facebook.com/freescale

  • TM

    External Use 81

    QorIQ T4240 SerDes Options Total of four x8 banks

    High speed serial

    • 2.5 , 5, 8 GHz for PCIe

    • 2.5, 3.125, and 5 GHz for sRIO

    • 3.125, 6.25, and 10.3125 GHz for

    Interlaken

    • 1.5, 3.0 GHz for SATA

    • 1.25, 2.5, 3.125, and 5 GHz for

    debug

    Ethernet options:

    • 10Gbps Ethernet MACs with XAUI

    or XFI

    • 1Gbps Ethernet MACs with SGMII

    (1 lane at 1.25 GHz with 3.125

    GHz option for 2.5Gbps Ethernet)

    • 2 MACs can be used with

    RGMII

    • 4 x1Gbps Ethernet MACs can be

    supported using a single lane at 5

    GHz (QSGMII)

    • HiGig is supported with 4 lines at

    3.125 GHz or 3.75 GHz (HiGig+)

  • TM

    External Use 82

    Decompression Compression Engine

    • Zlib: As specified in RFC1950

    • Deflate: As specified as in RFC1951

    • GZIP: As specified in RFC1952

    • Encoding

    − supports Base 64 encoding and decoding (RFC4648)

    • ZLIB, GZIP and DEFLATE header insertion

    • ZLIB and GZIP CRC computation and insertion

    • 4 modes of compression

    − No compression (just add DEFLATE header)

    − Encode only using static/dynamic Huffman codes

    − Compress and encode using static OR dyamic Huffman codes

    − at least 2.5:1 compression ratio on the Calgary Corpus

    • All standard modes of decompression

    − No compression

    − Static Huffman codes

    − Synamic Huffman codes

    • Provides option to return original compressed Frame along with the uncompressed Frame or release the buffers to BMAN

    32KB

    History

    Frame

    Agent

    QMan

    I/F

    BMan

    I/F

    Bus

    I/F

    Decompressor

    Compressor

    QMan

    Portal

    BMan

    Portal

    To

    Corenet

    4KB History

  • TM

    © 2014 Freescale Semiconductor, Inc. | External Use

    www.Freescale.com

    http://www.freescale.com/https://twitter.com/Freescalehttps://twitter.com/Freescalehttps://www.facebook.com/freescalehttps://www.facebook.com/freescale