optimizing pcie performance in pcs and embedded systems frozen

Copyright © 2009, PCI-SIG, All Rights Reserved 1

Optimizing PCIe® Performance in PCs & Embedded Systems

Optimizing PCIe® Performance in PCs & Embedded Systems

Mike AlfordGennum

Mike AlfordGennum

Copyright © 2009, PCI-SIG, All Rights Reserved 2PCI-SIG Developers Conference

Disclaimer

Copyright © 2009, PCI-SIG, All Rights Reserved 2

Presentation Disclaimer: All opinions, judgments, recommendations, etc. that are presented herein are the opinions of the presenter of the material

and do not necessarily reflect the opinions of the PCI-SIG®.

Some information in this presentation refers to a specification still in the development process.


AgendaLatency

Link layerPacket layerDriver/SWSystem level

DMA engine architectureConventionalPCIe® optimized

Root Complex CharacteristicsMeasured vs. theoretical

Avoiding hazards and race conditionsInterrupt controller design


OverviewObjective of the presentation is to explore some of the important elements of system performance WRT endpoint HW and SW design

How should they work and perform?How do they perform in actual implementations

Best practices in endpoint designFPGA and ASIC


LatencyThere are many forms or layers of latency in systems

System Level, End-to-End Latency

Application SW

SwitchRCEndpoint

Memory

Driver SWInterrupt Service Latency

DMA PCIe Core100’s of ns

10’s to 100’s of μs

> ms

(Packet Level Latency)

(Driver Level Latency)

(Application Latency)

Link Level Latency (PM etc.) 10’s of ns


Latency Impact ofPower Management

Lane 1Lane 0

Lane 3Lane 2

Lane 0

1 Packet over 4 lanes

1 Packet over a single lane

PM Scenario 1: Negotiate down # of lanes

PM Scenario 2: Aggregate Packets

Lane 1

Lane 3Lane 2

Additional Latency

Link A

Link

ALi

nk B

Link B


Packet Latency: Pre-PCIPrior to PCI, the cost of polling IO was about the same as polling memory

In the order of several hundred nsThis assumption was accepted in the architecture of IO devices and driver software

CPU(486)

Local Bus (32 bit/33MHz)

RAM IO


Packet Latency: PCIWith PCI, system aggregate bandwidth improved while IO latency degraded

Memory access has lower latency cost compared to IO

Especially if IO is below multiple layers of bridges


Packet Latency: PCI Express®PCI Express provides even more options for system expansion that can increase IO latency

PCIe Switch

CPU

Root Complex

DRAM

GPU

IO PCIe Repeater

PCIe Repeater

PCIe Switch

IO

PCIe Cable

PCIe cable delay = ~43ns/10m

Switch latency ~200ns or more


Polling Latency Across the System

1.50

920.01

2.50

1540.77

66.71 71.34

1

10

100

1000

10000

Cache Read Memory Read PCIe Endpoint Read

Ave

rage

Rea

d La

tenc

y (n

s)

System 1System 2

Data Measured from 2 Different PC Systems


Interrupt Latency FactorsMore cache layers (L1/L2/L3)

Cache miss probability decreases but penalty for a miss increases

– Results in less predictable interrupt latency (larger max/min ratio)

– Will tend to get worse in the future

Deeper processor pipelines result in longer interrupt latencies due to larger amount of context information that must be flushed or stored


Interrupt Latency ExperimentUse a test endpoint card to generate an interrupt under SW controlRepeatedly measure the time interval between assertion of the interrupt and the de-assertion from within the interrupt handler to generate a histogram

Vary system loading to observe the effect on interrupt latency

T

Interrupt Assert by

HW

Interrupt De-assert

by ISR


Interrupt Latency Example

1

10

100

1000

10000

100000

10 40 70 100

130

160

190

220

250

280

310

340

370

400

430

460

490

520

550

580

610

640

670

700

730

760

790

820

850

880

910

940

970

1000

Latency (Microseconds)

Latency (System Idle) Max=33usLatency (>90% CPU Load) Max=18.2msLatency (20% CPU Load) Max=23.4ms

Data Measured from PC System Running Windows XP

Note: “Max” values are the largest latencies measured under those conditions.


Simple DMA Service ScenarioHost Peripheral

Set up DMA Transfer

Execute DMA Transfer

Set up DMA Transfer

Execute DMA Transfer

If this latency is too long, then the peripheral will starve (example: dropped video frames, audio breakup)


Service Latency vs. Buffer Size

0

50

100

150

200

250

300

350

400

450

500

0 100 200 300 400 500

Throughput (MB/s)

Buf

fer S

ize

(KB

)

1000 us750 us500 us375 us250 us100 us50 us25 us

Example: 1 stream of 1080p60 video


DMA Engine ArchitectureWith PCIe, DMA can take advantage of the full duplex nature of the linkDMA can consist of multiple upstream DMA threads and multiple downstream DMA threads

However, only one thread in each direction can be active on the link at any one instant


Conventional Multi-channel DMA Approach

PHYLLTL

Up DMAm

Up DMA1

Up DMA2

AHB Ctl.

Down DMAn

Down DMA1

Down DMA2

AH

B

AHB Arb.

PCIe Core

Interconnects like AHB are a poor choice because they don’t provide full duplex data transfer


PCIe Optimized DMA Approach

MU

XD

ECO

DE


Scatter/Gather Controller Example

SG EngineSequencer Registers

DPTRRARB

SYS_ADDR_HSYS_ADDR_L

XFER_CTL

EVENT_SETEVENT_CLREVENT_EN

EVENT

CSR

DMA Control

Program Control

Event Control/Status

Sequencer Control/Ststus

Descriptor RAM

Instruction Decoder

External Conditional

InputsJMP Condition

SelectHost Access to Descriptor RAM

DataAddress

Downstream DMA Master

Upstream DMA Master

FIFO

FIFO

FIFO

FIFO

FIFO

FIFO

FIFO Status

Upstream Data

Downstream Data

Host Access to SG Registers

Application

Upstream Channel Select

MU

XD

EC

OD

E

Downstream Channel Select

Application Interaction

with EVENTInterrupt Output

FIFO

FIFO


Example SG List Entry

XFER_CTLSYS_ADDR_HSYS_ADDR_L

Specifies xfer count, direction, stream ID, etc.

64 bit host memory address


SG Engine Instruction SetLoad, Store, Add System Address

Used to manipulate the system address register– System address register used to specify the host source/destination

address for DMA xferLoad XFER_CTL

Pushes a command into either the upstream or downstream DMA controller

Load, Store, Add RA/RB RegistersUsed to manipulate the indexing/counting registers RA and RB

Conditional JumpUsed for polling FIFO status and for looping

“Event” assertionUsed as a semaphore mechanism and to signal interrupts


Simplified DMA MainControl Sequence

Channel 0 Ready for Servicing?

No

Service Channel 0

Initialize

Yes

Channel 1 Ready for Servicing?

No

Service Channel 1Yes

Channel n-1 Ready for Servicing?

No

Service Channel n-1Yes


Root Complex CharacteristicsMax Payload is 128B on most systems

256 on some server class chipsets and newer desktop512 seen in some of the latest at PlugfestSpec allows up to 4KB

Max read request size supportedTypical is 512BSpec allows up to 4KB

Typical read completion packet sizeMost systems use a cache line based fetching mechanism resulting in 64B cache aligned packetsSome RC chipsets provide read combining feature that will opportunistically combine multiple sequential 64B packets togetherSpec allows up to 4KB

Virtual channelsTypical RC/switches/FPGA cores support only the default VC0Spec allows for up to 8 hardware VC


Outstanding Reads vs. Performance

0

100

200

300

400

500

600

700

800

1 Outstanding 2 Outstanding 3 Outstanding 4 Outstanding

Number of Outstanding Reads

Thro

ughp

ut (M

B/s

)

System 1System 2System 3System 4

Measured Results for Endpoint DMA Read to 4 Different RC(PCIe 1.x x4 link)


Completion Ordering


Completion Ordering Summary

FIFO OrderLowest latency for single stream trafficFewer outstanding requests needed to sustain throughput

Out of OrderLowest latency for multi stream traffic

– Better when you have multiple streams with small FIFOs• Can use one outstanding request per FIFO and thus avoid re-ordering

logic

Actual systemsSome use FIFO order some OoO

– Endpoint generally needs to support both unless RC is always known

Typical PCIe IP cores do not re-order for you– Requires additional logic


Link EfficiencyThe specified link rate of 2.5GT/s (PCIe 1.x), 5GT/s (PCIe 2.x), 8GT/s (PCIe 3.0) is not all usable

Example:– X1 link at PCIe 2.x is 312.5 MB/s of raw bandwidth– Subtract 8b/10b encoding = 250MB/s– Subtract link layer traffic (ack/nack, replay, FC updates etc.)– Subtract packet overhead

Packet overhead STP

Sequence Number

PHY/PCSDLL

Header

LCRC

END

1 Byte

2 Bytes

12 Bytes (32 bit requests and completions)16 Bytes (64 bit requests)

Data PayloadBetween 4 Bytes and MAX_PAYLOAD size

TLP 4 Bytes

1 Byte


Link Efficiency vs.Payload Size

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

4 8 16 32 64 128 256 512 1K 2K 4K

Payload Size (Bytes)

Link

Effi

cien

cy

16 Byte Header12 Byte Header


RC/Endpoint Performance Under System Load

Experiment:X4 endpoint doing 600-700MB/s DMA in each direction to host memory via RC when system is ~2% CPU load (including DMA driver)What happens to endpoint throughput when host is stressed?

– Scenarios:• 100% CPU load, memory stress test, high IO traffic, GPU to host

memory traffic– Test results show a worst case degradation of only ~6% on a variety

of PC MB

ConclusionTypical PC RC memory contention is minimal Rule of thumb for sustainable throughput

– 150MB/s per PCIe lane per direction (double for PCIe 2.x)


Avoiding Hazards and Race Conditions

The definition of endpoint control/status registers needs to be multi-core/multi-thread friendly

Think like a driver/OS programmer (or at least have them review your spec)Avoid registers that cause state change on a read

– Can be a problem for bridges/processors that do caching/prefetchingWhere reads cause state changes, avoid packing multiple bit fields into the same naturally aligned DW (32 bits)

– Example: 8 bit Read FIFO from 2 different UARTs packed into the same DW– Why?:

• Byte lane selection not available for block operations and prefetching• Some processors don’t have byte lane information for reads

Use IOV like constructs such as providing multiple views of the register space to different processors/processes

Impact on performancePoor control mechanisms can result in ugly SW workarounds that can seriously impact performance


Interrupt Controller DesignProblem:

Use of IP blocks can result in poorly thought out interrupt controller that SW engineers will hate you for

– Example: lack of a single register to determine source of a shared interrupt– Common practice is to “OR” together all on-chip interrupt sources and use

that to generate INTx or MSI/MSI-XBest practices

Single read only status register where the status bits of all IP blocks are always readable (even if interrupt forwarding/generation is disabled)

– No need to poll multiple registers to determine the source of an interrupt– You always have the option of polling using a single register rather than

employing interruptsWhen multiple interrupts (hard or messaged) are to be generated have an enable (or mask) register per interrupt outputMake sure it is possible to clear each interrupt source separately

– Clearing one interrupt source will not cause another to be cleared unintentionally

Make sure you support INTx (legacy interrupt mode) in addition to MSI or MSI-X

– Win XP and below do not support MSI/MSI-X


Example Interrupt Controller7

ProgrammableI/O Pins

GPIO Cont roller

GPIO Cont rolRegisters

65

43

21

0

INT7

MSIGenerat ion

Int errupt Cont roller

INT6INT5INT4INT3INT2INT1INT0

INT_CFG0 InterruptConf igurat ion Register(1 per interrupt output)INT_STAT

InterruptStatus

Register(1 only)

On-chipInterruptSources

InterruptSources

GPIO OutputRegisters


Tips for Scalable Endpoint Design

Assume that packet latency and context switching latency will increase in future systems

Don’t rely on fast interrupt handling to keep your data pipes filledAvoid interrupts altogether if possible

– Use low frequency pollingRely on DMA with large scatter/gather lists that don’t have to be updated very often

Assume that throughput AND latency will increase in future systems

Have host driver poll on host memory based semaphores rather than polling the IO subsystem

– Have IO subsystem write semaphores into host memory


SummarySW/HW interaction should be designed to be relatively insensitive to packet level latency

High sensitivity to packet level latency may be a sign of poor HW/SW interaction

Assume that interrupt latency will be wildly variableFor isochronous data (example: video) rely on large SG lists so that the endpoint can operate for a long period without interrupt servicing

Take advantage of the bidirectional nature of the PCIe link

Avoid internal busses that are not bidirectional – Or have separate upstream/downstream buses to feed the

transaction layer

Copyright © 2009, PCI-SIG, All Rights Reserved 35PCI-SIG Developers Conference Copyright © 2009, PCI-SIG, All Rights Reserved 35

Thank you for attending the PCI-SIG Developers Conference 2009

For more information please go to www.pcisig.com

optimizing pcie performance in pcs and embedded systems frozen

Documents