ibm systems technical eventspublic.dhe.ibm.com › software › dw › linux390 › perf ›...

47
Linux on z Systems Disk I/O Performance Martin Kammerer Linux on z Systems Performance Evaluation 2017-10-09 I016653 IBM Systems Technical Events ibm.com/training/events Technical University Munich 2017 IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Upload: others

Post on 28-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

Linux on z Systems Disk I/O Performance

Martin KammererLinux on z Systems Performance Evaluation

2017-10-09I016653

IBM SystemsTechnical Eventsibm.com/training/events

Technical University

Munich 2017

IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Page 2: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

2IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

Motivation and Agenda

● In many Linux systems the disk attachment is not set up to get best throughput

● Material can be used as check list to review existing or future server configuration

● Describes tuning knobs on all layers

– Application program

– Linux kernel / device drivers

– Interconnects

– Storage server

● Considerations for

– Linux guests on z/VM and KVM

– KVM host

● z14 benefits

Page 3: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

3IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

Linux Kernel Components Involved In Disk I/O

Block device layer

Device mapper (dm)dm-multipath

Logical Volume Manager (LVM)

I/O scheduler

Virtual File System (VFS)

Application program Issuing reads and writes

Page cache

VFS dispatches requests to different devices andtranslates to sector addressingLVM defines the physical to logical device relationMultipath sets the multipath policiesdm holds the generic mapping of all block devices andperforms 1:n mapping for logical volumes and/or multipath

Page cache contains all file I/O data, direct I/O bypasses the page cacheI/O schedulers merge, order and queue requests, start device drivers via (un)plug device Data transfer handling

Direct Acces Storage Device driver Small Computer System Interface driver (SCSI)

Device drivers z Fi be r C h annel Protocol driver (zFCP)Queued Direct I/O driver (qdio)

Page 4: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

4IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

Considerations For Application Programs (1)

● direct I/O:

– is enabled for read and write if a user program is opening the file with option o_direct

– bypasses the page cache

– provides reliable constant throughput values

– is recommended if high I/O rates are expected for a file

– is a good choice in all cases where the application does not want Linux to economize I/O and/or where the application buffers larger amount of file contents

● async I/O:

– is exploited if a user program issues aio_read or aio_write system calls

– prevents the application from being blocked in the I/O system call until the I/O completes

– allows read merging by Linux in case of using page cache

– is recommended if high I/O rates are required in a user program

– can be combined with direct I/O (performance hint of database products)

Page 5: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

5IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

Considerations For Application Programs (2)

● Temporary files:

– should not reside on real disks

– should be in a ram disk or tmpfs to allow fastest access to these files

– don't need to survive a crash and therefore don't require a journaling file system

● File system:

– use the default file system of the Linux distribution

– select a well suited journaling mode● the “heaviest” could be the worst from a performance point of view

– to reduce contention in the file system directory● if possible omit for heavily used files the recording of the access time (noatime or relatime)

– if not set by default (xfs)

Page 6: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

6IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

Logical Volumes

● Linear logical volumes

– allow an easy extension of the file system

– don‘t increase performance

● Striped logical volumes

– provide the capability to perform simultaneous I/O on different stripes

– allow load balancing

– are extendable

– may lead to smaller I/O request sizes than linear logical volumes or normal physical volumes (in case the I/O request contains larger amount of data than stripe size)

● Don't use logical volumes for “/” or “/usr”

– If the logical volume gets corrupted your system is gone

● Logical volumes require more CPU cycles than physical disks

– Consumption increases with the number of physical disks used for the logical volume

● LVM / Logical Volume Manager is the tool to manage logical volumes

a

Page 7: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

7IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

Linux Multipath (High Availability Feature For SCSI Volumes)

● Connects a volume over multiple independent paths and / or ports (path_grouping_policy)

● Multibus mode

– Uses all paths in a priority group alternately

– Determining the path by estimating the shortest service time from actual data● Default policy is service-time in multipath.conf● Check the actual used value by the device mapper with “dmsetup table”

– To overcome bandwidth limitations of a single path

– Increases throughput for SCSI volumes in all Linux distributions

– For load balancing

● Failover mode

– Keeps the volume accessible in case of a single failure in the connecting hardware

– Should one path fail, the operating system will route I/O through one of the remaining paths, with no changes visible to the applications

– Will not improve performance

Page 8: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

8IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

Page Cache Considerations

● Pros

– Page cache helps Linux to economize I/O

– Read requests can be made faster by adding a read ahead quantity, depending on the historical behavior of file accessesby applications

– Write requests are delayed and data in the page cache can have multiple updates before being written to disk

– Write requests in the page cache can be merged into larger I/O requests

● Cons

– Requires Linux memory pages

– Not useful when cached data is not exploited like

● Data just only needed once

● Application buffers data itself

– Linux does not know which data the application really needs next, it only makes a guess

No alternatives if application cannot handle direct I/O

Page 9: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

9IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

Page Cache Administration

● /proc/sys/vm/swappiness

– Is one of several parameters to influence the swap tendency

– Can be set individually with percentage values from 0 to 100

– The higher the swappiness value, the more the system will swap

– A tuning knob to let the system prefer to shrink the Linux page cache instead of swapping processes

– Different kernel levels may react differently on the same swappiness value

– Setting swappiness to 0 can have unexpected effects

● [flush-<major>:<minor>]

– These threads write data out of the page cache per device

– /proc/sys/vm/dirty_background_ratio● Controls if writes are done more consistently to the disk or more in a batch

– /proc/sys/vm/dirty_ratio● If this limit is reached writing to disk needs to happen immediately

– Both can be set individually with percentage values from 0 to 100

Page 10: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

10IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

I/O Scheduler

● Three different I/O schedulers are available

– noop scheduler● Only last request merging

– deadline scheduler● Plus full search merge; avoids read request starvation

– complete fair queuing (cfq) scheduler● Plus all users of a particular drive would be able to execute about the same number of I/O

requests over a given time

● The default scheduler in current distributions is deadline

● Do not change the default scheduler / default settings without a reason

● It could reduce throughput or increase processor consumption

Page 11: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

11IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

I/O Scheduler performance

● deadline scheduler

– throughput and processor consumption moderate

● noop scheduler

– showed similar throughput like deadline scheduler

– may save processor cycles

● cfq scheduler

– had no advantage in all our measurements

– tends to high processor consumption

noop

cfq

deadline

Low consumption

High consumption

Complex

Trivial

Page 12: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

12IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

General Layout For FICON/ECKD

● The dasd driver starts the I/O on a subchannel

● Each subchannel connects to all channel paths in the path group

● Each channel connects via a switch to a host bus adapter

● A host bus adapter connects to both servers

● Each server connects to its extent pools / ranks

Block device layer

Switc

h

HBA 1

HBA 4

HBA 3

HBA 2 Serv

er 0

ranks

7

5

3

1

chpid 4

chpid 3

chpid 2

a

DAChannel

subsystem

dmdm-multipath

LVM

I/O scheduler

VFS

Application program

Page cache

dasd driver

chpid 1

Serv

er 1

Page 13: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

13IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

FICON/ECKD dasd I/O To A Single Volume With Alias Devices

● VFS sees one device

● The dasd driver sees the real device and the alias devices

● Load balancing with alias devices is done by the dasd driver

Switc

h

HBA 1

HBA 4

HBA 3

HBA 2 Serv

er 0

ranks

7

5

3

1

chpid 4

chpid 3

chpid 2

a

DAChannel

subsystemchpid 1

bcd

Serv

er 1

Block device layerI/O scheduler

VFS

Application program

Page cache

dasd driver

Page 14: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

14IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

FICON/ECKD dasd I/O To A Logical Volume

● VFS sees one device (logical volume)

● The device mapper sees the logical volume and the physical volumes

● Load balancing in the logical volume is done by the device mapper

Serv

er 0

ranks

7

5

3

1DA

Serv

er 1

Block device layerdmLVM

I/O scheduler

VFS

Application program

Page cache

dasd driver

Switc

h

HBA 1

HBA 4

HBA 3

HBA 2

chpid 4

chpid 3

chpid 2

a

Channelsubsystem

chpid 1

bcd

Page 15: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

15IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

FICON/ECKD dasd I/O To A Logical Volume Combined With PAV Devices

● VFS sees one device (logical volume)

● The device mapper sees the logical volume and the physical volumes

● Load balancing in the logical volume is done by the device mapper

● The dasd driver sees the real device and all alias devices

● Load balancing with alias devices is done by the dasd driver

Serv

er 0

ranks

7

5

3

1DA

Channelsubsystem

abcd

efgh

ijkl

mnop

Serv

er 1

Block device layerdmLVM

I/O scheduler

VFS

Application program

Page cache

dasd driver

Switc

h

HBA 1

HBA 4

HBA 3

HBA 2

chpid 4

chpid 3

chpid 2

chpid 1

Page 16: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

16IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

I/O Processing Characteristics For FICON/ECKD

● 1:1 mapping host subchannel:dasd

● Serialization of I/Os per subchannel

● I/O request queue in Linux

● Disk blocks are 4 KiB

● High availability by FICON path groups

● Load balancing by FICON path groups and Parallel Access Volumes (HyperPAV, virtual PAV as z/VM guest)

● Linux dasd driver optimizes throughput and controls when to use

– Parallel Access Volumes (HyperPAV, boosts I/O)

– High Performance FICON (zHPF)

Page 17: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

17IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

General Layout For FCP/SCSI

● The SCSI driver finalizes the I/O requests

● The zFCP driver adds the FCP protocol to the requests

● The qdio driver transfers the I/O to the channel

● A host bus adapter connects to both servers

● Each server connects to its extent pools / ranks

Switc

h

HBA 5

HBA 8

HBA 7

HBA 6 Serv

er 0

ranks

8

6

4

2

chpid 8

chpid 7

chpid 6

DA

chpid 5

SCSI driverzFCP driverqdio driver

Serv

er 1

Block device layerdm

dm-multipathLVM

I/O scheduler

VFS

Application program

Page cache

Page 18: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

18IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

FCP/SCSI LUN I/O To A Multipath Volume

● VFS sees one device

● The device mapper sees the multipath alternatives to the same volume

● Load balancing for the multipath volume is done by the device mapper

Switc

h

HBA 5

HBA 8

HBA 7

HBA 6 Serv

er 0

ranks

8

6

4

2

chpid 8

chpid 7

chpid 6

DA

chpid 5

Serv

er 1

SCSI driverzFCP driverqdio driver

Block device layerdm

dm-multipath

I/O scheduler

VFS

Application program

Page cache

Page 19: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

19IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

FCP/SCSI LUN I/O To A Logical Volume Combined With Multipath Volumes

● VFS sees one device

● The device mapper sees the logical volume and the physical volumes and the multipath alternatives to the same volume

● Load balancing in the logical volume and for the multipath volumes is done by the device mapper

Serv

er 0

ranks

8

6

4

2

DA

Serv

er 1

SCSI driverzFCP driverqdio driver

Block device layerdm

dm-multipathLVM

I/O scheduler

VFS

Application program

Page cache

Switc

h

HBA 5

HBA 8

HBA 7

HBA 6

chpid 8

chpid 7

chpid 6

chpid 5

Page 20: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

20IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

I/O Processing Characteristics For FCP/SCSI

● Several I/Os can be issued against a LUN immediately

● Queuing in the FICON Express card and/or in the storage server

● Additional I/O request queue in Linux

● Disk blocks are 512 Bytes for compatibility reasons

● High availability by Linux multipathing, type failover or multibus

● Load balancing and performance by Linux multipathing, type multibus

● The FCP/SCSI datarouter technique improves throughput

– Kernel parmline zfcp.datarouter=1 (default in recent Linux distributions)

Page 21: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

21IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

.

.

.

.

.

.

Channel Paths

● High availability requires more than 1 channel path between host and storage server

● Throughput scales with the number of channel paths

● Switches should support the maximum channel speed

Page 22: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

22IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

Channel Paths – Don’t Do This!!

● A small number of switch interconnects with same speed will degrade your carefully configured hosts and storage servers

.

.

.

.

.

.

.

.

.

.

.

.

Page 23: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

23IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

Storage server

● Provide a big cache to avoid real disk I/O

● Provide a Non Volatile Storage (NVS) to ensure that cached data, which has not yet been destaged to disk devices is not lost in case of a power outage

● Provide algorithms to optimize disk access and cache usage, optimizes for sequential I/O or for random I/O

● Have the biggest challenge with heavy random write, because this is limited by the destage speed if the NVS is full

Page 24: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

24IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

DS8000 Storage Server - Ranks

● The smallest unit is a array of disk drives

● A RAID 5 or RAID 10 technique is used to increase the reliability of the disk array, defining a rank

● The volumes for use by the host systems, are configured on the rank

● The volume has stripes on each disk drive

● All configured volumes will share the same disk drives

rank

Disk drives

volume

RAID technology

Page 25: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

25IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

DS8000 Storage Server – Components

● A rank

– represents a raid array of disk drives

– is connected to the servers by device adapters

– belongs to one server for primary access

● A device adapter connects ranks to servers

● Each server controls half of the storage servers disk drives, read cache and NVS

Disk drives

Serv

er 0 2

1

3

Device Adapter

4

5

volumes

6

7

8

Read

cac

heN

VSRe

ad c

ache

NVS

Serv

er 1

ranks

Page 26: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

26IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

DS8000 Storage Server – Extent Pool

● Extent pools

– consist of one or several ranks

– can be configured for one type of volumes only: either ECKD dasds or SCSI LUNs

– are assigned to one server

● Some possible extent pool definitions for ECKD and SCSI are shown in the example

Serv

er 0

ranks

2

1

3

Device Adapter

4

5

volumes

6

7

8

ECKD

ECKD

ECKD

SCSI

SCSI

SCSI

SCSI

ECKD

Serv

er 1

Page 27: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

27IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

1

DS8000 Storage Pool Striped Volume

● A storage pool striped volume (rotate extents) is defined on a extent pool consisting of several ranks

● It is striped over the ranks in stripes of 1 GiB

● As shown in the example

– The 1 GiB stripes are rotated over all ranks

– The first storage pool striped volume starts in rank 1

ranksvolume

1 74

32

Disk drives

65

32 8

Page 28: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

28IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

DS8000 Storage Pool Striped Volume

● A storage pool striped volume

– uses disk drives of multiple ranks

– uses several device adapters

– is bound to one server side

Serv

er 0

Serv

er 1

Device Adapter

extent pools / ranks

Disk drives

Page 29: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

29IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

Storage Server Performance Considerations To Speed Up I/O

● Volume configuration in the storage server

– Configure the volumes as storage pool striped volumes in extent pools larger than 1 rank, to use simultaneously● more physical disks● more cache and NVS● more device adapters

– Provide alias devices (HyperPAV) for FICON/ECKD volumes

● Striping in Storage Server or in Linux?

– Storage server striping● Is simple, requires only some administration effort when the storage server is configured● Striping effort is for free and not accounted as host CPU cycles

– Linux striping ● Can spread over all storage server extent pools● Allows arbitrary stripe sizes to match workload characteristics

Page 30: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

30IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

Best Performance With …

● The use of newest FICON card technology and appropriate switches are key for best throughput

● Throughput scales with the number of channel paths

● If the volumes are configured with storage pool striping a Linux striped logical volume may not be necessary

● For FICON/ECKD the use of HyperPAV devices is most important for high throughput

● If I/O rate is low

– Using the default with Linux page cache I/O is best choice

● If I/O rate is high

– Choose direct I/O or even better direct I/O and async I/O (both need to be supported by the application program)

● While both FCP/SCSI and FICON/ECKD provide sufficient I/O throughput, FCP/SCSI is recommended if really the maximum performance is needed

– Especially for random I/O (e.g. databases) FCP/SCSI is best choice

Page 31: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

Linux on z Systems Disk I/O Performance as z/VM Guest

Martin Kammererinux on z Systems Performance Evaluation

2017-10-09I016653

L

IBM SystemsTechnical Eventsibm.com/training/events

Technical University

Munich 2017

IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Page 32: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

32IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

Linux Guests With z/VM dasd Or edev Minidisks

● VFS sees one device

● Request serialization at the single device to z/VM

● Load balancing with alias devices is done by the dasd driver

Z/VM

Block device layerI/O scheduler

VFS

Application program

Page cache

dasd driver

Switc

h

HBA 1

HBA 4

HBA 3

HBA 2 Serv

er 0

ranks

7

5

3

1

chpid 4

chpid 3

chpid 2

DA

chpid 1

Serv

er 1

Page 33: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

33IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

Linux Guests With z/VM dasd Fullpack Minidisks With Virtual PAV

● VFS sees one device

● The dasd driver sees the real device and the alias devices

● Load balancing with alias devices is done by the dasd driver

Z/VM

Block device layerI/O scheduler

VFS

Application program

Page cache

dasd driver

Switc

h

HBA 1

HBA 4

HBA 3

HBA 2 Serv

er 0

ranks

7

5

3

1

chpid 4

chpid 3

chpid 2

DA

chpid 1

Serv

er 1

Page 34: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

34IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

Linux Guests With z/VM Attached SCSI Ports

● VFS sees one device

● The device mapper sees the multipath alternatives to the same volume

● Load balancing for the multipath volume is done by the device mapper

Switc

h

HBA 5

HBA 8

HBA 7

HBA 6 Serv

er 0

ranks

8

6

4

2

chpid 8

chpid 7

chpid 6

DA

chpid 5

Serv

er 1

SCSI driverzFCP driverqdio driver

Block device layerdm

dm-multipath

I/O scheduler

VFS

Application program

Page cache

Page 35: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

35IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

Linux Guests With z/VM Disk I/O Performance Considerations

● All applies as already described for Linux

– Logical volumes

– Multipathing

– HyperPAV

– I/O paths to storage server

– Storage server configuration

– Storage pool striped volumes

● Use z/VM dasd or edev minidisks for performance uncritical purposes

Page 36: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

KVM for z Systems Disk I/O Performance

Martin KammererLinux on z Systems Performance Evaluation

2017-10-09I016653

IBM SystemsTechnical Eventsibm.com/training/events

Technical University

Munich 2017

IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Page 37: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

37IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

Linux Kernel Components Involved In Disk I/OApplication program

Block device layer

Device drivers

Device mapper (dm)dm-multipath

Logical Volume Manager (LVM)

I/O scheduler

virtio_blk / iothreads

Page cache

Direct Acces Storage Device driver Small Computer System Interface driver (SCSI)z Fiber Channel Protocol driver (zFCP)Queued Direct I/O driver (qdio)

Block device layervirtio_blk

Virtual File System (VFS)

Page cache

Issuing reads and writes

VFS dispatches requests to different devices andtranslates to sector addressingPage cache contains all file I/O data, direct I/O bypasses the page cacheTransfer to host

Transfer to guestLVM defines the physical to logical device relationMultipath sets the multipath policiesdm holds the generic mapping of all block devices andperforms 1:n mapping for logical volumes and/or multipathPage cache contains all file I/O data, direct I/O bypasses the page cacheI/O schedulers merge, order and queue requests, start device drivers via (un)plug device Data transfer handling

host

guest

Page 38: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

38IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

KVM Host Disk I/O Performance Considerations

● All applies as already described for Linux

– Logical volumes

– Multipathing

– HyperPAV

– I/O paths to storage server

– Storage server configuration

– Storage pool striped volumes

Page 39: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

39IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

Disk Setup - Native Block Device

● The disk devices are 'owned' by the host as DASD or FCP device, but not mounted!

● The disk devices are propagated to the guest as independent virtio-blk data-plane devices.

● In the guest, each resulting virtio-blk device is partitioned and formatted with a file system and mounted.

Disk 1- native device

Guest

Host

Guest file system/mnt1 /mnt ... /mntn

Disk ...- native device

Disk n- native device

Page 40: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

40IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

Disk Setup - Image File

● The disk devices are 'owned' by the host as DASD or FCP device,

● Partitions are formatted and mounted on the host

Guest

Host

Guest file system/mnt1 /mntn

Disk 1- partition 1- XFS file system- mounted /mnt/disk1

Image file /mnt/disk1/file.img

Disk 1

Disk n- partition 1- XFS file system- mounted /mnt/diskn

Disk n

. . .

Image file /mnt/diskn/file.img

● The image file resides in the host file system

● The image files are propagated to the guest as independent virtio-blk data-plane devices

● In the guest, each resulting virtio-blk device is partitioned and formatted with a file system and mounted

Page 41: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

41IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

KVM Host Disk I/O, Virtual Devices (1)

● Para-virtualized devices hide the real storage devices for the guests

● Devices presented to the guest could be

– Block devices of DASD or SCSI volumes ● For performance critical purposes

– Disk image files● For performance uncritical purposes● Sparse files occupy only the written space (over commit disk space)

– Tradeoff: space consumption versus performance

change in affects increasesguest domain XML this guests throughput of this guest

Page 42: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

42IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

KVM Host Disk I/O, Virtual Devices (2)

● KVM host and guest operating system maintain a page cache

– Set cache=none in the device section of the guest configuration (recommended)● Disables the host’s page cache for this device

● Specify a quantity of iothreads in the guest configuration to enable simultaneous I/O

– Performance uncritical I/O devices can share the same iothread

– Performance critical I/O devices should have an individual iothread each

– The total number of iothreads should not be much bigger than the number of the virtual processorthe guest

s of

change in affects increasesguest domain XML this guest, KVM host throughput of this guest

change in affects increasesguest domain XML this guest, KVM host throughput of this guest

Page 43: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

Linux on z Systems Disk I/O on z14

Martin KammererLinux on z Systems Performance Evaluation

2017-10-09I016653

IBM SystemsTechnical Eventsibm.com/training/events

Technical University

Munich 2017

IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Page 44: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

44IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

Disk I/O on z14

● z14 FICON Express 16S+ versus z13 FICON Express 16S

– FICON Express 16S+ is supposed to allow much more IOPS than its predecessor

– With both FCP/SCSI and FICON/ECKD the expected triple IOPS on a single host channel number could be achieved

– Higher IOPS rates can improve throughput in situations where many I/Os (most likely short requests) are issued

● Database workloads could benefit from this

– I/O benchmarks showed throughput increase of up to 30% in all disciplines (sequential read and write, random read and write)

– No changes required to get improvements

Page 45: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

45IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

Thank You!

Martin KammererManager Linux on z SystemsPerformance Evaluation

Research & DevelopmentSchönaicher Strasse 22071032 Böblingen, Germany

[email protected]

Linux on System z – Tuning hints and tips http://www.ibm.com/developerworks/linux/linux390/perf/index.html

Live Virtual Classes for z/VM and Linux http://www.vm.ibm.com/education/lvc/

Mainframe Linux blog http://linuxmain.blogspot.com

Page 46: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

46IBM SYSTEMS / Linux on z Systems Disk I/O Performance / Oct. 09, 2017 / © 2017 IBM Corporation

Linu

x

Applications

storage

Trademarks

IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. A current list of IBM trademarks is available on the Web at “Copyright and trademarkinformation” at www.ibm.com/legal/copytrade.shtml.

Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.

Other product and service names might be trademarks of IBM or other companies.

Page 47: IBM Systems Technical Eventspublic.dhe.ibm.com › software › dw › linux390 › perf › Linux... · In many Linux systems the disk attachment is not set up to get best throughput

© Copyright IBM Corporation 2017. Technical University/Symposia materials may not be reproduced in whole or in part without the prior written permission of IBM.

IBM Systems Technical Events – ibm.com/training/events

Thank

ibm.com/training/systems

I016653