SO MUCH TO TALK ABOUT Chris Weeden CIUK 2016
• 1M lines of code
• 100+ tunables
• Custom code for specific areas
• Video surveillance
• Archiving (SMR)
• Synchronized RAID
• HPC …
• Uses >50 basic elements
• MTBF – 2M hours
… and a disk drive is just a disk drive ???
•The head flies 10 atoms above the media at 7.2K RPM
A380 flying at 500x speed of sound at ~1mm off the ground
•Writes and reads information every 15 nm
A380 counting every blade of grass
•Corrected error rate of 10
Making an irretrievable error on less than 10 blades of grass in an area the size of Ireland
If the Transducer were a Airbus 380 then the Disk is the Earth…
5
The Complete Storage Partner
Seagate Systems
Seagate
ClusterStor
Seagate
OneStor
Seagate
RealStor
Key
Customers
Key
Partners
Key
Differentation
Direct strategic partnership Direct strategic partnership
DEEP storage systems engineering & IP combined with
DEEP vertical storage integration = UNIQUE VALUE PROPOSITION
Key
Products
Lustre IBM Spectrum
Scale
Archive Object
Storage Tier
Modular
Storage
Enclosures
Application
Controllers
Chassis RAID
Controllers
RAID & Storage
Software
RealQuick
RealSpan
Award Winning ClusterStor Architecture
Improved observations, science and
modelling, will deliver better forecasts
and advice to support UK business, the
public and government. It will help make
the UK more resilient to high impact
weather and other environmental risks.
Rob Varley
Met Office Chief Executive
We live in a weather sensitive
environment, and people and businesses
increasing rely on us for accurate
environmental forecasting. Our new Cray
supercomputers will be a valuable
resource for us to meet our strategic,
operational and research objectives.
Kyung Heoun Lee
Director of the National Center for
Meteorological Supercomputing at KMA
These new systems are a key
component of our strategy of making
sure the DOD's scientists and engineers
have access to the most modern,
capable, and usable computational tools
available.
John West, director of the DOD's High
Performance Computing (HPC)
Modernization Program
The ClusterStor solution provided the
best performance density and,
therefore, was the most efficient high-
performance storage offering for our
environment.
Professor Thomas Ludwig Director DKRZ
and Research Team Leader
1TB/sec+ Storage File Systems Are
Seagate Clusterstor
5 of the 6
Powers the World's Fastest HPC Sites
Seagate received SIX HPCwire Reader’s Choice Awards
› 24 x 8 TB drives
› Dual Controllers – 12G SAS
ClusterStor Product Line Overview Vertically Integrated Like No other: From the RAW media the fastest systems in the world
Spectrum Scale
› Up to 100 GB/s per rack
› IBM SS 4.2
IEEL Lustre
› Up to 360 GB/s per rack
› Lustre 2.5 / 2.7
A200 - Object Store
› Tiered Archive
› More than 5 PB per rack
Seagate Software
GridRAID (PD-RAID)
Advanced telemetry
End to end Management
Guided repair
HSM data mover
Policy Engine enhancement
Advanced disk monitoring
Etc ...
SP-3424
Lustre Secure
› Up to 60 GB/s per rack
› Lustre 2.5 on SE-Linux
CP-3584
› Up to 84 x 8 TB drives
› Dual Controllers – 12G SAS
› 24 x 2.5’ drives or SSDs
› Dual Controllers – 12G SAS
SP-3224
SAS SATA SSD
› HPC Drive
› 4TB
› 10K RPM
› SMR Drive
› 8TB › SAS SSD
› 1.3 TB
› Up to 60 TB
› NL SAS
› 8 TB
› 7.2K RPM
Flash accelerators
› NVMe
› 1.3 TB
› PCIe x 16
› NVMe
› 10 GB/s
Fastest solution*
on the planet !!
* File system performance (GB/s) per [HDD, RU, Enclosure, Rack ….]
Fastest solution*
on the planet !!
Smartest solution
on the planet !!
Fastest flash on
the planet !!
Fastest solution*
on the planet !!
Fastest HDD on
the planet !!
Largest SSD on
the planet !!
Highest density HA Storage
Server on the planet !!
© 2016 Seagate, Inc. All Rights Reserved.
The Concept: Fully integrated, fully balanced, no bottlenecks …
ClusterStor Scalable Storage Unit
• Haswell/Broadwell CPUs
• EDR, OPA & 2x40 GbE, all SAS infrastructure
• SBB v3 Form Factor, PCIe Gen-3
• Embedded RAID & Lustre/ISS support
Enterprise Lustre 2.7 /
IBM Spectrum Scale 4.2.x
ClusterStor Manager
Data Protection Layer (PD-RAID/Grid-RAID)
Linux OS
Unified System Management (GEM-USM)
ClusterStor Engineered End-to-End Solution
Providing Productivity Critical Efficiency and Reliability
Integrated Software Complete Solution
Modular Building Blocks Performance + Capacity
Integrated Storage Servers and
Storage Capacity featuring Nytro
Flash Cache Accelerator & De-
Clustered RAID
The MOST complete End to End vender
of HPC Storage Solutions!
High density disk enclosures
Integrated HA Object Storage Servers
Dedicated software development,
integration, test & support team
CLI, GUI and API Integration Administration
Manufacturing integration & test validation
Maximum performance from each disk drives
Makes Lustre and Spectrum Scale easier
to manage
Support and Professional services
11
Efficiency Matters – Engineered Solution We Build This
High Availability Meta Data Server
Performance for Thousands of Compute Clients
High Availability Management Servers
Health and Performance Management
High Availability
Capacity Expansion
Lower Price per PB +
Performance
High Availability Data Network
Provides Compute Clients an
Alternative Data Path to the
Storage Servers & HDD/SSD
High Availability System
Management Network
Eliminates Single Points
of Failure
Factory Integration and Test
Faster Time to Acceptance and
Production
File System & Linux OS
Integration
100’s of SW Improvements +
Test Validation
De-clustered Parity RAID
Faster Time to Recovery &
Higher Data Resiliency
Modular Performance +
Capacity Building Blocks
Balanced I/O up to 1.6TB/sec
Disk Health Monitoring
& Management
Faster Time to Repair
High Availability Storage Servers
Failover / Failback Monitoring
and Management
Custom SAS Performance
Accelerator
Provides additional HDD
performance of up to 16GB/sec
per enclosure module
12
• Seagate enclosures
– OneStor 2U24 – 12G SAS
• Laguna Seca EAMs
– Single Socket CPUs (Haswell)
– 4 DIMMs per CPU
– EDR/OmniPath support
• Next gen SAS SSDs
– Capacity up to 15.4 TB
– DWPD ~1 to 3 (10 optional)
– Up to 20 GB/s per enclosure
(benchmark performance)
Flash tier – Seagate All Flash Array
12
13 © 2016 Seagate, Inc. All Rights Reserved.
14
RAS Service Overview
CRU Guided Repair
Events
Seagate Service Service Call and Telemetry data Email
Performance Data Trained Engineer
15
ClusterStor L300
HPC Disk Drive
17 © 2016 Seagate, Inc. All Rights Reserved.
• 4TB, 10K RPM, 5D, 3.5” FF HDD
• Performance increases across the board vs. 7200 RPM
– Large block & small block
– Random & sequential
– Reads & writes
• 2M hr MTBF and 750 TB/yr workload ratings
• Targeting ~13W max. typical operating power
– PowerBalance™ setting for ~2W lower available
• Configuration: 4Kn with 12Gb/s SAS SED
– Seeding market with initial product offering
• Available with Seagate ClusterStor NOW
High level product description
Enterprise Performance 3.5 HDD
19
ClusterStor L300 HPC 4TB SAS HDD
HPC Industry First; Best Mixed Application Workload Value
0
100
200
300
400
500
600
Random writes(4K IOPS, WCD)
Random reads(4KQ16 IOPS)
Sequential data rate(MB/s)
Performance Leader World-beating performance over other 3.5in HDDs: Speeding data ingest, extraction and access
Capacity Strong 4TB of storage for big data applications
Reliable Workhorse 2M hour MTBF and 750TB/year ratings for reliability under the toughest workloads your users throw at it
Power Efficient Seagate’s PowerBalance feature provides significant power benefits for minimal performance tradeoffs
CS HPC
HDD
CS HPC
HDD
NL 7.2K
RPM HDD
CS HPC
HDD
NL 7.2K
RPM HDD
NL 7.2K
RPM HDD
20 20
HPC Storage: Performance Efficiency & Value
Seagate has not only delivered the fastest but also the most efficient HPC storage systems in the world
50MB/s
100MB/s
150MB/s
200MB/s
250MB/s
300MB/s
Perf
orm
an
ce:
Raw
Dri
ve S
usta
ined
Perf
(O
D/M
Bs)
Inefficiency
GAP
ORNL
Spider-
2011
<25MB/s (per useable HDD)
Bluewaters
2013
69MB/s
Kaust
2014
90MB/s
RAW HDD Performance
Capability (OD)
Time
Realized HDD Performance in HPC Storage Systems
(per useable HDD)
(per useable HDD)
112MB/s
DKRZ
2015
(per useable HDD)
2016
Seagate has Pioneered 10x performance improvement in HPC Storage in the last 5 years
Seagate HPC Drive
21
Drive Write Performance -- iostat
23 © 2016 Seagate, Inc. All Rights Reserved.
Introducing
ClusterStor Nytro
SSD + HDD Array
All HDD Storage
Array All Flash Storage
Array
Very High Mixed I/O Work Load Efficiency
Very High Price / PB
High Mixed I/O Workload Efficiency
Low Price / PB
Medium Mixed I/O Workload Efficiency
Lowest Price / PB
Nytro Delivers:
Up to 10 times higher performance for small or random I/O
Seagate ClusterStor: Any Workload, Any Time
24
Nytro Intelligent I/O Manager Seagate Nytro XD Cache Management Software
- Linux Filter Driver per OSS
- Monitors Writes Block Stripe Size
- Admin Definable Threshold
Eg; 32kb Block Stipe Size or less to SSD
- Small Blocks Write to SSDs
Data Flush/Writes to HDDs
- Large Blocks Write to HDDs
› Small Block Sizes are Written to the
GridRaid HDD storage pool
› The Last Accessed Small Block Stripe
is Written to the HDD OST in a
Continuous “Cache Flush” Cycle
› Small Block Sizes are Written to the
GridRaid HDD storage pool
› The Last Accessed Small Block Stripe
is Written to the HDD OST in a
Continuous “Cache Flush” Cycle
ClusterStor Scalable Storage Unit
Object Storage Server #1
SSD Disk Pools are Configured as 1+1 /
RAID 10 w/OSS High Availability
Small Block Stripe
Sizes are Cached to a
SDD Disk Pool
Object Storage Server #2
Small Block Stripe
Sizes are Cached to a
SDD Disk Pool
Large Block Stripe Sizes
are written to HDD Large Block Stripe
Sizes are written to
HDD
25
Nytro Intelligent I/O Manager architecture
Nytro Intelligent I/O Manager (NIIOM)
– Filter driver and OS dependent functions
implemented as device mapper target driver
– Core caching library compiled as a Linux
kernel module with well defined APIs
– Work at the block layer be transparent to file
system and applications
– Core caching function is implemented as a
OS agnostic portable library with well
defined interfaces
– Filter Driver in OS stack intercept’s IO and
routes through Cache Management Library
for Caching functions
Linux Driver Architecture
Block Layer
PCIe
Nytro Low-Level
Driver
PCIe
SCSI Mid-Layer
sg sd st
Nytro XD DM
Target Filter
Driver
Device Mapper / Mapping Target
Interface
File Systems
Virtual File System
System Call Interface
Nytro XD
Caching Library
26
4 8 16 32 64 128 256
GridRAID 1662 1547 1502 1207 881 586 411
NytroXD 22949 22625 13168 5790 3823 1334 859
0
5000
10000
15000
20000
25000
IOP
s
4 8 16 32 64 128 256
GridRAID 2108 1980 1880 1567 1228 899 651
NytroXD 5403 4994 4281 3506 2617 1704 1194
0
1000
2000
3000
4000
5000
6000
IOP
s
Random Read
300N Nytro Preliminary Benchmarks
Random Write
IEEL 2.7 client, default perf tuning
28
ClusterStor
Lustre 2.7.x
Spectrum Scale 4.1
ClusterStor A200 Active Archive Product Overview
Combined with ClusterStor HSM or TSM to provide
automatic policy-driven data migration & retrieval
Unlimited scalability (file system size up to 2214 bytes)
High density storage up to 3.6PB* usable per rack
Utilizes network erasure coding to provide high
levels of data availability and data durability
No single points of failure, resiliant across single
maintenance events
Dual 10Gb Ethernet node connectivity
IB as an option
HSM
Packaged as upgrade
to ClusterStor
CS A200
ClusterStor A200
Object API & portfolio of network based interfaces
(POSIX, pNFS, CIFS, S3, HDF5, non-POSIX …)
* moving to 5+ PB/rack in late 2016 © 2016 Seagate, Inc. Under NDA with Atos.
HAMR (BATTLING PHYSICS)
30
Economic Benefits of SMR drives
Backed by Seagate Object store
Read Head
Write Head
Updates destroy portion
of next track
Shingled Technology increases capacity
of a platter by 30-40%
› Write tracks are overlapped by up to 50%
of write width
› Read head is much smaller & can reliably read
narrower tracks
SMR Drives are optimal for object stores as
most data is static/WORM
› Updates require special intelligence and may
be expensive in terms of performance
› Wide tracks in each band are often reserved
for updates
CS A200 manages SMR Drives directly to
optimize workflow & caching
› A200 avoids the ”Read-Update-Write” problem
by using Copy-On-Write !!
SMR Drives
31
HAMR Head and Laser Assembly
At the very tip of the actuator arm in a
hard drive is the slider
The slider flies only a few nanometers
above the disk
For HAMR heads the laser and laser
assembly are attached to the slider
Every head has its own laser
The laser is as long as a grain of salt
but about 1/3 as wide
32
HAMR Head and Laser Assembly (more pictures)
Slider
Recording Head
Laser
Laser Carrier
33
HAMR Head and Laser Assembly (more pictures)
35
Writing and Reading Magnetic Data
In a HDD the media is comprised of trillions of tiny
magnetic grains that can be oriented either up or
down relative to the surface of the disk
Data is encoded by the presence or absence of a
magnetic transition in the clock window
If we see a transition, we call that a “1”
If there is not a transition, we call that a “0”
Our data channel has the ability to detect these
transitions and convert them back into binary user
data
Clock Windows
36
Superparamagnetic Limit
is the mean time it takes for a grain to flip due
to thermal agitation. It depends on the attempt
frequency, f0, the energy holding the grain in
place KuV, and the thermal energy agitating the
grain, kT
The volume of the grains, V, needs to decrease
so we can continue to increase areal density.
This means that either Ku needs to increase or
we need to start scrubbing the data to refresh
thermal stability
We’ll talk about strategies to make thermally
stable grains in the next few slides
39
HAMR
In a HAMR system we temporarily heat
the media during recording. This
magnetically softens the disk allowing us
to write
The heating and cooling happens very
quickly and it only takes a nanosecond for
the media to get hot and cool back down
We have media today that is very
thermally stable at room temperature but
we need HAMR to write on it
Reader
Reader Shields
Writer
Written
Data NFT
Heated Spot on
Media
Optical Waveguide
40
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Position (nm)S
ign
al
(V)
Magnetic Write gateLaser Write gate
ACSN = 4.56 dB
ACSN = 23.80 dB
Signal@10MHz = 47 KFCI
Head Current = 13.9 mA
Laser Current = 40.1 mA
Servo Burst
This is a scope trace of some HAMR data
from 2001.
A magnetic field is applied to the disk but
recording only happens when the laser is on.
And it Works!
HOW NFTS WORK
Near Field Transducers
42
We’ve known for over 100 years that diffraction
limits the minimum optical spot size of focused
propagating light waves in the far field.
In a blue ray drive = 402 nm and sin = 0.85.
This gives a spot size of 238 nm. By today’s
definition of track pitch in a HDD, this is huge.
Even if we played an optical trick and used near-
field recording techniques, we still would not be
able to focus light smaller than ~100 nm.
We need another solution.
The Diffraction Limit
Abbe, Archiv f. Miroskop. Anat., 9 (1873) 413.
Lord Rayleigh, Phil. Mag., 5 (1896) 167.
d
sin
5.05.0
NAdFWHM
43
The simple solution is to just block the light with
a screen and poke a small hold in it.
Hans Bethe (fresh off of calculating
thermonuclear blast yields for the Manhattan
Project) calculated in 1944 that the amount of
light T, of wavelength , you can get through a
circular hole of radius r, scales as the ratio of the
hole diameter to the wavelength to the 4th power.
This means if I make the hole smaller by a factor
of 2, I need 16x more power to get the same
about of light through.
Small holes are just not practical since you
waste so much power at high areal densities.
Small Apertures
𝑇 ≈64
27𝜋2𝑟
𝜆
4
44
In 1998 Thomas Ebbesen, while working at NEC,
discovered that you can get more light through a
small hole if you use surface plasmons
He found that even though the hole area would
have only predicted a 5% throughput he was able
to measure a 9% throughput
Discovery of Extraordinary Transmission
200 nm Ag film
150 nm hole diameter
900 nm array period
~9% transmittance
~5% hole area
45
We’ve known about surface plasmons since the
1950s but Ebbesen showed us that we can do
something useful with it.
We can take light, convert it to a surface plasmon,
pass it through a tiny hole, and convert it back into
a photon with the net result being that we got more
light through than we would have otherwise
expected. From wikipedia –
Schematic representation of an electron density
wave propagating along a metal – dielectric
interface. The charge density oscillations and
associated electromagnetic fields are called
surface plasmon-polariton waves. The
exponential dependence of the electromagnetic
field intensity on the distance away from the
interface is shown on the right. These waves
can be excited very efficiently with light in the
visible range of the electromagnetic spectrum.
Surface Plasmons
ho
le
ho
le
ho
le
51
Putting it all together
Light hits the gold disk and is converted into a surface
plasmon
These surface plasmons travel along the edge of the
disk and down the peg
The surface plasmons interact with the recording media
and heat the disk
The heated area in the media is roughly the size of the
peg
By using a NFT like this, we are able to locally heat a
very small area of the disk. Much smaller than what we
could have achieved just by focusing the light
The round disk in this is picture is only 300 nm in
diameter. NFTs are really small.
+ + - - +
+ -
- +
+ -
-
+ +
Surface
Plasmons
Media Heating
+ +
+ +
- -
- -
+
+
- -
+ +
+ + +
+ - -
- -
- -
56
This is a HAMR drive with a clear cover
This drive is reading and writing data
You can see the laser light on the end of the
recording head
We have made more than 10,000 HAMR
drives during the last few years as we
develop this new technology
HAMR Drive Working
BIT PATTERNED MEDIA
58
Bit-Patterned Media
In conventional recording to increase areal density the
grains need to shrink and they become unstable
In a BPM system we have one grain per bit which is very
stable
By combining BPM and HAMR we can get an even higher
areal density than HAMR or BPM alone. We call this
heated dot magnetic recording (HDMR)
Challenges Remaining:
Developing a high volume, cost effective way to
manufacture the media is a challenge. Since each grain
is a bit, they need to be positioned precisely on the disk
There are still engineering challenges surrounding BPM
Track
Width
Track
Width
Bit
Length Bit
Length
Bit-Patterned
Media
Conventional
Media
61
Estimating Future Capacity from ASTC 2015 Roadmap
From the product manual for Seagate’s
ST8000NM0045 3.5” 7200 RPM
Enterprise drive, a 6 disk/12 head drive
has a capacity of 8TB with an areal
density of 802 GBPSI.
We can use the ASTC roadmap to scale
what kind of drive capacities are possible
in the future for a product like this
Technologies like helium filled drives will
allow more heads and disks to be used in
the same form factor. This will increase
the capacity of they drives further.
48 TB
97 TB
8 TB
3.5”
Drive Capacity
15 TB
70 Seagate Confidential
Thank You!