experiences€with€moving intelligence€closer€to€storage · pdf...
TRANSCRIPT
Experiences with movingintelligence closer to storagePankaj Mehra, HP Labs
Patents Pending
Click t
o buy NOW!
PDFXCHANGE
www.docutrack.com Clic
k to buy N
OW!PDFXCHANGE
www.docutrack.com
Outline
n Storage intelligencen Real examples from …n Transaction processingn Business intelligencen Content management
Click t
o buy NOW!
PDFXCHANGE
www.docutrack.com Clic
k to buy N
OW!PDFXCHANGE
www.docutrack.com
Storage is perceived asintelligent if it …n Is applicationaware
n Knows about objectsand/or metadata
n Embeds higherlayerfunctionsn Packs in an index or
runtime env
orn Is smart about its low
level functionalityn Can predict access
pattern or canguarantee QoS
Tag
?
Click t
o buy NOW!
PDFXCHANGE
www.docutrack.com Clic
k to buy N
OW!PDFXCHANGE
www.docutrack.com
Selected examples
n Examples fromdatabase siden Stock exchangen Data warehouse
n An intelligent dataaccess manageremployingn Persistent memoryn Multidimensional
indexn Embedded query
processing
n An example fromthe content siden Document archive
n Smart Cells =storage nodes withembeddedn Containersn Content indexn Hash index
Click t
o buy NOW!
PDFXCHANGE
www.docutrack.com Clic
k to buy N
OW!PDFXCHANGE
www.docutrack.com
Why have we spent 3 years rearchitecting database I/O?• Current database I/O technologies can deliver high
throughput, but …Techniques that improve throughput hurt response timeIn realworld systems, response time must be bounded
• Persistent Memory is the way to provide higherthroughput with faster response time
Click t
o buy NOW!
PDFXCHANGE
www.docutrack.com Clic
k to buy N
OW!PDFXCHANGE
www.docutrack.com
Does faster response time matter?Yes• Responsetimecritical apps
Stock exchangesHot stocks – dependent tradesMinibatches or group commit• Increases RT• high transaction abort cost
• Realtime enterprise informationdirectors
Telco, retail, supply chainPublishsubscribe workflow –response time extends throughout
• Mixedworkload appsLot of small, unrelatedtransactionsHigh response time – morepressure for system resources,locks, etc.
Frontendprocesses
SafestoreprocessesOrder
log filesMatchingprocesses
Order booksand traderesults log
Backendprocesses
StockSegment 1
StockSegment n
Click t
o buy NOW!
PDFXCHANGE
www.docutrack.com Clic
k to buy N
OW!PDFXCHANGE
www.docutrack.com
The ‘long pole’in the commit path
Click t
o buy NOW!
PDFXCHANGE
www.docutrack.com Clic
k to buy N
OW!PDFXCHANGE
www.docutrack.com
Traditional I/O vs. RDMA I/O• Traditional SCSIlike I/O has inherently high latency
100s of microsecondsTargetinitiated DMAHighoverhead software path
• RDMA I/O is much faster10s of microseconds of latency or lessCan be hostinitiatedVery thin software path; hardware does most of the work
Host TargetSend Command
DMA (read or write)
to/from host
ack
InitiateCommand
CompletionInterrupt
ReceivedCommand,Initiate DMA
DMAComplete, send
ack
Host Target
DMA write to target
Initiate RDMAWrite
Completion(Interrupt or
polled)
RDMAtarget is
not activelyinvolved
SCSI RDMA
Click t
o buy NOW!
PDFXCHANGE
www.docutrack.com Clic
k to buy N
OW!PDFXCHANGE
www.docutrack.com
Persistent Memory (PM)
Fast• Hardware accelerated• Can be used synchronously
Simpler protocol stackAlmost entirely in hardware
Reliable• As durable as disk
NonvolatileSingle fault tolerant: mirrored
• Independent fault zoneNot in a processor’s fault domainSurvives faults of other system
components
SANSAN
ClientClient
Nonvolatile memory
REGIONS & PERMISSIONS + + REGION CONTENTS
SAN Interface
Side RAM
password
ClientClient PMMPMM PM Unit (PMU)
Read/WritePM Region
Bytegrained• No readmodifywrite• Structure friendly• Bytegrained locking/sharing
Better concurrencyNo false sharing
Click t
o buy NOW!
PDFXCHANGE
www.docutrack.com Clic
k to buy N
OW!PDFXCHANGE
www.docutrack.com
A writeaside buffer in PM• Persistent Memory used to
buffer disk writesCritical synchronous writescomplete in fast PMLarger asynchronous diskwrites for higher throughputAudit log volume is alwaysflushed• Better scaling with concurrent
log writers
LogVol...
RemoteCopy
Database Database LogWriter
Log Record
DataVol 1 DataVol n
SCSI/iSCSI/FC
SCSI/iSCSI/FC
LogVolDataVol nDataVol 1
NPMU
...
Click t
o buy NOW!
PDFXCHANGE
www.docutrack.com Clic
k to buy N
OW!PDFXCHANGE
www.docutrack.com
Unified writeaside buffer and stablecommunication buffer in PM• PM also used as a
communication bufferProvides a shared endtoendpersistence medium for thecommit pathLog writing is now completelyoff of the critical pathFast RDMA writes replaceslow communication roundtripsPM is still utilized as a bufferfor log writes
Click t
o buy NOW!
PDFXCHANGE
www.docutrack.com Clic
k to buy N
OW!PDFXCHANGE
www.docutrack.com
PMU Implementations
ServerNet/
InfiniBandServerNet/
InfiniBand PMP
PMM
Allocated
Memory
Map/unmapmemory
Read/writemetadata
Read/writePM region
ClientClient
Management
commands
Software PMU prototypes for HPUX and NonStop
Advanced PMUs in designfor nextgeneration servers
Hardware PMU for NonStop
Click t
o buy NOW!
PDFXCHANGE
www.docutrack.com Clic
k to buy N
OW!PDFXCHANGE
www.docutrack.com
Performance of prototype PMUs
32.8 MB/s14.2 µs26.5 MB/s14.5 µsServerNet 2(S86000,NonStop,Hardware PMU)
337 MB/s9.9 µs337 MB/s14.7 µsInfiniBand 4x(rx5670, HPUX,Software PMU)
BandwidthLatencyBandwidthLatency
WriteReadNetwork
Click t
o buy NOW!
PDFXCHANGE
www.docutrack.com Clic
k to buy N
OW!PDFXCHANGE
www.docutrack.com
The original commit path of an INSERTtransaction in NonStop SQL
F. Checkpoint/ack
B. Checkpoint/ack
B. Checkpoint/ack
B. Checkpoint/ack
D. C
heckpoint/ack
D. C
heckpoint/ack
Critical path:
n clusterdata copies
n 2 waited diskI/Os to auditvolume
n 11 waitedround tripson messages
Click t
o buy NOW!
PDFXCHANGE
www.docutrack.com Clic
k to buy N
OW!PDFXCHANGE
www.docutrack.com
Insert transaction commit flow with anaudit writeaside checkpointing buffer
F. Checkpoint/ack
B. C
heckpoint/ack
B. C
heckpoint/ack
B. C
heckpoint/ack
D,H
. Checkpoint D
eltas to PM
D,H
. Checkpoint D
eltas to PM
J. Save C
omm
it Record in P
M
Critical path:
n cluster datacopies
n 2 waitedwrites topersistentmemory, notdisk
n 6 waitedround trips onmessages
Click t
o buy NOW!
PDFXCHANGE
www.docutrack.com Clic
k to buy N
OW!PDFXCHANGE
www.docutrack.com
Endtoend persistence eliminates entireprocesses and messages from critical path
Critical path:
n Just 2cluster datacopies
n 2 waitedwrites topersistentmemory
n Merely 2waited roundtrips onmessages
TMF lib
client
A. Flush Changes
E. All Flushed
J. Commit RecordsH. Deltas
F,I. Checkpoint Commit Record to PM
DataVol DataVol
ADPpri
AuditVol
H. Deltas
ADPpri
TMF lib
AuditVol
J. Release locks
PMU PMU
UpdatedRecords
PMU
TMF lib
TMPpri
PMUDataVol
TMF lib
DP2pri
TMF lib
DP2pri
TMF lib
DP2pri
TMF lib TMF lib
Click t
o buy NOW!
PDFXCHANGE
www.docutrack.com Clic
k to buy N
OW!PDFXCHANGE
www.docutrack.com
Significantly better throughputon benchmark hot stocks
0.00000
500.00000
1000.00000
1500.00000
2000.00000
2500.00000
32k 64k 128kTransaction Size (larger size = more boxcarring)
Thro
ughp
ut (4
k in
sert
s/Se
c)
1 Driver No PM 2 Drivers No PM 3 Drivers No PM1 Driver PM 2 Drivers PM 3 Drivers PM
Click t
o buy NOW!
PDFXCHANGE
www.docutrack.com Clic
k to buy N
OW!PDFXCHANGE
www.docutrack.com
What would persistentmemoryenabled storage controllers do?
B, C
. Checkpoint C
hanges to PM
n Just 2 data copiestotal (optimal)
n ADP takes over datavolume updatefunction from DP2n Drains data from
PMUs to disks andRDF peers off thecritical path
Critical path:n Just 2 data copiesn 2 waited writes to
persistent memoryn Merely 2 waited round
trips on messagesystem
Click t
o buy NOW!
PDFXCHANGE
www.docutrack.com Clic
k to buy N
OW!PDFXCHANGE
www.docutrack.com
Completing the picture …
B, C. C
heckpoint Changes to PM
n Shared state inpersistent memoryeliminates manyTMFLib PIO messages
n Faster TMP releaseslocks based on sharedtransaction state
Critical path:n Just 2 data copiesn 2 waited writes to
persistent memoryn Zero waited round trips
on message systembefore releasing locks
Click t
o buy NOW!
PDFXCHANGE
www.docutrack.com Clic
k to buy N
OW!PDFXCHANGE
www.docutrack.com
Other applications
n NAS filersn Directory and Inode Updatesn Changes to actual file
n Local Filesystems (VxFS)n Journal logsn Metadata changes
n Other database systems (Oracle)n Transaction Logs
n iSCSI serversnWrites of disk blocks
Click t
o buy NOW!
PDFXCHANGE
www.docutrack.com Clic
k to buy N
OW!PDFXCHANGE
www.docutrack.com
Pushing function closer to data:DP2 and Business Intelligence workloads
Click t
o buy NOW!
PDFXCHANGE
www.docutrack.com Clic
k to buy N
OW!PDFXCHANGE
www.docutrack.com
Parallel query planClic
k to buy N
OW!PDFXCHANGE
www.docutrack.com Clic
k to buy N
OW!PDFXCHANGE
www.docutrack.com
Parallel query execution
Application ProcessApplication ProcessApplication Process
ESPESPESP
DP2DP2DP2
ESPESPESP
DP2DP2DP2
ESPESPESP
DP2DP2DP2
ESPESPESP
DP2DP2DP2
ESPESPESP ESPESPESP ESPESPESP ESPESPESP ESPESPESP ESPESPESP ESPESPESP ESPESPESP
parallel groupbyparallel groupbyof a 4of a 4waywaypartitioned tablepartitioned table
Click t
o buy NOW!
PDFXCHANGE
www.docutrack.com Clic
k to buy N
OW!PDFXCHANGE
www.docutrack.com
DP2 is a databaseawarevolume managern Embeds a BTree index with
awareness of database rowsn Btree based multidimensional access
methodn See “Multi Dimensional Access Method: An
Efficient Search Method for MultidimensionalBTrees,”by Leslie et al.
n Has an embedded database runtimen Technically, can run any plann In most practical situations, runs scans
(including filters) and partial groupings
Click t
o buy NOW!
PDFXCHANGE
www.docutrack.com Clic
k to buy N
OW!PDFXCHANGE
www.docutrack.com
Pushing contentawareness closerto data in HP StorageWorksTM Grid
Click t
o buy NOW!
PDFXCHANGE
www.docutrack.com Clic
k to buy N
OW!PDFXCHANGE
www.docutrack.com
StorageWorks Grid ArchitectureSmart CellsSmart Cellsn Scalable distributed system
of self contained, allinclusive data repositories
PrinciplesPrinciplesn Scaleoutn Federationn Intelligence close to datan Pluggable platforms
supporting HP and 3rdparty storage services
Examplen HP RISSTM platform for
Information LifecycleManagement services
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
SmartCell
Smart Query FabricSmart Query Fabric
Storage:Storage:Block,Block,File &File &ObjectObject
ContentContentindexingindexing
AttributeAttributeindexingindexing
Supported protocols and A
PIs
Supported protocols and A
PIs
Click t
o buy NOW!
PDFXCHANGE
www.docutrack.com Clic
k to buy N
OW!PDFXCHANGE
www.docutrack.com
NASA/IEEE MSST 2005
HP Reference Information Storage Server (RISS):Principles of Storage Service Integration
HP RISSPlatform™
HarvesterMailbox crawlerGRAUfile/doc loader
Protocol plug in
SMTP/IMAPHTTPS (WebDAV, SOAP)DICOMCIFS/NFS Appliance realm
Application realm
ApplicationsFile systemMail server
Database serverHTTP server
Fault handlerEmail/document shortcutVirtual File SystemDatabase
Protocolhandler
Integration Realm
Click t
o buy NOW!
PDFXCHANGE
www.docutrack.com Clic
k to buy N
OW!PDFXCHANGE
www.docutrack.com
NASA/IEEE MSST 2005
SC SC SC SC SCSC SC SC SCSCSC SC SC SCSC
SC SC SC SC SCSC SCSC SC SC
HP RISS platform uses “Grid” principles forscalability and performance today
RISS Scope•Manage the semanticsof application data
•Provide unified view ofcomputing and storageresources
Email DocMgmtDoc
MgmtThird Party
Apps
•Off the shelf server orblade technologies
•Leveragesadvancements inhardware technology
HP ProLiant Servers
LifecycleManaged
•Secure•Protected•Retention•Access Controlled•Highly Available•Tamperproof
LifecycleManaged
•Secure•Protected•Retention•Access Controlled•Highly Available•Tamperproof
SC SC SC SC SCSC SC SC SCSCSC SC SC SCSC
SC SC SC SC SCSC SCSC SC SC
SOAP SOAPSMTP SMTP
Stores
HTTP HTTP HTTP
Queries
StorageWorks™ RISS
APIs
Click t
o buy NOW!
PDFXCHANGE
www.docutrack.com Clic
k to buy N
OW!PDFXCHANGE
www.docutrack.com
NASA/IEEE MSST 2005
3rdparty storage services can integrate with HPStorageWorks Grid RISS 1.5 at even deeper level
Exchangeintegration
Lotus Notesintegration
File discovery& classification
Basic metadatamanagement File migration Backup
Dynamicmetadata
management
DB discovery& archiving
D2D agent XAM clientGDS IMA(RISS)Partner Q BIBO API Partner XPartner A
Chunking Container Versioning Replication DuplicateElimination Compression
QueryService
ContentIndexingService
AccountServices Auditing
Install&
Config
Monitor&
ControlPolicy Security Notifica
tion XAM API
Basic WebServices SMTP CIFS, NFS XAM API
Web binding DICOM ECM/CRM
Firewall /Load Balancing
Core Platform
DeepIntegrationLayer
ShallowIntegrationLayer
Clients &Agentware
RISS 1.5 Platform
Click t
o buy NOW!
PDFXCHANGE
www.docutrack.com Clic
k to buy N
OW!PDFXCHANGE
www.docutrack.com
NASA/IEEE MSST 2005
Final Comments
• The new face of intelligent storage– Memory semantic access with durability– Transaction aware– Index enabled– Sophisticated search and query substrate embedded
• SCSI is evil and having that as the only standardizedtransport for OSD command set is pathetic! (personalopinion)– Higher level functions demand higher level protocols and
APIs
• SNIA efforts– XAM over OSD presents a thin ray of hope
Click t
o buy NOW!
PDFXCHANGE
www.docutrack.com Clic
k to buy N
OW!PDFXCHANGE
www.docutrack.com