datacenter.ai - predictive analytics for modern data … a predictive analytics framework for modern...
TRANSCRIPT
DataCenter.aiAPredictiveAnalyticsFrameworkforModernDataCedtner,
andACaseStudyforDiskFailurePrediction
Yuming MaArchitect,CiscoCloudServicesEricChenCEO,ProphetStor
March21,2017
DataCenterComponentFailure
• Datacenter componentfailureisnormal andregular• Unmanagedfailurecancause cascadingdisaster• Datacentercomponentshastonsofdatagenerated• Howtousethedataforbettermanagementandoperation?• Datacanbeusedtobuildanalytics• Analyticalintelligenceimprovesmanagementandoperation
DataCenterUsage
• AMcKinseystudyin2008peggingdata-centerutilizationatroughly6%.• AnAccenturepaperfrom2011samplingasmallnumberonAmazonEC2machinesfinding7percentutilizationoverthecourseofaweek• AGartnerreportfrom2012puttingindustrywideutilizationrateat12%.• The2014DataCenterEfficiencyAssessmentfromtheNRDChasfoundthatUSdatacenterutilizationat12to18%.• Howtoimprovedatacenterefficiency?
4
The challenges of the data center operation that DataCenter.ai aims to solve
Disks
Computing(Server&Storage)
Workload(Container)
DataCenter
Disksuddenfailure
Diskperformancedegradation
Workloadresourceallocationwaste
Disksizebloat
Over-committedcomputingresources
Over-heatedcomponents
Componentsuddenfailure
StorageClusterMonitoringandAnalytics
• Threetypesofdata:events,metrics,logs
• Datacollectedfromeachnode• Datapushedtomonitoringportals
• In-flightanalyticsforrun-timeRCA• Predictiveanalyticsforproactivealert
• Plugintosynthesizedataforclusterlevelmetricsandstatus
DataSources
Storageresources
……
Event
Metrics
Log
Computingresources … …
DataCenters
Workload …
Metadatacollection
8
Networkresources …
PredictiveAnalytics
storage Hyper-convergedApplicationServer
Machinelearning
Predictfailureandresourceconsumption
MetricsLogsEvents
Provideresourceallocationand
replacementplan
10
AI/MachineLearningAlgorithms
Time-seriesprediction
AnomalydetectionMonitor Alert PrescriptionPredict
Real-timealerting
Resourceallocationplan
Workloadoperationplan
Hardwarereplacementplan
11
DiskFailureScenarioinCeph
• Unmanageddiskfailure:• Diskfailed• OSDdaemondies• OSDisdownandout• AnewOSDwasselectedfordatareplica• CopyobjectsfromexistingreplicastonewOSD• FindandReplacethefaileddisk• Copyobjectsagain
• Manageddiskreplacement:• Setosd noout• MarkOSDout• ShutdownOSD• ReplacediskandstartOSD• Copyobjectsfromexistingreplicastonewdisk
ApacheSpark
InfluxDB
MachineLearning
PredictiveAnalysis
FuzzyLogic
Neo4J
DiskProphet
DeepLearning
15
SMARTdata
Diskperformance
data
ETL Pre-process Analytics
Failurediskprediction
Slowdiskprediction
Dashboard
Interaction
InputData
Metricsdatastore
Abandondiskprediction
Hostmetrics
Output
RESTAPI
Diskreplacementplan
DiskProphet providesmonitoring,prediction,andprescription
PredictionProcess
17
Agent Agent…
ApacheSparkworker
ApacheSparkworker
DiskProphetSupervisoractor
Life-expectancypredictionactor
Near-failurepredictionactor
InfluxDB
ApacheSparkworker
Performancepredictionactor
MultilayerPerceptron(MLP)isusedtopredictlifeexpectancyofdisks
Model life expectancy
model
SSD #42 ’s life expectancy is #4 bin(i.e. 360 ~ 450 days)
Normalize data and binning data
#1: 0 ~ 90 days#2: 90 ~ 180 days…
#12: 1080~ days
Day rangeBin
S.M.A.R.T data
Performance data
Size (available size and total capacity data)
Device metadata (vendor, model)
Predict life expectancy
Collect data from disks
Periodically update model with updated
data set
18
SupportVectorMachine(SVM)isusedtopredictnear-failurelikelihood
Historical data of good disks
Historical data of failure disks
Model near-failure likelihood
There is an 84% chance SSD #53 will
be failure in the next 14 days
=> Likely
Predict near-failure likelihood
S.M.A.R.T data
Performance data
Size (available size and total capacity data)
Device metadata (vendor, model)
Collect data from disks
Periodically update model with updated
data set
19
DiskProphet predictslifeexpectancyandnear-failurelikelihoodofdisks
Indexing
Binning
Normalization
Filter
Interpolation
InputSourcedata Transformdata
Disk Near-Failurelikelihood
LifeExpectancy
HD3215 Very likely 0 ~30days
WD3015 Likely 30 ~60days
ST400D5 Unlikely 180~210 days
Multilayerperceptron
Life-ExpectancyPrediction
SupportVectorMachine
Near-failurePrediction
DiskFailurePrediction
20
DiskProphet providesthereplacementplanbycombingthediskfailureandperformanceprediction
DiskPerformance
SMART
HostPerformance
InputdataFailure likelihood of SSD #2 within the next 7 days is “very likely”
Plan the replacement of SSD #2 at 2am
2am
#1 #2 #3
Diskfailureprediction
IOPSprediction
Predictivelowestloadingofnext24hours
21
PredictionTesting
2016/6/1~2016/9/30
Backlaze Testingdata
DiskProphetDiskFailure
PredictionReportDisk Near-Failure
likelihoodLifeExpectancy
HD3215 Very likely 0 ~30days
WD3015 Likely 30 ~60days
ST400D5 Unlikely 180~210days
25
94
94.5
95
95.5
96
96.5
97
97.5
98
1/1/16 1/11/16 1/21/16 1/31/16 2/10/16 2/20/16 3/1/16 3/11/16 3/21/16 3/31/16
DiskFailurePredictionAccuracy
Accuracy Average
Dailypredictionaccuracy
26
OnBackblaze dataset:1/1/16~3/31/16Averageaccuracyrateis96.1%
Summary
27
Failureprediction
Usageprediction
AITechnology
MonitoringConsole
Support Solution
Root-causeanalysis
DataCenter.aiautomatestheoperationofthesoftware-defineddatacenter