datacenter.ai - predictive analytics for modern data … a predictive analytics framework for modern...

28
DataCenter.ai A Predictive Analytics Framework for Modern Data Cedtner, and A Case Study for Disk Failure Prediction Yuming Ma Architect, Cisco Cloud Services Eric Chen CEO, ProphetStor March 21, 2017

Upload: dinhdieu

Post on 28-May-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

DataCenter.aiAPredictiveAnalyticsFrameworkforModernDataCedtner,

andACaseStudyforDiskFailurePrediction

Yuming MaArchitect,CiscoCloudServicesEricChenCEO,ProphetStor

March21,2017

TrendofDataCenterManagement

DataCenterComponentFailure

• Datacenter componentfailureisnormal andregular• Unmanagedfailurecancause cascadingdisaster• Datacentercomponentshastonsofdatagenerated• Howtousethedataforbettermanagementandoperation?• Datacanbeusedtobuildanalytics• Analyticalintelligenceimprovesmanagementandoperation

DataCenterUsage

• AMcKinseystudyin2008peggingdata-centerutilizationatroughly6%.• AnAccenturepaperfrom2011samplingasmallnumberonAmazonEC2machinesfinding7percentutilizationoverthecourseofaweek• AGartnerreportfrom2012puttingindustrywideutilizationrateat12%.• The2014DataCenterEfficiencyAssessmentfromtheNRDChasfoundthatUSdatacenterutilizationat12to18%.• Howtoimprovedatacenterefficiency?

4

The challenges of the data center operation that DataCenter.ai aims to solve

Disks

Computing(Server&Storage)

Workload(Container)

DataCenter

Disksuddenfailure

Diskperformancedegradation

Workloadresourceallocationwaste

Disksizebloat

Over-committedcomputingresources

Over-heatedcomponents

Componentsuddenfailure

Howcanwedobetter?

StorageClusterMonitoringandAnalytics

• Threetypesofdata:events,metrics,logs

• Datacollectedfromeachnode• Datapushedtomonitoringportals

• In-flightanalyticsforrun-timeRCA• Predictiveanalyticsforproactivealert

• Plugintosynthesizedataforclusterlevelmetricsandstatus

DataSources

Storageresources

……

Event

Metrics

Log

Computingresources … …

DataCenters

Workload …

Metadatacollection

8

Networkresources …

Stream Analytics

9

StormTopology StormTopology

PredictiveAnalytics

storage Hyper-convergedApplicationServer

Machinelearning

Predictfailureandresourceconsumption

MetricsLogsEvents

Provideresourceallocationand

replacementplan

10

AI/MachineLearningAlgorithms

Time-seriesprediction

AnomalydetectionMonitor Alert PrescriptionPredict

Real-timealerting

Resourceallocationplan

Workloadoperationplan

Hardwarereplacementplan

11

Datacenter.ai Framework

12

AStorageCaseStudyforDiskFailurePrediction

13

DiskFailureScenarioinCeph

• Unmanageddiskfailure:• Diskfailed• OSDdaemondies• OSDisdownandout• AnewOSDwasselectedfordatareplica• CopyobjectsfromexistingreplicastonewOSD• FindandReplacethefaileddisk• Copyobjectsagain

• Manageddiskreplacement:• Setosd noout• MarkOSDout• ShutdownOSD• ReplacediskandstartOSD• Copyobjectsfromexistingreplicastonewdisk

ApacheSpark

InfluxDB

MachineLearning

PredictiveAnalysis

FuzzyLogic

Neo4J

DiskProphet

DeepLearning

15

SMARTdata

Diskperformance

data

ETL Pre-process Analytics

Failurediskprediction

Slowdiskprediction

Dashboard

Interaction

InputData

Metricsdatastore

Abandondiskprediction

Hostmetrics

Output

RESTAPI

Diskreplacementplan

DiskProphet providesmonitoring,prediction,andprescription

SaaSandon-Prem Model

16

Agent Agent DiskProphet Server

DiskProphet Cloud

PredictionProcess

17

Agent Agent…

ApacheSparkworker

ApacheSparkworker

DiskProphetSupervisoractor

Life-expectancypredictionactor

Near-failurepredictionactor

InfluxDB

ApacheSparkworker

Performancepredictionactor

MultilayerPerceptron(MLP)isusedtopredictlifeexpectancyofdisks

Model life expectancy

model

SSD #42 ’s life expectancy is #4 bin(i.e. 360 ~ 450 days)

Normalize data and binning data

#1: 0 ~ 90 days#2: 90 ~ 180 days…

#12: 1080~ days

Day rangeBin

S.M.A.R.T data

Performance data

Size (available size and total capacity data)

Device metadata (vendor, model)

Predict life expectancy

Collect data from disks

Periodically update model with updated

data set

18

SupportVectorMachine(SVM)isusedtopredictnear-failurelikelihood

Historical data of good disks

Historical data of failure disks

Model near-failure likelihood

There is an 84% chance SSD #53 will

be failure in the next 14 days

=> Likely

Predict near-failure likelihood

S.M.A.R.T data

Performance data

Size (available size and total capacity data)

Device metadata (vendor, model)

Collect data from disks

Periodically update model with updated

data set

19

DiskProphet predictslifeexpectancyandnear-failurelikelihoodofdisks

Indexing

Binning

Normalization

Filter

Interpolation

InputSourcedata Transformdata

Disk Near-Failurelikelihood

LifeExpectancy

HD3215 Very likely 0 ~30days

WD3015 Likely 30 ~60days

ST400D5 Unlikely 180~210 days

Multilayerperceptron

Life-ExpectancyPrediction

SupportVectorMachine

Near-failurePrediction

DiskFailurePrediction

20

DiskProphet providesthereplacementplanbycombingthediskfailureandperformanceprediction

DiskPerformance

SMART

HostPerformance

InputdataFailure likelihood of SSD #2 within the next 7 days is “very likely”

Plan the replacement of SSD #2 at 2am

2am

#1 #2 #3

Diskfailureprediction

IOPSprediction

Predictivelowestloadingofnext24hours

21

DashBoard

Testresults

23

AITraining

2013/4/10~2015/12/31

Backblaze TrainingdataDiskProphet DiskFailure

PredictionModel

24

PredictionTesting

2016/6/1~2016/9/30

Backlaze Testingdata

DiskProphetDiskFailure

PredictionReportDisk Near-Failure

likelihoodLifeExpectancy

HD3215 Very likely 0 ~30days

WD3015 Likely 30 ~60days

ST400D5 Unlikely 180~210days

25

94

94.5

95

95.5

96

96.5

97

97.5

98

1/1/16 1/11/16 1/21/16 1/31/16 2/10/16 2/20/16 3/1/16 3/11/16 3/21/16 3/31/16

DiskFailurePredictionAccuracy

Accuracy Average

Dailypredictionaccuracy

26

OnBackblaze dataset:1/1/16~3/31/16Averageaccuracyrateis96.1%

Summary

27

Failureprediction

Usageprediction

AITechnology

MonitoringConsole

Support Solution

Root-causeanalysis

DataCenter.aiautomatestheoperationofthesoftware-defineddatacenter

Thank You

YumingMa:[email protected]:[email protected]