innovation and information technology june 20, 2005 research efforts toward non-stop services in...
TRANSCRIPT
![Page 1: Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate](https://reader033.vdocument.in/reader033/viewer/2022051217/5697bf7b1a28abf838c83559/html5/thumbnails/1.jpg)
June 20, 2005
Innovation and information technology
Research Efforts toward Non-Stop Services in High End and Enterprise Computing
Box Leangsuksun,
Associate Professor, Computer Science
Director, eXtreme Computing Research (XCR)
HA-OSCAR: unleashing HA Beowulf
![Page 2: Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate](https://reader033.vdocument.in/reader033/viewer/2022051217/5697bf7b1a28abf838c83559/html5/thumbnails/2.jpg)
June 20, 2005
Innovation and information technology
Research Collaborators
– National, Academic and Industry Labs
• ORNL• Intel, Dell, Ericsson• Lucent, CRAY• IU, NCSA, OSU, NCSU, UNM, TTU• Systran• OSDL (Linus is here)
• ANL, LLNL
![Page 3: Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate](https://reader033.vdocument.in/reader033/viewer/2022051217/5697bf7b1a28abf838c83559/html5/thumbnails/3.jpg)
June 20, 2005
Innovation and information technology
Service Unavailability Impacts
• No Performance and No Functionality• Losses of $195K - $58M with 3.5 hrs (Meta
Group report, 2000) – (enterprise)
• Enterprise/Shared Major computing resources- 7/24/365 (enterprise/HPC-HEC)
• Critical HPC apps such as National Security (Home Land defense) (HPC-HEC)
• Service provider Regulation/Mandate – FCC mandate (Class 5 local switch = 5 9’s)
• Losses time and opportunities• Life-threatening
![Page 4: Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate](https://reader033.vdocument.in/reader033/viewer/2022051217/5697bf7b1a28abf838c83559/html5/thumbnails/4.jpg)
June 20, 2005
Innovation and information technology
RASS Definitions
• Reliability (MTTF) – How fast it fails?
• Availability – What is the total uptime?– Availability = MTTF / (MTTF + MTTR)
• Serviceability – How fast to build, manage, upgrade system– Planned outages – 60% of total outages
• Security will impact Availability
![Page 5: Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate](https://reader033.vdocument.in/reader033/viewer/2022051217/5697bf7b1a28abf838c83559/html5/thumbnails/5.jpg)
June 20, 2005
Innovation and information technology
High Availability Open Source Cluster Application Resources (HA-OSCAR)
HA-OSCAR: unleashing HA Beowulf
![Page 6: Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate](https://reader033.vdocument.in/reader033/viewer/2022051217/5697bf7b1a28abf838c83559/html5/thumbnails/6.jpg)
June 20, 2005
Innovation and information technology
HA-OSCAR overview
•Production-quality Open source Linux-cluster project
•HA and HPC clustering techniques to enable critical HPC infrastructure Self-configuration Multi-head Beowulf system
•HA-enabled HPC Services:Active/Hot Standby
•Self-healing with 3-5 sec automatic failover time
•The first known field-grade open source HA Beowulf cluster release
![Page 7: Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate](https://reader033.vdocument.in/reader033/viewer/2022051217/5697bf7b1a28abf838c83559/html5/thumbnails/7.jpg)
June 20, 2005
Innovation and information technology
Monitoring & Self-healing cores
ServiceMonitor
ResourceMonitor
Healthchannel Monitor
Self-Healing Daemon
PBS ,MAUI , NFS,HTTP
services are monitored
load_average, disk_usage, free_memory are monitored
eth0,eth0:1 interfaces
are monitored
![Page 8: Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate](https://reader033.vdocument.in/reader033/viewer/2022051217/5697bf7b1a28abf838c83559/html5/thumbnails/8.jpg)
June 20, 2005
Innovation and information technology
Monitoring and recovery
• Enhancement based kernel.org MON , IPMI, and net-SNMP framework
• Recovery – Associative Response
• Local recovery, e.g. restart, checkpoint• Failover (simple or impersonate/clone)• Admin-defined actions
– Adaptive Response• Previous state and number retry• Acceleration (Time-series)• E.g. maui dies, restart. After 3 times reties within 3 mins,
failover
![Page 9: Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate](https://reader033.vdocument.in/reader033/viewer/2022051217/5697bf7b1a28abf838c83559/html5/thumbnails/9.jpg)
June 20, 2005
Innovation and information technology
Appeared in a front cover in two major Linux magazines, various technical papers, research exhibitions.
web site: http://xcr.cenit.latech.edu/ha-oscar
HA-OSCAR beta was released to open source community in March 2004
![Page 10: Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate](https://reader033.vdocument.in/reader033/viewer/2022051217/5697bf7b1a28abf838c83559/html5/thumbnails/10.jpg)
June 20, 2005
Innovation and information technology
On-going R&D works(Lab grade enhancements)
![Page 11: Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate](https://reader033.vdocument.in/reader033/viewer/2022051217/5697bf7b1a28abf838c83559/html5/thumbnails/11.jpg)
June 20, 2005
Innovation and information technology
Reliability Modeling for dummy
1,1,2,2
0,1,2,2 1,1,1,2 1,1,2,1
122 32
1,0,2,2
1
0,0,2,2 0,1,1,2 0,1,2,1 1,0,1,2 1,0,2,1 1,1,0,2 1,1,1,1 1,1,2,0
0,0,1,2 0,1,1,10,1,0,2 0,0,2,1 0,1,2,0 1,0,0,2 1,0,1,1 1,0,2,0 1,1,0,1 1,1,1,0
0,0,1,1 0,1,0,1 0,1,1,0 1,0,0,1 1,0,1,0
![Page 12: Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate](https://reader033.vdocument.in/reader033/viewer/2022051217/5697bf7b1a28abf838c83559/html5/thumbnails/12.jpg)
June 20, 2005
Innovation and information technology
UML Representation of System Architecture
XMI Representation with Embedded Dependability Information
Extracting Dependability parameters and Building Logical Representation
Results showing Reliability and Availability of System
Semantic Mapping and
Dependability Modeling
UML based Approach
![Page 13: Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate](https://reader033.vdocument.in/reader033/viewer/2022051217/5697bf7b1a28abf838c83559/html5/thumbnails/13.jpg)
June 20, 2005
Innovation and information technology
An example of UML tools
![Page 14: Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate](https://reader033.vdocument.in/reader033/viewer/2022051217/5697bf7b1a28abf838c83559/html5/thumbnails/14.jpg)
June 20, 2005
Innovation and information technology
Examples in UML diagrams
![Page 15: Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate](https://reader033.vdocument.in/reader033/viewer/2022051217/5697bf7b1a28abf838c83559/html5/thumbnails/15.jpg)
June 20, 2005
Innovation and information technology
Example of HA-OSCAR
A single
head cluster
λ
µ
SystemUnreliability
MTTF days
SystemInstantaneous Unavailability
Availability Percentage
SystemDowntimePer Year
NodeSwitchClient1Client2Client3Client4
21E-0532E-0567E-0589E-0576E-0554E-05
12E-041E-03
32E-0415E-0416E-0419E-04
1.7804E-01 300 2.87771E-03 99.71 25.2 hrs
HA-OSCAR
λ
µ System Un-
reliability
MTTF days
SystemInstantaneousUnavailability
AvailabilityPercentage
SystemDowntime Per Year
Node 1Node 2
Switch 1Switch 2Client 1Client 2Client 3Client 4
3.4E-058.6E-051E-05
1.3E-052.5E-059.8E-056.7E-053.5E-05
2E-0512E-042E-04
2.1E-0432E-044E-045E-04
21E-05
92.1138E-03 331 2.10727E-05 99.997
11 min
<RELIABILITY BLOCK DIAGRAM> <component> <name> Node1 <lambda> 3.4E-5 </lambda> <mu> 2.0E-5 </mu> </name> </component> <component> <name> Node2 <lambda> 8.6E-5 </lambda> <mu> 0.0012 </mu> </name> </component> <component> <name> Switch1 <lambda> 1.0E-5 </lambda> <mu> 2.0E-4 </mu> </name> </component> <component> <name> Switch2 <lambda> 1.3E-5 </lambda> <mu> 2.1E-4 </mu> </name> </component><component> <name> Client4 <lambda> 3.5E-5 </lambda> <mu> 2.1E-4 </mu> </name> </component> <Series id=0> Node1 Switch1 Client1 </Block0> </Series> <Series id=1> Node1 Switch2 Client1 </Block1> </Series> <Series id=2> Node1 Switch1 Client2 </Block2> </Series> <Series id=3> Node1 Switch2 Client2 </Block3> </Series> <Series id=4> Node1 Switch1 Client3 </Block4> </Series> <Series id=5> Node1 Switch2 Client3 </Block5> </Series> <Series id=6> Node1 Switch1 Client4 </Block6> </Series> <Series id=7> Node1 Switch2 Client4 </Block7> </Series> <Series id=8> Node2 Switch1 Client1 </Block8> </Series> <Series id=9> Node2 Switch2 Client1 </Block9> </Series> <Series id=10> Node2 Switch1 Client2 </Block10> </Series> <Series id=11> Node2 Switch2 Client2 </Block11> </Series> <Series id=12> Node2 Switch1 Client3 </Block12> </Series> <Series id=13> Node2 Switch2 Client3 </Block13> </Series> <Series id=14> Node2 Switch1 Client4 </Block14> </Series> <Series id=15> Node2 Switch2 Client4 </Block15> </Series> <Parallel> id=0 id=1 id=2 id=3 id=4 id=5 id=6 id=7 id=8 id=9 id=10 id=11 id=12 id=13 id=14 id=155 </Parallel> <System Unreliability> 9.211E-02 </System Unreliability> <Mean Time to Failure> <days> 331 </days> </Mean Time to Failure> <System Instantaneous Availability per year> 99.997 </System Instantaneous Availability per year> <System DownTime per year> <min> 11 </min> </System DownTime per year>
</RELIABILITY BLOCK DIAGRAM>
![Page 16: Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate](https://reader033.vdocument.in/reader033/viewer/2022051217/5697bf7b1a28abf838c83559/html5/thumbnails/16.jpg)
June 20, 2005
Innovation and information technology
Policy-based Fault Prediction, Hardware Management abstraction
![Page 17: Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate](https://reader033.vdocument.in/reader033/viewer/2022051217/5697bf7b1a28abf838c83559/html5/thumbnails/17.jpg)
June 20, 2005
Innovation and information technology
Policy-based Fault Prediction, Hardware Management abstraction
![Page 18: Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate](https://reader033.vdocument.in/reader033/viewer/2022051217/5697bf7b1a28abf838c83559/html5/thumbnails/18.jpg)
June 20, 2005
Innovation and information technology
Hardware Management abstraction
• Ability to access and control detailed status for better management (CPU temp, baseboard, power status, system ID/ up/ down etc.)
• IPMI (Intelligent Platform Management Interface)• open IPMI and OpenHPI (SA forum) • HW abstraction hinds vendor specific
– CPU – Power – Memory– Baseboard– Fan (cooling)
![Page 19: Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate](https://reader033.vdocument.in/reader033/viewer/2022051217/5697bf7b1a28abf838c83559/html5/thumbnails/19.jpg)
June 20, 2005
Innovation and information technology
Our early observations
01/25/2004 | 00:31:19 | Sys Fan 1 | critical01/25/2004 | 00:31:19 | Sys Fan 3 | critical01/25/2004 | 00:31:19 | Sys Fan 4 | critical01/25/2004 | 00:31:19 | Processor 1 Fan | ok01/25/2004 | 00:31:20 | Processor 2 Fan | ok
• Can set thresholds in managed elements to trigger events with severity levels
• Automatic failure trend analysis -> prediction
![Page 20: Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate](https://reader033.vdocument.in/reader033/viewer/2022051217/5697bf7b1a28abf838c83559/html5/thumbnails/20.jpg)
June 20, 2005
Innovation and information technology
A failure prediction & policy-based recovery Cluster management
• Detections - the damage done!• Predictions
– trend analysis– Anticipate imminent failures– Better handling– More difficult for multiple events/nodes correlations
• Example of IPMI events and trend analysis – E.g. CPU temp raising too fast with 5 min -> prepare to
checkpoint, failover and restart– Memory bit error detected -> take a node out
![Page 21: Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate](https://reader033.vdocument.in/reader033/viewer/2022051217/5697bf7b1a28abf838c83559/html5/thumbnails/21.jpg)
June 20, 2005
Innovation and information technology
HA-OSCAR monitoring, Fault prediction and recovery Restructure
![Page 22: Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate](https://reader033.vdocument.in/reader033/viewer/2022051217/5697bf7b1a28abf838c83559/html5/thumbnails/22.jpg)
June 20, 2005
Innovation and information technology
Cluster Power Management (IPMI)
![Page 23: Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate](https://reader033.vdocument.in/reader033/viewer/2022051217/5697bf7b1a28abf838c83559/html5/thumbnails/23.jpg)
June 20, 2005
Innovation and information technology
Reliability-aware Runtime
![Page 24: Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate](https://reader033.vdocument.in/reader033/viewer/2022051217/5697bf7b1a28abf838c83559/html5/thumbnails/24.jpg)
June 20, 2005
Innovation and information technology
Reliability-Aware Runtime
• Programming paradigm and Scalability impact “Reliability”, esp for HPC environment
• “AND Survivability” analysis based on– at 10, 100, 1000 nodes all have to survive.– Each node MTTF at 5000 hours– N=10, MTTF = 492.424242– N=100, MTTF = 49.9902931– N=1000, MTTF = 4.99999003– N=10000, MTTF = ½ hour
• Reliability and Availability info - Better Job execution (checkpointing, resource management)
![Page 25: Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate](https://reader033.vdocument.in/reader033/viewer/2022051217/5697bf7b1a28abf838c83559/html5/thumbnails/25.jpg)
June 20, 2005
Innovation and information technology
MTTF 1000-5000
The more & the faster processors, the faster failure rate
System reliability (MTTF) for k-of-n AND Survivability (k=n) Parallel
Execution model
0
100
200
300
400
500
600
700
800
10 50 100 500 1000 2000 5000
Number of Participating Nodes
Tota
l sys
tem
MTT
F (h
rs)
Node MTTF 1000 hrs
Node MTTF 3000 hrs
Node MTTF 5000 hrs
Node MTTF 7000 hrs
e.g. each nodal failure rate 2/yearN=10, MTTF = 492.424242N=100, MTTF = 49.9902931N=1000, MTTF = 4.99999003
![Page 26: Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate](https://reader033.vdocument.in/reader033/viewer/2022051217/5697bf7b1a28abf838c83559/html5/thumbnails/26.jpg)
June 20, 2005
Innovation and information technology
Reliability-aware Checkpointing
– Consideration of Scalability vs. Reliability in Runtime– MTTF vs. application execution time– HA-OSCAR monitoring -> Failure Prediction and
Detection– System-initiated (transparent) and Reliability-aware
checkpointing in MPI environments. – Developed smart checkpoint based on above. – Reduce unnecessary overheads yet reliability-aware– Detailed reports in HAPCW2004 and submitted to IEEE
cluster 2005
![Page 27: Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate](https://reader033.vdocument.in/reader033/viewer/2022051217/5697bf7b1a28abf838c83559/html5/thumbnails/27.jpg)
June 20, 2005
Innovation and information technology
Federated System Architecture (DOE fastOS)
![Page 28: Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate](https://reader033.vdocument.in/reader033/viewer/2022051217/5697bf7b1a28abf838c83559/html5/thumbnails/28.jpg)
June 20, 2005
Innovation and information technology
Summary
• Problems in Large-scale computing is similar to Wireless Sensor Network– Computing node = SN– Head node = gateway
• Reliability issues are similar– Depends on applications
• Self-config, self-awareness, self-healing
• Routing algorithm = location-aware