monitoring large-scale cloud infrastructures with opennebula
DESCRIPTION
Efficient monitoring is crucial when managing your Cloud infrastructure. The metrics collected by OpenNebula can be used to trigger automatic scaling, or quickly detect failures to automatically restart virtual machines. During this talk, I will show how OpenNebula can be used to efficiently monitor thousands of virtual machines at sub-1 minute interval. I will show how OpenNebula can be enhanced and optimized, and how different metrics collection tools such as Ganglia and Host-sFlow can be used with OpenNebula to monitor large-scale Cloud infrastructures.TRANSCRIPT
![Page 1: Monitoring Large-scale Cloud Infrastructures with OpenNebula](https://reader033.vdocument.in/reader033/viewer/2022042700/554be436b4c9056b348b48a2/html5/thumbnails/1.jpg)
Monitoring Large-scale Cloud Infrastructures with OpenNebulaSimon BouletOpenNebula ConsultantCo-founder of the Cloudnorth.com [email protected]
![Page 2: Monitoring Large-scale Cloud Infrastructures with OpenNebula](https://reader033.vdocument.in/reader033/viewer/2022042700/554be436b4c9056b348b48a2/html5/thumbnails/2.jpg)
Goals
1. Show how to configure OpenNebula to achieve sub-1 minute monitoring interval
2. Demonstrate the use of OpenNebula in large-scale cloud infrastructures
3. Suggest enhancements to OpenNebula performance and monitoring
![Page 3: Monitoring Large-scale Cloud Infrastructures with OpenNebula](https://reader033.vdocument.in/reader033/viewer/2022042700/554be436b4c9056b348b48a2/html5/thumbnails/3.jpg)
How Big Exactly is Large-scale?
How many hosts?1,000? 2,000? 10,000 VMs?
![Page 4: Monitoring Large-scale Cloud Infrastructures with OpenNebula](https://reader033.vdocument.in/reader033/viewer/2022042700/554be436b4c9056b348b48a2/html5/thumbnails/4.jpg)
Monitoring in OpenNebula
● Detects when a VM or host changes status (Running, Stopped, etc.)
● Built-in metrics: CPU, memory and network usage
● You can add as many metrics as you like by customizing driver
● Can be used to perform various tasks (auto scaling, high-availability redeployment, etc.)
![Page 5: Monitoring Large-scale Cloud Infrastructures with OpenNebula](https://reader033.vdocument.in/reader033/viewer/2022042700/554be436b4c9056b348b48a2/html5/thumbnails/5.jpg)
Don't Expect the Default Configuration to Perform Optimally
● Database: Use MySQL database backend, not the default SQLite
● Logs: Use Syslog log system, and disable debug logging (debug_level=1)
● Number of threads: Adjust the number of drivers threads (see -t option to your *MAD config options)
![Page 6: Monitoring Large-scale Cloud Infrastructures with OpenNebula](https://reader033.vdocument.in/reader033/viewer/2022042700/554be436b4c9056b348b48a2/html5/thumbnails/6.jpg)
Use OpenNebula >= 4.0
Prior versions did monitoring in two phases:1. The IM Monitor action monitored Hosts2. The VMM Poll action monitored VMs
100 Hosts + 1,000 VMs * 15 seconds interval = 4,400 actions per minute
Since OpenNebula 4.0, the IM Monitor action is capable of returning the information of VMs running on the monitored host
![Page 7: Monitoring Large-scale Cloud Infrastructures with OpenNebula](https://reader033.vdocument.in/reader033/viewer/2022042700/554be436b4c9056b348b48a2/html5/thumbnails/7.jpg)
Monitoring History
By default OpenNebula keeps 24h of monitoring history
15 seconds interval X 24h = 5760 records per VMAverage record size: 4KB23MB of monitoring history per VM
100 VM = 2.3GB10,000 VM = 230GB
HOST_MONITORING_EXPIRATION_TIME and VM_MONITORING_EXPIRATION_TIME config options
![Page 8: Monitoring Large-scale Cloud Infrastructures with OpenNebula](https://reader033.vdocument.in/reader033/viewer/2022042700/554be436b4c9056b348b48a2/html5/thumbnails/8.jpg)
Monitoring History (continued)
● Reduce history to 30 minutes (1800 seconds)
● Use MySQL MEMORY storage engine for vm_monitoring and host_monitoring tables
It's OK to lose monitoring history when MySQL is restartedMost recent monitoring values are stored in VM templateSet MySQL max_heap_table_size large enough to hold all your monitoring history
![Page 9: Monitoring Large-scale Cloud Infrastructures with OpenNebula](https://reader033.vdocument.in/reader033/viewer/2022042700/554be436b4c9056b348b48a2/html5/thumbnails/9.jpg)
Watch your Load Average
As of 4.2, the maximum number of simultaneous XML-RPC API connections is limited to 15Overloaded OpenNebula = Slow XML-RPC API response = API Limit / Timeout
● Reduce load at deployment time by adjusting number of VMs simultaneously deployed by scheduler
● Watch next release (4.4) forXML-RPC API concurrencyenhancements
![Page 10: Monitoring Large-scale Cloud Infrastructures with OpenNebula](https://reader033.vdocument.in/reader033/viewer/2022042700/554be436b4c9056b348b48a2/html5/thumbnails/10.jpg)
Local Caching Nameserver
OpenNebula use DNS name for monitoring hosts (unless you named your hosts using their IP address instead of name)
● Use a local caching nameserver to speed up DNS lookup (such as dnsmasq).
![Page 11: Monitoring Large-scale Cloud Infrastructures with OpenNebula](https://reader033.vdocument.in/reader033/viewer/2022042700/554be436b4c9056b348b48a2/html5/thumbnails/11.jpg)
Beware of SSH Transport
Most OpenNebula drivers (KVM, Xen, etc.) use SSH connections to perform actions
OK for deploying new VM, but expensive when doing VM monitoring
![Page 12: Monitoring Large-scale Cloud Infrastructures with OpenNebula](https://reader033.vdocument.in/reader033/viewer/2022042700/554be436b4c9056b348b48a2/html5/thumbnails/12.jpg)
Meet Ganglia<< Ganglia is a scalable distributed system monitor tool for high-performance computing systems such as clusters and grids. >>- Wikipedia
OpenNebula has built-in support for GangliaBy default Ganglia and OpenNebula must run on the same machine
Set GANGLIA_HOST in /var/lib/one/remotes/im/ganglia.d/ganglia_probe and /var/lib/one/remotes/vmm/kvm/poll_ganglia
![Page 13: Monitoring Large-scale Cloud Infrastructures with OpenNebula](https://reader033.vdocument.in/reader033/viewer/2022042700/554be436b4c9056b348b48a2/html5/thumbnails/13.jpg)
Meet Ganglia (continued)
![Page 14: Monitoring Large-scale Cloud Infrastructures with OpenNebula](https://reader033.vdocument.in/reader033/viewer/2022042700/554be436b4c9056b348b48a2/html5/thumbnails/14.jpg)
Ganglia Driver Limitations
1. Currently only 1 Ganglia Collector is supported
2. Need to run script on each host to export OpenNebula-specific metric (OPENNEBULA_VMS_INFORMATION)
3. Ganglia as a maximum length of 1392 bytes for string metrics
![Page 15: Monitoring Large-scale Cloud Infrastructures with OpenNebula](https://reader033.vdocument.in/reader033/viewer/2022042700/554be436b4c9056b348b48a2/html5/thumbnails/15.jpg)
Host sFlow<< The Host sFlow agent exports physical and virtual server performance metrics using the sFlow protocol. The agent provides scalable, multi-vendor, multi-OS performance monitoring with minimal impact on the systems being monitored.>>- http://host-sflow.sourceforge.net/
Exports a standard set of hypervisor and VM metricsOfficial support for Xen, KVM and Hyper-V, but uses Libvirt to gather metrics (and Libvirt has support LXC, OpenVZ, VMWare, etc.)
![Page 16: Monitoring Large-scale Cloud Infrastructures with OpenNebula](https://reader033.vdocument.in/reader033/viewer/2022042700/554be436b4c9056b348b48a2/html5/thumbnails/16.jpg)
Host sFlow (continued)
Source: http://blog.sflow.com/2012/02/ganglia-33-released.html
![Page 17: Monitoring Large-scale Cloud Infrastructures with OpenNebula](https://reader033.vdocument.in/reader033/viewer/2022042700/554be436b4c9056b348b48a2/html5/thumbnails/17.jpg)
Host sFlow (continued)
Sample MetricsHosts Metrics
VMs Metrics
Not currently supported in OpenNebula. Contact me if you're interested.
vnode_mem_total Hypervisor Total Memory
vnode_domains Hypervisor VM Count
<VM ID>.vcpu_state VM State (Running, Stopped, etc.)
<VM ID>.vmem_util VM Memory Utilization
<VM ID>.vdisk_free VM Free Disk Space
![Page 18: Monitoring Large-scale Cloud Infrastructures with OpenNebula](https://reader033.vdocument.in/reader033/viewer/2022042700/554be436b4c9056b348b48a2/html5/thumbnails/18.jpg)
4,000 VMs at Sub-1 Minute Interval
OpenNebula 4.2 + xml-rpc patch (upcoming in 4.4)Experimental Host sFlow Driver1 OpenNebula Core (EC2 High-CPU XLarge instance)1 Sunstone Web Server (EC2 Standard Medium instance)1 Ganglia Collector (EC2 Standard Medium instance)100 Hosts (EC2 High-CPU Medium instances)~40 VMs per Host~4,000 VMs (OpenVZ)15 - 60 second monitoring interval
![Page 19: Monitoring Large-scale Cloud Infrastructures with OpenNebula](https://reader033.vdocument.in/reader033/viewer/2022042700/554be436b4c9056b348b48a2/html5/thumbnails/19.jpg)
4,000 VMs at Sub-1 Minute Interval
![Page 20: Monitoring Large-scale Cloud Infrastructures with OpenNebula](https://reader033.vdocument.in/reader033/viewer/2022042700/554be436b4c9056b348b48a2/html5/thumbnails/20.jpg)
4,000 VMs at Sub-1 Minute Interval
![Page 21: Monitoring Large-scale Cloud Infrastructures with OpenNebula](https://reader033.vdocument.in/reader033/viewer/2022042700/554be436b4c9056b348b48a2/html5/thumbnails/21.jpg)
4,000 VMs at Sub-1 Minute Interval
![Page 22: Monitoring Large-scale Cloud Infrastructures with OpenNebula](https://reader033.vdocument.in/reader033/viewer/2022042700/554be436b4c9056b348b48a2/html5/thumbnails/22.jpg)
Looking Forward
There’s room for optimizations
● The command line tools can get very slow when returning very large result sets (but not the API…)
● Distributed driver, for example using ZeroMQ for distributing tasks to multiple workers
● Investigate PoolSQL locks being held for long period and blocking other threads (discussed in bug #1818)
● Gather metrics about OpenNebula internals: locks wait, effective monitoring interval, memory footprints, etc.
● Investigate very large Sunstone memory usage
![Page 23: Monitoring Large-scale Cloud Infrastructures with OpenNebula](https://reader033.vdocument.in/reader033/viewer/2022042700/554be436b4c9056b348b48a2/html5/thumbnails/23.jpg)
Thank you!
Questions?
“OpenNebula captured my interest for several technical reasons besides the fact that it is truly open. It's architecture is very elegant; it has C++ bones, ruby muscles and bash tendons. It's extensible and understandable. It has no peer as far as I can tell.”
Christopher Barry, Infrastructure Engineer, RJMetrics, September 2012
http://opennebula.org/users:testimonials