cis 602-01: computational reproducibility
TRANSCRIPT
D. Koop, CIS 602-01, Fall 2016
CIS 602-01: Computational Reproducibility
Virtual Machines and the Cloud
Dr. David Koop
Virtual Machines• Software Abstraction
- Behaves like hardware - Encapsulates all OS and
application state • Virtualization Layer
- Extra level of indirection - Decouples hardware, OS - Enforces isolation - Multiplexes physical hardware
across VMs
2D. Koop, CIS 602-01, Fall 2016
[via E. de Lara]
Virtualization Properties• Isolation
- Fault isolation - Performance isolation
• Encapsulation - Cleanly capture all VM state - Enables VM snapshots, clones
• Portability - Independent of physical hardware - Enables migration of live, running VMs
• Interposition - Transformations on instructions, memory, I/O - Enables transparent resource overcommitment, encryption,
compression, replication…
3D. Koop, CIS 602-01, Fall 2016
Process vs. System Virtualization• Process Virtualization
- Language-level: Java, .NET, Smalltalk
- OS-level: processes, Solaris Zones, BSD Jails, Docker Containers
- Cross-ISA emulation: Apple 68K-PPC-x86
• System Virtualization - VMware Workstation, Microsoft
VPC, Parallels - VMware ESX, Xen, Microsoft
Hyper-V
4D. Koop, CIS 602-01, Fall 2016
[via E. de Lara]
Types of Virtualization• Native/Bare metal (Type 1)
- Higher performance - ESX, Xen, HyperV
• Hosted (Type 2) - Easier to install - Leverage host’s device drivers - VMware Workstation, Parallels
5D. Koop, CIS 602-01, Fall 2016
[http://itechthoughts.wordpress.com/tag/full-virtualization/ via E. de Lara]
Types of Virtualization• Full virtualization
- Unmodified OS, virtualization is transparent to OS • Para virtualization
- OS modified to be virtualized
6D. Koop, CIS 602-01, Fall 2016
[http://forums.techarena.in/guides-tutorials/1104460.htm via E. de Lara]
What is a Virtual Machine Monitor?• Classic Definition (Popek and Goldberg ’74)
• VMM Properties - Equivalent execution: Programs running in the virtualized
environment run identically to running natively. - Performance: A statistically dominant subset of the instructions
must be executed directly on the CPU. - Safety and isolation: A VMM must control system resources.
7D. Koop, CIS 602-01, Fall 2016
[via E. de Lara]
What Needs to be Virtualized?• Processor • Memory • I/O
8D. Koop, CIS 602-01, Fall 2016
Guest OS + Applications
Virtual Machine Monitor
Page Fault
Undef Instr
vIRQ
MMU Emulation
CPU Emulation
I/O Emulation
Unp
rivile
ged
Priv
ilege
d
[via E. de Lara]
Hypervisor
9D. Koop, CIS 602-01, Fall 2016
[via E. de Lara]
Xen Design Principles1. Support for unmodified application binaries is essential, or users
will not transition to Xen. Hence we must virtualize all architectural features required by existing standard ABIs.
2. Supporting full multi-application operating systems is important, as this allows complex server configurations to be virtualized within a single guest OS instance.
3. Paravirtualization is necessary to obtain high performance and strong resource isolation on uncooperative machine architectures such as x86.
4. Even on cooperative machine architectures, completely hiding the effects of resource virtualization from guest OSes risks both correctness and performance.
10D. Koop, CIS 602-01, Fall 2016
[Barham et al., 2003]
Xen Architecture
11D. Koop, CIS 602-01, Fall 2016
[Xen Project Software Overview]
Virtualization Properties and Reproducibility?• Isolation
- Fault isolation - Performance isolation
• Encapsulation - Cleanly capture all VM state - Enables VM snapshots, clones
• Portability - Independent of physical hardware - Enables migration of live, running VMs
• Interposition - Transformations on instructions, memory, I/O - Enables transparent resource overcommitment, encryption,
compression, replication…
12D. Koop, CIS 602-01, Fall 2016
Project• Find some papers that you may be interested in reproducing • Do a survey of the material that is available for each paper:
- Code? • Is the code under version control?
- Data? • Is it clear how to process or understand the data? • Is there metadata?
- Virtual machine or container? • Does the hardware/software that deals with these still work?
- Provenance? • Do we have a record of the steps taken in producing a result? • How complete is it?
13D. Koop, CIS 602-01, Fall 2016
Project• If you are interested in a topic that aligns with reproducibility, please
email me/talk to me about your ideas • For example, if you are working on a research project that could
incorporate reproducibility
• Formal Specification and Initial Deadline Soon
14D. Koop, CIS 602-01, Fall 2016
Midterm• http://www.cis.umassd.edu/~dkoop/cis602/midterm.html • Tuesday, October 25, Dion 101, 3:30-4:45pm • Format: Multiple Choice + Free Response • Focus on the topics we have covered and the associated papers
- Reproducibility Themes - Scientific Writing - Version Control - Data Sharing, Citation, Repositories - Virtual Machines
• May include material not in the assigned papers but discussed in class (e.g. the Computer Systems Reproducibility work)
• See sample questions on the course web site
15D. Koop, CIS 602-01, Fall 2016
Virtual Machine Uses• Software Testing: Test multiple configurations on one computer • Migration: if a server fails, move the virtual machine elsewhere • Cross-environment work: Windows on Linux • Enterprise support: upgrade via image
• Education: concentrate on math/programming rather than install • Custom prototypes: try-before-you-buy
17D. Koop, CIS 602-01, Fall 2016
Stein Quote on Genome Informatics• "Cloud computing ... creates a new niche in the ecosystem for
genome software developers to package their work in the form of virtual machines. For example, many genome annotation groups have developed pipelines for identifying and classifying genes and other functional elements. Although many of these pipelines are open source, packaging and distributing them for use by other groups has been challenging given their many software dependencies and site-specific configuration options. In a cloud computing environment these pipelines can be packaged into virtual machine images and stored in a way that lets anyone copy them, run them and customize them for their own needs, thus avoiding the software installation and configuration complexities."
-L. Stein
18D. Koop, CIS 602-01, Fall 2016
Reproducibility, Virtual Appliances, and Cloud Computing 5
effort required by
experimenter low
high
effort required by those who only reproduce the experiments
low high
virtual machines
controlled environments
raw code and data
extensive documentation
effort required by those who reuse and extend the results
low high
virtual machines
controlled environments
raw code and data
extensive documentation
FIGURE 1.1These four approaches to disseminating science software vary in the e↵ort required by the original
experimenter, those who wish to directly reproduce the results, and those who wish to reuse and
extend the software for other purposes. Virtual machines (VMs) incure very little overhead for
the original experimenter and support direct reproducibility, but are not su�cient for long-term
extensibility. For extensibility, complete documentation is generally required, though some scientific
workflow systems and other controlled environments o↵er a possible solution.
extend the software and adapt it for their own projects — there is no “shortcut” for soft-ware reuse. For the extenders, this documentation significantly reduces the e↵ort required.Controlled environments also require some up-front e↵ort, but can significantly reduce thee↵ort required by both reproducers and extenders. Finally, virtual machines impose very lit-tle overhead on the experimenter, and direct reproducibility of results is straightforward, butan undocumented VM with all software pre-installed does very little to support long-termreusability and extensibility, perhaps o↵ering only a small improvement over providing theraw code and data. These approaches are not mututally exclusive; releasing a VM demom-strating particular results along with complete documentation for the requisite software isan appropriate strategy [8].
1.3.1 Other Uses of Virtual Machines
Beside reproducibility, the creation and exchange of virtual machines has other benefits forscientific knowledge sharing. First, VMs are also increasingly used to distribute software foreducational purposes. Sorin Mitran at the University of Washington uses virtual machinesin classes ranging from non-technical first-year seminars to graduate classes to packagethe software environment for teaching scientific computing.1 He finds that “using VMsallows a class to concentrate on the math and programming as opposed to installing allthe utilities that come together to solve a problem.” Second, VMs can be used to delivercustom prototypes and proofs of concept. Paradigm 4, the company that develops anddistributes the SciDB database engine [33], routinely uses Amazon Machine Images forcustomer projects. They develop a prototype on behalf of a customer and deliver it asan AMI, allowing the customer to reproduce results by running the scripts developed byP4. This approach provides a “try before you buy” mechanism that allows customers toexperiment with the system without investing IT resources to install the software locally.
1http://mitran.web.unc.edu/teaching/
Approaches to disseminating software
19D. Koop, CIS 602-01, Fall 2016
Improving Reproducibility• Capturing more variables • Fewer constraints on research methods • On-Demand Backups • Virtual Machines as Citable Publications • Code, Data, Environment + Resources • Automatic Upgrades • Competitive, Elastic Pricing • Reproducibility for Complex Architectures • Unfettered Collaborative Experiments • Data-intensive Computing • Cost Sharing • A Foundation for Single-Payer Funding • Compatibility with Other Approaches
20D. Koop, CIS 602-01, Fall 2016
Non-challenges• Security • Licensing • Vendor Lock-In and Long-Term Preservation
22D. Koop, CIS 602-01, Fall 2016
D. Koop, CIS 602-01, Fall 2016
Virtual Appliances, Cloud Computing, and Reproducible Research
B. Howe
Virtual machines considered harmful for reproducibility• Titus Brown: http://ivory.idyll.org/blog/vms-considered-harmful.html • "In essence, providing a gigantic black box of custom installed code
that was installed, set up, and executed by experts just isn't very useful to many people."
• "[R]eleasing shoddy VMs is easy to do, but it doesn't help you learn how to do a better job of reproducibility along the way. Releasing software pipelines, however crappy, is on the path towards better reproducibility."
• "[T]he distinction between a user and a maker. A user merely wants to take your software and run with it; a maker wants to probe, remix, and mash up your software. To maximize the benefit of our scientific software, we should be enabling makers, not users."
24D. Koop, CIS 602-01, Fall 2016