cis 602-01: computational reproducibility

25
D. Koop, CIS 602-01, Fall 2016 CIS 602-01: Computational Reproducibility Virtual Machines and the Cloud Dr. David Koop

Upload: others

Post on 14-Feb-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

D. Koop, CIS 602-01, Fall 2016

CIS 602-01: Computational Reproducibility

Virtual Machines and the Cloud

Dr. David Koop

Virtual Machines• Software Abstraction

- Behaves like hardware - Encapsulates all OS and

application state • Virtualization Layer

- Extra level of indirection - Decouples hardware, OS - Enforces isolation - Multiplexes physical hardware

across VMs

2D. Koop, CIS 602-01, Fall 2016

[via E. de Lara]

Virtualization Properties• Isolation

- Fault isolation - Performance isolation

• Encapsulation - Cleanly capture all VM state - Enables VM snapshots, clones

• Portability - Independent of physical hardware - Enables migration of live, running VMs

• Interposition - Transformations on instructions, memory, I/O - Enables transparent resource overcommitment, encryption,

compression, replication…

3D. Koop, CIS 602-01, Fall 2016

Process vs. System Virtualization• Process Virtualization

- Language-level: Java, .NET, Smalltalk

- OS-level: processes, Solaris Zones, BSD Jails, Docker Containers

- Cross-ISA emulation: Apple 68K-PPC-x86

• System Virtualization - VMware Workstation, Microsoft

VPC, Parallels - VMware ESX, Xen, Microsoft

Hyper-V

4D. Koop, CIS 602-01, Fall 2016

[via E. de Lara]

Types of Virtualization• Native/Bare metal (Type 1)

- Higher performance - ESX, Xen, HyperV

• Hosted (Type 2) - Easier to install - Leverage host’s device drivers - VMware Workstation, Parallels

5D. Koop, CIS 602-01, Fall 2016

[http://itechthoughts.wordpress.com/tag/full-virtualization/ via E. de Lara]

Types of Virtualization• Full virtualization

- Unmodified OS, virtualization is transparent to OS • Para virtualization

- OS modified to be virtualized

6D. Koop, CIS 602-01, Fall 2016

[http://forums.techarena.in/guides-tutorials/1104460.htm via E. de Lara]

What is a Virtual Machine Monitor?• Classic Definition (Popek and Goldberg ’74)

• VMM Properties - Equivalent execution: Programs running in the virtualized

environment run identically to running natively. - Performance: A statistically dominant subset of the instructions

must be executed directly on the CPU. - Safety and isolation: A VMM must control system resources.

7D. Koop, CIS 602-01, Fall 2016

[via E. de Lara]

What Needs to be Virtualized?• Processor • Memory • I/O

8D. Koop, CIS 602-01, Fall 2016

Guest OS + Applications

Virtual Machine Monitor

Page Fault

Undef Instr

vIRQ

MMU Emulation

CPU Emulation

I/O Emulation

Unp

rivile

ged

Priv

ilege

d

[via E. de Lara]

Hypervisor

9D. Koop, CIS 602-01, Fall 2016

[via E. de Lara]

Xen Design Principles1. Support for unmodified application binaries is essential, or users

will not transition to Xen. Hence we must virtualize all architectural features required by existing standard ABIs.

2. Supporting full multi-application operating systems is important, as this allows complex server configurations to be virtualized within a single guest OS instance.

3. Paravirtualization is necessary to obtain high performance and strong resource isolation on uncooperative machine architectures such as x86.

4. Even on cooperative machine architectures, completely hiding the effects of resource virtualization from guest OSes risks both correctness and performance.

10D. Koop, CIS 602-01, Fall 2016

[Barham et al., 2003]

Xen Architecture

11D. Koop, CIS 602-01, Fall 2016

[Xen Project Software Overview]

Virtualization Properties and Reproducibility?• Isolation

- Fault isolation - Performance isolation

• Encapsulation - Cleanly capture all VM state - Enables VM snapshots, clones

• Portability - Independent of physical hardware - Enables migration of live, running VMs

• Interposition - Transformations on instructions, memory, I/O - Enables transparent resource overcommitment, encryption,

compression, replication…

12D. Koop, CIS 602-01, Fall 2016

Project• Find some papers that you may be interested in reproducing • Do a survey of the material that is available for each paper:

- Code? • Is the code under version control?

- Data? • Is it clear how to process or understand the data? • Is there metadata?

- Virtual machine or container? • Does the hardware/software that deals with these still work?

- Provenance? • Do we have a record of the steps taken in producing a result? • How complete is it?

13D. Koop, CIS 602-01, Fall 2016

Project• If you are interested in a topic that aligns with reproducibility, please

email me/talk to me about your ideas • For example, if you are working on a research project that could

incorporate reproducibility

• Formal Specification and Initial Deadline Soon

14D. Koop, CIS 602-01, Fall 2016

Midterm• http://www.cis.umassd.edu/~dkoop/cis602/midterm.html • Tuesday, October 25, Dion 101, 3:30-4:45pm • Format: Multiple Choice + Free Response • Focus on the topics we have covered and the associated papers

- Reproducibility Themes - Scientific Writing - Version Control - Data Sharing, Citation, Repositories - Virtual Machines

• May include material not in the assigned papers but discussed in class (e.g. the Computer Systems Reproducibility work)

• See sample questions on the course web site

15D. Koop, CIS 602-01, Fall 2016

D. Koop, CIS 602-01, Fall 2016

Reproducibility, Virtual Appliances, and Cloud Computing

B. Howe

Virtual Machine Uses• Software Testing: Test multiple configurations on one computer • Migration: if a server fails, move the virtual machine elsewhere • Cross-environment work: Windows on Linux • Enterprise support: upgrade via image

• Education: concentrate on math/programming rather than install • Custom prototypes: try-before-you-buy

17D. Koop, CIS 602-01, Fall 2016

Stein Quote on Genome Informatics• "Cloud computing ... creates a new niche in the ecosystem for

genome software developers to package their work in the form of virtual machines. For example, many genome annotation groups have developed pipelines for identifying and classifying genes and other functional elements. Although many of these pipelines are open source, packaging and distributing them for use by other groups has been challenging given their many software dependencies and site-specific configuration options. In a cloud computing environment these pipelines can be packaged into virtual machine images and stored in a way that lets anyone copy them, run them and customize them for their own needs, thus avoiding the software installation and configuration complexities."

-L. Stein

18D. Koop, CIS 602-01, Fall 2016

Reproducibility, Virtual Appliances, and Cloud Computing 5

effort required by

experimenter low

high

effort required by those who only reproduce the experiments

low high

virtual machines

controlled environments

raw code and data

extensive documentation

effort required by those who reuse and extend the results

low high

virtual machines

controlled environments

raw code and data

extensive documentation

FIGURE 1.1These four approaches to disseminating science software vary in the e↵ort required by the original

experimenter, those who wish to directly reproduce the results, and those who wish to reuse and

extend the software for other purposes. Virtual machines (VMs) incure very little overhead for

the original experimenter and support direct reproducibility, but are not su�cient for long-term

extensibility. For extensibility, complete documentation is generally required, though some scientific

workflow systems and other controlled environments o↵er a possible solution.

extend the software and adapt it for their own projects — there is no “shortcut” for soft-ware reuse. For the extenders, this documentation significantly reduces the e↵ort required.Controlled environments also require some up-front e↵ort, but can significantly reduce thee↵ort required by both reproducers and extenders. Finally, virtual machines impose very lit-tle overhead on the experimenter, and direct reproducibility of results is straightforward, butan undocumented VM with all software pre-installed does very little to support long-termreusability and extensibility, perhaps o↵ering only a small improvement over providing theraw code and data. These approaches are not mututally exclusive; releasing a VM demom-strating particular results along with complete documentation for the requisite software isan appropriate strategy [8].

1.3.1 Other Uses of Virtual Machines

Beside reproducibility, the creation and exchange of virtual machines has other benefits forscientific knowledge sharing. First, VMs are also increasingly used to distribute software foreducational purposes. Sorin Mitran at the University of Washington uses virtual machinesin classes ranging from non-technical first-year seminars to graduate classes to packagethe software environment for teaching scientific computing.1 He finds that “using VMsallows a class to concentrate on the math and programming as opposed to installing allthe utilities that come together to solve a problem.” Second, VMs can be used to delivercustom prototypes and proofs of concept. Paradigm 4, the company that develops anddistributes the SciDB database engine [33], routinely uses Amazon Machine Images forcustomer projects. They develop a prototype on behalf of a customer and deliver it asan AMI, allowing the customer to reproduce results by running the scripts developed byP4. This approach provides a “try before you buy” mechanism that allows customers toexperiment with the system without investing IT resources to install the software locally.

1http://mitran.web.unc.edu/teaching/

Approaches to disseminating software

19D. Koop, CIS 602-01, Fall 2016

Improving Reproducibility• Capturing more variables • Fewer constraints on research methods • On-Demand Backups • Virtual Machines as Citable Publications • Code, Data, Environment + Resources • Automatic Upgrades • Competitive, Elastic Pricing • Reproducibility for Complex Architectures • Unfettered Collaborative Experiments • Data-intensive Computing • Cost Sharing • A Foundation for Single-Payer Funding • Compatibility with Other Approaches

20D. Koop, CIS 602-01, Fall 2016

Remaining Challenges• Cost • Culture • Provenance • Reuse

21D. Koop, CIS 602-01, Fall 2016

Non-challenges• Security • Licensing • Vendor Lock-In and Long-Term Preservation

22D. Koop, CIS 602-01, Fall 2016

D. Koop, CIS 602-01, Fall 2016

Virtual Appliances, Cloud Computing, and Reproducible Research

B. Howe

Virtual machines considered harmful for reproducibility• Titus Brown: http://ivory.idyll.org/blog/vms-considered-harmful.html • "In essence, providing a gigantic black box of custom installed code

that was installed, set up, and executed by experts just isn't very useful to many people."

• "[R]eleasing shoddy VMs is easy to do, but it doesn't help you learn how to do a better job of reproducibility along the way. Releasing software pipelines, however crappy, is on the path towards better reproducibility."

• "[T]he distinction between a user and a maker. A user merely wants to take your software and run with it; a maker wants to probe, remix, and mash up your software. To maximize the benefit of our scientific software, we should be enabling makers, not users."

24D. Koop, CIS 602-01, Fall 2016

Greg Wilson's Tweet• "VM images are just PDFs you can run. If we use 'em for

reproducibility today, ppl will have to hack tools to mine their content tomorrow."

25D. Koop, CIS 602-01, Fall 2016