Cloud computing
Hugh Shanahan,Department of Computer Science,Royal Holloway,University of London CCC 2011,
Huazhong Agricultural University, Wuhan
20 Sep 2011
The amount of Biological data is exploding
• The raw data for a human genome corresponds to 100’s of Gbytes.
• Cost of human genome has fallen from $100 M to ~ $3000 (July 2011)
• Main bottleneck is now reconstructing genome from the data generated.
• Much of the original Dogma is now seen as a simplification
• RNA now seen to play a fundamental role
• miRNA
• How DNA is stored is also crucial
Greater exploration
• 1000’s of genomes being scanned
• Different species
• Cancer genomics
• Methylome
• RNA-seq
• Have not got time to talk about metabolomics/proteomics ...
Caveat to sequence data
• There are many different companies building equipment to perform sequencing.
• They all have their own biases and sources of systematic error.
• The data generated is discrete in nature which tends to make people think it’s accurate.
• It could be as susceptible to systematic biases as microarrays are.
• Interpretation and analysis is the real bottleneck.
The era of Big Data
• Biological data - Petabytes now, expected to be Exabyte (millions of Terabyte) by 2020.
• High Energy Physics - Large Hadron Collider producing Pbytes of data per year
• Square Kilometre Array (full operation 2024) - one Exabyte a day
• Haven’t even mentioned Google or Bing yet....
Problems - Solutions - Cloud Computing ?
• Data sets this size cannot be moved about on the Internet.
• Data must be analysed, not just retrieved.
• Many people want access to this data, many of whom are
• not computational scientists
• may not have financial resources to buy powerful computers
• may want access to best software, best practices etc.
• data to be updated in a timely fashion
Solutions - Cloud Computing ?
• Cloud computing may be the solution.
• Data centre for cloud co-located with data generation.
• Processing as well as data retrieval done at data generation centre.
Cloud Computing Definition - “If it looks like a duck”
• Features of cloud computing are
• Computing is mostly done at a data centre provided by a vendor
• Client-side computing is minimal
• Servers at data centre make heavy use of virtualisation (as oppose to Grids)
• Client can select number of instances of VM and data usage
• Client pays on a per-use basis - “Somebody’s Credit Card is being used”
• The computing is treated as a utility rather than a resource.
Cloud providers
• Amazon Web Services (AWS)
• Provide Linux or Windows VM
• You get a command line.
• Microsoft - Azure
• More complicated method of submission
• Open Source - Eucalyptus (stability ?)
• Other providers out there ...
Advantages
• Data centre can be where the data is generated and accessed everywhere (in theory).
• Data could be kept up to date.
• Analysis tools could be kept up to date
• Services can be developed which go significantly beyond a simple command line interface (Azure works along these lines).
• Scalability - if you want 1 or 100 VM’s you can get it.
Disadvantages
• At present vendors do not provide tailored environment for Scientific client.
• VM (regardless of OS) is effectively blank canvas and hence have to upload all the right binaries, libraries and data that you need.
• Data may not be in the correct configuration - storage/compute tradeoff.
• Like any utility have to watch use carefully !
• Vendor lock in.
• Security.
• Legal issues - licensing, nationality of vendor and data centre.
Show me the money
• Commercial clouds charge on a per use basis.
• Disk space
• CPU time
• Amazon and Microsoft charge via time VM is deployed
• Google tries to charge per CPU cycle.
• Move from once-off payment model to rolling costs.
Big Data - a new discipline ?
Big Data
Machine Learning /
Pattern Recognition
Hardware
Quality ControlFinance /
Accounting
Conclusions
• Microarray data gives us a first insight into the dynamic cell.
• Sequence and other omic data sets are expanding into Petabytes.
• Big Data is upon us.
• Cloud computing is not a panacea.
• Cloud computing may democratise access to Big Data.