2013 nas-ehs-data-integration-dc

1.Integrating large, fast-moving, andheterogeneous data sets in biology.C. Titus BrownAsst Prof, CSE and Microbiology; BEACON NSF STC Michigan State [email protected]

2. Introduction Background: Modeling & data analysis undergrad => Open source software development + softwareengineering + developmental biology + genomics PhD => Bio + computer science faculty => Data driven biology Currently working with next-gen sequencing data(mRNAseq, metagenomics, difficult genomes). Thinking hard about how to do data-drivenmodeling & model-driven data analysis. 3. Goal & outline Address challenges and opportunities of heterogeneous data integration: 1000 ft view.Outline: What types of analysis and discovery do we wantto enable? What are the technical challenges, commonsolutions, and common failure points? Where might we look for success stories, andwhat lessons can we port to biology? My conclusions. 4. Specific types of questions I have a known chemical/gene interaction; do I see itin this other data set? I have a known chemical/gene interaction; what othergene expression is affected? What does chemical X do to overall phenotype, effecton gene expression, altered protein localization, andpatterns of histone modification? More complex/combinatorial interactions: What does this chemical do in this genetic background? What kind of additional gene expression changes aregenerated by the combination of these two chemicals? What are common effects of this class of chemicals? 5. What general behavior do we want toenable? Reuse of data by groups that did not/could notproduce it. Publication of reusable/forkable data analysispipelines and models. Integration of data and models. Serendipitous uses and cross-referencing of data sets(mashups). Rapid scientific exploration and hypothesis generationin data space. 6. (Executable papers & data reuse) ENCODE All data is available; all processing scripts for papers are available on a virtual machine. QIIME (microbial ecology)Amazon virtual machine containing software and datafor:Collaborative cloud-enabled tools allow rapid,reproducible biological insights. (pmid 23096404) Digital normalization paperAmazon virtual machine, again:http://arxiv.org/abs/1203.4802 7. Executable papers can support easyreplication & reuse of code, data.(IPythonNotebook; alsosee RStudio) http://ged.msu.edu/papers/2012-diginorm/notebook/ 8. What general behavior do we want toenable? Reuse of data by groups that did not/could notproduce it. Publication of reusable/forkable data analysispipelines and models. Integration of data and models. Serendipitous uses and cross-referencing of data sets(mashups). Rapid scientific exploration and hypothesis generationin data space. 9. An entertaining digression --A mashup of Facebook top 10 books by college and per-college SAT rank http://booksthatmakeyoudumb.virgil.gr/ 10. Technical obstacles Syntactic incompatibility The first 90% of bioinformatics: your IDs are differentfrom my IDs. Semantic incompatibility The second 90% of bioinformatics: what does genemean in your database? Impedance mismatch SQL is notoriously bad at representing intervals andhierarchies Genomes consist of intervals; ontologies consist ofhierarchies! SQL databases dominate (vs graph or object DBs). Data volume & velocity Large & expanding data sets just make everythingharder. Unstructured data aka publications most scientific knowledge is locked 11. Typical solutions Entity resolution Accession numbers or other common identifiersrequires global naming system OR translators. Top down imposition of structure Centralized DB; Here is the schema you will all use;limits flexibility, prevents use of unstructured data, heavyweight. Ontologies to enable correct communication Centrally coordinated vocabularyslow, hard to get right, doesnt solve unstructured data problem.Balancing theoretical rigor with practical applicability is particularlyhard. Ad hoc entity resolution (winging it) Common solutiondoesnt work that well. 12. Are better standards thesolution? http://xkcd.com/927/ 13. Rephrasing technical goalsHow can we best provide a platform or platforms tosupport flexible data integration and datainvestigation across a wide range of data sets and data types in biology?My interests: Avoid master data manager and centralization Support federated roll-out of new data andfunctionality Provide flexible extensibility of ontologies andhierarchies Support diverse ecology of databases, 14. Success stories outside ofbiology? Look for domains: with really large amounts of heterogenous data, that are continually increasing in size, are being effectively mined on an ongoing basis, Have widely used programmatic interfaces thatsupport mashups and other cross-database stuff, and are intentional, with principles that we cansteal or adapt. 15. Success stories outside ofbiology? Look for domains: with really large amounts of heterogenous data, that are continually increasing in size, are being effectively mined on an ongoing basis, Have widely used programmatic interfaces thatsupport mashups and other cross-database stuff, and are intentional, with principles that we cansteal or adapt.Amazon. 16. Amazon: > 50 million users, > 1 million product partners,billions of reviews, dozens of compute services Continually changing/updating data sets. Explicitly adopted a service-oriented architecturethat enables both internal and external use of thisdata. For example, the amazon.com Web site is itselfbuilt from over 150 independent services Amazon routinely deploys new services andfunctionality. 17. Sources:The Platform Rant (Steve Yegge) -- in which hecompares the Google and Amazon approaches:https://plus.google.com/112678702228711889851/posts/eVeouesvaVXA summary at HighScalability.com:http://highscalability.com/amazon-architecture (They are both long and tech-y, note, but the firstis especially entertaining.) 18. A brief summary of coreprinciplesMandates from the CEO:1. All teams must expose data and functionality solely through a service interface.2. All communication between teams happens through that service interface.3. All service interfaces must be designed so that they can be exposed to the outside world. 19. More colloquially: You should eat your own dogfood.Design and implement the database and databasefunctionality to meet your own needs; and only use thefunctionality youve explicitly made available to everyone.To adapt to research: database functionality should bedesigned in tightly integration with researchers who areusing it, both at a user interface level andprogrammatically.(Genome databases have done a really good job of this,albeit generally in a centralized model.) 20. If the customers arent integratedinto the development loop: 21. A platform view?Diffn geneData Metabolicexpressionexplorationmodel queryWWWGene ID translator IsoformChemical resolution/relationships comparisonExpression normalizationExpressionExpression ExpressionExpression datadata data data II(tiling) (microarray)(mRNAseq) (mRNAseq) 22. A few points Open source and agile software development approaches can be surprisingly effective and inexpensive. Developing services in small groups that include customer-facing developers helps ensure utility. Implementing services in the cloud (e.g. virtual machines, or on top of infrastructure as a service services) gives developer flexibility in tools, approaches, implementation; also enables scaling and reusability. 23. Combining modelling with data Data-driven modeling: connections and parameterscan be, to some extent, determined from data. Model-driven data investigation: data that doesnt fitthe known model is particularly interesting.The second approach is essentially how particlephysicists work with accelerator data: build a model &then interpret the data using the model.(In biology, models are less constraining, though; moreunknowns.) 24. Using developmental models Davidson et al., http://sugp.caltech.edu/endomes 25. Using developmental modelsModels can contain useful abstractions ofspecific processes; here, the direct effects of blocking nuclearization of B-catenin can be predicted by following the connections.Models provide a common language for (dis)agreementa community. 26. Using developmental models Davidson et al., http://sugp.caltech.edu/endomes 27. Social obstacles Training of biologically aware software developers is lacking. Molecular biologists are still very much of a computationally nave mindset: give me the answer so I can do the real work Incentives for data sharing, much less useful data sharing are not yet very strong. Pubs, grants, respect... Patterns for useful data sharing are still not well understood, in general. 28. Other places to look NEON and other NSF centers (e.g. NCEAS) are collecting vast heterogenous data sets, and are explicitly tackling the data management/use/integration/reuse problem. SBML (Systems Biology Markup Language) is a modeling descriptive language that enables interoperability of modeling software. Software Carpentry runs free workshops on effective use of computation for science. 29. My conclusions We need a platform mentality to make the most useof our data, even if we dont completely embraceloose coupling and distribution. Agile and end-user focused software developmentmethodologies have worked well in other areas; muchof the hard technical space has already beenexplored in Internet companies (and probably socialnetworking companies, too). Data is most useful in the context of an explicit model;models can be generated from data, and models canfeed back into data gathering. 30. Things I didnt discuss Database maintenance and active curation isincredibly important. Most data only makes sense in the context of otherdata (think: controls; wild type vs knockout; otherbackgrounds; etc.) so we will need lots more data tointerpret the data we already have. Deep learning is a promising field for extractingcorrelations from multiple large data sets. All of these technical problems are easier to solvethan the social problems (incentives; training). 31. Thanks --This talk and ancillary notes will be available on my blog ~soon: http://ivory.idyll.org/blog/Please do contact me at [email protected] if you havequestions or comments.

2013 nas-ehs-data-integration-dc

Documents

integration of data

reuse of data

data types

data setsmashups

data driven biology

andheterogeneous data

unstructured data problem

use of unstructured