big data kirk borne george mason university lsst all hands meeting august 13 - 17, 2012

Download Big Data Kirk Borne George Mason University LSST All Hands Meeting August 13 - 17, 2012

If you can't read please download the document

Upload: piper-fear

Post on 15-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1

Big Data Kirk Borne George Mason University LSST All Hands Meeting August 13 - 17, 2012 Slide 2 Characteristics of Big Data 1a Big quantities of data are acquired everywhere. It is now a big issue in all aspects of life: science, business, healthcare, gov, social networks, etc. Slide 3 Characteristics of Big Data 1b Big quantities of data are acquired everywhere. It is now a big issue in all aspects of life: science, business, healthcare, gov, social networks, etc. Slide 4 Characteristics of Big Data 1c Big quantities of data are acquired everywhere. It is now a big issue in all aspects of life: science, business, healthcare, gov, social networks, etc. Slide 5 Characteristics of Big Data 2 Big quantities of data are acquired everywhere. It is now a big issue in all aspects of life: science, business, healthcare, gov, social networks, etc. But What do we mean by big? Gigabytes? Terabytes? Petabytes? Exabytes? The meaning of big is domain-specific and resource- dependent (data storage, I/O throughput, computation cycles, communication costs) I say we all are dealing with our own tonnabytes Slide 6 Characteristics of Big Data 3 There are 4 dimensions to the Big Data challenge: 1.Volume (tonnabytes data challenge) 2.Variety (complexity, curse of dimensionality) 3.Velocity (rate of data and information flowing at us) 4.Verification (verifying inference-based models from data) Therefore, we need something better to cope with the tonnabytes Slide 7 Data Science Informatics & Statistics Slide 8 This graphic says it all Graphic provided by S. G. Djorgovski, Caltech Clustering examine the data and find the data clusters (clouds), without considering what the items are = Characterization ! Classification for each new data item, try to place it within a known class (i.e., a known category or cluster) = Classify ! Outlier Detection identify those data items that dont fit into the known classes or clusters = Surprise ! 8 Slide 9 Data-Enabled Science: Scientific KDD (Knowledge Discovery from Data) Characterize the new (clustering, unsupervised learning) Assign the known (classification, supervised learning) Discover the unknown (outlier detection, semi-supervised learning) The two major benefits of BIG DATA: 1.best statistical analysis of typical events 2.automated search for rare events Graphic from S. G. Djorgovski Slide 10 Basic Astronomical Knowledge Problems 1 The clustering problem: Finding clusters of objects within a data set What is the significance of the clusters (statistically and scientifically)? What is the optimal algorithm for finding friends-of- friends or nearest neighbors? N is >10 10, so what is the most efficient way to sort? Number of dimensions ~ 1000 therefore, we have an enormous subspace search problem Are there pair-wise (2-point) or higher-order (N-way) correlations? N is >10 10, so what is the most efficient way to do an N-point correlation? algorithms that scale as N 2 logN wont get us there Slide 11 Basic Astronomical Knowledge Problems 2 Outlier detection: (unknown unknowns) Finding the objects and events that are outside the bounds of our expectations (outside known clusters) These may be real scientific discoveries or garbage Outlier detection is therefore useful for: Novelty Discovery is my Nobel prize waiting? Anomaly Detection is the detector system working? Data Quality Assurance is the data pipeline working? How does one optimally find outliers in 10 3 -D parameter space? or in interesting subspaces (in lower dimensions)? How do we measure their interestingness? Slide 12 The dimension reduction problem: Finding correlations and fundamental planes of parameters Number of attributes can be hundreds or thousands The Curse of High Dimensionality ! Are there combinations (linear or non-linear functions) of observational parameters that correlate strongly with one another? Are there eigenvectors or condensed representations (e.g., basis sets) that represent the full set of properties? Basic Astronomical Knowledge Problems 3 Slide 13 Basic Astronomical Knowledge Problems 4 The superposition / decomposition problem: Finding the defining features that separate different classes objects that overlap in simple parameter spaces What if there are 10 10 objects that overlap in a 10 3 -D parameter space? What is the optimal way to separate and extract the different unique classes of objects? Slide 14 The LSST Big Data Manifesto More data is not just more data more is different! More data is not just more data more is different! Discover the unknown unknowns. Discover the unknown unknowns. Massive Data-to-Knowledge challenge. Massive Data-to-Knowledge challenge. Slide 15 The LSST Big Data Challenges 1.Massive data stream: ~2 Terabytes of image data per hour that must be mined in real time (for 10 years). 2.Massive 20-Petabyte database: more than 20 billion objects need to be classified, and most will be monitored for important variations in real time. 3.Massive event stream: knowledge extraction in real time for ~2,000,000 events each night. Challenge #1 includes both the static data mining aspects of #2 and the dynamic data mining aspects of #3. Challenge #1 includes both the static data mining aspects of #2 and the dynamic data mining aspects of #3. Look at these in more detail... Look at these in more detail... Slide 16 LSST big data challenges # 1, 2 Each night for 10 years LSST will obtain roughly the equivalent amount of data that was obtained by the entire Sloan Digital Sky Survey Our grad students will be asked to mine these data (~20 TB each night 40,000 CDs filled with data): A truckload of CDs each and every day for 10 yrs Cumulatively, a football stadium full of 100 million CDs after 10 yrs The challenge is to find the new, the novel, the interesting, and the surprises (the unknown unknowns) within all of these data. Yes, more is most definitely different ! Slide 17 LSST big data challenge # 3 Approximately 2,000,000 times each night for 10 years LSST will detect a new sky event, and the astronomical community will be challenged with classifying these events. What will we do with all of these events? time flux Characterize first ! (Unsupervised Learning) Classify later. Slide 18 Characterization includes Feature Detection and Extraction: Identifying and describing features in the data via machine algorithms or human inspection (including the potentially huge contributions from Citizen Science) Extracting feature descriptors from the data Curating these features for search, re-use, & discovery Finding other parameters and features from other archives, other databases, other information sources and using those to help characterize (ultimately classify) each new event. hence, coping with a highly multivariate parameter space Slide 19 Data-driven Discovery (Unsupervised Learning) i.e., What can I do with characterizations? 1.Class Discovery Clustering 2.Principal Component Analysis Dimension Reduction 3.Outlier Detection Surprise / Anomaly / Deviation / Novelty Discovery 4.Link Analysis Association Analysis Network Analysis 5.and more. Slide 20 20 Addressing the D2K (Data-to-Knowledge) Challenge Complete end-to-end application of Data Science: Data management, metadata management, data search, information extraction, data mining, knowledge discovery Applies to any discipline (not just science disciplines) Skilled workforce needed to take data to knowledge Slide 21 21 Informatics in Education and An Education in Informatics Slide 22 Data Science Education: Two Perspectives Informatics in Education working with data in all learning settings Informatics (Data Science) enables transparent reuse and analysis of data in inquiry-based classroom learning. Learning is enhanced when students work with real data and information (especially online data) that are related to the topic (any topic) being studied. http://serc.carleton.edu/usingdata/ (Using Data in the Classroom) http://serc.carleton.edu/usingdata/ Example: CSI The Cosmos An Education in Informatics students are specifically trained: to access large distributed data repositories to conduct meaningful inquiries into the data to mine, visualize, and analyze the data to make objective data-driven inferences, discoveries, and decisions Numerous Data Science programs now exist at several universities (GMU, Caltech, RPI, Michigan, Cornell, U. Illinois, Indiana U., ) http://cds.gmu.edu/ (Computational & Data Sciences @ GMU) http://cds.gmu.edu/ Slide 23 23 Responses to Big Data 1 2.5 approaches to dealing with Big Data: Data Science = Informatics & Statistics (and data-intensive computing) Citizen Science = Human Computation Or else (where possible) combine these two use the very effective human cognitive skills of pattern recognition and anomaly detection to generate training sets of relevant features (characterizations) to improve the machine algorithms. Slide 24 24 Responses to Big Data 2 LSST Informatics & Statistics Science Collaboration: breakout @ 11am in TB-A New Journal: Astronomy & Computing Poster and flyers available in hallway http://www.journals.elsevier.com/astronomy-and-computing/http://www.journals.elsevier.com/astronomy-and-computing/ New AAS Working Group on Astroinformatics & Astrostatistics Members: Bill Zeljko Ivezic (chair), Kirk Borne, George Djorgovski, Eric Feigelson, Eric Ford, Alyssa Goodman, Aneta Siemiginowska, Alex Szalay, Rick White. Visit https://www.facebook.com/AstroInformaticshttps://www.facebook.com/AstroInformatics Slide 25 25 LSST Informatics & Statistics Breakout Session Brief lightning talks by 7 team members : Jogesh Babu: Statistical Resources Kirk Borne: Outlier Detection for Surprise Discovery in Big Data Matthew Graham: Characterizing and Classifying CRTS Joseph Richards: Time-Domain Discovery and Classification Sam Schmidt: Upcoming Challenges for Photometric Redshifts Lior Shamir: Automatic Analysis of Galaxy Morphology John Wallin: Citizen Science and Machine Learning Open Discussion : LSST Publication Reviews: informatics & statistics participation LSST Science Book chapter Research Roadmap document 11:00am-12:30pm today Tortolita Ballroom A