source: alex szalay. example: sloan digital sky survey the sdss telescope array is systematically...

12
source: Alex Szalay

Upload: jeffrey-martin

Post on 20-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Source: Alex Szalay. Example: Sloan Digital Sky Survey The SDSS telescope array is systematically mapping ¼ of the entire sky Discoveries are made by

source: Alex Szalay

Page 2: Source: Alex Szalay. Example: Sloan Digital Sky Survey The SDSS telescope array is systematically mapping ¼ of the entire sky Discoveries are made by

Example: Sloan Digital Sky SurveyThe SDSS telescope array is systematically

mapping ¼ of the entire sky

Discoveries are made by querying the database, not through a zero-sum wrestling match for telescope time

Managed by an RDBMS(MS SQL Server), equipped with a hierarchical triangular mesh index, among other customizations

15 TB in the final release in 2007818 GB in the RDBMS (13.6B tuples)

Page 3: Source: Alex Szalay. Example: Sloan Digital Sky Survey The SDSS telescope array is systematically mapping ¼ of the entire sky Discoveries are made by

source: Alex Szalay

Page 4: Source: Alex Szalay. Example: Sloan Digital Sky Survey The SDSS telescope array is systematically mapping ¼ of the entire sky Discoveries are made by

Drowning in data; starving for information

Acquisition eventually outpaces analysis Medicine: Online publishing, digital charts Astronomy: Big telescopes (more in a bit) Genetics: PCR, Shotgun Sequencing Oceanography: ?? Marine Microbiology: ??

Empirical X Analytical X Computational X X-informatics

“Increase Data Collection Exponentially in Less Time, with FlowCAM”

Page 5: Source: Alex Szalay. Example: Sloan Digital Sky Survey The SDSS telescope array is systematically mapping ¼ of the entire sky Discoveries are made by

Cyber-Observatories

Arctic Observing Network (AON) Ocean Observing Initiative (OOI) National Ecological Observatory Network (NEON) The Waters Network The Long-Term Ecological Research (LTER) network The Geosciences Network (GEON) Earthscope/Incorporated Research Institutions for Seismology

(IRIS) Virtual Solar-Terrestrial Observatory (VSTO) Linked Environments for Atmospheric Discovery (LEAD)

Page 6: Source: Alex Szalay. Example: Sloan Digital Sky Survey The SDSS telescope array is systematically mapping ¼ of the entire sky Discoveries are made by

source: Alex Szalay

Page 7: Source: Alex Szalay. Example: Sloan Digital Sky Survey The SDSS telescope array is systematically mapping ¼ of the entire sky Discoveries are made by

source: Jim Gray

Page 8: Source: Alex Szalay. Example: Sloan Digital Sky Survey The SDSS telescope array is systematically mapping ¼ of the entire sky Discoveries are made by

Relational Databases (In Codd we Trust…)

At IBM Almaden in 60s and 70s, Ted Codd worked out a formal basis for tabular data representation, organization, and access1.

The early systems were buggy and slow (and sometimes reviled), but programmers only had to write 5% of the code the previously did.

Key Idea: Programs that manipulate tabular data exhibit an algebraic structure; proposed a relational algebra to manipulate these data sets in their logical form, indpendently of their physical representation

1 E. F. Codd, “A Relational Model of Data for Large Shared Data Banks”, Communications of the ACM 13(6), pp 377-387, 1970

phsyical data independence

logical data independence

Page 9: Source: Alex Szalay. Example: Sloan Digital Sky Survey The SDSS telescope array is systematically mapping ¼ of the entire sky Discoveries are made by

source: Raghu Ramakrishnan

Page 10: Source: Alex Szalay. Example: Sloan Digital Sky Survey The SDSS telescope array is systematically mapping ¼ of the entire sky Discoveries are made by

Characteristicsof Cloud Computing• Virtual – Physical location and underlying

infrastructure details are transparent to users • Scalable – Able to break complex workloads

into pieces to be served across an incrementally expandable infrastructure

• Efficient – “Services Oriented Architecture” for dynamic provisioning of shared compute resources

• Flexible – Can serve a variety of workload types – both consumer and commercial

Page 11: Source: Alex Szalay. Example: Sloan Digital Sky Survey The SDSS telescope array is systematically mapping ¼ of the entire sky Discoveries are made by

Cloud Computing as Hosted Data Management Services

Yahoo Yahoo Distributed Hash Tables: Key/value pairs Yahoo Distributed Ordered Tables: Ordered ranges PNUTS: Relational-style storage, indexing and query

Amazon S3: Simple Storage SimpleDB: Quasi-Relational features

Google APIs for: Storage, Visualization, Document processing, Images, Mail

Microsoft: CloudDB: Relational-style features

Page 12: Source: Alex Szalay. Example: Sloan Digital Sky Survey The SDSS telescope array is systematically mapping ¼ of the entire sky Discoveries are made by

Workflow at CMOP

Cloning/cDNA/…

Sequencingplates

InspectionFASTA files

OHSUWashington University

BLAST

FASTA files

PNNL

Post processingHit tables

Cleaninge.g., trim bad reads at the end

Link

Shared Knowledge

Analyze

synopsis

Cloud

Hit tables + metadata