digital worlds (applications) · digital worlds (applications) ! vec (enterprise scale) • 1,300...
TRANSCRIPT
Digital Worlds (applications) q VEC (Enterprise Scale)
• 1,300 source databases • 10+ million views (via data integration)
q US Healthcare (National Scale) • Scale
o Health care and social assistance offices: 784,626 incl • Doctors offices: 220,131 • Dentists: 127,057 • Hospitals: 6,505 • Clinics: ~5,000 ~= SME say 100 Databases
o Patients: 100-300+ million o Databases: ~32 million
• Scope o Comprehensive medical events, methods, analysis, …
• E.g., Alice (62) in Emergency Room with liver failure
o Insurance, payments, … o New metric: healthcare quality
• Examples o SHRINE (2009): 3 hospitals; uses 2,381,883 distinct concepts (ontologies) o HHS CIO (Todd Park): Open Health Data Initiative o US (PCAST, White House) vision
Observations q Data Sources
• Massive o Number o Heterogeneity o Distribution (data at source) o Constant change – data, model, ontology, business rules, …
• Constrained o Governance: privacy, confidentiality, legal, … o Quality, correctness, precision, … o Competition
q Critical Requirement: meaningful • Human lives • Health of individuals, communities, nation • Economic impact: $ trillions / year • Political: meaningless debates
Trends q Digital Universe
q Holistic Views • Information Ecosystems: data • Ecosystems: Processes over services
q Big Data: massive o Number o Distribution o Heterogeneity
• Semantics • Structure: relational databases, X databases, web, deep web • Technology: databases, data warehouses, files, …
q New Models: problem solving, data, … • Data-driven • Social computing: data as social artifacts • Science: Wolfram Alpha • Pragmatics: Driven by healthcare quality improvement
Databases and AI: The Twain Just Met q Database World
• Engineering (RDBMSs) @ scale • Reasoning: Relational model (FoL)
q AI World • Reasoning: more powerful & expressive • Engineering: in the small
q Digital Universe, e.g., Web • Reasoning: beyond the RDM & AI? • Engineering: way beyond RDBMS
q Information ecosystems • Databases: join • Web: link
Power Law of Data The value of a data element is proportional to the number of its meaningful uses.
What Underlies the Digital Universe
Languages
Modelling
Semantics
Data Models
Problem Solving Computation
Algorithms
Execution
Semantics
DBMS Engines
What Underlies the Data Universe
Relational Data Model
Problem Solving Computation Semantics Semantics
RDBMS Data Independence
Relational Database Improvements q Pre-Relational
• Hierarchical • Network
q Relational • Row store • OLAP / Data Warehouse
q Post-Relational • RDF store • Column store • Bare bones relational • Stream / complex event processing
q Push Down • Database / data warehouse appliances (20+ on the market) • In-database analytics, … (10+ on the market)
Data Models For New Domains Must Honor Data Independence
q Array (Matrix)-store (SciDB) [Linear algebra]
q XML databases: structured content, information exchange
q Content management: e.g., Sharepoint
q Graph/network store: social networking (Facebook), link analysis
q Protein store: protein folding, drug discovery, …
q Geospatial / map store: location-based applications
q Time series: signal processing, statistical and financial analysis
q Cloud / Mesh data (NoSQL) stores: web scale applications
q and they just keep coming …
Data Universe
Database Universe
Relational Data
Universe
DBU
RDM
Scientific Data
Model
Time Series Data
Model
Geo-Spatial Data
Model
Digital Media Data
Model
Document Data
Model
Graph- Network
Data Model
ETC. ETC. ETC.
Data Universe
DBU
RDM
Scientific Data
Model
Time Series Data
Model
Geo-Spatial Data
Model
Digital Media Data
Model
Document Data
Model
Graph- Network
Data Model
ETC. ETC. ETC.
Data Universe
Data Integration Solution Space:
Semantic Technologies (AI) Knowledge Representation Minimal Powerful
Ontologies Minimal Powerful Semantic Web Modest / emerging Modest / emerging
Semantic Data Management Emerging Emerging Architectural
Information-As-A-Service Emerging Emerging Cloud Emerging N/A
Databases Relational Optimal 4 homogeneous
relational data Optimal 4 pure relational data
Domain-specific Emerging Emerging
Computation Problem Solving
Data Independence Required
Databases vs. Semantic Web
Mathematical Logic
Discrete Worlds
Probabilistic / Eventual Reasoning
Single Versions of Truth
Databases Semantic Web
What Logic ?
Heterogeneous Worlds
Common Sense Reasoning?
Multiple Truths
1,000s of databases
Data Models LOD Models?
DI: Relational Join DI: Evidence Gathering
. . .
Data Management
Seman+cally Heterogeneous Views
Single versio
ns
of truth
Mul2p
le versio
ns of truth
Seman+cally Homogeneous Databases
. . . Web
. . . Data Warehouses
Sca
le
Databases vs. Web Evidence Gathering
Analysis / BI Explora2on
Data Integration q Query: define the result
• Entity • Computation
q Find candidate data sets: search q Extract, Transform, and Load (ETL): engineering q Data Integration
• Entity resolution • Integration computation
Harder
Hard
Managing Data @ Scale I q Introduction
• Michael L. Brodie
q Global Data Integration and Global Data Mining • Chris Bizer
q DB vs RDF: structure vs correlation • Peter Boncz