accumulo summit 2014: addressing big data challenges through innovative architecture, databases and...
DESCRIPTION
The ability to collect and analyze large amounts of data is a growing problem within the scientific community. The growing gap between data and users calls for innovative tools that address the challenges faced by big data volume, velocity and variety. The Massachusetts Institute of Technology, Lincoln Laboratory (MIT LL) is not immune to these challenges and has developed a set of tools that address many of these challenges. Big data volume stresses the storage, memory, and compute capacity of a computing system and requires access to a computing cloud. Choosing the right cloud is problem specific. Currently, there are four multi-billion dollar ecosystems that dominate the cloud computing environment: enterprise clouds, big data clouds, SQL database clouds, and supercomputing clouds. Each cloud ecosystem has its own hardware, software, conferences, and business markets. The broad nature of business big data challenges make it unlikely that one cloud ecosystem can meet its needs and solutions are likely to require the tools and techniques from more than one cloud ecosystem. The MIT Supercloud was developed to address this challenge. To our knowledge, the MIT SuperCloud is the only deployed cloud system that allows all four ecosystems to co-exist without sacrificing performance or functionality. The velocity of big data velocity stresses the rate at which data can be absorbed and meaningful answers produced. Led by the NSA, a Common Big Data Architecture (CBDA) was developed for the U.S. government based on the Google Big Table NoSQL approach and is now in wide use. MIT/LL played a leading role in developing the CBDA and is a leader in adapting the CBDA to a variety of big data challenges. Big data variety may present the largest challenge and greatest opportunities. The promise of big data is the ability to correlate diverse and heterogeneous data to form new insights. The centerpiece of the CBDA is the NSA developed Apache Accumulo database (capable of millions of entries/second) and the MIT/LL developed D4M schema. These technologies allow vast quantities of highly diverse data (text, computer logs, and social media data, etc.) to be automatically ingested into a common schema that enables rapid query and correlation of every element. The talk will concentrate on how we utilize the aforementioned technologies in our mission to apply advanced technology to problems of national security.TRANSCRIPT
Addressing Big Data Challenges through Innovative Architecture,
Databases and Software
UNCLASSIFIED
Dr. Vijay [email protected]
Accumulo SummitCollege Park, MD
June 12, 2014
This work is sponsored by the Assistant Secretary of Defense for Research and Engineering under Air Force Contract #FA8721-05-C-0002. Opinions, interpretations, recommendations and conclusions are those of the authors and are not necessarily endorsed by the United States Government
Accumulo SummitVNG - 2
Acknowledgements
• Bill Arcand
• Bill Bergeron
• David Bestor
• Chansup Byun
• Matt Hubbell
• Jeremy Kepner
• Pete Michaleas
• Julie Mullen
• Andy Prout
• Albert Reuther
• Tony Rosa
• Charles Yee
And many more …
Accumulo SummitVNG - 3
Outline
• Introduction
• Cloud Computing and Challenges
• Innovative Architecture: MIT SuperCloud
• Innovative Databases: Apache Accumulo
• Innovative Software: D4M
• R&D Examples
• Conclusions
Accumulo SummitVNG - 4
Introduction to MIT Lincoln Laboratory
Established 1951
Lincoln Laboratory is a Department of Defense FFRDC operated by MIT
Accumulo SummitVNG - 5
Technology in Support of National Security
Sensors Information Extraction Communications
Integrated Sensing and Decision Support
(Secure – Countermeasure Resistant)
Purpose
Core Work Areas
Space Control
Intelligence,Surveillance, and
Reconnaissance Systems and Technology
Tactical Systems
Air and MissileDefense Technology
Homeland ProtectionAir Traffic Control
Communication Systems Advanced TechnologyCyber Security and
Information SciencesEngineering
Current Mission Areas
MIT Lincoln Laboratory
Accumulo SummitVNG - 6
MIT Lincoln Laboratory - Focus Areas -
Rapid Prototyping
Trusted Government Advisor
University AffiliationsUniversity Affiliations
SystemAnalysis
• Highly instrumented• Field / operational
testing
• Capabilities against existing & future threats
• Rapid development
• Operationallyrelevant
• Validatedby testing
Methodology Outputs
TestingTechnologyPrototyping
Assessmentsto Senior
Leadership
Architects andRequirements
Definition
AdvancedTechnologyPrototypes
Broad Multi-Mission Technology Strength
Architecture Analysis and Test
Architecture Analysis and Test
Conferences, WorkshopsOutreach
Conferences, WorkshopsOutreach
Accumulo SummitVNG - 7
Outline
• Introduction
• Cloud Computing and Challenges
• Innovative Architecture: MIT SuperCloud
• Innovative Databases: Apache Accumulo
• Innovative Software: D4M
• R&D Examples
• Conclusions
Accumulo SummitVNG - 8
WarfightersOperators Analysts
MaritimeGround SpaceC2 CyberText and Social Media
<html>
Data
AirHUMINTWeather
Gap
Common Big Data Challenge
Users
Year
Info
rma
tio
n S
tore
d
(MB
)
1986 1989 1992 1995 1998 2001 2004 2007 2010 2013
3 X 1014 MB
7 X 1012 MIPS
World TotalInformationStored
World TotalComputing Capacity
Millio
ns
of In
stru
ctio
ns
pe
r Se
co
nd
(MIP
S)
1014
1015
1016
1013
1012
1011
1014
1015
1016
1013
1012
1011
1010
Source: M. Hilbert and P. López, Science, Vol. 332 (2011) and associated online material
Rapidly increasing- Data volume- Data velocity- Data variety
Accumulo SummitVNG - 9
Common Big Data Architecture
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
Computing
Web
Files
Scheduler
Ingest & EnrichmentIngest &
EnrichmentIngestDatabases
Accumulo SummitVNG - 10
Common Big Data Architecture- Data Volume: Cloud Computing -
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
Computing
Web
Files
Scheduler
Ingest & EnrichmentIngest &
EnrichmentIngestDatabases
Operators
MITSuperCloud
Enterprise Cloud
Big Data Cloud Database Cloud
Compute CloudMIT SuperCloud merges four clouds
LLSuperCloud: Sharing HPC Systems for Diverse Rapid Prototyping, Reuther et al, IEEE HPEC 2013
Accumulo SummitVNG - 11
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
Computing
Web
Files
Scheduler
Ingest & EnrichmentIngest &
EnrichmentIngestDatabases
Lincoln benchmarkingvalidated Accumulo performance
Common Big Data Architecture- Data Velocity: Accumulo Database -
Accumulo SummitVNG - 12
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
Computing
Web
Files
Scheduler
Ingest & EnrichmentIngest &
EnrichmentIngestDatabases
D4M demonstrated auniversal approach to diverse data
columnsro
ws
Σ
raw
Common Big Data Architecture- Data Variety: D4M Schema -
intel reports, DNA, health records, publication citations, web logs, social media, building alarms, cyber, … all handled by a common 4 table schema
D4M 2.0 Schema: A General Purpose High Performance Schema for the Accumulo Database, Kepner et al, IEEE HPEC 2013
Accumulo SummitVNG - 13
The Cloud within the Common Big Data Architecture
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
Computing
Web
Files
Scheduler
Ingest & EnrichmentIngest &
EnrichmentIngestDatabases
The “Cloud”
Accumulo SummitVNG - 14
• Each cloud ecosystem supports many multi-$B industries• Each cloud ecosystem uses different software and hardware
Four Ecosystems Dominate Large Scale Cloud Computing
Enterprise
Big Data Database
Supercomputing
- Interactive- On-demand- Virtualization
- High performance- Scientific computing- Batch jobs
- Java- Distributed- Easy admin
- Indexing- Search- Atomic
Accumulo SummitVNG - 15
• MIT SuperCloud adds virtual machines and added security• Combines all four ecosystems without sacrificing performance
Enterprise
Big Data
MIT SuperCloud
Supercomputing
- Interactive- On-demand- Virtualization
- High performance- Scientific computing- Batch jobs
- Java- Distributed- Easy admin
- Indexing- Search- Atomic
Database
MIT SuperCloud
Accumulo SummitVNG - 16
• VMware is the main enterprise computing virtualization technology• Message Passing Interface (MPI) is the primary supercomputing API• System Query Lange (SQL) is the primary database API• Hadoop & Accumulo & D4M are core to government big data clouds
MIT SuperCloud
Enterprise
Big Data
- Interactive- On-demand- Virtualization
- Java- Distributed- Easy admin
Core Technologies
VMware
Hadoop
MPI
SQL
Database
Supercomputing
- High performance- Scientific computing- Batch jobs
- Indexing- Search- Atomic
D4M = Dynamic Distributed Dimensional Data Model
Accumulo SummitVNG - 17
Outline
• Introduction
• Cloud Computing and Challenges
• Innovative Architecture: MIT SuperCloud
• Innovative Databases: Apache Accumulo
• Innovative Software: D4M
• R&D Examples
• Conclusions
Accumulo SummitVNG - 18
MIT SuperCloud
• Developed to address the challenges associated with big data volume
• Cloud system allows all four ecosystems of the cloud to exist within the same computational architecture
• Key Innovations:– Shared HPC cloud capabilities– High performance– Reliable
• Brings the power of cloud computing to the HPC community
Accumulo SummitVNG - 19
• Allows different architectures to be dynamically combined and tested• Allows different architectures to be dynamically combined and tested
Cloud Ecosystems
Enterprise
Big Data
- Interactive- On-demand- Virtualization
- Java- Distributed- Easy admin
VMware
Hadoop
MPI
SQL
Database
Supercomputing
- High performance- Scientific computing- Batch jobs
- Indexing- Search- Atomic
MIT SuperCloud
Accumulo SummitVNG - 20
MIT SuperCloud
Network Storage
Scheduler
Monitoring System
Compute NodesService NodesClusterSwitch
LAN Switch
Interactive Compute Job
Interactive VM Job
Interactive Database Job
ProjectData
TX-E1
Accumulo SummitVNG - 21
Cloud Computing @ MIT
• The cloud computing infrastructure at Lincoln Laboratory is based on the MIT Supercloud infrastructure which allows the different cloud eco systems to co exist
• MIT SuperCloud architecture addresses the issues of big data volume
• Centerpiece of MIT SuperCloud: Accumulo database
Accumulo SummitVNG - 22
Outline
• Introduction
• Cloud Computing and Challenges
• Innovative Architecture: MIT SuperCloud
• Innovative Databases: Apache Accumulo
• Innovative Software: D4M
• R&D Examples
• Conclusions
Accumulo SummitVNG - 23
Apache Accumulo
• Highest performance open source database
• Contributed to Apache project by the NSA in 2011
• Used extensively for government applications
• Requires a schema for storing and organizing data to obtain full benefits
Accumulo SummitVNG - 24
Accumulo and the MIT SuperCloud
• Apache Accumulo is a high performance database used for a variety of purposes– Helps address the big data velocity challenge
• Accumulo is the centerpiece of the Common Big Data Architecture developed by MIT Lincoln Laboratory
• Key features:– Open Source– High Performance– Widely adopted – Vibrant Developer Community
• MIT Lincoln Laboratory has developed a set of tools – D4M to help researchers use Accumulo for novel research
Accumulo SummitVNG - 25
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
Computing
Web
Files
Scheduler
Ingest & EnrichmentIngest &
EnrichmentIngest
Common Big Data Architecture- Data Velocity: Accumulo Database -
Databases
Accumulo SummitVNG - 26
Outline
• Introduction
• Cloud Computing and Challenges
• Innovative Architecture: MIT SuperCloud
• Innovative Databases: Apache Accumulo
• Innovative Software: D4M
• R&D Examples
• Conclusions
Accumulo SummitVNG - 27
High Level Language: D4Mhttp://www.mit.edu/~kepner/D4M
AccumuloDistributed Database
Query:AliceBobCathyDavidEarl
Associative ArraysNumerical Computing Environment
D4MDynamic Distributed Dimensional Data Model
A
C
DE
B
A D4M query returns a sparse matrix or a graph…
…for statistical signal processing or graph analysis in MATLAB
D4M binds associative arrays to databases, enabling rapid prototyping of data-intensive cloud analytics and visualization
Accumulo SummitVNG - 28
D4M
• The Dynamic Distributed Data Model– Supports database and computation systems that deal with
Big Data– Developed at Lincoln Laboratory
• Key Features:– Applies linear algebra and signal processing techniques to
databases through associative arrays– D4M data schema offers a one-stop solution for most types
of data source for any type of database– Low barrier to entry – API accessible to those even with
minimal database and/or big-data background
Accumulo SummitVNG - 29
Associative Arrays
• Extends associative arrays to 2D and mixed data types
A(’#a1',’#b2') = ’same_tweet'
• Key innovation: 2D is 1-to-1 with triple store(’#a1',’#b2',’same_tweet’)
• Enables composable mathematical operations
A + B A - B A & B A|B A*B
• Enables composable query operations via array indexing
A(’#al b2',:) A(’#al,',:) A(’#a* ',:)
A('#al: b2',:) A(1:2,:) A == #b2#a1
#b2
same_tweet#a1 #b2
Accumulo SummitVNG - 30
Data Schema
• A structure described in a language supported by the database management system
• Accumulo supports triples– How can we represent heterogeneous data types in a common data
schema?– Use the D4M schema
• D4M schema converts structured or unstructured raw data to the 3-tuple representation supported by Accumulo:
– row is a unique identifier (often some variation of a time stamp)– column is a unique representation of the data– value is typically just ‘1’
• Usually use a 4 table representation– The Edge Table, the Transpose Table, Degree Table, Raw Table
(row, column) value
Accumulo SummitVNG - 31
Exploded Table
row_num col1 col2 col3
001 row1col1 row1col2 word1 word2 word3
002 row2col1 row2col2 word2 word3
003 … … word1 word3
col1|row1col1 col1|row2col1 col2|row1col2 col2|row2col2 col3|word1 col3|word2 col3|word3
row_num|001 1 1 1 1 1
row_num|002 1 1 1 1
row_num|003 1 1
Use as row indices
Create columns for each unique type/value pair
col1|row1col1 col1|row2col1 col2|row1col2 col2|row2col2 col3|word1 col3|word2 col3|word3
Degree 1 1 1 1 2 2 3
row_num|001 row_num|002 row_num|003
col1|row1col1 1 col1|row2col1 col2|row1col2 1 1 col2|row2col2 1
col3|word1 1 1
col3|word2 1 1 col3|word3 1 1
text
row_num|001
word1 word2 word3
row_num|002
word2 word3
row_num|003
word1 word3
Tedge
TedgeDeg
TedgeT TedgeTxt
Accumulo SummitVNG - 32
• Key innovation: mathematical closure– All associative array operations return associative arrays
• Enables composable mathematical operations
A + B A - B A & B A|B A*B
• Enables composable query operations via array indexing
A('alice bob ',:) A('alice ',:) A('al* ',:)
A('alice : bob ',:) A(1:2,:) A == 47.0
• Simple to implement in a library (~3500 lines) in programming environments with: 1st class support of 2D arrays, operator overloading, sparse linear algebra
Composable Associative Arrays
• Complex queries with ~50x less effort than Java/SQL• Naturally leads to high performance parallel implementation
Accumulo SummitVNG - 33
Using D4M for Advanced Analytics
• D4M allows researchers to harness the versatility of the MIT SuperCloud architecture and speed of Apache Accumulo through the familiarity of high level languages such at MATLAB or GNU Octave.
• D4M schema provides an approach to mitigate challenges associated with big data variety
• D4M is used for a variety of applications across the Department of Defense and Intelligence Community
Accumulo SummitVNG - 34
• Introduction
• Cloud Computing and Challenges
• Innovative Architecture: MIT SuperCloud
• Innovative Databases: Apache Accumulo
• Innovative Software: D4M
• R&D Examples
• Conclusions
Outline
Accumulo SummitVNG - 35
Supporting National Security-Rapid Solution Prototyping-
336592592584179712 2013-05-20 21:21:42 20798128 kiefpief web 3b77caf94bfc81fe I am sending love to Oklahoma. And actually -- to everyone who may need it. You are loved. And you are not alone. Promise. #PrayforOklahoma336600956710027264 2013-05-20 21:54:56 35.99894978 -78.90660222 -8783842.7781526 4300476.86376416 22435220 RyanBLeslie Twitter for iPad348803787 bced47a0c99c71d0 @HaydenBigCntry RT @jiminhofe: The devastation in Oklahoma is …
Step 1: Start an instance of Accumulo and Ingest DataStep 2: Find all tweets with keyword:
>>A = Tedge(Row(Tedge(:, 'word|#prayforoklahoma,')),:);Step 3: Filter tweets by location:
>>B = A(:, 'latlon|+-003934,:,latlon|+-003979,’);Step 4: Visualize results:
>>Assoc2KML(B);
Accumulo SummitVNG - 36
Promoting big data discovery-Domain Agnostic Analytics-
NOISE
SIGNAL
N-D SPACE
Example background model:Power Law Graph
Goal: Find subgraph of interest using background model to identify noise
Model Background Data to Extract Signal from Observations
filter
filter
pass
dmax
- =
ObservedData
BackgroundModel of Data
ResidualData
Signal&
NoiseNoise Signal
Big Data Filtering and SamplingDetecting Subgraphs of Interest from Large Graphs
Accumulo SummitVNG - 37
Securing the Cloud-The Lincoln Secure and Resilient Cloud-
Analytics
Computing
Web
Files
Scheduler
Ingest & Enrichment
Ingest & EnrichmentIngest
Databases
Secure and Resilient
Communication+ Provenance
Secure andResilient
Storage
Secure and Resilient
Processing
• Big Data systems are vulnerable to a variety of attacks• Improve security of cloud systems by researching:
• Security in Communication and Provenance• Security in Data Storage• Security in Processing• Security in the underlying architecture
Accumulo SummitVNG - 38
Ensuring Privacy-Computing On Masked Data-
Big Data Veracity
<html>
ChallengesAnalysts
Analytics
ComputingScheduler
Ingest & EnrichmentIngest &
Enrichment
Remote Code Injection
Hypervisor Privilege Escalation
Cross VM Side Channels
Data Loss / Exfiltration
Data Integrity Attack
Current Approaches
<html>
Analytics
Computing
Files
Scheduler
Ingest & EnrichmentIngest &
EnrichmentIngest
Encryptedlink
EncryptedlinkEncrypted
storage
Encryptedstorage
Vision
<html>
Computing
Files
Scheduler
Ingest & EnrichmentIngest &
Enrichment
Compute on Encrypted
Data
Compute on Encrypted
Data
Compute on Encrypted Data
Step 1: Mask data and ingest into database>>put(Tedge, Mask(Aedge, maskcode));
Step 2: Query DB for results with masked queries>>Aedge_mt = Tedge(Row(Tedge(:,StrMask(‘word|bieber ‘, maskcode))),:);>>Atxt_mt = TedgeTxt(Row(Tedge(:,StrMask(‘word|bieber ‘, maskcode))),:);
Step 3: Unmask Results>>Aedge = Unmask(Aedge_mt, maskcode);>>Atxt = Unmask(Atxt_mt, maskcode);
Use D4M and CMD to protect the 4th V of Big Data – Veracity• Big Data systems are vulnerable to a variety of attacks• Currently encrypt data at rest but data in flight is in the clear• Compute on Encrypted Data: Data is always protected by
encryption through the system.
Accumulo SummitVNG - 39
Outline
• Introduction
• Cloud Computing and Challenges
• Innovative Architecture: MIT SuperCloud
• Innovative Databases: Apache Accumulo
• Innovative Software: D4M
• R&D Examples
• Conclusions
Accumulo SummitVNG - 40
Summary
Air and MissileDefense
HomelandProtection
Air TrafficControl
CommunicationSystems
AdvancedTechnology
SpaceControl
ISR Systemsand Technology Tactical Systems
Mission Areas:
Cyber Security
Engineering
• Lincoln Laboratory missions collect and process vast amounts of data from many sources
• MIT Lincoln Laboratory makes use of innovations in system architecture (MIT SuperCloud), database technologies (Apache Accumulo) and software (D4M) to develop technology in support of national security
Data Sources:
MaritimeGround SpaceC2 CyberOSINT
<html>
AirHUMINTWeather
Lincoln Laboratory is always interested in technical exchange with big data community!
Accumulo SummitVNG - 41
Backup
Accumulo SummitVNG - 42
Cyber Security and Information Sciences
Human Language Technology
Cyber Security Metrics Anti-Tamper Hardware Cyber Situational Awareness
Correlation and visualization of cyber alert data makes it possible to detect and understand attacks on large, enterprise networks.
Lincoln Laboratory builds, supports, and uses cyber ranges to evaluate the performance of cyber security technology.
Metrics are defined and measured to estimate the defensive posture of enterprise-class networks.
Physically unclonable functions are used to embed cryptographic key material in a coating around a computing module permitting detection of tampering.
Net-Centric Operations
Cyber Testing and Range Development
Research and prototyping of Service-Oriented Architectures that enable the dynamic composition of systems involving complex sensors, processing and decision-support elements.
Algorithms are developed and implemented for speech and biometric applications, including language/speaker identification, machine translation, and face comparison.
S-13