towards large-scale incident response and interactive network
TRANSCRIPT
Towards Large-Scale Incident Response andInteractive Network Forensics
Matthias VallentinUC Berkeley / ICSI
Dissertation ProposalUC Berkeley
December 14, 2011
April 21, 2009: Bad News for UC Berkeley
2 / 63
Blind SQL InjectionHavij..?deploy_id=799+and+ascii(substring((database()),1,1))<79 31..?deploy_id=799+and+ascii(substring((database()),1,1))<103 11582..?deploy_id=799+and+ascii(substring((database()),1,1))<91 31..?deploy_id=799+and+ascii(substring((database()),1,1))<97 31..?deploy_id=799+and+ascii(substring((database()),1,1))<100 11582..?deploy_id=799+and+ascii(substring((database()),1,1))=99 11582..?deploy_id=799+and+ascii(substring((database()),2,1))<79 31..?deploy_id=799+and+ascii(substring((database()),2,1))<103 31..?deploy_id=799+and+ascii(substring((database()),2,1))<115 11582..?deploy_id=799+and+ascii(substring((database()),2,1))<109 11582..?deploy_id=799+and+ascii(substring((database()),2,1))<106 11582..?deploy_id=799+and+ascii(substring((database()),2,1))=105 11582..?deploy_id=799+and+ascii(substring((database()),3,1))<79 31..?deploy_id=799+and+ascii(substring((database()),3,1))<103 11582..?deploy_id=799+and+ascii(substring((database()),3,1))<91 31
Database name: ci...
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727) Havij3 / 63
Example: Debugging an APT IncidentAdvanced Persistent Threat (APT)Severe security breaches manifest over large time periods1. Initial compromise: stealthy and inconspicuous2. Maintenance: periodic access checks3. Sudden strike: quick and devastating
or3. Continuous leakage: piecemeal exfiltration under the radar
Analyst questionsI How did the attacker get in?I How long did the attacker stay under the radar?I What is the damage?I Was an insider involved?I How to detect similar attacks in the future?I How do we describe the attack?
4 / 63
Incident Response Challenges and the Sobering Reality
ChallengesI Volume: machine-generated data exceeds our analysis capacitiesI Heterogeneity: multitude of data and log formatsI Procedure: unsystematic investigations
RealityI Reliance on incomplete contextI Manual ad-hoc analysisI UNIX tools (awk, grep, uniq)I Expert islands
How do we tackle this situation?
5 / 63
Thesis StatementHypothesisKey operational networking tasks, such as incident response and forensicinvestigations, base their decisions on descriptions of activity that arefragmented across space and time:
I Space: heterogeneous data formats from disparate sourcesI Time: discrepancy in expressing past and future activity
StatementWe can design and build a system to attain a unified view across space andtime.
past present future past present future
6 / 63
Outline
1. Prior Work: Building a NIDS Cluster
2. Use Cases
3. Workload Characterization
4. Requirements
5. Related Work
6. Architecture
7. Roadmap
8. Summary
7 / 63
Outline
1. Prior Work: Building a NIDS Cluster
2. Use Cases
3. Workload Characterization
4. Requirements
5. Related Work
6. Architecture
7. Roadmap
8. Summary
8 / 63
Basic Network Monitoring
Internet Local NetworkTap
Monitor
I Passive tap splits trafficI OpticalI CoppperI Switch span port
I Monitor receives full packet stream→ Challenge: do not fall behind processing packets!
9 / 63
High-Performance Network Monitoring:The NIDS Cluster [VSL+07]
Internet Local NetworkTap
Frontend
Manager
PacketsLogsState
Worker Worker Worker
User
10 / 63
The NIDS Cluster
I ContributionsI Design, prototype, and evaluation of cluster architectureI Bro scripting language enhancements
I Runs now in production at large sites with a 10 Gbps uplink:I UC Berkeley (26 workers), 50,000 hostsI LBNL (15 workers), 12,000 hostsI NCSA (10 × 4-core workers), 10,000 hosts
I Generates follow-up challengesI How to archive and process the output of the cluster?I How to efficiently support incident response and network forensics?
11 / 63
Outline
1. Prior Work: Building a NIDS Cluster
2. Use Cases
3. Workload Characterization
4. Requirements
5. Related Work
6. Architecture
7. Roadmap
8. Summary
12 / 63
Use Case #1: Classic Incident Response
I Goal: quickly isolate scope and impact of security breachI Often begins with a piece of intelligence
I “IP X serves malware over HTTP”I “This MD5 hash is malware”I “Connections to 128.11.5.0/27 at port 42000 are malicious”
I Analysis style: Ad-hoc, interactive, several refinements/adaptionsI Typical operations
I Filter: project, selectI Aggregate: mean, sum, quantile, min/max, histogram, top-k,
unique
⇒ Bottom-up: concrete starting point, then widen scope
13 / 63
Use Case #2: Network Troubleshooting
I Goal: find root cause of component failureI Often no specific hint, merely symptomatic feedback
I “Email does not work :-/”I Typical operations
I Zoom: slice activity at different granularitiesI Time: seconds, minutes, days, . . .I Space: layer 2/3/4/7, protocol, host, subnet, domain, URL, . . .
I Study time series data of activity aggregatesI Find abnormal activity
I “A sudden huge spike in DNS traffic”I Use past behavior to determine present impact [KMV+09] and predict
future [HZC+11]I Judicious machine learning [SP10]
⇒ Top-down: start broadly, then narrow scope incrementally
14 / 63
Use Case #3: Combating Insider Abuse
I Goal: uncover policy violations of personnelI Insider attack:
I Chain of authorized actions, hard to detect individuallyI E.g., data exfiltration
1. User logs in to internal machine2. Copies sensitive document to local machine3. Sends document to third party via email
I Analysis procedure: connect the dotsI Identify first action: gather and compare activity profiles
I “Vern accessed 10x more files on our servers today” [SS11]I “Ion usually does not log in to our backup machine at 3am”
I Identify last action:I Filter fingerprints of sensitive documents at borderI Reinspect past activity under new bias
⇒ Relate temporally distant events
15 / 63
Outline
1. Prior Work: Building a NIDS Cluster
2. Use Cases
3. Workload Characterization
4. Requirements
5. Related Work
6. Architecture
7. Roadmap
8. Summary
16 / 63
Descriptions of Activity: Bro Event Trace
I Use Bro event trace→ Descriptions of activityI Instrumentation: meta events
I TimestampI NameI Size
I Generate from real UCB traffic
Trace DetailsI October 17, 2011, 2:35pm, 10 minI 219 GBI 284,638,230 packetsI 6,585,571 connections Network
Event Engine
Script Interpreter
Packets
Events
Logs Notifications
Workload
17 / 63
Event Workload of one node (1/26)
0 100 200 300 400 500 600
5000
1000
015
000
2000
0
Time (seconds)
Eve
nts
per
seco
nd
5000 10000 15000 20000
0.0
0.2
0.4
0.6
0.8
1.0
Event rate (# events/sec)E
CD
F
Estimator Events/sec MB/sec
Median 10,760 13.4Mean 10,370 14.2Peak 22,460 35
→ need to support peaks of 106 events/sec and 1 GB/sec 18 / 63
Outline
1. Prior Work: Building a NIDS Cluster
2. Use Cases
3. Workload Characterization
4. Requirements
5. Related Work
6. Architecture
7. Roadmap
8. Summary
19 / 63
Requirements
I InteractivityI Security-related incidents are time-critical
I ScalabilityI Distributed system to handle high ingestion ratesI Aging: graceful roll-up of older data
I ExpressivenessI Represent arbitrary activity
I Result FidelityI Trade latency for result correctness
I Analytics & StreamingI A unified approach to querying historical and live data
20 / 63
Outline
1. Prior Work: Building a NIDS Cluster
2. Use Cases
3. Workload Characterization
4. Requirements
5. Related Work
6. Architecture
7. Roadmap
8. Summary
21 / 63
Traditional Views
I Data Base Management Systems (DBMS)I Store first, query later+ Generic– Monolithic
I Data Stream Management Systems (DSMS)I Process and discard+ High throughput– No persistence
I Online Transactional Processing (OLTP)I Small transactional inserts/updates/deletes+ Consistency– Overhead
I Online Analytical Processing (OLAP)I Aggregation over many dimensions+ Speed– Batch loads
22 / 63
Newer Movements
I NoSQL+ Scalability– Flexibility
I MapReduce+ Expressive– Batch processing
I In-memory Cluster Computing+ Speed– Streaming data, initial load
23 / 63
Outline
1. Prior Work: Building a NIDS Cluster
2. Use Cases
3. Workload Characterization
4. Requirements
5. Related Work
6. Architecture
7. Roadmap
8. Summary
24 / 63
VAST: Visibility Across Space and Time
VASTI Visibility
I Realize interactive data explorationsI Across space:
I Unify heterogeneous data formatsI Across time:
I Express past and future behavior uniformly
past present future
25 / 63
Bro’s Data Model: Declaration and InstantiationI Rich-typed: first-class networking types (addr, port, subnet, . . . )I Semi-structured: nested data with container types
Event declaration (simplified)
type connection: record { orig: addr, resp: addr, ... }event connection_established(c: connection)event http_request(c: connection, method: string, URI: string)event http_reply(c: connection, status: string, data: string)
Event instantiation
connection_established({127.0.0.1, 128.32.244.172, ... })http_request({127.0.0.1, 128.32.244.172, ..}, "GET", "/index.html")http_reply({127.0.0.1, 128.32.244.172, ..}, "200", "<!DOCTYPE ht..")http_request({127.0.0.1, 128.32.244.172, ..}, "GET", "/favicon.ico")http_reply({127.0.0.1, 128.32.244.172, ..}, "200", "\xBE\xEF\x..")connection_established({127.0.0.1, 128.32.112.224, ... })
26 / 63
Network-Wide Unified Representation of Activity
27 / 63
Expressing Behavior Between Events [VHM+11]
28 / 63
VAST: Architecture Overview
I Distributed architectureI Elasticity via MQ middle layerI Exchangeability of
components
I DFS: fault-tolerance, replicationI Archive: key-value store
I Contains serialized eventsI Store
I Partitioned in-memorycolumn-store
I Cache semantics (e.g., LRU)I Indexing via compressed
bitmaps
StoreArchive
DFS
Ingest Query
29 / 63
Software Reuse
I Don’t build from scratch unless necessaryI Reuse?
I Streaming: SparkStreamI Archive: Spark, memcachedI Query engine: SharkI DFS: HDFS, KFS
I BuildI StoreI Glue for unified data model
30 / 63
Distributed Ingestion
31 / 63
Ingest
1. Events arrive at Event Router1.1 Assign UUID x1.2 Put (x, event) in archive1.3 Forward event to Indexer1.4 Forward event to Stream
Manager2. Indexer writes event into tablet
I Group related activity3. Tablet Manager flushes “ripe”
tablets based onI Reached capacity (bytes or
events)I Last accessI Age
Query
StoreEventRouter
Tablets
Tablet Manager
Indexer
flush
ripe?
DFS
put
Tablets
write
Archive
Stream Manager
32 / 63
Query
1. User or NIDS issues query2. Query Manager
I Distributes query to datanodes
I Spins up new nodes3. Proxy hits tablet index
a Generates direct result (astablet)
b Returns set of UUIDs→ Archive lookup
4. Tablet ManagerI Flush and load tablets
Query
Store
Tablets
Tablet Manager
Proxy
flush
evict
DFS Tablets
Archive
Stream Manager
query
load
QueryManager
get
33 / 63
Meeting the Requirements
I Interactivity→ In-memory cache of tablets→ (Bitmap) Indexing
I Scalability→ Messaging middle-layer (MQ)→ Distributed architecture
I Expressiveness→ Data: Bro’s event model→ Query: Rich inter-event relationships
I Result Fidelity→ Sampling & Bootstrapping
I Analytics & Streaming→ Historical queries: tablet-based storage + archive→ Live queries: stream processing engine
34 / 63
Outline
1. Prior Work: Building a NIDS Cluster
2. Use Cases
3. Workload Characterization
4. Requirements
5. Related Work
6. Architecture
7. Roadmap
8. Summary
35 / 63
Roadmap
1. Identify Building BlocksI What existing systems to leverage?I What to build? What not?→ Time Estimate: 1 month
2. InfrastructureI Core data structures to represent activityI Message-passing middle layer→ Time Estimate: 1-2 months
3. IngestionI How to handle high-volume event stream?I Provide circular buffer semantics: recent activity in-memory→ Time Estimate: 2 months
36 / 63
Roadmap4. Query
I Express and implement data queries: historical & liveI Express and implement behavior modelsI Bounding errors: trading accuracy for latency→ Time Estimate: 3-4 months
5. Testing UsabilityI Bring in early adopters: ICSI, LBNL, NCSAI Deploy-measure-tweak cycle: integrate feedback, fix bugs→ Time Estimate: 1 month
6. Real-World EvaluationI Use the system in production for real incidentsI Learn how effectively it supports incident response & forensics→ Time Estimate: 2 month
7. TuningI Time: Build bitmap indexing on top of tablet storeI Space: elevate old activity into higher-level abstractions (aging)I Address the lessons learned from the evaluation→ Time Estimate: 3-4 months
37 / 63
Outline
1. Prior Work: Building a NIDS Cluster
2. Use Cases
3. Workload Characterization
4. Requirements
5. Related Work
6. Architecture
7. Roadmap
8. Summary
38 / 63
Recapitulation
1. Large-scale operational network analysis is ill-supported todayI No homogeneous representation of activityI Dealing with past activity differs from expressing future events
2. We need an integrated platform to better support these tasks3. Derived requirements based on workload characterization4. Presented an architecture draft
39 / 63
Thank You
FIN
40 / 63
References I
F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach,M. Burrows, T. Chandra, A. Fikes, and R.E. Gruber.Bigtable: A Distributed Storage System for Structured Data.ACM Transactions on Computer Systems (TOCS), 26(2):1–26, 2008.
A. Colantonio and R. Di Pietro.Concise: Compressed ’n’ Composable Integer Set.Information Processing Letters, 110(16):644–650, 2010.
Francesco Fusco, Marc Ph. Stoecklin, and Michail Vlachos.NET-FLi: On-the-fly Compression, Archiving and Indexing ofStreaming Network Traffic.Proceedings of the VLDB Endowment, 3:1382–1393, September 2010.
41 / 63
References II
Amir Houmansadr, Ali Zand, Casey Cipriano, Giovanni Vigna, andChristopher Kruegel.Nexat: A History-Based Approach to Predict Attacker Actions.In Proceedings of the 27th Annual Computer Security ApplicationsConference, ACSAC ’11, Orlando, Florida, December 2011.
Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal,Jitendra Padhye, and Paramvir Bahl.Detailed Diagnosis in Enterprise Networks.In Proceedings of the ACM SIGCOMM 2009 Conference on DataCommunication, SIGCOMM ’09, pages 243–254, New York, NY, USA,2009. ACM.
Andrew Lamb.Building Blocks for Large Analytic Systems.In 5th Extremely Large Databases Conference, XLDB ’11, Menlo Park,California, October 2011.
42 / 63
References III
Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer,Shiva Shivakumar, Matt Tolton, and Theo Vassilakis.Dremel: Interactive Analysis of Web-Scale Datasets.Proceedings of the VLDB Endowment, 3(1-2):330–339, September2010.
Robert Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan.Interpreting the Data: Parallel Analysis with Sawzall.Scientific Programming, 13(4):277–298, 2005.
Robin Sommer and Vern Paxson.Outside the Closed World: On Using Machine Learning for NetworkIntrusion Detection.In Proceedings of the 2010 IEEE Symposium on Security and Privacy,SP ’10, pages 305–316, Washington, DC, USA, 2010. IEEE ComputerSociety.
43 / 63
References IV
Malek Ben Salem and Salvatore J. Stolfo.Modeling User Search Behavior for Masquerade Detection.In Proceedings of the 14th International Conference on RecentAdvances in Intrusion Detection, RAID ’11, Menlo Park, CA, 2011.
Arun Viswanathan, Alefiya Hussain, Jelena Mirkovic, Stephen Schwab,and John Wroclawski.A Semantic Framework for Data Analysis in Networked Systems.In Proceedings of the 8th USENIX Conference on Networked SystemsDesign and Implementation, NSDI ’11, Boston, MA, 2011. USENIXAssociation.
44 / 63
References V
Matthias Vallentin, Robin Sommer, Jason Lee, Craig Leres, VernPaxson, and Brian Tierney.The NIDS Cluster: Scalably Stateful Network Intrusion Detection onCommodity Hardware.In Proceedings of the 10th International Conference on RecentAdvances in Intrusion Detection, RAID ’07, pages 107–126, GoldGoast, Australia, September 2007. Springer.
Kesheng Wu, Ekow J. Otoo, Arie Shoshani, and Henrik Nordberg.Notes on Design and Implementation of Compressed Bit Vectors.Technical Report LBNL-3161, Lawrence Berkeley National Laboratory,Berkeley, CA, USA, 94720, 2001.
Kesheng Wu.FastBit: an Efficient Indexing Technology for AcceleratingData-Intensive Science.Journal of Physics: Conference Series, 16:556–560, 2005.
45 / 63
References VI
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, ScottShenker, and Ion Stoica.Spark: Cluster computing with working sets.In Proceedings of the 2nd USENIX conference on Hot topics in cloudcomputing, HotCloud ’10, pages 10–10, Berkeley, CA, USA, 2010.USENIX Association.
46 / 63
Backup Slides
47 / 63
Illustrating Bottom-Up Data Navigation
⇡�
↵�
⇡
�
�
48 / 63
Illustrating Top-Down Data Navigation
⇡� �
�
↵
� revert
49 / 63
Illustrating Insider Abuse Data Navigation
� �
Bro
50 / 63
The Bro Network Security Monitor
I Fundamentally different from other IDSI Network analysis platformI Policy-neutral at the coreI Highly stateful
Key components1. Event engine (core)
I TCP stream reassemblyI Protocol analysis
2. Script interpreterI “Domain-specific Python”I Generates extensive logs
User Interface
Network
Event Engine
Script Interpreter
Packets
Events
Logs Notifications
51 / 63
Generating and Receiving Bro Events
BroccoliI C libraryI Send/Receive Bro eventsI Language bindings
I RubyI PythonI Perl
→ Anyone can generate/receive events Network
Event Engine
Script Interpreter
Packets
Events
Logs Notifications
3rd-party Application
BroccoliEvents
Com
m
(Broccoli = Bro client communications library)
52 / 63
Publish/Subscribe in the Bro Event Model
3rd-party Application
Broccoli
Bro
Apache
Broccoli
OpenSSH
Broccoli
53 / 63
Expressing Behavior [VHM+11]
I RequirementsI Analysis over multi-type, multi-variate, timestamped dataI Analysis over higher-level abstractionsI Composition of abstractionsI A wide variety of relationships
I RelationshipsI CausalityI Partial or total orderingI Dynamic changes over timeI ConcurrencyI PolymorphismI Synchronous and asynchronous operationsI Eventual operationsI Value dependenciesI InvariantsI Basic relations: boolean operators, loops, etc.
54 / 63
Expressing Behavior Between Events [VHM+11]
55 / 63
Inspirations
1. Dremel [MGL+10]I In-situ data accessI Columnar storageI Nested data model
2. Bigtable [CDG+08]I Sharding: distributed tablets
3. Sawzall [PDGQ05]I Aggregators: sample, sum, maximum, quantile, top-k, unique
4. Spark [ZCF+10]I In-memory computationI Iterative processing
5. FastBit [Wu05]I Bitmap indexes
56 / 63
Design Philosophy Touch Stones [Lam11]Storage
I Keep data sorted → reduce seeks, easy random entryI Shard with access locality → minimize involved nodesI Store data in columns → don’t waste I/OI Use append-only disk format → avoid expensive index updates
ComputeI Use disk appropriately → large sequential readsI Trade CPU for I/O → type-specific, aggressive compressionI Use pipelined parallelism → hide latencyI Ship compute to data → aggregation serving tree
QueryI Make it user-friendly → declarative query interfaceI Provide query hooks → support complex analysis
57 / 63
Taxonomy Grammar
taxonomy ::= typdef* | event+typedef ::= name, typeevent ::= name, argument*, attribute*argument ::= name, type, attribute*attribute ::= key, value?type ::= basic | domain | complexbasic ::= bool | count | int | double | stringdomain ::= addr | port | subnet | time | intervalcomplex ::= enum | vector | set | map | recordrecord ::= argument+
58 / 63
Event Reordering in NET-FLivia Locality-Sensitive Hashing (oLSH) [FSV10]
59 / 63
Effect of LSH [FSV10]
60 / 63
Bitmap Indexes
I Column cardinality: # distinct valuesI One bitmap bi for each value i
I Sparse, but compressibleI WAH [WOSN01]I COMPAX [FSV10]I Consice [CDP10]
I Can operate on compressed bitmapsI No need to decompress
2
1
2
0
0
1
3
0
0
0
1
1
0
0
0
1
0
0
0
1
0
1
0
1
0
0
0
0
0
0
0
0
0
0
1
b1 b2 b3b0
Data Bitmap Index
61 / 63
Bitmap Index Operations
& &
b1 b2 b3 r
=
62 / 63
Bitmap Index Retrieval
⌘ {i1, i2, i3, . . .} Archive
QueryManager
63 / 63