Digital Forensics
Cyber Crimes: The Transformation of Crime in
Information Age
Yong Guan
Associate Director for Research, Information Assurance Center
Associate Professor of Electrical and Computer Engineering
Iowa State University
July10, 2014
S2ERC Ames 2014
Our Efforts
Cyber Crimes: A painful side-effect of the innovations of
Computer and Internet technologies
Increasing criminal activities online
Almost all physical crimes involve digital evidence
Low percentage of cases reported to law enforcement
Our Research Foci:
Digital Forensics: Investigative and Build Accountability
Network and System Security
Assurance Modeling and Risk Analysis
One of the first 7 NSA-designated IA CAEs
Increased data size & urgency
Data centers, Storage, Time
Increased use of encryption
Child pornography case
Increased complexity
ISP, device diversity
Anti-Forensics
Anonymity
Steganography
“Co-Space”
Growing sophistication and stealthiness of cyber criminals!
Massive base of installed infrastructure with insufficient support for security
Paradigm Shift of Incident Response
Identify the cause of problem vs. Fix the problem - Priority?
Legal Influence
Evidence presented, examined, and challenged by the jury and the judges in the courtroom
Social Impacts
Concern of negative publicity
Low percentage of cases reported to Law Enforcement
The Uses of Digital Forensic Solutions
Understand the root causes and impacts of incidents and misbehaviors
Internal and 3rd party auditing
Incident response (IT, insurance, healthcare, e-business)
Electronic evidence discovery and recovery
Network and security monitoring
Attack and inside threat attribution
Data analytics
Physical security (device fingerprinting, biometric security)
Countering Anti-Forensics
Evidence principles (legal, social, technical)
Hidden, Low-Profile
Coordinated
Geographically-distributed
Financial loss and societal impacts
Botnets
Malware
Click Frauds
DDoS
Spam/Phishing
Privacy-violation
… …
Recent Trends - Cyber Attacks & Crimes
Problem Definition
Abnormal/malicious activities and patterns thereof are
often the meaningful signs for many security problems.
7/11/2014 7
Super-spreaders
DDOS attack
Spam emails
Worm spreading
Botnet takeover
Botnets
Malware
Click Frauds
DDoS
Spam/Phishing
Privacy-violation
… …
Sketches
Log … …
… … … … Log … …
… … … … Sketches
Sketches
Log … …
… … … … Log … …
… … … … Sketches
Process each packet
in a wire speed
In-memory sketches
capture traffic status
Error-bounded measurements enable low profile
attack detection
Time-decaying window model is used to detect on-going attacks
Scalable to process network-wide
measurements
Mergeable from
multiple monitoring
points
Reversible for
Identification of
Problem Sources
Research Problems on Security
Monitoring and Attribution
Requirements of the Designed Solutions
7/11/2014 9
Network-wide traffic view
Duplicate removing
Mergeable measurements
Super-spreader identification
Space & time limitation
Our Idea - Sketch Design
Group Testing
Cardinality
Estimation
Error-correcting Code
Our Sketch
7/11/2014 10
• Sketches – Give (ε,δ)-approximations on cardinalities of super-
spreaders in each data stream with using space and time.
– Mergeable: merging two sketches equals to merging two data streams.
– Reversible: recover the identity of the super-spreaders from the sketch.
L layers(groups).
Te
xtᶯ subgroups for group testing
+ 1 subgroup for FP removing
Text
Counters used in
cardinality estimation
for each subgroup.
1. Each packet is independently hashed
into multiple groups according to the
source s.
Hash functions are based on the quotient
and remainder of s divided by L.
2. In each group, (s,d) is mapped into multiple
subgroups according to the 1-bit of quotient q
of s divided by L. Error-correcting code is used to encode q to
w(q) before mapping.
3. Each subgroup where (s,d) is mapped to
will update its cardinality using the
destination d.
Proposed Approach
7/11/2014 11
1 0 0 0 1 0 0 0 1 0
0
0
0
1
0
1
0
1
0
1 1 0 0 0 1 1 0 0
0 1 0 0 0 0 0 0 1
0 0 1 0 1 0 0 0 0
0 0 1 0 0 0 0 0 0
1 0 0 1 1 0 1 0 0
1 1 0 1 1 1 1 0 0
0 0 0 1 1 0 0 0 0
0 0 0 0 0 1 0 1 0
0 0 1 1 0 0 0 0 0
Bt[*,*]
W(y) = 000001010
try each of the hash functions on
decoded y
a = 1000. Layer number is
also used to recover the
super-spreader’s ID.
y = 0010. y is the quotient of
the super-spreader in this
group with high probability.
8th layer
decoding
Text
Layers
(groups)T
extSubgroups
Text
super-spreader
candidate x
Create a 2D binary matrix from C[*,*,*]: test each subgroup C[a,b,*]
in each layer/group to see if its cardinality is larger than the
threshold. If yes, set B[a,b]=1, else set B[a.b]=0.
A Snapshot of Our Recent and On-going Efforts
Reversible
Sketches
Coding
Theory
Dynamic
Membership
Query
Time-decaying
Window
Hash Functions:
Bloom Filters
Hash Tables
Super-spreader
Detection
PCA-based Traffic
Anomaly
Click frauds
Traffic Activity
Graph Analysis
Using the low-
rank properties
Low-rank Matrix
Approximation
Persistent attacks
Botnet C&C
Communication
Heavy-Change
Detection
Entropy and Distribution
Property Tests Linear Algebra for Matrix
Approximation
Social Graph
Analysis
E-evidence Imaging and Recovery
File and FS analysis
Imaging (SSD, mobile platforms)
File Carving
Family & Friends Businesses Activists Media Military & Law Enforcement
Anonymity <--> Accountability
Anonymous systems: the ring of Gyges in cyber world.
Well-known online services
Tor
Anonymizer
Use of Tor
Wikileaks
Threatening emails/phone calls
German Child Porn, 2006
Darknet – Silk Road, summer of 2011
The Design of
Accountable
Anonymity
A More Recent Effort
Security & Privacy, and Forensics of Medical Devices
Shelby Kobes
Challenges in Security Monitoring and
Forensic Analytics
Internet/cellular service providers have to measure and analyze network traffic: Maintenance:
Equipment Failures
Vendor Implementation Errors
Software Bugs
Usage monitoring: Flash Crowds,
Large File Transfers
Term-of-service Abuse
Security: Online Fraud Activities
Malware Spreading
DDoS Attacks
Network-wide Traffic Anomaly
Our algorithm provides theoretical bounds for the PCA-based traffic anomaly detection.
The space requirement, the communication cost, and other resources can be optimized over a distributed network monitoring environment.
Yang LIU, Linfeng ZHANG, and Yong GUAN. A Distributed Data Streaming Algorithm for Network-wide Traffic
Anomaly Detection. SIGMETRICS Perform. Eval. Rev. 37, 2, July 2009.
Yang LIU, Linfeng ZHANG, and Yong GUAN, Sketch-based Streaming PCA Algorithm for Network-wide Traffic
Anomaly Detection. ICDCS 2010.
Super Spreader (malware spreading)
A new reversible sketch to aggregate traffic information for
super spreader detection
Running time for the sketch updating is near-optimal
The number of aggregated flows achieves the lower bound.
Yang LIU, Wenji CHEN, and Yong GUAN. A Fast Sketch for Aggregate Queries over High-Speed Network
Traffic. INFOCOM 2012.
.
Long Duration Flow of Botnet C&Cs
We propose a data streaming algorithm for tracking LDFs in a
high-speed network, which can detect LDFs with only few false
negatives but no false positive.
Our algorithm can provide the strongest error bound for the
flow duration estimation, which is optimal for this problem.
The running time to process each packet in our algorithm is
constant, regardless of the error bound.
Time
Yang LIU, Wenji CHEN, and Yong GUAN, False Positive or False Negative: Data
Streaming Algorithms for Tracking Long Duration Flows. Submitted to IEEE Transactions
on Parallel and Distributed Systems (TPDS).
Duplicate Detection for DoS and Attack
Pattern Analysis
We propose a novel data structure using Cuckoo hashing in a time-decaying window model.
We introduce a new algorithm to maintain a time information for each item.
Our data structure is near-optimal in both space and running time.
Time
Duplication
Yang LIU, Wenji CHEN, and Yong GUAN, Near-optimal Approximate Membership
Query over Time-decaying Windows. INFOCOM 2013.
Reversible Sketches
Motivation: Change Detection, Super Spreaders, etc.
Problem: Aggregate Queries, but difficult to identify the root
causes of the alarm
1+log(n/ℓ)
( f i , si )
( f1 , s1 )
( f2 , s2 )
Pa
cke
t
Stre
am
ℓ
+
si
+
si
+
si
+
si
+
si
+
si
+
si
+
si
+
si
At each row, we update multiple
counters to maintain enough
information to recover keys later
Each packet is hashed
into multiple rows
Running time for the sketch updating is near-optimal.
The number of aggregated flows achieves the lower bound of the heavy-change problem.
Can be implemented with other aggregate queries, and improve their efficiency and reliability, e.g. super spreader detection.
Yang LIU, Wenji CHEN, and Yong GUAN. A Fast Sketch for Aggregate
Queries over High-Speed Network Traffic. INFOCOM 2012.
Yang LIU, Wenji CHEN, and Yong GUAN, Identifying High-Cardinality Hosts
from Network-wide Traffic Measurements. IEEE CNS 2013.
f({e2})=25
KPI: 93 0.16
f({e2,e4})=3
KPI:62 1 f({e2,e9})=8
KPI:88 1
f({e2,e11})=75
KPI: 91 1
f({e3,e11})=12
KPI: 80 1
f({e2,e3,e9})=7
KPI: 79 0.47
f({e2,e3,e11})=13
KPI: 78 1
f({e2,e3,e9,e11})=8
KPI: 74 1
f({e2,e3})=15
KPI: 81 0.35
f({e3})=0 0
f({e3,e9})=0 0 f({e9,e11})=0 0
f({e2,e9,e11})=0 0 f({e3,e9,e11})=0 0
D2
F1 F2
G1
Forecast for a subset of events. Suppose one event has been
observed, and you are interested in knowing if another event will
occur by the end of the process instance. Suppose you have
observed that event e2 has occurred. What is the probability that
event e3 will occur by the end of the process instance?
P(e3 is in end event log | e2 is in end event log)=(15+7+13+8) /
(25+3+43+8+75)
A manager might ask herself: what is the probability that a car is
made with a bad weld knowing that robot access control was
blocked? The manager may have a second production line
model based on the KPI of dollar cost, and knows that a bad
weld is very expensive to fix after the car is produced, and is
curious to know how blocked robot access is associated with
bad welds. She can check the probability that e3 occurs with
other events as well to determine any high-level probabilities,
and make adjustments in the production line that reduce her
dollar cost in the second model.
P(D2 is final event set)=
15 / (15+7+13+8)
P(F1 is final event set)=
7 / (15+7+13+8)
P(F2 is final event set)=
13 / (15+7+13+8)
P(G1 is final event set)=
8 / (15+7+13+8)
POMM. The POMM allows us to model
the conditional probabilities on the
nodes. We give a short example here.
One property of POMMs is that local
conditional probabilities can be multiplied
to produce other conditional probabilities.
This can reduce computation time for
computing marginal probabilities.
P(F1|D2)=(7+8) / (15+7+8+13)
P(G1|D2)=8 / (15+7+8+13)
P(G1|F1)=8 / (7+8)
P(G1|F1)*P(F1|D2)=P(G1|D2).
Attack Impact Analysis and Assurance
Modeling
The Given G(V,E) |V|= n, |E|= m, m >> n.
Edge Sparsification: Create an approximation G’ of G, s.t.:
G’ has fewer edges than G sparsity,
while guaranteeing that G’ preserves certain property of G
and computing on G’ is much cheaper than that on G.
G
cut-value
max-flow
K-connectiv
ity
conductance
shortest paths
Spectrum of
Laplacian
Component-based structura
l prop.
distributions
Cuts: Cut-value C = sum of weights of edges with
one end in S and another in V\S.
Conductance: For a cut (S,T), T=V\S,
For the graph G,
Conductance of S measures how hard it is to
leave S when taking a random walk on the graph.
Spectrum of Laplacian matrix: Normalized Laplacian matrix L = D-1/2LD-1/2 ,
where L = D – A. Let the eigenvalues (spectrum) of L be λ1, λ2, …,
λn, then 0 = λ1 ≤ λ2 ≤ … ≤ λn,
Multiplicity of eigenvalue zero = # of connected
components
When G is connected, λ2 > 0.
Analyzing G’ instead of G:
Save a lot of time
and allow run-time analysis
Graph Sparsification/Streaming for
Complex System Analysis
What information is at risk in medical
records and devices
Some of the data stored in hospital Computers that could be used by a hacker
Surgical history
Obstetric history
Medications Allergies
Family history
Social history
Habits
Immunization history
Prescriptions
Test results
Current Health
Device Information
Insurance Information
Medical X-rays
Personal Device Function