dlp systems: models, architecture and algorithms
Post on 22-Jan-2018
687 Views
Preview:
TRANSCRIPT
Copyright 2011 Trend Micro Inc.Classification 8/2/2013 1
DLP Systems: Models, Architecture and Algorithms
Liwei Ren, Ph.D, Sr. Architect
Data Security Research, Trend Micro™
May, 2013, UCSC, Santa Cruz, CA
Copyright 2011 Trend Micro Inc.
Backgrounds:
• Liwei Ren, Data Security Research, Trend Micro™– Research interests:
• DLP, differential compression, data de-duplication, file transfer protocols, database security, and practical algorithms.
– Education:
• MS/BS in mathematics, Tsinghua University, Beijing
• Ph.D in mathematics, MS in information science, University of Pittsburgh
– Relevant works for this talk:
• Provilla, Inc : a startup focusing on endpoint based DLP products and solutions. It was co-founded by Liwei and acquired by Trend Micro a few years ago.
• Patents --- Liwei holds 10+ patents for DLP, mostly, for DLP content inspection techniques.
• Trend Micro™
– Global security software company with headquarter in Tokyo, and R&D centers in Nanjing, Taipei and Silicon Valley.
– One of top 3 anti-malware vendors
– Pioneer in cloud security
– DLP vendor via Provilla™ acquisition2
Copyright 2011 Trend Micro Inc.
Agenda
• What is Data Loss Prevention (DLP) ?
• Concepts, Models, Architecture
• Content Inspection Problems
• Practical Algorithms for DLP
• Summary
• References
• Q&A
Classification 8/2/2013 3
Copyright 2011 Trend Micro Inc.
What Is Data Loss Prevention?
• What is Data Loss Prevention?– Data loss prevention (aka, DLP) is a data security technology that detects
data breach incidents in timely manner and prevents them by monitoring data in-use (endpoints), in-motion (network traffic), and at-rest (data storage) in an organization’s network.
– A.k.a. ,Data Leak Prevention (DLP),Information Leak Prevention (ILP) or Information Leak Detection and Prevention (ILDP).
Classification 8/2/2013 4
Copyright 2011 Trend Micro Inc.
What Is Data Loss Prevention?
• A Few Elements of a DLP system:– WHAT data to protect?
– WHO leaks data?
– HOW the data is leaked?
– WHERE to protect data?
– WHAT actions to take?
Classification 8/2/2013 5
Copyright 2011 Trend Micro Inc.
Concepts, Models and Architecture
• WHAT data to protect?
Classification 8/2/2013 6
• WHO causes data leaks?
External Hackers
Copyright 2011 Trend Micro Inc.
Concepts, Models and Architecture
Three Data States:
Classification 8/2/2013 7
Copyright 2011 Trend Micro Inc.
Concepts, Models and Architecture
• Data-in-use:
Classification 8/2/2013 8
• Data-in-motion:
Copyright 2011 Trend Micro Inc.
Concepts, Models and Architecture
• Data-at-rest at risk:
Classification 8/2/2013 9
Copyright 2011 Trend Micro Inc.
Concepts, Models and Architecture
• DLP for data-in-use and data-in-motion:
Classification 8/2/2013 10
• A conceptual view!
Copyright 2011 Trend Micro Inc.
Concepts, Models and Architecture
• DLP for data-in-use and data-in-motion:
Classification 8/2/2013 11
• A technical view!
Copyright 2011 Trend Micro Inc.
Concepts, Models and Architecture
• DLP Model for data-in-use and data-in-motion:– If DATA flows from SOURCE to DESTINATION via CHANNEL, the
system takes ACTIONs
Classification 8/2/2013 12
– DATA specifies what confidential data is
– SOURCE can be an user, an endpoint, an email address, or a group of them
– DESTINATION can be an endpoint, an email address, or a group of them, or simply the external world
– CHANNEL indicates the data leak channel such as USB, email, network protocols and etc
– ACTION is the action that needs to be taken by the DLP system when an incident occurs
Copyright 2011 Trend Micro Inc.
Concepts, Models and Architecture
• DLP for data-at-rest:
Classification 8/2/2013 13
Copyright 2011 Trend Micro Inc.
Concepts, Models and Architecture
• DLP Model for data-at-rest:– If DATA resides at SOURCE , the system takes ACTIONs
Classification 8/2/2013 14
– DATA specifies what the sensitive data (which has potential for leakage) is
– SOURCE can be an endpoint, a storage server or a group of them
– ACTION is the action that needs to be taken by the DLP system when confidential data is identified at rest.
Copyright 2011 Trend Micro Inc.
Concepts, Models and Architecture
• Typical DLP systems:– DLP Management Console
– DLP Endpoint Agent
– DLP Network Gateway
– Data Discovery Agent (or Appliance)
Classification 8/2/2013 15
Copyright 2011 Trend Micro Inc.
Concepts, Models and Architecture
• Typical DLP system architecture:
Classification 8/2/2013 16
Copyright 2011 Trend Micro Inc.
Agenda
• What is Data Loss Prevention (DLP) ?
• Concepts, Models, Architecture
•Content Inspection Problems• Practical Algorithms for DLP
• Summary
• References
• Q&A
Classification 8/2/2013 17
Copyright 2011 Trend Micro Inc.
Content Inspection Problems
• Two fundamental problems for a DLP system:
Classification 8/2/2013 18
• It is a pair of problems that always come together:
• One determines data sensitivity based on what has been defined.
Copyright 2011 Trend Micro Inc.
Content Inspection Problems
• Four typical approaches for <defining, determining> sensitive data in a DLP system:
Classification 8/2/2013 19
1. Document fingerprinting
2. Database record fingerprinting
3. Multiple Keyword matching
4. Regular expression matching
Copyright 2011 Trend Micro Inc.
Content Inspection Problems
• Document fingerprinting:• A technique for identifying modified versions of known documents
• Problem Definition (Model 1):– Let S= { T1, T2, …,Tn} be a set of known texts
– Given a query text T, one needs to determine if there exist at least a document t ϵ S such that T and t share common textual content significantly, where multiple returned documents are ranked by how much common content are shared.
Classification 8/2/2013 20
Copyright 2011 Trend Micro Inc.
Content Inspection Problems
• An alternative model (Model 2):– Let S= { T1, T2, …,Tn} be a set of known texts
– Given a query text T and X%, one needs to determine if there exist at least a text t ϵ S such that SIM(T,t)≥ X%, where SIM() is a function to measure the similarity between two texts.
• Multiple documents are ranked by the percentiles .
Classification 8/2/2013 21
Copyright 2011 Trend Micro Inc.
Content Inspection Problems
• Database record fingerprinting:– A technique for identifying sensitive data records within a text.
– A.k.a., Exact Match in DLP field
• Use Case: – We have several personal data records of <SSN, Phone#, address>
that are included in a text, we want to extract all records from the text to determine the sensitivity of the file.
Classification 8/2/2013 22
Copyright 2011 Trend Micro Inc.
Content Inspection Problems
Hhhhhdds ghghg 178-76-6754 ggkjkfddfdkkkk879-45-6785kjkjjk 43 Atword Street, Pittsburgh, PA 15260 kllkll 412-876-6789 kjkjjkj 76 Parkview Ave, Sunnyvale, CA 94086 hhsjskkdhjhjhj 408-780-8876hjhjkjkjjj 159-87-8965 hjhjhjhjmnnmnxcbls w243 54y45 wefddewdddw3n nn xxxxxxxxxx
23
SSN Phone # Address
178-76-6754 412-876-6789 43 Atword Street, Pittsburgh, PA 15260
159-87-8965 408-780-8876 76 Parkview Ave, Sunnyvale, CA 94086
…… …… ……
An example: a text contains a few data records:
Copyright 2011 Trend Micro Inc.
Content Inspection Problems
• Problem Definition (Model 3) :– Let S= { R1, R2, …,Rn} be a set of known data records from a same table.
– Given any text T, one needs to extract all records or sub-records from T while the record cells may appear randomly within the text.
Classification 8/2/2013 24
Copyright 2011 Trend Micro Inc.
Content Inspection Problems
• Problem Definition for Keyword Match:– Let S= {K1,K2,…,Kn} be a dictionary of keywords.
– Given any text T, one needs to identify all keyword occurrences in T.
• Problem Definition for RegEx Match:– Let S= {P1,P2,…,Pm} be a set of RegEx patterns.
– Given any text T, one needs to identify all pattern instances from T.
Classification 8/2/2013 25
Easy problems?– Not at all! For large n and m, one will
have performance issue.– That’s the problem of scalability.– Scalable algorithms must be provided.
Copyright 2011 Trend Micro Inc.
Agenda
• What is Data Loss Prevention (DLP) ?
• Concepts, Models, Architecture
• Content Inspection Problems
• Practical Algorithms for DLP• Summary
• References
• Q&A
Classification 8/2/2013 26
Copyright 2011 Trend Micro Inc.
Practical Algorithms for DLP
• We investigate some algorithms for 2 problems:
Classification 8/2/2013 27
1. Document fingerprinting
2. Multiple keyword matching
Assumption: a text T is a sequence of UTF-8 characters without loss of generality.
Copyright 2011 Trend Micro Inc.
Document Fingerprinting Algorithms
• Lets investigate algorithmic solutions for Model 2 ( document fingerprinting).
• Analysis for Solution:1. We need to construct the function SIM(T,t). For example:
– SIM(T,t) = |T ∩t| /Min(|T|,|t|) based on common sub-strings.
2. An Obvious Challenge:– If n is large, say, in scale of millions, we can not compute SIM(T, Tk) one by one
to find the t that satisfies SIM(T,t) ≥ X%
– We need to figure out an approach that can identify a possible candidate quickly.
3. General search engines like Google use keywords to index/identify the documents. Should we? There are too many keywords and language dependency. The answer is NO.
4. So, which features can we use for indexing/searching?– One answer is documents fingerprints.
28
Copyright 2011 Trend Micro Inc.
Document Fingerprinting Algorithms• What are document fingerprints?
– A fingerprint is a hash value
– One text has multiple fingerprints
– Unique to the text: two irrelevant texts do not share any fingerprints.
– Robustness: it can survive moderate textual changes.
29
Copyright 2011 Trend Micro Inc.
Document Fingerprinting Algorithms• How to extract fingerprints from a text?
– Anchoring point:• A point in the text that can endure the moderate changes.
• Its neighborhood (of fixed size) is unique to the text
– We select a few anchoring points to fingerprints:
• To generate hash values around their neighborhoods.
• These hash values are the fingerprints
30
•Samples of anchoring points and their neighborhood:Thereareabundantliteraturesonhowtogeneratedifferencebetween
twofilesBasicallytherearetwofundamentalapproachestoattackthisgenericp
roblemLCSmodelwhereLCSstandsforlargestcommonsubsequenceCalculate
thelargestcommonsubsequenceoftwostringFindasequenceofeditoperationsbasedontheLCSsothatonecanapplytheeditoperationstothereferencefiletoconstructthetargetfileBlock movemodel
Copyright 2011 Trend Micro Inc.
Document Fingerprinting Algorithms
• Conclusion : we have a solution that consists of two algorithms and one search technology:
– An algorithm for computing SIM(T,t)
– An algorithm for fingerprint generator FPGEN(T)
– Fingerprint search engine
31
Copyright 2011 Trend Micro Inc.
Document Fingerprinting Algorithms
• Fingerprint generation algorithm 1:– INPUT: String T
• Select top M candidate characters based on a score function– Character frequency n
– Character positions in the text T: P(1), …, P(n)
– SCORE(c) = SQRT(D(n) * [ P(n)-P(1)] / SQRT(D)
» Where D= [(P(2)-P(1)]2+ [(P(3)-P(2)] 2 + … + *(P(n)-P(n-1)] 2
• For each selected character c– Create a hash around the neighborhood at each occurrence
– Sort these hashes
– Select the top N hashes
– These N hashes are fingerprints
– OUTPUT: M*N fingerprints
32
Note 2: Two keys of this algorithm are (a) the score function; (b)sorting the hashes.
Note 1: M and N are pre-defined.
Copyright 2011 Trend Micro Inc.
33
Document Fingerprinting Algorithms
• About the score function:– Why SQRT(n) ?
• Measurement of frequency for the given character
• The larger the value, more stable the character is
– Why [ P(n)-P(1)] / SQRT(D) ?• Measurement of distribution for the given character
• The larger the value, more even distributed the character, and more stable the character;
• WHY? Think about a constrained optimization problem:
– min f(X1,X2 , … Xm) = X12+ X2
2 + … Xm2
» subject to
» X1+ X2 + … Xm = c AND
» Xk ≥ 0, k=1,2,…,m
Note: The solution of the optimization problem is Xk
= c/m, k=1,2,…,m
Copyright 2011 Trend Micro Inc.
Document Fingerprinting Algorithms
There are alternative algorithms to construct a fingerprint generation function.
34
We recently constructed algorithm 2:– A novel approach based on rolling hash function
H(x);
– It selects anchoring points with first filter H(x) = 0 mod p;
– It further selects anchoring points with a heuristic second filter.
– It also employs the asymmetric architecture of fingerprint match;
Note 1: The anchoring points have better distribution across text.
Note 2: Two keys of this algorithm are (a) Rolling hash; (b)Asymmetric use of two filters.
Copyright 2011 Trend Micro Inc.
Multiple Keyword Match
Essentially, it is a multi-pattern string match problems.
35
Problem Definition:– Let S={P1,P2,…,Pk} be multiple short strings as
patterns;
– Given any string T, one needs to identify all pattern occurrences in T.
Copyright 2011 Trend Micro Inc.
Multiple Keyword Match
Existing string match algorithms:
36
Algorithm Type
Naïve string match One pattern
Knuth–Morris–Pratt One pattern
Boyer-Moore One pattern
Boyer-Moore-Horspool One pattern
Boyer-Moore-Horspool-Raita One pattern
Rabin-Karp Multi-patterns
Aho-Corasick Multi-patterns
Sun-Manber Multi-patterns
Copyright 2011 Trend Micro Inc.
Multiple Keyword Match
37
Key elements of the algorithm:– Character comparison can be made from right to left, starting from the end of
the pattern.
– Ending Character Heuristics • Consider that we are pointing to character R[i] and try to compare it with the
ending character of P
• Bad character– If R[i- ≠P,m- and R,i- is not included in P’s alphabet, then it is safe for the pointer to skip
m positions arriving at R[i+m].
– If R[i- ≠P,m-, R,i- is included in P’s alphabet, and R,i-’s last occurrence within P has distance q from the end of P, then it is safe for the pointer to skip q positions arriving at R[i+q].
• Good character– If R[i] =P[m] , P is not matched , and R[i] has no other occurrences within P, then it is safe
for the pointer to skip m positions arriving at R[i+m].
– If R[i] =P[m] , P is not matched and R[i-’s last occurrence other than P,m- has distance q from the end of P, then it is safe for the pointer to skip q positions arriving at R[i+q].
• Matched instance– If R[i] =P[m] and P is matched, then save the instance.
– It is almost safe to move the pointer to skip m positions arriving at R[i+m].
Boyer-Moore-Horspool (BMH) Algorithm
Copyright 2011 Trend Micro Inc.
Multiple Keyword Match
38
• Rabin-Karp Algorithm – Hash based string match
• Rabin-Karp hash function H(S):– For a given string S = x1x2…xm with length m, a hash function can be
constructed as:
• H(S) = x1bm-1 + x2 bm-2 + … + xm-1 b + xm mod q
• Where b is a base number, usually we take b=256 , and q is a big prime number.
– For pattern P, H(P) = p1bm-1 + p2 bm-2 + … + pm-1 b + pm mod q
– If we denote Rk = R[k,k+m-1], we can derive H(Rk+1) from H(Rk) with relatively small cost
– H(Rk+1) = [ H(Rk) – rkbm-1 ] b + rk+m mod q
– This is an iterative formula which is a common practice for algorithm optimization
Copyright 2011 Trend Micro Inc.
Multiple Keyword Match
39
• Rabin-Karp hash function:– The quantity bm-1 mod q can be pre-calculated to save CPU time.
– For each iteration, we only need 5 arithmetic operations.
• It can be further reduced to 4
• One considers the number rkbm-1
– Horner’s rule
• H(S) = (…( (x1b + x2)b + x3) b + … + x m-1 ) b + xm mod q
• Yet another formula for performance tuning
Copyright 2011 Trend Micro Inc.
Multiple Keyword Match
40
• Rabin-Karp algorithm for multiple patterns:
– Input: • String R, multiple patterns {P1,…,Pk},
• n= Length(R), mj =Length(Pj), q, b,
– Procedure:• Step 0:
– Let m = Min(mk)
– Calculate the number bm-1 mod q
– Calculate all H(Pj,1,…,m-) (j=1,..,k) and H(R1) by Horner’s rule• Step 1: Let i=1
• Step 2:
If there exists j in *1,2,…,k+ such that
H(Pj,1,…,m-) = H(Ri) and Pj = R[i,…, mj +i-1],
it is a match and output the instance
• Step 3: i = i + 1
• Step 4: If i > n-m, stop
• Step 5: Calculate H(Ri+1) using the iterative formula.
• Step 6 Go to step 2
– Output: All matched instances
Copyright 2011 Trend Micro Inc.
Multiple Keyword Match
41
A practical hybrid method:– BMH or Rabin-Karp
– If k < Magic-number,
• Use BMH k times,
• Otherwise, use Rabin-Harp
– Magic-number=100 is my exercise in DLP products.
Rabin-Karp has its weakness :
• when Min({Length(Pi)| i =1,2,…,k +) is small, say, less than 4, we have trouble.
• We need to introduce efficient multiple pattern match for short patterns.
Copyright 2011 Trend Micro Inc.
Multiple Keyword Match
42
We have a complimentary solution to RK algorithm when handling multiple short patterns
– This is Reverse-trie matching algorithm.
A reverse-trie presents a set of keywords, especially, it is good for CJK languages in UTF-8 encoding :
c d
b
a
c
b a
a
root
The keyword set: {abc,abcd,acd}
Copyright 2011 Trend Micro Inc.
Agenda
• What is Data Loss Prevention (DLP) ?
• Concepts, Models, Architecture
• Content Inspection Problems
• Practical Algorithms for DLP
• Summary
• References
• Q&A
Classification 8/2/2013 43
Copyright 2011 Trend Micro Inc.
Summary
• What DLP is.
• DLP Security Model
• Architecture of a DLP System
• Four Content Inspection Problems
• Two Algorithms for DLP Content Inspection – Document Fingerprinting
– Multi-keyword matching
Classification 8/2/2013 44
Copyright 2011 Trend Micro Inc.
References
• Liwei Ren et al., Document fingerprinting with asymmetric selection of anchor points, US patent 8359472
• Liwei Ren et al., Two tiered architecture of named entity recognition engine, US patent 8321434.
• Yingqiang Lin el al., Scalable document signature search engine, US patent 8266150
• Liwei Ren et al., Fingerprint based entity extraction, US patent 7950062
• Liwei Ren et al., Document match engine using asymmetric signature generation, US patent 7860853
• Liwei Ren et al., Match engine for querying relevant documents, US patent 7747642
• Liwei Ren et al., Matching engine with signature generation, US patent 7516130
Classification 8/2/2013 45
Copyright 2011 Trend Micro Inc.
Q&A
Any questions?
Classification 8/2/2013 46
Copyright 2011 Trend Micro Inc.
Thank You!
Classification 8/2/2013 47
Innovation is not a part time job, and it is not even a full-time job. It’s a life style.
top related