dlp systems: models, architecture and algorithms

DLP Systems: Models, Architecture and Algorithms

Liwei Ren, Ph.D, Sr. Architect

Data Security Research, Trend Micro™

May, 2013, UCSC, Santa Cruz, CA

Backgrounds:

• Liwei Ren, Data Security Research, Trend Micro™– Research interests:

• DLP, differential compression, data de-duplication, file transfer protocols, database security, and practical algorithms.

– Education:

• MS/BS in mathematics, Tsinghua University, Beijing

• Ph.D in mathematics, MS in information science, University of Pittsburgh

– Relevant works for this talk:

• Provilla, Inc : a startup focusing on endpoint based DLP products and solutions. It was co-founded by Liwei and acquired by Trend Micro a few years ago.

• Patents --- Liwei holds 10+ patents for DLP, mostly, for DLP content inspection techniques.

• Trend Micro™

– Global security software company with headquarter in Tokyo, and R&D centers in Nanjing, Taipei and Silicon Valley.

– One of top 3 anti-malware vendors

– Pioneer in cloud security

– DLP vendor via Provilla™ acquisition2

Agenda

• What is Data Loss Prevention (DLP) ?

• Concepts, Models, Architecture

• Content Inspection Problems

• Practical Algorithms for DLP

• Summary

• References

• Q&A

Classification 8/2/2013 3

What Is Data Loss Prevention?

• What is Data Loss Prevention?– Data loss prevention (aka, DLP) is a data security technology that detects

data breach incidents in timely manner and prevents them by monitoring data in-use (endpoints), in-motion (network traffic), and at-rest (data storage) in an organization’s network.

– A.k.a. ,Data Leak Prevention (DLP),Information Leak Prevention (ILP) or Information Leak Detection and Prevention (ILDP).

What Is Data Loss Prevention?

• A Few Elements of a DLP system:– WHAT data to protect?

– WHO leaks data?

– HOW the data is leaked?

– WHERE to protect data?

– WHAT actions to take?

Concepts, Models and Architecture

• WHAT data to protect?

• WHO causes data leaks?

External Hackers

Three Data States:

• Data-in-use:

• Data-in-motion:

• Data-at-rest at risk:

• DLP for data-in-use and data-in-motion:

• A conceptual view!

• DLP for data-in-use and data-in-motion:

• A technical view!

• DLP Model for data-in-use and data-in-motion:– If DATA flows from SOURCE to DESTINATION via CHANNEL, the

system takes ACTIONs

– DATA specifies what confidential data is

– SOURCE can be an user, an endpoint, an email address, or a group of them

– DESTINATION can be an endpoint, an email address, or a group of them, or simply the external world

– CHANNEL indicates the data leak channel such as USB, email, network protocols and etc

– ACTION is the action that needs to be taken by the DLP system when an incident occurs

• DLP for data-at-rest:

• DLP Model for data-at-rest:– If DATA resides at SOURCE , the system takes ACTIONs

– DATA specifies what the sensitive data (which has potential for leakage) is

– SOURCE can be an endpoint, a storage server or a group of them

– ACTION is the action that needs to be taken by the DLP system when confidential data is identified at rest.

• Typical DLP systems:– DLP Management Console

– DLP Endpoint Agent

– DLP Network Gateway

– Data Discovery Agent (or Appliance)

• Typical DLP system architecture:

Agenda

•Content Inspection Problems• Practical Algorithms for DLP

• Summary

• References

• Q&A

Content Inspection Problems

• Two fundamental problems for a DLP system:

• It is a pair of problems that always come together:

• One determines data sensitivity based on what has been defined.

• Four typical approaches for <defining, determining> sensitive data in a DLP system:

1. Document fingerprinting

2. Database record fingerprinting

3. Multiple Keyword matching

4. Regular expression matching

• Document fingerprinting:• A technique for identifying modified versions of known documents

• Problem Definition (Model 1):– Let S= { T1, T2, …,Tn} be a set of known texts

– Given a query text T, one needs to determine if there exist at least a document t ϵ S such that T and t share common textual content significantly, where multiple returned documents are ranked by how much common content are shared.

• An alternative model (Model 2):– Let S= { T1, T2, …,Tn} be a set of known texts

– Given a query text T and X%, one needs to determine if there exist at least a text t ϵ S such that SIM(T,t)≥ X%, where SIM() is a function to measure the similarity between two texts.

• Multiple documents are ranked by the percentiles .

• Database record fingerprinting:– A technique for identifying sensitive data records within a text.

– A.k.a., Exact Match in DLP field

• Use Case: – We have several personal data records of <SSN, Phone#, address>

that are included in a text, we want to extract all records from the text to determine the sensitivity of the file.

Hhhhhdds ghghg 178-76-6754 ggkjkfddfdkkkk879-45-6785kjkjjk 43 Atword Street, Pittsburgh, PA 15260 kllkll 412-876-6789 kjkjjkj 76 Parkview Ave, Sunnyvale, CA 94086 hhsjskkdhjhjhj 408-780-8876hjhjkjkjjj 159-87-8965 hjhjhjhjmnnmnxcbls w243 54y45 wefddewdddw3n nn xxxxxxxxxx

SSN Phone # Address

178-76-6754 412-876-6789 43 Atword Street, Pittsburgh, PA 15260

159-87-8965 408-780-8876 76 Parkview Ave, Sunnyvale, CA 94086

…… …… ……

An example: a text contains a few data records:

• Problem Definition (Model 3) :– Let S= { R1, R2, …,Rn} be a set of known data records from a same table.

– Given any text T, one needs to extract all records or sub-records from T while the record cells may appear randomly within the text.

• Problem Definition for Keyword Match:– Let S= {K1,K2,…,Kn} be a dictionary of keywords.

– Given any text T, one needs to identify all keyword occurrences in T.

• Problem Definition for RegEx Match:– Let S= {P1,P2,…,Pm} be a set of RegEx patterns.

– Given any text T, one needs to identify all pattern instances from T.

Easy problems?– Not at all! For large n and m, one will

have performance issue.– That’s the problem of scalability.– Scalable algorithms must be provided.

Agenda

• Practical Algorithms for DLP• Summary

• References

• Q&A

Practical Algorithms for DLP

• We investigate some algorithms for 2 problems:

1. Document fingerprinting

2. Multiple keyword matching

Assumption: a text T is a sequence of UTF-8 characters without loss of generality.

Document Fingerprinting Algorithms

• Lets investigate algorithmic solutions for Model 2 ( document fingerprinting).

• Analysis for Solution:1. We need to construct the function SIM(T,t). For example:

– SIM(T,t) = |T ∩t| /Min(|T|,|t|) based on common sub-strings.

2. An Obvious Challenge:– If n is large, say, in scale of millions, we can not compute SIM(T, Tk) one by one

to find the t that satisfies SIM(T,t) ≥ X%

– We need to figure out an approach that can identify a possible candidate quickly.

3. General search engines like Google use keywords to index/identify the documents. Should we? There are too many keywords and language dependency. The answer is NO.

4. So, which features can we use for indexing/searching?– One answer is documents fingerprints.

Document Fingerprinting Algorithms• What are document fingerprints?

– A fingerprint is a hash value

– One text has multiple fingerprints

– Unique to the text: two irrelevant texts do not share any fingerprints.

– Robustness: it can survive moderate textual changes.

Document Fingerprinting Algorithms• How to extract fingerprints from a text?

– Anchoring point:• A point in the text that can endure the moderate changes.

• Its neighborhood (of fixed size) is unique to the text

– We select a few anchoring points to fingerprints:

• To generate hash values around their neighborhoods.

• These hash values are the fingerprints

•Samples of anchoring points and their neighborhood:Thereareabundantliteraturesonhowtogeneratedifferencebetween

twofilesBasicallytherearetwofundamentalapproachestoattackthisgenericp

roblemLCSmodelwhereLCSstandsforlargestcommonsubsequenceCalculate

thelargestcommonsubsequenceoftwostringFindasequenceofeditoperationsbasedontheLCSsothatonecanapplytheeditoperationstothereferencefiletoconstructthetargetfileBlock movemodel

• Conclusion : we have a solution that consists of two algorithms and one search technology:

– An algorithm for computing SIM(T,t)

– An algorithm for fingerprint generator FPGEN(T)

– Fingerprint search engine

• Fingerprint generation algorithm 1:– INPUT: String T

• Select top M candidate characters based on a score function– Character frequency n

– Character positions in the text T: P(1), …, P(n)

– SCORE(c) = SQRT(D(n) * [ P(n)-P(1)] / SQRT(D)

» Where D= [(P(2)-P(1)]2+ [(P(3)-P(2)] 2 + … + *(P(n)-P(n-1)] 2

• For each selected character c– Create a hash around the neighborhood at each occurrence

– Sort these hashes

– Select the top N hashes

– These N hashes are fingerprints

– OUTPUT: M*N fingerprints

Note 2: Two keys of this algorithm are (a) the score function; (b)sorting the hashes.

Note 1: M and N are pre-defined.

• About the score function:– Why SQRT(n) ?

• Measurement of frequency for the given character

• The larger the value, more stable the character is

– Why [ P(n)-P(1)] / SQRT(D) ?• Measurement of distribution for the given character

• The larger the value, more even distributed the character, and more stable the character;

• WHY? Think about a constrained optimization problem:

– min f(X1,X2 , … Xm) = X12+ X2

2 + … Xm2

» subject to

» X1+ X2 + … Xm = c AND

» Xk ≥ 0, k=1,2,…,m

Note: The solution of the optimization problem is Xk

= c/m, k=1,2,…,m

There are alternative algorithms to construct a fingerprint generation function.

We recently constructed algorithm 2:– A novel approach based on rolling hash function

– It selects anchoring points with first filter H(x) = 0 mod p;

– It further selects anchoring points with a heuristic second filter.

– It also employs the asymmetric architecture of fingerprint match;

Note 1: The anchoring points have better distribution across text.

Note 2: Two keys of this algorithm are (a) Rolling hash; (b)Asymmetric use of two filters.

Multiple Keyword Match

Essentially, it is a multi-pattern string match problems.

Problem Definition:– Let S={P1,P2,…,Pk} be multiple short strings as

patterns;

– Given any string T, one needs to identify all pattern occurrences in T.

Existing string match algorithms:

Algorithm Type

Naïve string match One pattern

Knuth–Morris–Pratt One pattern

Boyer-Moore One pattern

Boyer-Moore-Horspool One pattern

Boyer-Moore-Horspool-Raita One pattern

Rabin-Karp Multi-patterns

Aho-Corasick Multi-patterns

Sun-Manber Multi-patterns

Key elements of the algorithm:– Character comparison can be made from right to left, starting from the end of

the pattern.

– Ending Character Heuristics • Consider that we are pointing to character R[i] and try to compare it with the

ending character of P

• Bad character– If R[i- ≠P,m- and R,i- is not included in P’s alphabet, then it is safe for the pointer to skip

m positions arriving at R[i+m].

– If R[i- ≠P,m-, R,i- is included in P’s alphabet, and R,i-’s last occurrence within P has distance q from the end of P, then it is safe for the pointer to skip q positions arriving at R[i+q].

• Good character– If R[i] =P[m] , P is not matched , and R[i] has no other occurrences within P, then it is safe

for the pointer to skip m positions arriving at R[i+m].

– If R[i] =P[m] , P is not matched and R[i-’s last occurrence other than P,m- has distance q from the end of P, then it is safe for the pointer to skip q positions arriving at R[i+q].

• Matched instance– If R[i] =P[m] and P is matched, then save the instance.

– It is almost safe to move the pointer to skip m positions arriving at R[i+m].

Boyer-Moore-Horspool (BMH) Algorithm

• Rabin-Karp Algorithm – Hash based string match

• Rabin-Karp hash function H(S):– For a given string S = x1x2…xm with length m, a hash function can be

constructed as:

• H(S) = x1bm-1 + x2 bm-2 + … + xm-1 b + xm mod q

• Where b is a base number, usually we take b=256 , and q is a big prime number.

– For pattern P, H(P) = p1bm-1 + p2 bm-2 + … + pm-1 b + pm mod q

– If we denote Rk = R[k,k+m-1], we can derive H(Rk+1) from H(Rk) with relatively small cost

– H(Rk+1) = [ H(Rk) – rkbm-1 ] b + rk+m mod q

– This is an iterative formula which is a common practice for algorithm optimization

• Rabin-Karp hash function:– The quantity bm-1 mod q can be pre-calculated to save CPU time.

– For each iteration, we only need 5 arithmetic operations.

• It can be further reduced to 4

• One considers the number rkbm-1

– Horner’s rule

• H(S) = (…( (x1b + x2)b + x3) b + … + x m-1 ) b + xm mod q

• Yet another formula for performance tuning

• Rabin-Karp algorithm for multiple patterns:

– Input: • String R, multiple patterns {P1,…,Pk},

• n= Length(R), mj =Length(Pj), q, b,

– Procedure:• Step 0:

– Let m = Min(mk)

– Calculate the number bm-1 mod q

– Calculate all H(Pj,1,…,m-) (j=1,..,k) and H(R1) by Horner’s rule• Step 1: Let i=1

• Step 2:

If there exists j in *1,2,…,k+ such that

H(Pj,1,…,m-) = H(Ri) and Pj = R[i,…, mj +i-1],

it is a match and output the instance

• Step 3: i = i + 1

• Step 4: If i > n-m, stop

• Step 5: Calculate H(Ri+1) using the iterative formula.

• Step 6 Go to step 2

– Output: All matched instances

A practical hybrid method:– BMH or Rabin-Karp

– If k < Magic-number,

• Use BMH k times,

• Otherwise, use Rabin-Harp

– Magic-number=100 is my exercise in DLP products.

Rabin-Karp has its weakness :

• when Min({Length(Pi)| i =1,2,…,k +) is small, say, less than 4, we have trouble.

• We need to introduce efficient multiple pattern match for short patterns.

We have a complimentary solution to RK algorithm when handling multiple short patterns

– This is Reverse-trie matching algorithm.

A reverse-trie presents a set of keywords, especially, it is good for CJK languages in UTF-8 encoding :

The keyword set: {abc,abcd,acd}

Agenda

• Practical Algorithms for DLP

• Summary

• References

• Q&A

Summary

• What DLP is.

• DLP Security Model

• Architecture of a DLP System

• Four Content Inspection Problems

• Two Algorithms for DLP Content Inspection – Document Fingerprinting

– Multi-keyword matching

References

• Liwei Ren et al., Document fingerprinting with asymmetric selection of anchor points, US patent 8359472

• Liwei Ren et al., Two tiered architecture of named entity recognition engine, US patent 8321434.

• Yingqiang Lin el al., Scalable document signature search engine, US patent 8266150

• Liwei Ren et al., Fingerprint based entity extraction, US patent 7950062

• Liwei Ren et al., Document match engine using asymmetric signature generation, US patent 7860853

• Liwei Ren et al., Match engine for querying relevant documents, US patent 7747642

• Liwei Ren et al., Matching engine with signature generation, US patent 7516130

Any questions?

Thank You!

Innovation is not a part time job, and it is not even a full-time job. It’s a life style.

dlp systems: models, architecture and algorithms

Technology

ece-vii-dsp algorithms & architecture [10ec751]-notes

an opportunity for discovery dlp tsxv - dlp resources inc

adapting particle filter algorithms to the gpu architecture

system architecture, algorithms, software and hardware ·...

dynamic spectrum management architecture and algorithms

transforming tlp into dlp with the dynamic inter-thread...

parallel hardware/software architecture for the bwt and...

designing physics algorithms for gpu architecture

isca 2014 | heterogeneous system architecture (hsa):...

dlp & vdi presentation and practical information · dlp &...

architecture and algorithms for an ieee 802.11-based multi...

02 04 stevencarlini schneiderelectric final…create...

protocol architecture and algorithms for distributed data...

alex lamb* montréal institute for learning algorithms...

dlp income & growth fund i, llc - dlp capital...

transforming tlp into dlp with the dynamic inter-thread...

the autokey security architecture, protocol and algorithms -...

architecture and algorithms for privacy preserving thermal

cryptographic algorithms and their implementations...

dsp algorithms & architecture jan 2014