faust technology for clustering (includes anomaly detection) and classification (where are we now?)...

FAUST Technology for Clustering (includes Anomaly Detection) and Classification (Where are we now?)FAUST Technology for Clustering (includes Anomaly Detection) and Classification (Where are we now?)FAUST technology for classification/clustering is built for speed improvements so that big data can be mined in human time.FAUST technology for classification/clustering is built for speed improvements so that big data can be mined in human time.Oblique FAUSTOblique FAUST is generalized to is generalized to CC FAUSTCC FAUST which places cuts at all large which places cuts at all large Count ChangesCount Changes (CCs).(CCs). A CC reveals a cluster boundary almost always (i.e., almost always a large Count Decrease (CD) occurs iff we are exiting a A CC reveals a cluster boundary almost always (i.e., almost always a large Count Decrease (CD) occurs iff we are exiting a

cluster somewhere on the cut hyper-plane and a large Count Increase (CI) occurs iff we are entering a cluster.cluster somewhere on the cut hyper-plane and a large Count Increase (CI) occurs iff we are entering a cluster.

CC FAUST makes a cut at each CC in the yCC FAUST makes a cut at each CC in the yood values (A gap is a LCD followed by a LCI so LCC includes Oblique FAUST)d values (A gap is a LCD followed by a LCI so LCC includes Oblique FAUST)CC FAUST is Divisive Hierarchical Clustering which, if continued to singleton sub-clusters, builds a complete dendogram.CC FAUST is Divisive Hierarchical Clustering which, if continued to singleton sub-clusters, builds a complete dendogram.

If the problem at hand is outlier [anomaly] detection, any singleton sub-cluster separated by a sufficient gaps, is an outlier.If the problem at hand is outlier [anomaly] detection, any singleton sub-cluster separated by a sufficient gaps, is an outlier.

CC FAUST will scale up, because entering and leaving a cluster "smoothly" (meaning CC FAUST will scale up, because entering and leaving a cluster "smoothly" (meaning without noticeable count changewithout noticeable count change) is no ) is no more likely for large datasets than for small. (It's a measure=0 phenomenon). Do we need BARREL FAUST at all now?more likely for large datasets than for small. (It's a measure=0 phenomenon). Do we need BARREL FAUST at all now?

BARREL CC FAUST is still useful for estimating the diameter of a set as SQRT((dot_product_width onto a d-line)BARREL CC FAUST is still useful for estimating the diameter of a set as SQRT((dot_product_width onto a d-line)22 +(max +(max barrel radius from that d-line)barrel radius from that d-line)22).).

Density UniformityDensity Uniformity (DU)(DU) of a sub-cluster might be defined as the reciprocal of the variance of the counts. of a sub-cluster might be defined as the reciprocal of the variance of the counts.A sub-cluster dendogram should have a Density Label (DE) and a Density Uniformity label (DU) on every edge (subcluster

We can end a dendogram branch as soon as DE and DU are high enough (> thresholds, DT and DUT) to save time.We can end a dendogram branch as soon as DE and DU are high enough (> thresholds, DT and DUT) to save time.

How can we [quickly] estimate DE and DU? DU is easy - just calculate the variance of the point counts.Density, DE, = count/volume = count/cnrn. We have the count and n. cn is a known constant (e.g., c1=, c2=4/3...) We have

volume once we have the radius. Barrel CC FAUST gives us a good radius estimate.

In advance, we decide on a density threshold, DET, and a Density Uniformity Threshold DUT.In advance, we decide on a density threshold, DET, and a Density Uniformity Threshold DUT.

To choose the "best" clustering (partitioning of the set into sub-clusters) we proceed depth first across the dendogram left-most To choose the "best" clustering (partitioning of the set into sub-clusters) we proceed depth first across the dendogram left-most branch to right-most branch, going down until the DET and DUT thresholds are exceeded. See next slides)branch to right-most branch, going down until the DET and DUT thresholds are exceeded. See next slides)

Oblique FAUST Code Layering?Oblique FAUST Code Layering?A layer (or object or black box or procedure) in the code called A layer (or object or black box or procedure) in the code called

the the CUTTERCUTTER::INPUTS:INPUTS: I1. A SPTS (Scalar PTreeSet = bitsliced column of numbers, presumably coming from a dot product functional)I1. A SPTS (Scalar PTreeSet = bitsliced column of numbers, presumably coming from a dot product functional)I2. The method: Cut_at? I2. The method: Cut_at?

I2a. p%_Count_Change (e.g., p=25%), I2a. p%_Count_Change (e.g., p=25%), I2b. Other, non-uniform count change thresholds?I2b. Other, non-uniform count change thresholds?I2c. centers of gaps onlyI2c. centers of gaps only

I3. Whether the 1-counts of the sub-cluster mask pTrees should be computed and returned (Y/N), since it is an expensive step.I3. Whether the 1-counts of the sub-cluster mask pTrees should be computed and returned (Y/N), since it is an expensive step.OUTPUTS:OUTPUTS:O1. A pointer to a mask pTree for each new "sub-cluster" (i.e., indetifying each set of points separated by consecutive cuts).O1. A pointer to a mask pTree for each new "sub-cluster" (i.e., indetifying each set of points separated by consecutive cuts).O2. The 1-count of each maskO2. The 1-count of each mask

the the GRAMMERGRAMMER::INPUTS:INPUTS: I1. An existing Labeled Dendogram (labeled with e.g., the unit vector that produced it, the density of each edge subcluster...) including the tree of I1. An existing Labeled Dendogram (labeled with e.g., the unit vector that produced it, the density of each edge subcluster...) including the tree of

pointers to a mask pTrees for each node (incl. the root, which need not be all of the original set)pointers to a mask pTrees for each node (incl. the root, which need not be all of the original set)I2. The new threshold levels (if, e.g., the density threshold is lower than that of the existing, GRAMMER prunes the dendogramI2. The new threshold levels (if, e.g., the density threshold is lower than that of the existing, GRAMMER prunes the dendogramOUTPUTS:OUTPUTS:O1. The new labeled DendogramO1. The new labeled Dendogram

I like the idea of building a custom dendogram for the user according to specifications. Then the user can examine it while we I like the idea of building a custom dendogram for the user according to specifications. Then the user can examine it while we churn out the next level (as done in the next 2 slides, i.e., the next higher density threshold). The reason is that the full churn out the next level (as done in the next 2 slides, i.e., the next higher density threshold). The reason is that the full dendogram down to singletons is impossibly large and the information gain with new each level rises from zero up to a dendogram down to singletons is impossibly large and the information gain with new each level rises from zero up to a maximum and then falls steadily to zero again at the singleton level (the bottom of the full dendo is huge but worthless)maximum and then falls steadily to zero again at the singleton level (the bottom of the full dendo is huge but worthless)

A thought on the sub-cluster dendogram in general: The root should be labeled with the PTreeSet of the table involved.A thought on the sub-cluster dendogram in general: The root should be labeled with the PTreeSet of the table involved.The sub-root level should be labeled with the particular SPTS of this branch (the D-line or unit vectors, d, of the dot product...The sub-root level should be labeled with the particular SPTS of this branch (the D-line or unit vectors, d, of the dot product...Each sub-level after that should be labeled as above.Each sub-level after that should be labeled as above.

Hadoop Treeminer principles? Hadoop Treeminer principles? Never discard a derived pTree and never, never discard a computed count Never discard a derived pTree and never, never discard a computed count (makes catalogue mgmt a serious undertaking?)(makes catalogue mgmt a serious undertaking?)OR OR pTree hoarding is good.pTree hoarding is good.

Choosing a clustering from a DEL and DUL labeled Dendogram

A B C D E F GA B C D E F G

The algorithm for choosing the optimal clustering from a labeled dendogram is as follows: The algorithm for choosing the optimal clustering from a labeled dendogram is as follows: Let DET=.4 Let DET=.4 and DUT=and DUT=½½

DEL=.1 DEL=.1 DUL=1/6DUL=1/6

DEL=.2 DEL=.2 DUL=1/8DUL=1/8

DEL=.4 DEL=.4 DUL=1DUL=1

DEL= DEL= DUL=DUL=

DEL= DEL= DUL=DUL=

DEL= DEL= DUL=DUL=

DEL= DEL= DUL=DUL=

DEL= DEL= DUL=DUL=

DEL= DEL= DUL=DUL=

DEL= DEL= DUL=DUL=

DEL=.5 DEL=.5 DUL=DUL=½½

DEL=.3 DEL=.3 DUL=DUL=½½

1 y1y2 y72 y3 y5 y83 y4 y6 y94 ya5 6 78 yf9 yba ycb yd yecdef0 1 2 3 4 5 6 7 8 9 a b c d e f

MA cut at 7 and 11MA cut at 7 and 111 3 1 0 2 0 6 21 3 1 0 2 0 6 211234234 55 6677 88 99 aa bbcc ddeeffAPPLYING CC FAUST TO SPAETH APPLYING CC FAUST TO SPAETH DensityCount/rDensityCount/r22 labeled dendogram labeled dendogram

for LCC FAUST on Spaeth with D=AvgMedian for LCC FAUST on Spaeth with D=AvgMedian DET=.3DET=.3YY(.15)(.15)

{y1,y2,y3,y4,y5}{y1,y2,y3,y4,y5}(.37)(.37) {y6,yf}{y6,yf}(.08)(.08) {y7,y8,y9,ya,yb.yc.yd.ye}{y7,y8,y9,ya,yb.yc.yd.ye}(.07)(.07)

{y7,y8,y9,ya}{y7,y8,y9,ya}(.39)(.39) {yb,yc,yd,ye}{yb,yc,yd,ye}(1.01)(1.01){y6}{y6}()() {yf}{yf}()()

D=AM D=AM DET=.5DET=.5

{y1,y2,y3,y4}{y1,y2,y3,y4}(.63)(.63) {y5}{y5}()()

{y7,y8,y9}{y7,y8,y9}(1.27)(1.27) {ya}{ya}()()D=AM D=AM DET=1DET=1{y1,y2,y3}{y1,y2,y3}(2.54)(2.54) {y4}{y4}()()

DCount/rDCount/r22 labeled dendogram for LCC FAUST on Spaeth w labeled dendogram for LCC FAUST on Spaeth w D=cylces thru diagonals nnxx,nxxn,nnxx,nxxn..., D=cylces thru diagonals nnxx,nxxn,nnxx,nxxn..., DET=.3DET=.3

YY(.15)(.15)

{y1,y2,y3,y4,y5}{y1,y2,y3,y4,y5}(.37)(.37) {y6,y7,y8,y9,ya,yb.yc.yd.ye,yf}{y6,y7,y8,y9,ya,yb.yc.yd.ye,yf}(.09)(.09)

{y6,y7,y8,y9,ya}{y6,y7,y8,y9,ya}(.17)(.17) {yb,yc,yd,ye,yf}{yb,yc,yd,ye,yf}(.25)(.25)

{yf}{yf}()() {yb,yc,yd,ye}{yb,yc,yd,ye}(1.01)(1.01)

{y7,y8,y9,ya}{y7,y8,y9,ya}(.39)(.39) {y6}{y6}()()

D-lineD-line

labeled dendogram for LCC FAUST on Spaeth labeled dendogram for LCC FAUST on Spaeth w D=furthestAvg, w D=furthestAvg, DET=.3DET=.3

YY(.15)(.15)

y1,2,3,4,5y1,2,3,4,5(.37(.37{y6,yf}{y6,yf}(.08)(.08) {y7,y8,y9,ya,yb.yc.yd.ye}{y7,y8,y9,ya,yb.yc.yd.ye}(.07)(.07)

{y7,8,9,a}{y7,8,9,a}(.39)(.39){yb,yc,yd,ye}{yb,yc,yd,ye}(1.01)(1.01)

{y6}{y6}()() {yf}{yf}()()

1 y1y2 y72 y3 y5 y83 y4 y6 y94 ya5 6 78 yf9 yba ycb yd ye0 1 2 3 4 5 6 7 8 9 a b c d e f

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 05/64 [0,64)

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 110/64 [64,128)

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

Y y1 y2y1 1 1y2 3 1y3 2 2y4 3 3y5 6 2y6 9 3y7 15 1y8 14 2y9 15 3ya 13 4pb 10 9yc 11 10yd 9 11ye 11 11yf 7 8

yofM 11 27 23 34 53 80118114125114110121109125 83

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0

p2 0 0 1 0 1 0 1 0 1 0 1 0 1 1 0

p1 1 1 1 1 0 0 1 1 0 1 1 0 0 0 1

p0 1 1 1 0 1 0 0 0 1 0 0 1 1 1 1

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1

p2' 1 1 0 1 0 1 0 1 0 1 0 1 0 0 1

p1' 0 0 0 0 1 1 0 0 1 0 0 1 1 1 0

p0' 0 0 0 1 0 1 1 1 0 1 1 0 0 0 0

p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1

0[0,8)

p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0

1[8,16)

p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1

1[16,24)

p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0

1[24,32)

p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1

1[32,40)

p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0

0[40,48)

p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1

1[48,56)

p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0

0[56,64)

p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1

p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0

p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1

2[80,88)

p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0

0[88,96)

p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1

0[96,104)

p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0

2[194,112)

p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1

3[112,120)

p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0

3[120,128)

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

1/16[0,16)

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

2/16[16,32)

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

1[32,48)

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

1[48,64)

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

0[64,80)

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

2[80,96)

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

2[96,112)

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

6[112,128)

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

3/32[0,32)

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

2/32[64,96)

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

2/32[32,64)

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

¼[96,128)

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

f=

UDR Univariate Distribution Revealer (on Spaeth:)

Pre-compute and enter into the ToC, all DT(YPre-compute and enter into the ToC, all DT(Ykk) plus those for selected Linear Functionals (e.g., d=main diagonals, ModeVector .) plus those for selected Linear Functionals (e.g., d=main diagonals, ModeVector .Suggestion: In our pTree-base, every pTree (basic, mask,...) should be referenced in ToC( pTree, pTreeLocationPointer, pTreeOneCount ).and these Suggestion: In our pTree-base, every pTree (basic, mask,...) should be referenced in ToC( pTree, pTreeLocationPointer, pTreeOneCount ).and these OneCts should be repeated everywhere (e.g., in every DT). The reason is that these OneCts help us in selecting the pertinent pTrees to access - and in OneCts should be repeated everywhere (e.g., in every DT). The reason is that these OneCts help us in selecting the pertinent pTrees to access - and in

fact are often all we need to know about the pTree to get the answers we are after.).fact are often all we need to know about the pTree to get the answers we are after.).

0 0 1 1 1 1 0 1 01 1 1 1 0 1 0 00 0 0 2 0 0 2 3 32 0 0 2 3 3

1 2 1 1 0 2 2 6 1 2 1 1 0 2 2 6

3 2 2 8 3 2 2 8

5 105 10

depthDT(S)depthDT(S)bb≡≡BitWidth(S) h=depth of a node k=node offsetBitWidth(S) h=depth of a node k=node offsetNodeNodeh,kh,k has a ptr to pTree{x has a ptr to pTree{xS | F(x)S | F(x)[k2[k2b-h+1b-h+1, (k+1)2, (k+1)2b-h+1b-h+1)} and )} and

its 1countits 1count

applied to S, a column of numbers in bistlice format (an SpTS), will applied to S, a column of numbers in bistlice format (an SpTS), will produce the produce the DistributionTree of S DT(S)DistributionTree of S DT(S)

1515 depth=h=0depth=h=0

depth=h=1depth=h=1nodenode2,32,3

[96.128)[96.128)

faust technology for clustering (includes anomaly detection) and classification (where are we now?)...

Documents