in the recently concluded openworld conf larry ellison of oracle announced a new in-memory offering...

14
In the recently concluded OpenWorld Conf Larry Ellison of Oracle announced a new in-memory offering to company’s latest 12c database platform. This much awaited in-memory option from Oracle comes almost 3 years after the launch of HANA, a similar product from competitor SAP The in-memory platform utilizes a computer’s main memory instead of its disk drive to process queries faster through fewer interactions with the CPU. Such a platform improves query processing rate many-fold, as well as improve the performance of the core-CPU in the long run. The new in-memory option for Oracle DBs is expected to receive huge demand from its customers. Data is stored in both row as well as column formats in the database, with faster transactional operations in row format and faster analytical operations in the column format. Oracle expects the new platform would increase query rates by 100 times and improve processing rates by three times. [1 ] 3 years since the launch of HANA, SAP has seen tremendous growth with > 2,500 customers shifting to the hybrid cloud-based platform. [2 ] With ~100K customers for its DB business alone Oracle’s decision to offer an in-memory option to its new cloud based DB looks promising. This offering could provide some resistance to SAP HANA, and therefore expect demand for the cloud- based in-memory platform to be robust. Check out our complete analysis for Oracle Cloud Computing Bubble Signals Transition From On-Premise Software Streamlined processes and increased demand for productivity from businesses has brought forth various cloud options into the software market. Cloud based services provide faster provisioning, on-demand access and agile resource scheduling by using countless virtualized servers on the cloud to cope with increased processing requirements. A survey by Oracle shows 65% would shift to a DB-as-a-Service (DBaaS) system from an on-premise DBS because DBaaS’ are quicker. [3 ] Altho continuously growing data requirements will keep driving growth in the DB market, cloud-based DBs are on the rise. The adoption of cloud-based services by businesses is growing at a rapid pace as the DBaaS model costs less than on-premise software. Increase in virtualized offerings from cloud players and weakness in IT spending resulted in slow growth in on-premise DB sftwre. Currently, the cloud-based DBaaS market is estimated to be worth $150 million. However, the market is estimated to grow at an annualized rate of 86%, to reach $1.8 billion by 2016. In comparison, the market for on-premise deployments is expected to grow at 33% annually until 2016. [4 ]

Upload: clemence-payne

Post on 19-Jan-2016

232 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: In the recently concluded OpenWorld Conf Larry Ellison of Oracle announced a new in-memory offering to company’s latest 12c database platform.Oracle This

In the recently concluded OpenWorld Conf Larry Ellison of Oracle announced a new in-memory offering to company’s latest 12c database platform.

This much awaited in-memory option from Oracle comes almost 3 years after the launch of HANA, a similar product from competitor SAP

The in-memory platform utilizes a computer’s main memory instead of its disk drive to process queries faster through fewer interactions with the CPU. Such a platform improves query processing rate many-fold, as well as improve the performance of the core-CPU in the long run.

The new in-memory option for Oracle DBs is expected to receive huge demand from its customers. Data is stored in both row as well as column formats in the database, with faster transactional operations in row format and faster analytical operations in the column format .

Oracle expects the new platform would increase query rates by 100 times and improve processing rates by three times. [1]

3 years since the launch of HANA, SAP has seen tremendous growth with > 2,500 customers shifting to the hybrid cloud-based platform. [2]

With ~100K customers for its DB business alone Oracle’s decision to offer an in-memory option to its new cloud based DB looks promising .

This offering could provide some resistance to SAP HANA, and therefore expect demand for the cloud-based in-memory platform to be robust.

Check out our complete analysis for Oracle

Cloud Computing Bubble Signals Transition From On-Premise Software Streamlined processes and increased demand for productivity from businesses has brought forth various cloud options into the software market. Cloud based services provide faster provisioning, on-demand access and agile resource scheduling by using countless virtualized servers on the cloud to cope with increased processing requirements.

A survey by Oracle shows 65% would shift to a DB-as-a-Service (DBaaS) system from an on-premise DBS because DBaaS’ are quicker. [3]

Altho continuously growing data requirements will keep driving growth in the DB market, cloud-based DBs are on the rise. The adoption of cloud-based services by businesses is growing at a rapid pace as the DBaaS model costs less than on-premise software.

Increase in virtualized offerings from cloud players and weakness in IT spending resulted in slow growth in on-premise DB sftwre.  Currently, the cloud-based DBaaS market is estimated to be worth $150 million. However, the market is estimated to grow at an annualized rate of 86%, to reach $1.8 billion by 2016. In comparison, the market for on-premise deployments is expected to grow at 33% annually until 2016. [4]

With new products like 12c database coupled with its efforts to make inroads into the cloud market, we expect the company to leverage from this incredible growth opportunity in cloud services. The latest in-memory offering for the company’s first cloud-ready database offers seamless transitioning from older databases and no challenges in data migration for customers. [1] We believe this could be the start of a transition from on-premise database management services to DBaaS for the company, and expect business to be a big growth opp for market leader Oracle.

Oracle Announces Enhanced In-Memory Applications with New Oracle Database In-Memory Option, Oracle PressRoom, September 2013 [↩] [↩]

Seven More Questions for SAP’s Co-CEO Bill McDermott, AllThingsD, January 2013 [↩]

Delivering Database as a Service (DBaaS) using Oracle Enterprise Manager 12c, oracle.com, March 2013 [ ]↩

DBaaS poised to drive next-generation database growth, 451Research, August 2013 [ ]↩

From another source: SAPs HANA has one columnar store for both transactions and analytics. Oracle's approach leeps 2 redundant copies of data, one for transactions in a row and another for analytics in a column.

Other players in the in-memory market: ScaleOut Software announced hServerV2 which brings realtime analytics to Map/Reduce in Hadoop.

JaveOne has released HazelCast in-memory Datastore.

Page 2: In the recently concluded OpenWorld Conf Larry Ellison of Oracle announced a new in-memory offering to company’s latest 12c database platform.Oracle This

From: Arjun Roy Sent: Tuesday, November 19, 2013 1:47 PM To: Perrizo, William Subject: Barrel Clustering

Is it right that in high dimensions, considering 'd' in any direction would lead to no linear gaps before proceeding to radial gaps?Yes – but maybe we should say “is highly likely to lead to few linear gaps” rather than “would lead to no linear gaps.  Remote outliers should be isolated by a gap on one extreme of the projection line for some d.  It may be unlikely that we'd see any “interior” gaps on the projection line. “interior”means other than gaps at one end of the projection line

separating out a singleton or doubleton outlier from the rest of the set. I would like to test 2 thingsa)      Reducing the dimension - (Han's book talks about Information gain on specific attributes). Just use the attributes which give the highest info

gains and then using Functional Gap Analysis. Just wondering if reduced dimensions will preserve the essence of actual data.Yes that is a good step.  In general terms it is called attribute selection (selecting out the “important” attributes and throwing out the

“unimportant”.  Important means those that retain the essence of the information we are trying to data mine out.  Info gain is one way to filter out the “important” attributes.

b)      Reducing dimensions to say 10%. E.g. consider 5 attributes out of 50. So its going to be a Vector rather than a functional value. Not sure how to tackle this situation.

Is this something different from attribute selection (selecting the most important 5 out of 50)? I have found sequential FAUST (~2011) gave good results - comparable and even better than functional gap approach.There was FAUST Unsupervised in which we did classification, but I don’t recall a method which we called FAUST which did clustering by other

means than dot product gap analysis.  Possibly the sequential you are thinking of was when we used unit vectors, dk, such that d1 = e1 = (1,0,…,0), d2 = e2 = (0,1,0,…,0) etc>?  The nice part about that is that the dot product values take no computation since they are given as the original columns of numbers already.  So in an n-dimensional space, starting with d1,d2,…,dn makes good sense and often finishes the job.  But if it doesn’t then we go to d’s that are linear combinations of the dk’s (i.e., diagonal and midpt to diagonal d;’s etc.)  Note that the d’s we called the midline d’s are exactly the dimensional dk = ek cases.  So you have reminded me that they are certainly the first ones we should consider because the dot product step can be skipped.

Thanks for forcing me to think about this.  First, my analysis last Saturday was no only wrong but also incomplete.  There are many more d’s to be considered in a comprehensive set than I listed.  In 2D I had them all, but even in 3D I left out the 4 d’s that run from a side-midpoint to and adjacent edge midpoint.  I will try to get it right for next Saturday ;-)

 So starting by considering gaps in each columns themselves (dk=ek k=1..n) should be the first step.  Even for large datasets, at least one of these

might find an “interior” gap too.  Once one interior gap is found then the spaces considered in the next round are ½ the size.  So there is twice the potential for a gap in the next round, etc. ?  I think Mark is finding that mostly even large datasets reveal gaps under some d.

How to approach barrel?  The thing to remember is that there are really two functionals; 1. The Linear dot product with a d, (y-p) dot d 2. The Spherical  (y-p) dot (y-p)Barrel is  then just     [ (y-p) dot (y-p) ]   –   [ (y-p) dot d ]^2

Page 3: In the recently concluded OpenWorld Conf Larry Ellison of Oracle announced a new in-memory offering to company’s latest 12c database platform.Oracle This

Oblique FAUST with comprehensive initial linear step

Ways to handle negative numbers generated in the dot product projection step.

1. Only pair p with d if (y-p)od 0

2. Use sign mask to separate into 2 parts (positive part, negative part) then analyze whether there's a gap at 0.

3. After computing SpTS which represents the dot product result and is expressed in 2's complement, before converting to regular decimal bitslice pTrees, compute the minimum and subtract it from the result. Then convert to a decimal bitslice PTreeSet.

1. Mini1..ik≡ vector with minYh at pos h for h=1i ..ik and maxYh elsewhere (MinVec=Min1..n MaxVec=Min-)

For d=ek, k=1..n, use p=MinVec

For d=Diagonali1..1k≡(q-p)/|q-p|, where p=Mini1..ik

and q=Mincomp{i1..ik} , use p.In the Barrel method it is paramount to locate the barrel carefully.Find mode(s) of every column (use pTree Gap Finder, but watch for denseness, not sparseness (next slide)). For the column, k, with the maximum density at the mode, let d=ek

Let p=VecMod≡(modeY1,...,modeYn)

Let's take our pulse wrt the first linear FAUST step (done prior to any barrel-based reach limitation masking)1. Exhaustive search for a good unit vector, d (which produces good linear gaps): The idea is to sequence

through a comprehensive collection of d's (pairing each with a starting pt, p, s.t. (Y-p)od 0 or to deal with negatives by analyzing for gaps in 3 steps, in the negative range, around 0, in the positive range). At a rough coverage, this is easy: Take d{e1..en} and p=MinVec. Take d{Diagc | c{1..n}} where for c any subset of {1..n}. Diagc is the unit diagonal from the corner, p, with pk=MinYk kc else pk=MaxYk, to the corner, q, with qk=MaxYk kc, else qk=MinYk. Here negatives never appear. When AvgPoints of sides and hypersides are used for p and/or q, care must be taken in picking p since negatives can appear.

2. Selective choices of d: ModeVec seems very valuable (along with MedVec and Mean.)

Page 4: In the recently concluded OpenWorld Conf Larry Ellison of Oracle announced a new in-memory offering to company’s latest 12c database platform.Oracle This

X x1 x2p1 1 1p2 3 1p3 2 2p4 3 3p5 6 2p6 9 3p7 15 1p8 14 2p9 15 3pa 13 4pb 10 9pc 11 10pd 9 11pe 11 11pf 7 8

xofM 11 27 23 34 53 80118114125114110121109125 83

No zero counts yet (=gaps)

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0

p2 0 0 1 0 1 0 1 0 1 0 1 0 1 1 0

p1 1 1 1 1 0 0 1 1 0 1 1 0 0 0 1

p0 1 1 1 0 1 0 0 0 1 0 0 1 1 1 1

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1

p2' 1 1 0 1 0 1 0 1 0 1 0 1 0 0 1

p1' 0 0 0 0 1 1 0 0 1 0 0 1 1 1 0

p0' 0 0 0 1 0 1 1 1 0 1 1 0 0 0 0

p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1

p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0

p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1

p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0

p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1

p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0

p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1

p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0

p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1

p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0

p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1

p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0

p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1

p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0

p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1

p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

f=p1 and xofM-GT=23. First round of finding Lp gaps

width = 24 =16 gap: [100 0000, 100 1111]= [64,80)

width=23 =8 gap:[010 1000, 010 1111]=[40,48)

width=23 =8 gap:[011 1000, 011 1111]=[56,64)

width= 24 =16 gap: [101 1000, 110 0111]=[88,104)

width=23=8 gap:[000 0000, 000 0111]=[0,8)

OR between gap 1 & 2 for cluster C1={p1,p3,p2,p4}

OR between gap 2 and 3 for cluster C2={p5}

between 3,4 cluster C3={p6,pf} Or for cluster C4={p7,p8,p9,pa,pb,pc,pd,pe}

f=

pTree Gap Finder can also be used to find

the mode(s) of the distribution of F

values! Instead of watching for

spareness (and ultimately a zero count)

we watch for large (the largest?) counts

Page 5: In the recently concluded OpenWorld Conf Larry Ellison of Oracle announced a new in-memory offering to company’s latest 12c database platform.Oracle This

Oblique FAUST Pipe 1

dp

1. Linear project onto d-line from within a pipe. If no good gaps, start over with new p, d, else if good linear gapped region(s) appear,2. Look for good radial gaps, use them as in OFP0, If none, for each linear gapped region [xod=a1,xod=a2]

(corresp. pts on pd-line are bk=p+akd=pod k=1,2), do:in a narrowed region around p1=avg(b1,b2), incr radius until 1st gap or thinning appears

(at r1 Let q1=mean(post-thinning barrel stave ring (radius from r1 to r1 + delta)and let d1=(q1-p1)/|q1-p1|

.q1.p1r13. Start over with p1, d1

(more likely to be down the middle of the round cluster and therefore produce barrel gaps (linear and radial)) .b1=+a1d-pod

.b2=p+a2d-pod

Alternatively, one could keep finding points on the sphere (like b1 and b2) until one has n-1 of them (n=dimension of space). Then there is a formula for the center of the circle through those points (at least in low dimensions???). However, even if there is a formula in high dimensions, it would be a nasty one and would take lots of rounds of the above to preduce the n-1 points (~ n/2 rounds). If n=17,000 for instance, that makes it infeasible.

Page 6: In the recently concluded OpenWorld Conf Larry Ellison of Oracle announced a new in-memory offering to company’s latest 12c database platform.Oracle This

Oblique FAUST Pipe 2

dp

1. Linear project onto d-line from within a pipe. If no good gaps, reset p, d2. linear gapped region (mean=m1), increase radius until there appears a region between gaps or thinnings

(at r1 and at r2).

Let m2=mean(2nd pre-thinning barrel stave). .m2

.m1

.p'

r1

r2

r3

r3

r4

3. Look again for radial gap. if none goto1.4. If radial gap, restrict to full barrel and look for linear gaps. If

none goto 1.6. If linear gap, declare subcluster, mask off and goto 1.

Let r3=(r2+r1)/2, r4=r2 - r3 Reset d-line thru p'=(r3*m2+r4*m1)/2

Page 7: In the recently concluded OpenWorld Conf Larry Ellison of Oracle announced a new in-memory offering to company’s latest 12c database platform.Oracle This

Q&A f=distance dominated functional, avgGap=(fmax-fmin)/|f(X)|may be a good measurement for setting thresholds, e.g., x is an outlier=anomaly if gap around {x} > 3*avgGap?

If d and t are trained over DocumentTerm (DT) Gradient(F)=G=(Gd, Gt). Instead of a LineSearch using F(s)=f +sG, always use 2D-RectangleSearch,

F(sd,st)=F(f + sd*Gd + st*Gt). Set F/sd =0 and F/st=0.

It may be a better approach to find dense cells (sphere, barrel, cone) then fuse them, because it's difficult to position themaround clusters (due to bumps, protrusion etc.) (Not true for outlier clusters (singleton\doubleton))

An Akg: Start with a line and a small radius barrel around it. Find dense regions between 2 consecutive gaps in this pipe.This should identify portion of a dense cluster. Lots of ways to go from there:a. Use centroid of dense pipe piece as sphere|barrel center.b. Move to a better centroid for that cluster by a gradient asc/desc processc. In a "GA mutation" fashion, jump to a nearby centroid, governed by some

fitness function (e.g., count in dense pipe piece).

If the minimum barrel radii >> 0, we have chosen a d-line far from the data. It may be advisable to pick p to ba an actual data point.Here are the formulas from the spreadsheet:G=(B12-B$6)*B$9+(C12-C$6)*C$9+(D12-D$6)*D$9+(E12-E$6)*E$9 H=G12-$G$9 L=(x-p)od-min I=(B12-B$6)^2+(C12-C$6)^2+(D12-D$6)^2+(E12-E$6)^2 J=@SQRT(I12-G12^2) B=SQRT[(x-p)o(x-p)-(x-p)od^2]Note we don't round, so we are calculating pTree bitslices by truncating.We don't even need to do that! For fixed piont, here are the bislice formulas:@MOD(@INT(F/2^6),2)@MOD(@INT(F/2^5),2)@MOD(@INT(F/2^4),2)@MOD(@INT(F/2^3),2)@MOD(@INT(F/2^2),2)@MOD(@INT(F/2^1),2)@MOD(@INT(F/2^0),2) Keep going (take bitslices to the right of decimal pt)@MOD(@INT(F/2^-1),2)@MOD(@INT(F/2^-2),2) ...Floating point? Bitslice the mantissa. The exponent shifts the slice name. E.g.,

.1011 25

.0010 24

.1010 2-1

24

100

23

000

10110. 10. .01010

22

100

21

110

20

000

2-1

000

2-2

001

2-3

000

2-4

001

Gap analytic tools: L(x)=xod, S(x)=(x-p)o(x-p) and then from those, B(x)=S(x)-L2(x)

(If T is the minimum gap threshold, use T2 for S and B )

Oblique FAUST, Barrel (OFLB)Alternate Lpqx, Bpqx to get a cluster dendogram (topdown). Take p=1st_TR pt? d=vomavg

Defining Avg Density? AvD = count / k=1..dim(maxk-mink)? This is for choosing good Thresholds.

MinGapThres=Tb,AvD≡ b*(1/ AvD)1/dim b=adjustable paramIf we're given a TrainingSet, TR, with K classes, is avgk=1..Kvomk a better mediod than VoM?Take p=MinCorner, q=MaxCorner of box circumscribing

{VoMk}k=1..K better than not circ box of TR?

SSPTS = set of all SPTSs (columns of reals); V = n-dim vector space. Code operations on SSPTS (both 1 level or multi-level):

DPv (Dot Product with a fixed vector, vV)

SSPTS SSPTS SSPTS (Binary Algebraic Operations): including: +, -, /, RWP =Row_Wise_Product

SSPTS SSPTS (Unary Operations) including: SPc=Scalar_Product (Multiply each SPTS row by same constant, c. Use const SPTS? all rows=c, then RWP. More efficient? w/o forming const SPTS? Use c's bit pattern c only? (subset of previous with n = |SSPTS|?)

{SPTSk}k=1..n SSPTS (Unary ops.Typically SPTSk=Vk) incl: SDv (Square Distance from a fixed vector, vV)

Note, SSPTS includes SPTSs of all cardinalities (= depths = # of rows)It seems best to code on SSPTS rather than on SSPTSn (card(SPTS)=n).Of course, it is very important to know what the rows represent so as to avoid nonsense results, however, why restrict the operations themselves?When SPTS operands are of different depths, the result SPTS's depth =

depth of the shallowest operand (operate from the top of each).

ERa = FP's EinRings (n=1, rR) result masks rows s.t. row < a

SPTS R includes AGa = YC's Aggregates and iceberg queies: count, sum, avg, max, min, median, rank_k, top_k, IceBergQueries.

2225/16

Page 8: In the recently concluded OpenWorld Conf Larry Ellison of Oracle announced a new in-memory offering to company’s latest 12c database platform.Oracle This

p 140 192 807 3 T=MGW=12

d=x-n=.58 .15 .58 .53

CONCRETE ST CM WA FA AG 8 140 192 807 3 8 168 122 780 3 9 190 162 803 3 10 310 192 851 3 20 230 195 759 14 20 238 187 847 3 21 212 180 779 14 21 191 162 804 14 22 166 176 780 28 22 234 198 852 14 22 230 195 758 14 23 234 198 852 28 23 190 162 803 14 23 363 165 756 7 24 168 122 780 28 24 338 175 756 3 24 286 145 804 3 24 222 189 870 14 24 230 195 759 28 25 319 156 880 3 25 222 189 870 28 25 230 195 758 28 25 195 166 906 14 25 212 180 779 28 25 166 176 780 14 25 250 187 861 14 26 191 162 804 28 26 195 166 906 28 26 238 228 594 7 26 238 187 847 14 26 213 159 904 14 28 190 162 803 28 28 389 158 926 3 28 234 198 852 56 28 199 192 826 28 28 140 192 807 28 28 324 184 660 3 29 380 154 605 3 29 375 127 993 3 29 313 176 612 3 29 250 187 861 28 29 166 176 780 56 29 222 189 870 56 40 214 182 786 28 40 190 162 803 100 40 469 138 841 3 40 238 187 847 56 40 333 228 594 270 40 212 180 779 100 41 333 228 594 365 41 390 146 756 3 41 222 189 870 100 41 191 162 804 100 41 531 142 894 3 41 190 228 670 28 41 380 228 594 90 41 380 228 594 270 41 380 228 594 180 41 230 195 758 100 41 402 147 852 3 42 475 228 594 270 42 190 228 670 90 42 428 228 594 90 42 475 228 594 90 42 475 228 594 365 42 199 192 826 180 42 428 228 594 180 42 250 187 861 100 43 213 159 904 56 43 475 228 594 180 43 313 176 612 7

43 428 228 594 270 43 213 159 904 100 44 428 228 594 365 44 238 187 847 100 44 199 192 826 360 44 140 192 807 180 44 380 228 594 365 45 140 192 807 360 46 375 127 993 7 46 375 127 993 7 46 266 228 670 28 46 374 170 757 7 46 214 182 785 28 47 190 228 670 180 47 214 182 786 56 47 425 151 804 7 47 266 228 670 90 47 531 142 894 7 47 380 154 605 7 48 304 228 670 28 49 304 228 670 90 49 425 154 887 7 49 425 154 887 7 49 266 228 670 180 49 425 154 887 7 60 425 154 887 28 60 375 127 993 56 60 425 154 887 28 60 425 154 887 28 61 374 170 757 28 62 540 162 676 28 62 425 151 804 28 63 374 170 757 56 63 375 127 993 91 64 425 154 887 56 64 425 154 887 56 64 425 154 887 56 65 425 151 804 56 65 374 170 757 91 65 425 154 887 91 65 313 176 612 56 65 425 154 887 91 65 425 154 887 91 66 439 186 708 28 66 319 156 880 56 67 469 138 841 28 67 313 176 612 91 67 425 151 804 91 68 286 145 804 28 68 475 181 782 28 68 319 156 880 91 68 402 147 852 28 68 338 175 756 91 69 469 138 841 56 71 363 165 756 28 71 363 165 756 28 71 363 165 756 28 71 363 165 756 28 71 469 138 841 91 72 475 181 782 56 72 439 186 708 56 73 286 145 804 56 73 439 186 708 91 74 390 146 756 28 74 475 181 782 91 74 402 147 852 56 75 402 147 852 91 75 324 184 660 28 77 363 165 756 56 77 286 145 804 91 77 363 165 756 56 77 363 165 756 56 77 363 165 756 56 79 390 146 756 56 79 363 165 756 91 79 363 165 756 91 79 363 165 756 91 79 363 165 756 91 80 324 184 660 56 83 390 146 756 91

0 1 7 7 1 411 2 112 1 315 4 217 1 118 3 119 2 120 2 121 1 122 6 123 3 124 4 125 2 126 1 127 2 229 2 332 2 133 3 134 2 135 1 136 2 137 3 138 3 139 5 140 2 141 2 142 8 143 1 144 2 145 3 146 6 147 2 148 2 149 3 150 2 151 8 253 1 154 1 155 1 156 3 157 1 158 3 361 2 162 3 163 2 265 1 166 2 167 5 168 1 169 1 170 5 171 1 273 1 174 6 175 1 378 4 381 1 182 1 183 2 386 1

L1 M1

L2 M12 H 17 C3

OF LB...LB Clustering on Concrete(STrength,ConcreteMix,WAter,FineAggregate, AGgregate). Assess STerror L<40M<60H

(x-p)od/4 Ct Gp3 C

if 1st B radius>>0, use p=min_radius_pt

L2 M1 C0

L20 M9 H4 C1

H4

M3 H1 C4 H1

L18 M26 H28 C2

Br/4 ct gp3 C4 40 1 33 73 1 1 74 1 42116 1

H1

M3

Br/4 ct gp3 C3 0 1 13 13 3 1 14 3 3 17 5 2 19 1 4 23 1 6 29 2 1 30 2 2 32 1 2 34 1 4 38 1 4 42 1 2 44 1 3 47 1 57104 1 4108 1 2110 1 3113 1 11124 3

L1

M3 H3 C31

L1 M1 H4 C32 H1

M1 H5 C33 H1

H2 H1

M1

M2 M1 M3

(x-p)od/4 gp3 C31 67 3 370 3

M3 . H3 (x-p)od/4 gp3 C32

0 2 3232 3 234 1

L1 M1 d=4 . H4

Br/4 gp3 C2

0 1 3 3 1 1 ... 8 1 513 1 316 2 118 3 321 1 1 ...26 2 329 1 736 1 2 ...43 2 346 2 1 ...48 2 351 1 960 1 262 3 1375 1 782 1 183 1 487 1 1 ...91 1 394 2

L1

L9 M1 C21

L4 M3 H1 C22 M1

L2 M4 H3 C23 M1

L2 M3 H16 C24

L2 M3 H4 C25 M1 M1 H3 C26 M1 M2

M3 H1 C27 M2

(x-p)od/4 gp3 C27

46 1 349 1 251 1 556 1 M3 . H1

(x-p)od/4 gp3 C26

46 1 349 1 251 1 556 1

M1 ' H3

(x-p)od/4 gp3 C25 38 1 442 1 345 1 449 1 554 1 155 1 358 1

M2

H2M1 H1C251

H1

Br/4 gp3 C251 0 1 2525 1

M1 H1(x-p)od/4 gp3 C24

36 1 339 1 2...53 1 558 1

L1

L1 M2 H16 C241 M1

Br/4 gp3 C241

0 1 2 ... 4 4 5 9 1 110 4 616 1 117 4 320 1 4161 1

L1 M1 H5 C2411

H5

H5M1

c (Clust dendogram w/o purity)

c0 c1 c2 c3 c4

c31 c32 c33c21 c22 c23 c24 c25 c26 c27

(x-p)od/4 gp3 C33

0 1 3030 4 232 1

M1 . H5

c251c241

(x-p)od/4 g3 C411

13 1 921 5

M1 . H5

c2411

(x-p)od/4 gp3 C23

30 3 434 1 337 1 441 1 445 1 550 1 151 1

M3L1

H1 H1 H1

L1 M1 C231

c231

Br/4 gp3 C231 0 1 3535 1

L1 M1

(x-p)od/4 gp3 C22

38 1 139 1 140 1 545 1 348 1 351 1 152 1 557 1

L3

M2 . H1

M1L1

(x-p)od/4 gp3 C21

35 1 136 1 137 2 138 1 139 1 342 3 143 1

L6L3 M1 C211

Br/4 gp3 C211

0 1 4 4 1 3 7 1 310 1

c211

L1L1

M1L1

Br/4 gp3 C1 0 1 5... 7 1 3...21 1 324 4 1943 1 447 1 4...53 1 3...59 1 7...68 1 10...79 1

L2L11 M3 C11

L1L4 M1 M2

L1 M2 H1 C12 H3 M1

L1

c11 c12

(x-p)od/4 gp3 C1220 1 323 1 427 2

L1 H1 M2

(x-p)od/4 gp3 C1115 1 217 1 320 1 121 1 122 3 123 1 124 2 125 1 429 3

L11 M3

Br/4 gp3 C019 2 4665 1

L1L1 M1d=4

Page 9: In the recently concluded OpenWorld Conf Larry Ellison of Oracle announced a new in-memory offering to company’s latest 12c database platform.Oracle This

Cp,d(x)=(x-p)od / (x-p)o(x-p) Oblique FAUST Cone (OFC) (Enclose clusters with cone gaps)

gapBarrel

Oblique FAUST (OF) Clustering: Linear (default) OFL, Spherical OFS, Barrel OFB, Conical OFC)

Bp,d(x)=(x-p)o(x-p)-((x-p)od)2 Oblique FAUST Barrel (OFB) (Enclose clusters with barrel gaps)

Search for GapLower>T, GapUpper>T and GapBarrel>T2 (BR≡Barrel_Radius)

Search Sp for spherical gap, {x | r2 Sp(x) < (r+T)2}= so that the interior of the r-sphere about p encloses a sub-cluster.

Sp(x)=(x-p)o(x-p) Oblique FAUST Spherical (OFS) (Enclose clusters with spherical gaps)

No gaps show on the red, blueor green projection lines

dp

rp

Lp,d:XR: Lp,d(x)=(x-p)od Oblique FAUST Linear (OFL) clustering

(Enclose clusters between (n-1)-dimensional hyperplanar gaps)

Find a1<a2 such that =GapLower={x | a1<Lpd(x)<a1+T} and

=GapUpper={x | a2<Lpd(x)<a2+T} and C={x|a1+T<Lpd(x)<a2}

GapUpper

dp

GapLower

a1

a2

Bpd

xx

Note: Bpd(x) = Sp(x) - L2pd(x)

Note: C2pd(x) = L2

pd(x) / Sp(x)

Assume a real number table, T, converted to a PTreeSet.

Each method uses a real valued functional from X to R and all methods are completely data parallel (data can be distributed over a cluster, processed in parallel (dot product), then the partial results sent home to be added.

Page 10: In the recently concluded OpenWorld Conf Larry Ellison of Oracle announced a new in-memory offering to company’s latest 12c database platform.Oracle This

p=min q=maxClsAreaLnkeAcoeLnk) 1 15 6 2 5 1 15 6 1 5 ... 3 12 5 6 5

F<4.5 R Ct gp 0 5 1010 9 414 13 317 3 3

Oblique FAUST Pipe 0Clustering on SEEDS(

Thinning1 45 2 48

dp

0. Always start with a pd-line linear gap analysis, then: 1. Find gapped regions in pipe: Project inside of pd-pipe (small radius).

full linear gapped region, analyze for an Initial Radial Gap (IRG).2. If IRG, increase linear region width until cap gaps appear..3. Mask off that cluster

4. GOTO 1, using revise p,d at this point or if either 2, 3 have no gaps

Notes:1. OFP0 may not work well if the pipe runs through the

edge of a round cluster. The philosophy is the probabilistically, pipes are more likely to run through the center region of a round cluster since there is more of it. Next we try ways of adjusting p so that the pipe is more in the center of the subcluster.

2. I also tried Spherical when it appeared from the pipe analysis that we were at the center of a cluster. So far this didn't work out.

xod Ct gp 0 3 1 1 17 1 2 39 1 3 26 1 4 10 1 5 10 1 6 17 1 7 16 1 8 6 1 9 3 1 10 3

Thinning2 5 36 2

0 12 0

pipe

xod Ct gp in pipe2 8 13 14 14 51 region! Look for radial gaps

R Ct gp 0 5 1010 9 414 13 317 3 320 1 222 8 224 5 428 1 230 4 232 2 133 1 437 11 542 2 244 10 246 8 551 4 354 1 357 4 259 1 665 1 873 1The alg only specifies looking at the first region,but it is interesting that other

clusters are revealedNext, Lin gap anal in r=26 barrel.

40 2 2

5 0 3

0 0 31

0 0 12

xod Ct gp1 5 12 11 13 20 14 8 15 7 42 7 2

4.5<F<7.5 R Ct gp17 3 522 1 2... MinRad toohigh! reset p

4.5<F<7.5 R Ct gp 0 3 1010 3 414 5 620 4 222 7 628 1 230 9 232 6 1042 2 547 1 1360 2

4 34 0

1 2 2

xod Ct gp in pipe2 8 13 14 14 51 region! Look for radial gaps

Page 11: In the recently concluded OpenWorld Conf Larry Ellison of Oracle announced a new in-memory offering to company’s latest 12c database platform.Oracle This

Oblique FAUST Pipe 01 Clustering on SEEDS

1. Linearly project in a pdr0-pipe.2. For every very dense region (take just the most dense middle portion? Find first radial density falloff at r 1. If none GOTO 5.3. Linearly project pdr1 barrel to determine the linear extent of that dense region. If it fails to show up. GOTO 5.4. Mask off that cluster5. Revise p and d and GOTO 1.

p=nnnn q=xxxx R Ct gp0 8 11 50 12 38 13 20 14 18 15 7 16 8 17 1r0=1.5 pipe

L Ct gp2 2 13 7 14 5No thinnings so it's just one dense regionIn middle of dense portion, [=3], find radial falloff, r1

R Ct gp 0 5 1010 2 414 7 317 3 522 3 628 1 937 1 1451 1 859 1 665 1 873

34 23 1

Page 12: In the recently concluded OpenWorld Conf Larry Ellison of Oracle announced a new in-memory offering to company’s latest 12c database platform.Oracle This

US≡Universe of Scalar pTreeSets A ScalarpTreeSet is the complete set of pTrees for a column of real numbers (Complete: a pTree bit pos, - to

DPv:=Dot Product with a fixed real vector, vV Again, use v's bit pattern?

n-ary operations: US...USUS

SPc=Scalar Product (multiplying every row of an SpTS by a constant, c). One can use * above or

possibly, use c's bit pattern to avoid constructing a constant SpTS)

ERa= EinRings = pTree mask of rows < a apply < above? Better, use a's bit pattern only?

AGa = YC's Aggregates, count, sum, avg, max, min, median, rank_k, top_k, IceBergQueries (Here, the result is a number, but a number is a depth=1, width=1 SpTS.)

SDv=Square Distance from fixed real vector, vV Again use v's bit pattern only?

add (row-wise) SpTS1+...+ SpTSm SpTSresult SpTSresult,k=SpTS1,k+..+SpTSm,k

|SpTSresult| ≡ depth = min{|SpTS1|,..., |SpTSm|}

-, /, * are similarly defined row_wise operations.

=, >, <, , are binary ops producing mask pTree (i.e., bitwidth(SpTSresult)=1

ToC entry for a SpTSs in a RRBBaaDDBBReal Real BitBitarrayarray DataDataBaseBasepredicate? level=? depth=? Min=? Rank¼ Median=Rank½? Rank¾>Max=? Sum=?

bit pos ptr|purity1count

(n+1,) pure0 0n pointern

146374n-1 pointern-1 9284...n' pure1 depth...n" pure0 0...-m pointer-m

48293847(-,-m-1) pure0

Notes:0. pointerk =pointer to a bit vector.1.1st and last rows can be implied?2. Other cols? (e.g., Identical Twins)3. Need separate ToC for PTreeSets (as sequences of same-depth SpTSs for

tables of real numbers).4. It's OK to black box SpTSs as real

columns since then complex columns can be defined at a higher ToC level via pointers to their real and complex parts.

5. To mult by 2k shift bit pos defs only.So a cleaner ToC might be:

How do we define (black box) ScalarpTreeSets (SpTSs) in a ToC or VDB Catalog?

ToC for SpTSs in a RBaDBpredicate? .. Sum?Highest_NonPure0_Bit_Position=nPointerArray=( ptr(n),...,ptr(-m) )CountArray=( cnt(n),...,cnt(-m) )

Some red info is redundant. How much pre-computed (redundant?) info should be placed in the ToC? Rules of thumb: Pre-compute everything that might be useful and certainly pre-compute all info that

might require Horizontal Processing of Vertical pTrees. I believe Min, Med, Max, Sum can be derived from ToC info (the counts) without accessing actual pTrees and thus, should be store in the ToC only if doing so save significant processing time.

I.e., use offsets instead of keywords to implement the pTree pointer table.

Page 13: In the recently concluded OpenWorld Conf Larry Ellison of Oracle announced a new in-memory offering to company’s latest 12c database platform.Oracle This

Oblique FAUST with a comprehensive initial linear step (done in parallel?)

For table,Y=(Y1.Y2.Y3.Y4), let n=minYk, x=maxYk a=avgYk m=medianYk k=1|2|3|4yY, Lpq(y)=(y-p)o(q-p/|p-q|), p,q any

of p and q form diagonals:nnnn - xxxxnnnx - xxxnnnxn - xxnxnnxx - xxnnnxnn - xnxxnxnx - xnxnnxxn - xnnxnxxx - xnnnxnnn - nxxxxnnx - nxxnxnxn - nxnxxnxx - nxnnxxnn - nnxxxxnx - nnxnxxxn - nnnxxxxx - nnnn

0123456789abcdef

8 thru f are the same diagonals as 7 thru 0, so we only need 2n-1 =8 0 thru 7

or p and q midlines combo(n,n-1) of them=n!/(n-(n-1))!(n-1)!=n=4aaan - aaaxaana - aaxaanaa - axaanaaa - xaaa

aana - xxxx xxxn xnxx xnxn nxxx nxxn nnxx nnxn

or p and q from side-mid-pt to a opposite corner (n2n-1 = 32)aaan - xxxx

xxnxxnxxxnnxnxxxnxnxnnxxnnnx

anaa - xxxx xxxn xxnx xxnn nxxx nxxn nxnx nxnn

naaa - xxxx xxxn xxnx xxnn xnxx xnxn xnnx xnnn

By substituting m for a (median for avg) we could get 32 more, but they seem likely to be essentially the same lines as given by a?

Possibility: Always use m instead of a as a center?32 mdpt-corner are more and more distinct from

the diagonals as the dimension increases.There are (n+1)(2n-1+1) (p,q) pairs to consider.

I have not used column numbers. e.g., n=minimum means n is a number and it is the minimum of a column of numbers (indicated here by the position in which it appears, not by subscript (offset, not keyword identification of the column)).Each of n,x,a,m are width=4 vectors or 4-tuples of numbers, so they are:(n1, n2, n3, n4) (x1, x2, x3, x4) (a1, a2, a3, a4) (m1, m2, m3, m4)As on slide 1, these are precomputed and stored as ToC info

yY = PTreeSet for a table of reals, for any of the 44 (p,q), then in a first dot product with (q-p) step do all 44 dot products and gap analyses! Lot of work? Yes, but there is computational parallelism (Note there is no need to insist on unit vectors - we can dot with q-p and then realize that the gaps will be |q-p| wider than they would have been, had we used dpq instead (affects choice of threshold only)). So if p=(n1, x2, n3, x4) then q=(x1, n2, x3, n4), and then

So if, given Y pre-compute all these scalar-times-SpS binary multiplications of Yk with the 16 pre-computed numbers, nk, xk, ak and mk, then any of these dot product SpSs is a 8-sum of those we precomputed and is thus not much work (4*45=180 8-sums).

It may be possible to engineer 8-summing process to cut time even further? Or to engineer an efficient way to do 8 scalar multiplications and the 8-summings in one efficient operation. Of course, it won't always be 8. For the Netflix Movie PTreeSet, there are 17,000 columns, not 4; and for the User PTreeSet there are 500,000.If finding outliers (anomalies), since most anomalies occur at the outside boundaries of

a set, this simple method might find all of them without further processing.The other aspect of OF that needs attention is making this local in the sense that clusters are revealed

by local application of linear gap analysis even though for the entire space, no gaps appear. This is the reason to introduce barrel gaps if the initial linear gap analysis fails.

(y-p)o(q-p)=yoq-yop-poq+pop= (y1y2y3y4)o(n1,x2,n3,x4)- (y1y2y3y4)o(x1,n2,x3,n4)-poq+pop

n1y1+ x2y2+ n3y3+ x4y4+ x1y1+ n2y2+ x3y3+ n4y4

precomputed fixed precomputed fixed vectorssvectorss

Page 14: In the recently concluded OpenWorld Conf Larry Ellison of Oracle announced a new in-memory offering to company’s latest 12c database platform.Oracle This

Oblique FAUST a comprehensive initial linear step using a central point, p

For Y=(Y1.Y2.Y3.Y4), n=minYk, x=maxYk

m=medYk o=rank¼Yk t=rank¾Yk k=1|2|3|4

yY, Lpq(y)=yo(q-p)/|q-p|, p=central pt= the1st pt in the 1st ring around p in SVoM(Y)q is any of (2n-1 of them)xxxxxxxnxxnxxxnnxnxxxnxnxnnxxnnn

or q: m=median pt = Rank½ (n of them)mmmxmmxmmxmmxmmm

ttxtttxotoxttoxootxtotxoooxtooxo

or q: o=Rank¼, t=Rank¾ pt (n2n-1)

tttxttoxtotxtooxottxotoxootxooox

txtttxtotxottxoooxttoxtooxotoxoo

xtttxttoxtotxtooxottxotoxootxooo

pp

xx

xn

mx

xm

txox

xt

xo

What we are attempting to do here is get a full coverage of ~evenly spaced projection lines (pq-lines or d-lines). What we are attempting to do here is get a full coverage of ~evenly spaced projection lines (pq-lines or d-lines). We are doing so by attempting to evenly space q on [half of] the sides.We are doing so by attempting to evenly space q on [half of] the sides.That seems easier than evenly spacing points on [half of] the sphere (using angles)That seems easier than evenly spacing points on [half of] the sphere (using angles)If dim=N is high, there may be unevenness to this method? If dim=N is high, there may be unevenness to this method? However, using rank(k/N), k=1..N-1 instead of length increments by =(max-min)/N should ameliorate that? However, using rank(k/N), k=1..N-1 instead of length increments by =(max-min)/N should ameliorate that? We can calculate the ranks using our logn pTree rank procedures.We can calculate the ranks using our logn pTree rank procedures.Furthermore, once we move to barrel gapping to limit the radial reach (and therefore materialize gaps for large dataset that would not Furthermore, once we move to barrel gapping to limit the radial reach (and therefore materialize gaps for large dataset that would not

appear with just linear analysis), using an actual point in the set as p has definite advantages (always centers the barrel on appear with just linear analysis), using an actual point in the set as p has definite advantages (always centers the barrel on actual points in the space so we get a r=0 radius every time).actual points in the space so we get a r=0 radius every time).

Oblique FAUST 2nd Barrel step (paralleling?)

1. L(q-p)/|q-p|(y)=(y-p)o(q-p)/|q-p| find outliers and linearly gapped clusters. (1/|q-p|=a)2. Then look for barrel gaps around all dense regions (between thinnings) using Sp(y)=(y-p)o(y-p)= yoy -2yop +popand B(q-p)/|q-p|(y)=Sp(y)-L(q-p)/|q-p|(y)2 = yoy -2yop + pop - (yop-yoq-poq+pop)2/a2

Every dot product in the above is a linear combination of SpSs of the form, b*Yk for some b{n.x,a,0,m,t}.

Thus, if we precompute those SpSs we can put together L and B projections efficiently.