15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 exploring and ... · (segments or clusters) such that a...
TRANSCRIPT
![Page 1: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/1.jpg)
Principal Data Scientist
Booz Allen Hamilton
http://www.boozallen.com/datascience
Kirk Borne@KirkDBorne
Exploring and ExploitingInterestingness in Data Science
0
5000
10000
15000
-8 -6 -4 -2 0 2 4 6 8
![Page 2: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/2.jpg)
Astronomy + Data + Statistics =Long-time friends and acquaintances!
2
Now there are newfields of researchand education inAstronomy :Astrostatistics andAstroinformaticshttp://asaip.psu.edu/
“Statistical Astronomy”(1953; 644 pages)
![Page 3: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/3.jpg)
Sniffing out cold cases with DOGs:Difference of Gaussians discovers
Field of Streams around Milky Way galaxy
3http://classic.sdss.org/includes/sideimages/fos_dr6_marked.html
![Page 4: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/4.jpg)
Interestingness in Data:Moving beyond Outlier Detection
to Surprise Discovery!
4
• Outlier Detection 1.0 = Distance-based
• Outlier Detection 2.0 = Density-based
• Outlier Detection 3.0 = Pattern-based:
– Finding the interesting, unexpected pattern (trend, correlation, change-point, segment, precursor signal, association) in your data
– To facilitate more insightful ‘data-to-action’
![Page 5: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/5.jpg)
The Data Science Revolution =Moving from data to insight to action!
5
![Page 6: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/6.jpg)
All of the features in the data histogramconvey valuable (actionable) information (the long tail, outliers, multi-modal peaks, …)
0
2000
4000
6000
8000
10000
12000
14000
-8 -6 -4 -2 0 2 4 6 8
6
![Page 7: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/7.jpg)
Mixture Models = Statistical Clustering
Each of these data histograms can be represented by the mixture (i.e., sum) of several Gaussian normal distributions, such as the 3 Gaussian distributions shown in the lower right.
Each Gaussian statistically represents (characterizes) one “cluster” of data values within the full set of data values.
Comprehensive web resource for Mixture Models for clustering and unsupervised learning in Data Mining:http://www.csse.monash.edu.au/~dld/mixture.modelling.page.html
7
![Page 8: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/8.jpg)
Statistical Clustering tags (characterizes) the data, enabling discovery: making the data “smart”!
8
Each Gaussian in the mixture can be characterized by various parameters, such as the mean, variance (standard deviation), and amplitude (i.e., the strength of that particular Gaussian component within the mixture).
These parameters can be plotted as a function of some independent (treatment) variable, to discover trends and correlations in the effects across the different segments of the population. h
ttp
s://
ww
w.r
esea
rch
gate
.ne
t/p
ub
licat
ion
/620
0224
_Co
nfo
rmat
ion
al_e
ntr
op
y_in
_m
ole
cula
r_re
cogn
itio
n_
by_
pro
tein
s
![Page 9: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/9.jpg)
Massive data collections unlock deeper insights into hard problems and complex systems
…Be careful whatyou wish for!!!!
9
![Page 10: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/10.jpg)
4 Types of Machine Learning Discovery from Data:
10
(Graphic by S. G. Djorgovski, Caltech)
1) Class Discovery: Find the categories of objects (population segments), events, and behaviors in your data. + Learn the rules that constrain the class boundaries (that uniquely distinguish them).
2) Correlation (Predictive and Prescriptive Power) Discovery: (insights discovery) – Find trends, patterns, dependencies in data that reveal the governing principles or behavioral patterns (the object’s “DNA”).
3) Outlier / Anomaly / Novelty / Surprise Discovery: Find the new, surprising, unexpected one-in-a-[million / billion / trillion] object, event, or behavior.
4) Association (or Link) Discovery: (Graph and Network Analytics) – Find both the typical (usual) and the atypical (unusual, interesting) data associations / links / connections in your domain.
![Page 11: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/11.jpg)
Examples of Interestingness in Data
1) Outliers
2) Counting
3) Mapping
4) Associations
5) Linking
6) Clustering
7) Looking
11
![Page 12: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/12.jpg)
1) Outliers
2) Counting
3) Mapping
4) Associations
5) Linking
6) Clustering
7) Looking
12
Examples of Interestingness in Data
![Page 13: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/13.jpg)
Trend Lines in big data sets: Descriptive Analytics! It is tempting to over-fit every wiggle in the data.
-1000
0
1000
2000
3000
4000
5000
6000
7000
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Bo
ilin
g P
oin
t
Melting Point
92 Naturally Occurring Elements
13
![Page 14: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/14.jpg)
This is a better fit to the trend line… (generalization!) for use in Predictive Analytics!
0
1000
2000
3000
4000
5000
6000
7000
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Bo
ilin
g P
oin
t
Melting Point
92 Naturally Occurring Elements
14
![Page 15: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/15.jpg)
Trend Line and Outliers:
where is the real discovery?
Sometimes we are tempted to think that
outliers are just noise or natural variance.
0
1000
2000
3000
4000
5000
6000
7000
0 1000 2000 3000 4000 5000
Bo
ilin
g P
oin
t
Melting Point
Boiling Points and Melting Points
of the 92 Chemical Elements
15
![Page 16: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/16.jpg)
Trend Line and Outliers:
where is the real discovery?
Sometimes we are tempted to think that
outliers are just noise or natural variance.
0
1000
2000
3000
4000
5000
6000
7000
0 1000 2000 3000 4000 5000
Bo
ilin
g P
oin
t
Melting Point
Boiling Points and Melting Points
of the 92 Chemical Elements
16
![Page 17: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/17.jpg)
Trend Line and Outliers:
where is the real discovery?
Sometimes we are tempted to think that
outliers are just noise or natural variance.
0
1000
2000
3000
4000
5000
6000
7000
0 1000 2000 3000 4000 5000
Bo
ilin
g P
oin
t
Melting Point
Boiling Points and Melting Points
of the 92 Chemical Elements
17
![Page 18: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/18.jpg)
Trend Line and Outliers:
Add some context to the data!
…that diagonal line in
the plot (where melting
point = boiling point)
… this provides some
context (related to your
prior knowledge)!0
1000
2000
3000
4000
5000
6000
7000
0 1000 2000 3000 4000 5000
Bo
ilin
g P
oin
t
Melting Point
Boiling Points and Melting Points
of the 92 Chemical Elements
18
![Page 19: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/19.jpg)
Trend Line and Outliers:
Add some context to the data!
…that diagonal line in
the plot (where melting
point = boiling point)
… this provides some
context (related to your
prior knowledge)!0
1000
2000
3000
4000
5000
6000
7000
0 1000 2000 3000 4000 5000
Bo
ilin
g P
oin
t
Melting Point
Boiling Points and Melting Points
of the 92 Chemical Elements
19
![Page 20: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/20.jpg)
Trend Line and Outliers:
What is that point below the line?
…that diagonal line in
the plot (where melting
point = boiling point)
… this provides some
context (related to your
prior knowledge)!0
1000
2000
3000
4000
5000
6000
7000
0 1000 2000 3000 4000 5000
Bo
ilin
g P
oin
t
Melting Point
Boiling Points and Melting Points
of the 92 Chemical Elements
20
![Page 21: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/21.jpg)
Trend Line and Outliers: there’s the
real discovery!
Arsenic:Melts @ 1089
oK
Boils @ 889oK
0
1000
2000
3000
4000
5000
6000
7000
0 1000 2000 3000 4000 5000
Bo
ilin
g P
oin
t
Melting Point
Boiling Points and Melting Points
of the 92 Chemical Elements
21
![Page 22: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/22.jpg)
Trend Line and Outliers: there’s the
real discovery!
Arsenic:Melts @ 1089
oK
Boils @ 889oK
Arsenic! 0
1000
2000
3000
4000
5000
6000
7000
0 1000 2000 3000 4000 5000
Bo
ilin
g P
oin
t
Melting Point
Boiling Points and Melting Points
of the 92 Chemical Elements
22
Novelty Discovery!
![Page 23: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/23.jpg)
1) Outliers
2) Counting
3) Mapping
4) Associations
5) Linking
6) Clustering
7) Looking
23
Examples of Interestingness in Data
![Page 24: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/24.jpg)
Correlation / Trend / Association Discovery == Predictive and Prescriptive Power Discovery!
24
![Page 25: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/25.jpg)
1) Outliers
2) Counting
3) Mapping
4) Associations
5) Linking
6) Clustering
7) Looking
25
Examples of Interestingness in Data
![Page 26: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/26.jpg)
“What is going on in that neighborhood
on Saturday evenings between 6pm and 8pm?”
26Source for graphic: https://www.boozallen.com/s/insight/publication/field-guide-to-data-science.html
![Page 27: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/27.jpg)
1) Outliers
2) Counting
3) Mapping
4) Associations
5) Linking
6) Clustering
7) Looking
27
Examples of Interestingness in Data
![Page 28: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/28.jpg)
Classic Textbook Example of Data Mining (Legend?): Data
mining of grocery store logs indicated that men who buy
diapers also tend to buy beer at the same time.
Association Discovery Example #1
28
![Page 29: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/29.jpg)
Wal-Mart studied product sales in their Florida stores in 2004
when several hurricanes passed through Florida.
Wal-Mart found that, before the hurricanes arrived, people
purchased 7 times as many of {one particular product}
compared to everything else.
Association Discovery Example #2
29
![Page 30: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/30.jpg)
Wal-Mart studied product sales in their Florida stores in 2004
when several hurricanes passed through Florida.
Wal-Mart found that, before the hurricanes arrived, people
purchased 7 times as many strawberry pop tarts compared
to everything else.
Association Discovery Example #2
30
![Page 31: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/31.jpg)
Strawberry pop tarts???
http://www.nytimes.com/2004/11/14/business/yourmoney/14wal.htmlhttp://www.hurricaneville.com/pop_tarts.html
http://bit.ly/1gHZddA31
![Page 32: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/32.jpg)
Association Rule Discovery forHurricane Intensification Forecasting
• Research by GMU geoscientists
• Predict the final strength of hurricane at landfall.
• Find co-occurrence of final hurricane strength with specific values of measured physical properties of the hurricane while it is still over the ocean.
• Result: the association rule discovery prediction is better than National Hurricane Center prediction!
• Research Paper by GMU scientists: https://ams.confex.com/ams/pdfpapers/84949.pdf
32
![Page 33: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/33.jpg)
1) Outliers
2) Counting
3) Mapping
4) Associations
5) Linking
6) Clustering
7) Looking
33
Examples of Interestingness in Data
![Page 34: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/34.jpg)
“All the World is a Graph” – Shakespeare?The natural data structure of the world is not
rows and columns, but a Graph!
34
![Page 35: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/35.jpg)
“Everything connects to everything else”- Leonardo da Vinci
35
![Page 36: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/36.jpg)
(Graphic by Cray, for Cray Graph Engine CGE)
http://www.cray.com/products/analytics/cray-graph-engine
“All the World is a Graph” – Shakespeare?The natural data structure of the world is not
rows and columns, but a Graph!
36
![Page 37: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/37.jpg)
Simple Example of the Power of Graph:Semi-Metric Space
• Entity {1} is linked to Entity {2} (small distance A)
• Entity {2} is linked to Entity {3} (small distance B)
• Entity {1} is *not* linked directly to Entity {3} (Similarity Distance C = infinite)
• Similarity Distances between A, B, and C violate the triangle inequality!
{1} {3}{2}
37
![Page 38: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/38.jpg)
• Entity {1} is linked to Entity {2} (small distance A)
• Entity {2} is linked to Entity {3} (small distance B)
• Entity {1} is *not* linked directly to Entity {3} (Similarity Distance C = infinite)
• Similarity Distances between A, B, and C violate the triangle inequality!
• The connection between black hat entities {1} and {3} never appears explicitly
within a transactional database.
• Examples: (a) Medical Research Discoveries across disconnected journals,
through linked semantic assertions; (b) Customer Journey modeling; (c) Safety
Incident Causal Factor Analysis; (d) Marketing Attribution Analysis; (e) Fraud
networks, Illegal goods trafficking networks, Money-Laundering networks.
{1} {3}{2}
Simple Example of the Power of Graph:Semi-Metric Space
38
![Page 39: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/39.jpg)
analytics.gmu.eduCDDA Spring 2014 Workshop
Research Example: Literature-Based Discovery (LBD)
39
References:• https://www.sciencedirect.com/science/article/pii/S0950705116303860• https://summerofhpc.prace-ri.eu/introducing-lbdream-and-literature-based-discovery/
![Page 40: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/40.jpg)
analytics.gmu.eduCDDA Spring 2014 Workshop
Research Example: Discovery in the
NIH-NLM Semantic MEDLINE Database
Project Description: Conduct semantic graph mining of the NIH-NLM metadata repository from ~26 million medical research articles.
Graph Database: ~90 million RDF triples (predications; semantic assertions).
Research Project: (PhD dissertation at GMU) Novel subgraph discovery; Context-based discovery; New concept emergence in medical research; Story discovery in linked graph network; and Hidden knowledge discovery through semi-metrics.
40https://skr3.nlm.nih.gov/SemMedDB/
![Page 41: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/41.jpg)
1) Outliers
2) Counting
3) Mapping
4) Associations
5) Linking
6) Clustering
7) Looking
41
Examples of Interestingness in Data
![Page 42: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/42.jpg)
Clustering = the process of partitioning a set of data into subsets
(segments or clusters) such that a data element belonging to any
chosen cluster is more similar to data elements belonging to
that cluster than to data elements belonging to other clusters.
= Group together similar items + separate the dissimilar items
= Identify similar characteristics, patterns, or behaviors among
subsets of the data elements.
Challenge #1) No prior knowledge of the number of clusters.
#2) No prior knowledge of semantic meaning of the clusters.
#3) Different clusters are possible from the same data set!
#4) Different clusters are possible using different similarity metrics.42
![Page 43: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/43.jpg)
How to know if your clusters are good enough:
Reference: http://www.biomedcentral.com/content/supplementary/1471-2105-9-90-S2.pdf
You know the clusters are good … … if the clusters are compact relative to their separation
… if the clusters are well separated from one another
… the “within cluster” errors are small (low variance within)
… if the number of clusters is small relative to the number of data points
Various measures of cluster compactness exist, including the Dunn index , C-index, and the DBI (Davies-Bouldin Index)
43
![Page 44: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/44.jpg)
Application of Davies-Bouldin Index
Assume K (the number of clusters) and assume other things (choice of clustering algorithm; the choice of clustering feature attributes; etc.)
Measure DBI
Test another set of values for the cluster input parameters (K, feature attributes, etc.)
Measure DBI
… continue iterating like this until you find the set of cluster input parameters that yields the best (minimum) value for DBI.
44
![Page 45: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/45.jpg)
Scientific Discovery from
Cluster Analysis of data
parameters from events on
the Sun and around the Earth
![Page 46: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/46.jpg)
Cluster Analysis:Find the clusters, then Evaluate them
D-
B
Ind
ex
Delay (hr) of Dst from Vsw and Bz
DBI for Dst_Vsw_Bz
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
1.2
0 1 2 3 4 5 6 7 8 9 10 11 12
Time Shift
DB
I
2C DBI
3C DBI
4C DBI
Average
Figure 10. Davies-Bouldin index for various time delays of Dst from Vsw and Bz for cases of 2 (blue), 3 (red), 4 (yellow) clusters, and the overall average (purple), indicating an optimal delay of ~2-3 hours for Dst.
Good Clusters =
Small Size relative to
Cluster Separation.
DISCOVERY! ...
Solar wind events
have the strongest
association (i.e., the
tightest clusters) with
the space plasma
events within the
Earth’s magnetosphere
about 2-4 hours after
a major plasma outburst
occurs on the Sun.
46
![Page 47: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/47.jpg)
1) Outliers
2) Counting
3) Mapping
4) Associations
5) Linking
6) Clustering
7) Looking
47
Examples of Interestingness in Data
![Page 48: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/48.jpg)
“You can see a lot by just looking”
(and you can see around corners!)
Cognitive, Contextual, Insightful, Forecastful
48https://www.speedcafe.com/2017/07/12/f1-demo-take-place-london-streets/
![Page 49: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/49.jpg)
Final Thoughts
49
![Page 50: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/50.jpg)
In the Big Data era, Everything is Quantified and Monitored :– Populations & Persons– Smart Cities, Energy, Grids, Farms, Highways– Environmental Sensors– IoE = Internet of Everything!
Discovery through Machine Learning and Data Science:– Class Discovery, Correlation Discovery,
Novelty Discovery, and – Association Discovery: Find interesting
cases where condition X is associated with event Y with time shift Z.
17 SDGs are KPIs for the World!
(currently, the SDGs have 229
Key Performance Indicators)( SDG: Sustainability Development Goal )
Big Data + the IoT + Citizen Data Scientists =
= Partners in SustainabilityThe Internet of Things (IoT):Knowing the knowable via deep, wide, and fast data from ubiquitous sensors!
Big Data:
Sustainability Development Goals
http://www.unglobalpulse.org 50
![Page 51: 15000 10000 5000 0 -8 -6 -4 -2 0 2 4 6 8 Exploring and ... · (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging](https://reader033.vdocument.in/reader033/viewer/2022042222/5ec8940a0ed963761e148f06/html5/thumbnails/51.jpg)
Thank you!Contact information, for further questions or inquiries:
Dr. Kirk Borne, Principal Data Scientist, Booz Allen Hamilton
Twitter: @KirkDBorne or Email: [email protected]
Get slides here: http://www.kirkborne.net/ASA2018/
51Booz | Allen | Hamilton