data mining in hed ucsf
TRANSCRIPT
-
8/6/2019 Data Mining in HEd UCSF
1/35
Data Mining System InHigher Education &Persistence Clustering andPredication
Jing Luan, Ph.D., ITMC
Director, Planning and Research, Cabrillo
CollegeOctober, 2001
-
8/6/2019 Data Mining in HEd UCSF
2/35
Jing Luan, UCSF/SPSS, 2001 2
In 45 minutes
Tiered Knowledge Management Model(TKMM)
Data Mining Overview: concept anduse
Demonstration ofClementine
Data Mining plan at your collegeData mining, statistics and OLAP
Q&A
-
8/6/2019 Data Mining in HEd UCSF
3/35
Jing Luan, UCSF/SPSS, 2001 3
three
two
one
three
Tiered Knowledge ManagementModel (TKMM)
Tacit Knowledge
Portals
CRM
Data Warehouses
Enterprise ResourcePlanning (ERP)
MiddlewareOLAP
DataMining
CollaborativeWorkingEnvironment(CWE)
Knowledge BaseKnowledgeWorkers
Knowled
geMapping
Tiers:Tiers:
two
one
Explicit Knowledge
-
8/6/2019 Data Mining in HEd UCSF
4/35
Jing Luan, UCSF/SPSS, 2001 4
TKMM: Explicit KnowledgeManagement
TIER ONE
Data Engines
SQL Server, Oracle, Informix, Sybase, UniData, DB2
Enterprise Resource Planning (ERP)
PeopleSoft, Datatel, SAP, Oracle, Banner
TIER TWO
Querying:BrioQuery, Business Objects, PowerPlay
Access, Foxpro
Online Data Processing:
ASP, JSP, iHTML, XML
TIER THREE:
Mining :
Clementine, Enterprise Miner,
Statistica, Mineset, Darwin, SpotFire
Classical statistics
SPSS, SAS, BMDP, SysStat
Topography of Tiered Knowledge Management Model (TKMM) for explicit knowledge
Many data
mining projects
fail due to lack
of understanding
of these three
tiers,
particularly in
data (feature)
extraction in
Tier One.
-
8/6/2019 Data Mining in HEd UCSF
5/35
Jing Luan, UCSF/SPSS, 2001 5
Guiding Principles
LRM (Learner RelationshipManagement)
Student Life Cycle
Student Clustering, student types
Data source and quality
CRISP-DM (all about a system)
The One-Percent Doctrine
-
8/6/2019 Data Mining in HEd UCSF
6/35
Jing Luan, UCSF/SPSS, 2001 6
Data Mining in Higher Ed
Alumni
Institutional Effectiveness
Marketing
Enrollment Management
-
8/6/2019 Data Mining in HEd UCSF
7/35
Jing Luan, UCSF/SPSS, 2001 7
Data Mining in Higher Ed-institutional effectiveness
What do we know about our students?
What factors contributive to learning?
Who is likely to fail, drop out?
What courses provide high FTES, use spacebetter?
Whatre the course taking patterns?
-
8/6/2019 Data Mining in HEd UCSF
8/35
Jing Luan, UCSF/SPSS, 2001 8
Data Mining in Higher Ed-enrollment management
Which groups prefer what services?
Which student is likely to drop out?
Where do our students come from?
Who is likely to return?
-
8/6/2019 Data Mining in HEd UCSF
9/35
Jing Luan, UCSF/SPSS, 2001 9
Data Mining in Higher Ed-marketing
Who is likely to respond to our newmarketing strategy?
What factor garners the highest respoWhich type of marketing worksbetter?
-
8/6/2019 Data Mining in HEd UCSF
10/35
Jing Luan, UCSF/SPSS, 2001 10
Data Mining in Higher Ed- alumni
What different types of alumni arethere?
Who is likely to pledge for whichamount and when?
-
8/6/2019 Data Mining in HEd UCSF
11/35
Jing Luan, UCSF/SPSS, 2001 11
Lift Chart: Gain Chart
0
40thpercentile
35%
25%
Lift
Savings ($)
If every percentage point = $2,500, savings =(70% *$2,500) (40% * $2,500) = $175,000 - $100,000 =$75,000 BACK
70thpercentile
quo
ta
Hypothetical database marketing campaign
-
8/6/2019 Data Mining in HEd UCSF
12/35
Jing Luan, UCSF/SPSS, 2001 12
o2 Not-persist
o1 Persist
x3 Demographics
x2 GPA
Artificial Neural Networks(ANN)
w1
w5n
x1 # of Terms
x4 Courses
x5 Fin Aid
xjn
Multi-layer perceptron (MLP): feed forward back propagation
oj
=
=
n
i
jiiwof1
-
8/6/2019 Data Mining in HEd UCSF
13/35
Jing Luan, UCSF/SPSS, 2001 13
Decision Trees Rule Induction
Rule 1:
If Income $55,000 and # of Children =
3, then multiple policies
Rule 2:
If Income < $55,000, and single and Age
< 30, thensingle policy
)(log2)()(1
nPnPNHn
i
=
=Information theorem:
-
8/6/2019 Data Mining in HEd UCSF
14/35
Jing Luan, UCSF/SPSS, 2001 14
The Use ofClementine
Real-time demonstration Student persistence prediction
-
8/6/2019 Data Mining in HEd UCSF
15/35
Jing Luan, UCSF/SPSS, 2001 15
Examining Data
-
8/6/2019 Data Mining in HEd UCSF
16/35
Jing Luan, UCSF/SPSS, 2001 16
Clustering using TwoStep
-
8/6/2019 Data Mining in HEd UCSF
17/35
Jing Luan, UCSF/SPSS, 2001 17
Building Models forPersistence in Streams
A node is being executed (notice
the red arrows denoting the flow
of data.
-
8/6/2019 Data Mining in HEd UCSF
18/35
Jing Luan, UCSF/SPSS, 2001 18
Output(Boosting/Reduction)
Because thereare always fewer
graduates than
all students.
Clementine can
balance the
dataset first.
-
8/6/2019 Data Mining in HEd UCSF
19/35
Jing Luan, UCSF/SPSS, 2001 19
Seeing the Work of NeuralThinking
Graphic display
showing an ANN
is learning the
data.
-
8/6/2019 Data Mining in HEd UCSF
20/35
Jing Luan, UCSF/SPSS, 2001 20
Results of Neural Node
These are the outputs the Neural
Networks. Overall accuracy and
significance of features (left).Predicted number of policies using
fresh data vs. known data (above).
-
8/6/2019 Data Mining in HEd UCSF
21/35
Jing Luan, UCSF/SPSS, 2001 21
Examining C5.0
The control
panel of the
C5.0 node,(Expert)
-
8/6/2019 Data Mining in HEd UCSF
22/35
Jing Luan, UCSF/SPSS, 2001 22
Results of C5.0 NodeView the
prediction by
individual
records (PNXT
vs. $C-PNXT).
View the
overall
prediction
accuracy.
-
8/6/2019 Data Mining in HEd UCSF
23/35
Jing Luan, UCSF/SPSS, 2001 23
Comparing C&RT and C5.0
Use the Analysis
node to
examined thedifference in
accuracy for
C&RT and C5.0.
See next slide.
-
8/6/2019 Data Mining in HEd UCSF
24/35
Jing Luan, UCSF/SPSS, 2001 24
Which One is Better:C&RT & C5.0
C5.0 has an
accuracy rate of
66.3% and C&RT
63.7%. They agree
72% of the time.
-
8/6/2019 Data Mining in HEd UCSF
25/35
Jing Luan, UCSF/SPSS, 2001 25
Scoring New Data
Moment of truth. The
most powerful feature of
data mining is to uselearned rules to predict
(score) using fresh data
for business purposes.
Shown here is the change
of dataset to a fresh data
set unseen by Clementinebefore now.
-
8/6/2019 Data Mining in HEd UCSF
26/35
Jing Luan, UCSF/SPSS, 2001 26
Using Models to Score NewData
Test Set Results Scored Results
Decision:
-
8/6/2019 Data Mining in HEd UCSF
27/35
Jing Luan, UCSF/SPSS, 2001 27
2 TYPES OF DATA MINING
SUPERVISED
Purpose:
For classificationand estimation
Models
C5.0,
C&RT,ANN, etc
UNSUPERVISED
Purpose
For clustering andassociation
Models
Kohonen,
Kmeans,TwoStep
GRI, etc.But pre-classified data means data without target.
-
8/6/2019 Data Mining in HEd UCSF
28/35
Jing Luan, UCSF/SPSS, 2001 28
Data Mining Tasks
Predicting onto new data by using rules orpatterns and behaviors Classification
EstimationUnderstanding the groupings, trends, and
characteristics of your customer Segmentation
Visualizing the Euclidean spatial relationships,trends, and patterns of your data Description
-
8/6/2019 Data Mining in HEd UCSF
29/35
Jing Luan, UCSF/SPSS, 2001 29
Statistics!But I Use OLAP For All My Work!
Statistics knowledge is very useful.
Data mining cannot replace
statistics in a number of areas.There are overlapping areas.
OLAP is the middle tier.
We must go beyond countingheads!
-
8/6/2019 Data Mining in HEd UCSF
30/35
Jing Luan, UCSF/SPSS, 2001 30
How Do Data Mining, Statisticsand OLAP Compare
Data Mining Statistics OLAPNeural Net Regression,
Structural Equation
C5.0, C&RT PCA, FactorAnalysis
Kohonen, K-means,TwoStep
Cluster Analysis,Probability Density
Cubes
Spatial Visualization2-3 dimensioncharts
2-3 dimensioncharts
Machine Learning/
Artificial
Intelligence
Mathematics ETL, SQL
Unsu ervised Descri tive Tem oral Trend
E l ti D t Mi i
-
8/6/2019 Data Mining in HEd UCSF
31/35
Jing Luan, UCSF/SPSS, 2001 31
Evaluating Data MiningSoftware
Company stability and customer feedback
User Interface
Scalability (up and down)
Server/Client (real-time, KDD)
Modeling capacities
Learning Curve
Join a listserv, such as CLUGCost
D t Mi i Pl t Y
-
8/6/2019 Data Mining in HEd UCSF
32/35
Jing Luan, UCSF/SPSS, 2001 32
Data Mining Plan at YourCollege
1. Determine business needs
2. Determine technology infrastructureand management support
3. Determine data source4. Identify mining areas
5. Invite an expert to jump start
6. Pilot test mining results7. CRISP-DM and Real-time data mining,
Knowledge Discovery in Databases(KDD)
-
8/6/2019 Data Mining in HEd UCSF
33/35
Jing Luan, UCSF/SPSS, 2001 33
Data Mining Skills Set
Driving Forces ofDM:
Computer StorageAlgorithms
KnowledgeManagement
Translate to Skill-set:
Data domainexpert
Familiar w/models
System level viewof decisionmaking
-
8/6/2019 Data Mining in HEd UCSF
34/35
-
8/6/2019 Data Mining in HEd UCSF
35/35
Jing Luan UCSF/SPSS 2001 35
Contact
Jing Luan, Ph.D., ITMC
Director, Planning and Research
Cabrillo College
Email: [email protected]