data mining in hed ucsf

8/6/2019 Data Mining in HEd UCSF

1/35

Data Mining System InHigher Education &Persistence Clustering andPredication

Jing Luan, Ph.D., ITMC

Director, Planning and Research, Cabrillo

CollegeOctober, 2001


2/35

Jing Luan, UCSF/SPSS, 2001 2

In 45 minutes

Tiered Knowledge Management Model(TKMM)

Data Mining Overview: concept anduse

Demonstration ofClementine

Data Mining plan at your collegeData mining, statistics and OLAP

Q&A


3/35


three

two

one

three

Tiered Knowledge ManagementModel (TKMM)

Tacit Knowledge

Portals

CRM

Data Warehouses

Enterprise ResourcePlanning (ERP)

MiddlewareOLAP

DataMining

CollaborativeWorkingEnvironment(CWE)

Knowledge BaseKnowledgeWorkers

Knowled

geMapping

Tiers:Tiers:

two

one

Explicit Knowledge


4/35


TKMM: Explicit KnowledgeManagement

TIER ONE

Data Engines

SQL Server, Oracle, Informix, Sybase, UniData, DB2

Enterprise Resource Planning (ERP)

PeopleSoft, Datatel, SAP, Oracle, Banner

TIER TWO

Querying:BrioQuery, Business Objects, PowerPlay

Access, Foxpro

Online Data Processing:

ASP, JSP, iHTML, XML

TIER THREE:

Mining :

Clementine, Enterprise Miner,

Statistica, Mineset, Darwin, SpotFire

Classical statistics

SPSS, SAS, BMDP, SysStat

Topography of Tiered Knowledge Management Model (TKMM) for explicit knowledge

Many data

mining projects

fail due to lack

of understanding

of these three

tiers,

particularly in

data (feature)

extraction in

Tier One.


5/35


Guiding Principles

LRM (Learner RelationshipManagement)

Student Life Cycle

Student Clustering, student types

Data source and quality

CRISP-DM (all about a system)

The One-Percent Doctrine


6/35


Data Mining in Higher Ed

Alumni

Institutional Effectiveness

Marketing

Enrollment Management


7/35


Data Mining in Higher Ed-institutional effectiveness

What do we know about our students?

What factors contributive to learning?

Who is likely to fail, drop out?

What courses provide high FTES, use spacebetter?

Whatre the course taking patterns?


8/35


Data Mining in Higher Ed-enrollment management

Which groups prefer what services?

Which student is likely to drop out?

Where do our students come from?

Who is likely to return?


9/35


Data Mining in Higher Ed-marketing

Who is likely to respond to our newmarketing strategy?

What factor garners the highest respoWhich type of marketing worksbetter?


10/35


Data Mining in Higher Ed- alumni

What different types of alumni arethere?

Who is likely to pledge for whichamount and when?


11/35


Lift Chart: Gain Chart

0

40thpercentile

35%

25%

Lift

Savings ($)

If every percentage point = $2,500, savings =(70% *$2,500) (40% * $2,500) = $175,000 - $100,000 =$75,000 BACK

70thpercentile

quo

ta

Hypothetical database marketing campaign


12/35


o2 Not-persist

o1 Persist

x3 Demographics

x2 GPA

Artificial Neural Networks(ANN)

w1

w5n

x1 # of Terms

x4 Courses

x5 Fin Aid

xjn

Multi-layer perceptron (MLP): feed forward back propagation

oj

=

=

n

i

jiiwof1


13/35


Decision Trees Rule Induction

Rule 1:

If Income $55,000 and # of Children =

3, then multiple policies

Rule 2:

If Income < $55,000, and single and Age

< 30, thensingle policy

)(log2)()(1

nPnPNHn

i

=

=Information theorem:


14/35


The Use ofClementine

Real-time demonstration Student persistence prediction


15/35


Examining Data


16/35


Clustering using TwoStep


17/35


Building Models forPersistence in Streams

A node is being executed (notice

the red arrows denoting the flow

of data.


18/35


Output(Boosting/Reduction)

Because thereare always fewer

graduates than

all students.

Clementine can

balance the

dataset first.


19/35


Seeing the Work of NeuralThinking

Graphic display

showing an ANN

is learning the

data.


20/35


Results of Neural Node

These are the outputs the Neural

Networks. Overall accuracy and

significance of features (left).Predicted number of policies using

fresh data vs. known data (above).


21/35


Examining C5.0

The control

panel of the

C5.0 node,(Expert)


22/35


Results of C5.0 NodeView the

prediction by

individual

records (PNXT

vs. $C-PNXT).

View the

overall

prediction

accuracy.


23/35


Comparing C&RT and C5.0

Use the Analysis

node to

examined thedifference in

accuracy for

C&RT and C5.0.

See next slide.


24/35


Which One is Better:C&RT & C5.0

C5.0 has an

accuracy rate of

66.3% and C&RT

63.7%. They agree

72% of the time.


25/35


Scoring New Data

Moment of truth. The

most powerful feature of

data mining is to uselearned rules to predict

(score) using fresh data

for business purposes.

Shown here is the change

of dataset to a fresh data

set unseen by Clementinebefore now.


26/35


Using Models to Score NewData

Test Set Results Scored Results

Decision:


27/35


2 TYPES OF DATA MINING

SUPERVISED

Purpose:

For classificationand estimation

Models

C5.0,

C&RT,ANN, etc

UNSUPERVISED

Purpose

For clustering andassociation

Models

Kohonen,

Kmeans,TwoStep

GRI, etc.But pre-classified data means data without target.


28/35


Data Mining Tasks

Predicting onto new data by using rules orpatterns and behaviors Classification

EstimationUnderstanding the groupings, trends, and

characteristics of your customer Segmentation

Visualizing the Euclidean spatial relationships,trends, and patterns of your data Description


29/35


Statistics!But I Use OLAP For All My Work!

Statistics knowledge is very useful.

Data mining cannot replace

statistics in a number of areas.There are overlapping areas.

OLAP is the middle tier.

We must go beyond countingheads!


30/35


How Do Data Mining, Statisticsand OLAP Compare

Data Mining Statistics OLAPNeural Net Regression,

Structural Equation

C5.0, C&RT PCA, FactorAnalysis

Kohonen, K-means,TwoStep

Cluster Analysis,Probability Density

Cubes

Spatial Visualization2-3 dimensioncharts

2-3 dimensioncharts

Machine Learning/

Artificial

Intelligence

Mathematics ETL, SQL

Unsu ervised Descri tive Tem oral Trend

E l ti D t Mi i


31/35


Evaluating Data MiningSoftware

Company stability and customer feedback

User Interface

Scalability (up and down)

Server/Client (real-time, KDD)

Modeling capacities

Learning Curve

Join a listserv, such as CLUGCost

D t Mi i Pl t Y


32/35


Data Mining Plan at YourCollege

1. Determine business needs

2. Determine technology infrastructureand management support

3. Determine data source4. Identify mining areas

5. Invite an expert to jump start

6. Pilot test mining results7. CRISP-DM and Real-time data mining,

Knowledge Discovery in Databases(KDD)


33/35


Data Mining Skills Set

Driving Forces ofDM:

Computer StorageAlgorithms

KnowledgeManagement

Translate to Skill-set:

Data domainexpert

Familiar w/models

System level viewof decisionmaking


34/35


35/35

Jing Luan UCSF/SPSS 2001 35

Contact

Jing Luan, Ph.D., ITMC

Director, Planning and Research

Cabrillo College

Email: [email protected]

data mining in hed ucsf

Documents