data mining in hed ucsf

Upload: belgaum

Post on 07-Apr-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 Data Mining in HEd UCSF

    1/35

    Data Mining System InHigher Education &Persistence Clustering andPredication

    Jing Luan, Ph.D., ITMC

    Director, Planning and Research, Cabrillo

    CollegeOctober, 2001

  • 8/6/2019 Data Mining in HEd UCSF

    2/35

    Jing Luan, UCSF/SPSS, 2001 2

    In 45 minutes

    Tiered Knowledge Management Model(TKMM)

    Data Mining Overview: concept anduse

    Demonstration ofClementine

    Data Mining plan at your collegeData mining, statistics and OLAP

    Q&A

  • 8/6/2019 Data Mining in HEd UCSF

    3/35

    Jing Luan, UCSF/SPSS, 2001 3

    three

    two

    one

    three

    Tiered Knowledge ManagementModel (TKMM)

    Tacit Knowledge

    Portals

    CRM

    Data Warehouses

    Enterprise ResourcePlanning (ERP)

    MiddlewareOLAP

    DataMining

    CollaborativeWorkingEnvironment(CWE)

    Knowledge BaseKnowledgeWorkers

    Knowled

    geMapping

    Tiers:Tiers:

    two

    one

    Explicit Knowledge

  • 8/6/2019 Data Mining in HEd UCSF

    4/35

    Jing Luan, UCSF/SPSS, 2001 4

    TKMM: Explicit KnowledgeManagement

    TIER ONE

    Data Engines

    SQL Server, Oracle, Informix, Sybase, UniData, DB2

    Enterprise Resource Planning (ERP)

    PeopleSoft, Datatel, SAP, Oracle, Banner

    TIER TWO

    Querying:BrioQuery, Business Objects, PowerPlay

    Access, Foxpro

    Online Data Processing:

    ASP, JSP, iHTML, XML

    TIER THREE:

    Mining :

    Clementine, Enterprise Miner,

    Statistica, Mineset, Darwin, SpotFire

    Classical statistics

    SPSS, SAS, BMDP, SysStat

    Topography of Tiered Knowledge Management Model (TKMM) for explicit knowledge

    Many data

    mining projects

    fail due to lack

    of understanding

    of these three

    tiers,

    particularly in

    data (feature)

    extraction in

    Tier One.

  • 8/6/2019 Data Mining in HEd UCSF

    5/35

    Jing Luan, UCSF/SPSS, 2001 5

    Guiding Principles

    LRM (Learner RelationshipManagement)

    Student Life Cycle

    Student Clustering, student types

    Data source and quality

    CRISP-DM (all about a system)

    The One-Percent Doctrine

  • 8/6/2019 Data Mining in HEd UCSF

    6/35

    Jing Luan, UCSF/SPSS, 2001 6

    Data Mining in Higher Ed

    Alumni

    Institutional Effectiveness

    Marketing

    Enrollment Management

  • 8/6/2019 Data Mining in HEd UCSF

    7/35

    Jing Luan, UCSF/SPSS, 2001 7

    Data Mining in Higher Ed-institutional effectiveness

    What do we know about our students?

    What factors contributive to learning?

    Who is likely to fail, drop out?

    What courses provide high FTES, use spacebetter?

    Whatre the course taking patterns?

  • 8/6/2019 Data Mining in HEd UCSF

    8/35

    Jing Luan, UCSF/SPSS, 2001 8

    Data Mining in Higher Ed-enrollment management

    Which groups prefer what services?

    Which student is likely to drop out?

    Where do our students come from?

    Who is likely to return?

  • 8/6/2019 Data Mining in HEd UCSF

    9/35

    Jing Luan, UCSF/SPSS, 2001 9

    Data Mining in Higher Ed-marketing

    Who is likely to respond to our newmarketing strategy?

    What factor garners the highest respoWhich type of marketing worksbetter?

  • 8/6/2019 Data Mining in HEd UCSF

    10/35

    Jing Luan, UCSF/SPSS, 2001 10

    Data Mining in Higher Ed- alumni

    What different types of alumni arethere?

    Who is likely to pledge for whichamount and when?

  • 8/6/2019 Data Mining in HEd UCSF

    11/35

    Jing Luan, UCSF/SPSS, 2001 11

    Lift Chart: Gain Chart

    0

    40thpercentile

    35%

    25%

    Lift

    Savings ($)

    If every percentage point = $2,500, savings =(70% *$2,500) (40% * $2,500) = $175,000 - $100,000 =$75,000 BACK

    70thpercentile

    quo

    ta

    Hypothetical database marketing campaign

  • 8/6/2019 Data Mining in HEd UCSF

    12/35

    Jing Luan, UCSF/SPSS, 2001 12

    o2 Not-persist

    o1 Persist

    x3 Demographics

    x2 GPA

    Artificial Neural Networks(ANN)

    w1

    w5n

    x1 # of Terms

    x4 Courses

    x5 Fin Aid

    xjn

    Multi-layer perceptron (MLP): feed forward back propagation

    oj

    =

    =

    n

    i

    jiiwof1

  • 8/6/2019 Data Mining in HEd UCSF

    13/35

    Jing Luan, UCSF/SPSS, 2001 13

    Decision Trees Rule Induction

    Rule 1:

    If Income $55,000 and # of Children =

    3, then multiple policies

    Rule 2:

    If Income < $55,000, and single and Age

    < 30, thensingle policy

    )(log2)()(1

    nPnPNHn

    i

    =

    =Information theorem:

  • 8/6/2019 Data Mining in HEd UCSF

    14/35

    Jing Luan, UCSF/SPSS, 2001 14

    The Use ofClementine

    Real-time demonstration Student persistence prediction

  • 8/6/2019 Data Mining in HEd UCSF

    15/35

    Jing Luan, UCSF/SPSS, 2001 15

    Examining Data

  • 8/6/2019 Data Mining in HEd UCSF

    16/35

    Jing Luan, UCSF/SPSS, 2001 16

    Clustering using TwoStep

  • 8/6/2019 Data Mining in HEd UCSF

    17/35

    Jing Luan, UCSF/SPSS, 2001 17

    Building Models forPersistence in Streams

    A node is being executed (notice

    the red arrows denoting the flow

    of data.

  • 8/6/2019 Data Mining in HEd UCSF

    18/35

    Jing Luan, UCSF/SPSS, 2001 18

    Output(Boosting/Reduction)

    Because thereare always fewer

    graduates than

    all students.

    Clementine can

    balance the

    dataset first.

  • 8/6/2019 Data Mining in HEd UCSF

    19/35

    Jing Luan, UCSF/SPSS, 2001 19

    Seeing the Work of NeuralThinking

    Graphic display

    showing an ANN

    is learning the

    data.

  • 8/6/2019 Data Mining in HEd UCSF

    20/35

    Jing Luan, UCSF/SPSS, 2001 20

    Results of Neural Node

    These are the outputs the Neural

    Networks. Overall accuracy and

    significance of features (left).Predicted number of policies using

    fresh data vs. known data (above).

  • 8/6/2019 Data Mining in HEd UCSF

    21/35

    Jing Luan, UCSF/SPSS, 2001 21

    Examining C5.0

    The control

    panel of the

    C5.0 node,(Expert)

  • 8/6/2019 Data Mining in HEd UCSF

    22/35

    Jing Luan, UCSF/SPSS, 2001 22

    Results of C5.0 NodeView the

    prediction by

    individual

    records (PNXT

    vs. $C-PNXT).

    View the

    overall

    prediction

    accuracy.

  • 8/6/2019 Data Mining in HEd UCSF

    23/35

    Jing Luan, UCSF/SPSS, 2001 23

    Comparing C&RT and C5.0

    Use the Analysis

    node to

    examined thedifference in

    accuracy for

    C&RT and C5.0.

    See next slide.

  • 8/6/2019 Data Mining in HEd UCSF

    24/35

    Jing Luan, UCSF/SPSS, 2001 24

    Which One is Better:C&RT & C5.0

    C5.0 has an

    accuracy rate of

    66.3% and C&RT

    63.7%. They agree

    72% of the time.

  • 8/6/2019 Data Mining in HEd UCSF

    25/35

    Jing Luan, UCSF/SPSS, 2001 25

    Scoring New Data

    Moment of truth. The

    most powerful feature of

    data mining is to uselearned rules to predict

    (score) using fresh data

    for business purposes.

    Shown here is the change

    of dataset to a fresh data

    set unseen by Clementinebefore now.

  • 8/6/2019 Data Mining in HEd UCSF

    26/35

    Jing Luan, UCSF/SPSS, 2001 26

    Using Models to Score NewData

    Test Set Results Scored Results

    Decision:

  • 8/6/2019 Data Mining in HEd UCSF

    27/35

    Jing Luan, UCSF/SPSS, 2001 27

    2 TYPES OF DATA MINING

    SUPERVISED

    Purpose:

    For classificationand estimation

    Models

    C5.0,

    C&RT,ANN, etc

    UNSUPERVISED

    Purpose

    For clustering andassociation

    Models

    Kohonen,

    Kmeans,TwoStep

    GRI, etc.But pre-classified data means data without target.

  • 8/6/2019 Data Mining in HEd UCSF

    28/35

    Jing Luan, UCSF/SPSS, 2001 28

    Data Mining Tasks

    Predicting onto new data by using rules orpatterns and behaviors Classification

    EstimationUnderstanding the groupings, trends, and

    characteristics of your customer Segmentation

    Visualizing the Euclidean spatial relationships,trends, and patterns of your data Description

  • 8/6/2019 Data Mining in HEd UCSF

    29/35

    Jing Luan, UCSF/SPSS, 2001 29

    Statistics!But I Use OLAP For All My Work!

    Statistics knowledge is very useful.

    Data mining cannot replace

    statistics in a number of areas.There are overlapping areas.

    OLAP is the middle tier.

    We must go beyond countingheads!

  • 8/6/2019 Data Mining in HEd UCSF

    30/35

    Jing Luan, UCSF/SPSS, 2001 30

    How Do Data Mining, Statisticsand OLAP Compare

    Data Mining Statistics OLAPNeural Net Regression,

    Structural Equation

    C5.0, C&RT PCA, FactorAnalysis

    Kohonen, K-means,TwoStep

    Cluster Analysis,Probability Density

    Cubes

    Spatial Visualization2-3 dimensioncharts

    2-3 dimensioncharts

    Machine Learning/

    Artificial

    Intelligence

    Mathematics ETL, SQL

    Unsu ervised Descri tive Tem oral Trend

    E l ti D t Mi i

  • 8/6/2019 Data Mining in HEd UCSF

    31/35

    Jing Luan, UCSF/SPSS, 2001 31

    Evaluating Data MiningSoftware

    Company stability and customer feedback

    User Interface

    Scalability (up and down)

    Server/Client (real-time, KDD)

    Modeling capacities

    Learning Curve

    Join a listserv, such as CLUGCost

    D t Mi i Pl t Y

  • 8/6/2019 Data Mining in HEd UCSF

    32/35

    Jing Luan, UCSF/SPSS, 2001 32

    Data Mining Plan at YourCollege

    1. Determine business needs

    2. Determine technology infrastructureand management support

    3. Determine data source4. Identify mining areas

    5. Invite an expert to jump start

    6. Pilot test mining results7. CRISP-DM and Real-time data mining,

    Knowledge Discovery in Databases(KDD)

  • 8/6/2019 Data Mining in HEd UCSF

    33/35

    Jing Luan, UCSF/SPSS, 2001 33

    Data Mining Skills Set

    Driving Forces ofDM:

    Computer StorageAlgorithms

    KnowledgeManagement

    Translate to Skill-set:

    Data domainexpert

    Familiar w/models

    System level viewof decisionmaking

  • 8/6/2019 Data Mining in HEd UCSF

    34/35

  • 8/6/2019 Data Mining in HEd UCSF

    35/35

    Jing Luan UCSF/SPSS 2001 35

    Contact

    Jing Luan, Ph.D., ITMC

    Director, Planning and Research

    Cabrillo College

    Email: [email protected]