a data mining course for computer science primary sources and implementations dave musicant...

A Data Mining Course for Computer SciencePrimary Sources and Implementations

Dave Musicant

Saturday, March 4, 2006

Overview

What is data mining? Why offer a course in data mining? Why focus on research papers in an

undergraduate class? What topics do I cover? What research papers do I use in class? What assignments do I use? Does it work?

What is data mining? “The non-trivial discovery of novel, valid, comprehensible

and potentially useful patterns from data” (Fayyad et al)

Data Mining and Machine Learning are two sides of the same coin Data mining focuses more on larger datasets Machine learning focuses more on connections with artificial

intelligence ... but there is much overlap in the two areas.

My course is titled “Machine Learning and Data Mining” boosts student enthusiasm

Why offer a course in data mining? Interesting applied area of CS that uses theoretical

techniques Reinforces and introduces data structures and

algorithms heaps, R-trees, graphs

Privacy and ethics Personal ownership in assignments

Students choose datasets in areas that interest them

New field, yet accessible Can be done with only Data Structures as a prereq It’s my research area

Why research papers? Can it be done? One approach to course is to use data mining software

Lopez & Ludwig, University of Minnesota-Morris

I wanted students to implement data mining algorithms Textbook support w/ computer science focus is limited

(I use Margaret Dunham’s text as a side reference)

Primary sources provide a rich experience With proper selection, papers are accessible to

undergraduates Papers must be supplemented in classroom

e.g. specific topics in linear algebra, statistics directs classroom activity toward filling gaps and interpreting

papers instead of parroting reading

Topics, Papers, Assignments Each topic consists of one or more papers that are

assigned to the students to read before class discussion. Students post to Caucus (electronic message board):

something they didn’t understand, or something they found interesting

potential exam question

Assignment follows class discussion

Detailed references for all papers and datasets can be found in paper

Topic 0: What is Data Mining? Paper: J. Friedman. “Data Mining and Statistics: What’s

the Connection?” Entertaining and controversial Pokes fun at flaws on all sides Helps to ensure buy-in from computer science students (they

haven’t been tricked into taking a stats course)

Assignment: For the “census-income” dataset, determine: Number of records and features How many features are continuous, how many are nominal For continuous features: average, median, minimum, maximum,

standard deviation 2-dimensional scatter plots of two features at a time Interesting patterns

Topic 1: Classification and Regression

Example: First Trimester Screening

Use this training set to learn how to classify patients where diagnosis is not known:

The input data is often easily obtained, whereas the classification is not.

Input Data Classification

Patient ID tissue (cm) Chemical 1 Chemical 2 Diagnosis1 5 20 118 Positive2 3 15 130 Negative3 7 10 52 Negative4 2 30 100 Positive

Patient ID tissue (cm) Chemical 1 Chemical 2 Diagnosis101 4 16 95 ?102 9 22 125 ?103 1 14 80 ?

Training Set

Testing Set

Technique: Nearest Neighbor

Envision each example as a point in n-dimensional space

Classify test point same as nearest training point

tissue (cm) Chemical 1 Diagnosis5 20 Positive3 15 Negative7 10 Negative2 30 Positive

0

5

10

15

20

25

30

35

0 1 2 3 4 5 6 7 8What am I?

Topic 1: Classification and Regression Focus on scalable nearest neighbor algorithms Paper: Roussopoulos et. al. “Nearest Neighbor Queries”

How to do NN efficiently when data doesn’t fit in core Requires R-trees (I cover in class)

Assignment: Code up the traditional k-nearest neighbor algorithm, apply to census-income data Experiment with different distance metrics (1-norm, 2-norm,

cosine) Experiment with different values of k Produce plots showing training and test set accuracies Interpret results

Topic 2: Clustering Sometimes referred to as unsupervised learning Goal: find clusters of similar data Less accurate than supervised learning, but quite useful

when no training set is available Where are the clusters below? How many are there?

chemical 1

tissue(cm)

chemical 2

tissue(cm)

Topic 2: Clustering Assignment: Find dataset of interest from UCI

Repository iris plant, letter recognition, liver disorders, Pima Indians

diabetes, Congressional voting records, wine recognition, zoo this dataset is used for most remaining assignments if dataset has a class label, discard it for this assignment

Implement basic clustering algorithm (k-means) Try varying number of clusters Try two different techniques for initializing clusters Report and interpret results found

Topic 2: Clustering Paper: Bradley et al, “Scaling Clustering Algorithms to

Large Databases” Describes “Scalable K-means” algorithm Class discussion around “data mining desiderata”

Paper: Guha et al, “CURE: An Efficient Clustering Algorithm for Large Databases” Agglomerative clustering algorithm

completely different approach Requires use of a heap (as I pose the assignment)

Assignment: Implement stripped-down version of CURE Run on dataset, interpret results

Topic 3: Association Rules “Supermarket basket analysis” What items do people tend do buy together at the same

time? Paper: Agrawal et al, “Fast Algorithms for Mining

Association Rules” presents classic Apriori algorithm (skim other portions of paper)

Assignment: Implement Apriori algorithm and implement on own dataset

Topic 4: Web Mining How does Google rank importance of web pages? Every page has a PageRank

PageRank of a page is determined by the PageRank of the pages that link to it

manifests itself as an eigenvalue problem

Paper: Page et al, “The PageRank Citation Ranking: Bringing Order to the Web” describes basic version of Google PageRank algorithm cover eigenvalues in class exposure to linear algebra, numerical analysis

Topic 4: Web Mining Paper: Chakrabarti et al, “Mining the Link Structure of

the World Wide Web” describes HITS algorithm for ranking web pages Google isn’t the only way to do it uses Latent Semantic Analysis, which requires singular value

decomposition (cover in class)

Assignment: Implement PageRank algorithm try it on archive of department website crawling for an assignment is dangerous sparse data representation hashing or other form of map for efficiency interpret results

hubs

authorities

Topic 5: Collaborative Filtering a.k.a. Recommender Systems

“I like Pink Floyd, Dream Theater, and Evanescence. Who should I be listening to?”

Amazon.com, Yahoo! Launchcast

Paper: Breese et al, “Empirical Analysis of Predictive Algorithms for Collaborative Filtering” Algorithms are nearest neighbor-like in flavor Involve averaging numerical scores Need to normalize for individual biases

Students already working on final project, so no assignment

Topic 6: Ethical Issues in Data Mining Privacy concerns Good vs. evil uses of data mining Video: Ramakrishnan et al, “Data Mining: Good, Bad, or

Just a Tool?” Panel discussion from KDD 2004

Before watching video, students post to Caucus: how data mining could be exploited how this could be prevented (if possible)

After watching video followup commentary

Pictures from conference website at http://www.acm.org/sigs/sigkdd/kdd2004/

Topic 6: Ethical Issues in Data Mining Students response to video was more engaged than I

expected More problems than solutions are raised in video

Frustrated students that solutions weren’t clear

Many students interested in issue of accountability If someone’s privacy is violated, who is responsible? “Who do I sue?”

Lively class discussion

Final Project “Do almost anything you want regarding data mining, so

long as I approve it” Find a paper and implement the algorithm within Find a dataset of interest and study it completely, using

Weka and/or their own code from throughout the term Quantitative association rules Poker association rules Collaborative filtering (music, art)

Attack KDD Cup problems KDD Cup 2005: identify categories for web search queries tried this once: tended to be too big for them in the time that I had could perhaps be done with right level of support

Conclusions Papers are most memorable part of course

Students speak very positively about this in evaluations Significant prep time for me to fill in gaps

Caucus motivates reading papers Students find this a pain, but are thankful afterwards in evals Important to set deadline for posting a few hours before class so I

have time to read

Programming assignments work (mostly) well Allow students to work in pairs if they wish Grading is difficult: unspecified details in algorithms, differing

datasets All materials available on my website at

http://www.mathcs.carleton.edu/faculty/dmusican/cs377s05

a data mining course for computer science primary sources and implementations dave musicant...

Documents