a data mining course for computer science primary sources and implementations dave musicant...
Post on 21-Dec-2015
213 views
TRANSCRIPT
A Data Mining Course for Computer SciencePrimary Sources and Implementations
Dave Musicant
Saturday, March 4, 2006
Overview
What is data mining? Why offer a course in data mining? Why focus on research papers in an
undergraduate class? What topics do I cover? What research papers do I use in class? What assignments do I use? Does it work?
What is data mining? “The non-trivial discovery of novel, valid, comprehensible
and potentially useful patterns from data” (Fayyad et al)
Data Mining and Machine Learning are two sides of the same coin Data mining focuses more on larger datasets Machine learning focuses more on connections with artificial
intelligence ... but there is much overlap in the two areas.
My course is titled “Machine Learning and Data Mining” boosts student enthusiasm
Why offer a course in data mining? Interesting applied area of CS that uses theoretical
techniques Reinforces and introduces data structures and
algorithms heaps, R-trees, graphs
Privacy and ethics Personal ownership in assignments
Students choose datasets in areas that interest them
New field, yet accessible Can be done with only Data Structures as a prereq It’s my research area
Why research papers? Can it be done? One approach to course is to use data mining software
Lopez & Ludwig, University of Minnesota-Morris
I wanted students to implement data mining algorithms Textbook support w/ computer science focus is limited
(I use Margaret Dunham’s text as a side reference)
Primary sources provide a rich experience With proper selection, papers are accessible to
undergraduates Papers must be supplemented in classroom
e.g. specific topics in linear algebra, statistics directs classroom activity toward filling gaps and interpreting
papers instead of parroting reading
Topics, Papers, Assignments Each topic consists of one or more papers that are
assigned to the students to read before class discussion. Students post to Caucus (electronic message board):
something they didn’t understand, or something they found interesting
potential exam question
Assignment follows class discussion
Detailed references for all papers and datasets can be found in paper
Topic 0: What is Data Mining? Paper: J. Friedman. “Data Mining and Statistics: What’s
the Connection?” Entertaining and controversial Pokes fun at flaws on all sides Helps to ensure buy-in from computer science students (they
haven’t been tricked into taking a stats course)
Assignment: For the “census-income” dataset, determine: Number of records and features How many features are continuous, how many are nominal For continuous features: average, median, minimum, maximum,
standard deviation 2-dimensional scatter plots of two features at a time Interesting patterns
Topic 1: Classification and Regression
Example: First Trimester Screening
Use this training set to learn how to classify patients where diagnosis is not known:
The input data is often easily obtained, whereas the classification is not.
Input Data Classification
Patient ID tissue (cm) Chemical 1 Chemical 2 Diagnosis1 5 20 118 Positive2 3 15 130 Negative3 7 10 52 Negative4 2 30 100 Positive
Patient ID tissue (cm) Chemical 1 Chemical 2 Diagnosis101 4 16 95 ?102 9 22 125 ?103 1 14 80 ?
Training Set
Testing Set
Technique: Nearest Neighbor
Envision each example as a point in n-dimensional space
Classify test point same as nearest training point
tissue (cm) Chemical 1 Diagnosis5 20 Positive3 15 Negative7 10 Negative2 30 Positive
0
5
10
15
20
25
30
35
0 1 2 3 4 5 6 7 8What am I?
Topic 1: Classification and Regression Focus on scalable nearest neighbor algorithms Paper: Roussopoulos et. al. “Nearest Neighbor Queries”
How to do NN efficiently when data doesn’t fit in core Requires R-trees (I cover in class)
Assignment: Code up the traditional k-nearest neighbor algorithm, apply to census-income data Experiment with different distance metrics (1-norm, 2-norm,
cosine) Experiment with different values of k Produce plots showing training and test set accuracies Interpret results
Topic 2: Clustering Sometimes referred to as unsupervised learning Goal: find clusters of similar data Less accurate than supervised learning, but quite useful
when no training set is available Where are the clusters below? How many are there?
chemical 1
tissue(cm)
chemical 2
tissue(cm)
Topic 2: Clustering Assignment: Find dataset of interest from UCI
Repository iris plant, letter recognition, liver disorders, Pima Indians
diabetes, Congressional voting records, wine recognition, zoo this dataset is used for most remaining assignments if dataset has a class label, discard it for this assignment
Implement basic clustering algorithm (k-means) Try varying number of clusters Try two different techniques for initializing clusters Report and interpret results found
Topic 2: Clustering Paper: Bradley et al, “Scaling Clustering Algorithms to
Large Databases” Describes “Scalable K-means” algorithm Class discussion around “data mining desiderata”
Paper: Guha et al, “CURE: An Efficient Clustering Algorithm for Large Databases” Agglomerative clustering algorithm
completely different approach Requires use of a heap (as I pose the assignment)
Assignment: Implement stripped-down version of CURE Run on dataset, interpret results
Topic 3: Association Rules “Supermarket basket analysis” What items do people tend do buy together at the same
time? Paper: Agrawal et al, “Fast Algorithms for Mining
Association Rules” presents classic Apriori algorithm (skim other portions of paper)
Assignment: Implement Apriori algorithm and implement on own dataset
Topic 4: Web Mining How does Google rank importance of web pages? Every page has a PageRank
PageRank of a page is determined by the PageRank of the pages that link to it
manifests itself as an eigenvalue problem
Paper: Page et al, “The PageRank Citation Ranking: Bringing Order to the Web” describes basic version of Google PageRank algorithm cover eigenvalues in class exposure to linear algebra, numerical analysis
Topic 4: Web Mining Paper: Chakrabarti et al, “Mining the Link Structure of
the World Wide Web” describes HITS algorithm for ranking web pages Google isn’t the only way to do it uses Latent Semantic Analysis, which requires singular value
decomposition (cover in class)
Assignment: Implement PageRank algorithm try it on archive of department website crawling for an assignment is dangerous sparse data representation hashing or other form of map for efficiency interpret results
hubs
authorities
Topic 5: Collaborative Filtering a.k.a. Recommender Systems
“I like Pink Floyd, Dream Theater, and Evanescence. Who should I be listening to?”
Amazon.com, Yahoo! Launchcast
Paper: Breese et al, “Empirical Analysis of Predictive Algorithms for Collaborative Filtering” Algorithms are nearest neighbor-like in flavor Involve averaging numerical scores Need to normalize for individual biases
Students already working on final project, so no assignment
Topic 6: Ethical Issues in Data Mining Privacy concerns Good vs. evil uses of data mining Video: Ramakrishnan et al, “Data Mining: Good, Bad, or
Just a Tool?” Panel discussion from KDD 2004
Before watching video, students post to Caucus: how data mining could be exploited how this could be prevented (if possible)
After watching video followup commentary
Pictures from conference website at http://www.acm.org/sigs/sigkdd/kdd2004/
Topic 6: Ethical Issues in Data Mining Students response to video was more engaged than I
expected More problems than solutions are raised in video
Frustrated students that solutions weren’t clear
Many students interested in issue of accountability If someone’s privacy is violated, who is responsible? “Who do I sue?”
Lively class discussion
Final Project “Do almost anything you want regarding data mining, so
long as I approve it” Find a paper and implement the algorithm within Find a dataset of interest and study it completely, using
Weka and/or their own code from throughout the term Quantitative association rules Poker association rules Collaborative filtering (music, art)
Attack KDD Cup problems KDD Cup 2005: identify categories for web search queries tried this once: tended to be too big for them in the time that I had could perhaps be done with right level of support
Conclusions Papers are most memorable part of course
Students speak very positively about this in evaluations Significant prep time for me to fill in gaps
Caucus motivates reading papers Students find this a pain, but are thankful afterwards in evals Important to set deadline for posting a few hours before class so I
have time to read
Programming assignments work (mostly) well Allow students to work in pairs if they wish Grading is difficult: unspecified details in algorithms, differing
datasets All materials available on my website at
http://www.mathcs.carleton.edu/faculty/dmusican/cs377s05