1 acm kdd cup a survey: 1997-2011 qiang yang 杨强 (partly based on xinyue liu’s slides @sfu, and...
TRANSCRIPT
1
ACM KDD Cup A Survey: 1997-2011
Qiang Yang 杨强(partly based on Xinyue Liu’s slides @SFU,
and Nathan Liu’s slides @hkust)
Hong Kong University of Science and Technology香港科大
2
About KDD Cup (1997 – 2011)
Competition is a strong mover for Science and Engineering: ACM Programming
Contest World College level
Programming skills ROBOCUP
World Robotics Competition
3
About ACM KDDCUP ACM KDD: Premiere Conference in knowledge
discovery and data mining ACM KDDCUP:
Worldwide competition in conjunction with ACM KDD conferences.
It aims at: showcase the best methods for discovering higher-level
knowledge from data. Helping to close the gap between research and industry Stimulating further KDD research and development
4
Statistics
Participation in KDD Cup grew steadily
Average person-hours per submission: 204Max person-hours per submission: 910
Year 97 98 99 2000 2005 2011
Submissions 16 21 24 30 32 1000+
5
Algorithms (up to 2000)Algorithms Tried vs Submitted
0
2
4
6
8
10
12
14
16
18
20
Algorithm
Entri
es
Tried
Submitted
6
KDD Cup 97 A classification task –
to predict financial services industry (direct mail response)
Winners Charles Elkan, a Prof
from UC-San Diego with his Boosted Naive Bayesian (BNB)
Silicon Graphics, Inc with their software MineSet
Urban Science Applications, Inc. with their software gain, Direct Marketing Selection System
7
MineSet (Silicon Graphics Inc.) A KDD tool that combines data access,
transformation, classification, and visualization.
8
KDD Cup 98: CRM Benchmark
URL: www.kdnuggets.com/meetings/kdd98/kdd-cup-98.html
A classification task – to analyze fund raising mail responses to a non-profit organization
Winners Urban Science Application
s, Inc. with their software GainSmarts.
SAS Institute, Inc. with their software SAS Enterprise Miner ™
Quadstone Limited with their software Decisionhouse ™
KDDCUP 1998 Results
$-
$5,000
$10,000
$15,000
$20,000
$25,000
$30,000
$35,000
$40,000
$45,000
$50,000
$55,000
$60,000
$65,000
$70,000
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%Maximum Possible Profit Line($72,776 in profits with 4,873 mailed)
GainSmarts
SAS/Enterprise Miner
Quadstone/Decisionhouse
Mail to Everyone Solution ($10,560 in profits with 96,367 mailed)
10
ACM KDD Cup 1999 URL:
www.cse.ucsd.edu/users/elkan/kdresults.html
Problem To detect network intrusion and protect a computer network from unauthorized users, including perhaps insiders
Data: from DoD Winners
SAS Institute Inc. with their software Enterprise Miner.
Amdocs with their Information Analysis Environment
11
KDDCUP 2000: Data Set and Goal:
Data collected from Gazelle.com, a legwear and legcare Web retailer Pre-processedTraining set: 2 months Test sets: one month Data collected includes:
Click streams Order information
The goal – to design models to support web-site personalization and to improve the profitability of the site by increasing customer response.
Questions - When given a set of page views,
characterize heavy spenders
characterize killer pages characterize which
product brand a visitor will view in the remainder of the session?
12
KDDCUP 2000: The Winners
Question 1 & 5 Winner: Amdocs
Question 2 & 3 Winner: Salford Systems
Question 4 Winner: e-steam
13
KDD Cup 2001 3 Bioinformatics Tasks
Dataset 1: Prediction of Molecular Bioactivity for Drug Design
half a gigabyte when uncompressed
Dataset 2: Prediction of Gene/Protein Function (task 2) and Localization (task 3)
Dataset 2 is smaller and easier to understand
7 megabytes uncompressed
A total of 136 groups participated to produce a total of 200 submitted predictions over the 3 tasks: 114 for Thrombin, 41 for Function, and 45 for Localization.
14
2001 Winners Task 1, Thrombin:
Jie Cheng (Canadian Imperial Bank of Commerce).
Bayesian network learner and classifier
Task 2, Function: Mark-A. Krogel (University of Magdeburg).
Inductive Logic programming Task 3, Localization:
Hisashi Hayashi, Jun Sese, and Shinichi Morishita (University of Tokyo).
K nearest neighbor
Task 2: the genes of one
particular type of organism
A gene/protein can have more than one function, but only one localization.
15
molecular biology : Two tasks Task 1: Document
extraction from biological articles
Task 2: Classification of proteins based on gene deletion experiments
Winners: Task 1: ClearForest
and Celera, USA Yizhar Regev and
Michal Finkelstein Task 2: Telstra
Research Laboratories, Australia
Adam Kowalczyk and Bhavani Raskutti
16
2003 KDDCUP Information
Retrieval/Citation Mining of Scientific research papers
based on a very large archive of research papers
First Task: predict how many citations each paper will receive during the three months leading up to the KDD 2003 conference
Second Task: a citation graph of a large subset of the archive from only the LaTex sources
Third Task: each paper's popularity will be estimated based on partial download logs
Last Task: devise their own questions
17
2003 KDDCUP: Results Task 1:
Claudia Perlich, Foster Provost, Sofus Kacskassy
New York University Task 2:
1st place: David Vogel AI Insight Inc.
Task 3: Janez Brank and Jure Leskovec Jozef Stefan Institute, Slovenija
Task 4: Amy McGovern, Lisa Friedland,
Michael Hay, Brian Gallagher, Andrew Fast,
Jennifer Neville, and David Jensen University of Massachusetts Amherst, USA
18
2004 Tasks and Results 粒子物理学和同调蛋白质预测( Particle physics; plus protein homology prediction )
两个子任务的冠军分别为:David S. Vogel, Eric Gottschalk, and Morgan C. Wang以及Bernhard Pfahringer, Yan Fu (付岩 ), RuiXiang Sun, Qiang Yang (杨强 ), Simin He, Chunli Wang, Haipeng Wang, Shiguang Shan, Junfa Liu, Wen Gao.
Past KDDCUP Overview: 2005-2010Year Host Task Technique Winner
2005 Microsoft Web query categorization
Feature Engineering, Ensemble
HKUST (沈抖,杨强,等)
2006 Siemens Pulmonary emboli detection
Multi-instance, Non-IID sample, Cost sensitive, Class Imbalance, Noisy data
AT&T, Budapest University of Technology & Economics
2007 Netflix Consumer recommendation
Collaborative Filtering, Time series, Ensemble
IBM Research, Hungarian Academy of Sciences
2008 Siemens Breast cancer detection from medical images
Ensemble, Class imbalance, Score calibration
IBM Research,National Taiwan University
2009 Orange Customer relationship prediction in telecom
Feature selection,Ensemble
IBM Research, University of Melbourne
2010 PSLC Data Shop
Student performance prediction in E-Learning
Feature engineering, Ensemble,Collaborative filtering
National Taiwan University ( CJ Lin, S. Lin, etc.)
KDDCUP’11 Dataset 11 years of data Rated items are
Tracks Albums Artists Genres
Items arranges in a taxonomy Two tasks
Track 1 Track 2
#ratings 263M 63M
#items 625K 296K
#users 1M 249K
Items in a Taxonomy
Track 1 Details
Track 1 Highlights Largest publicly available dataset Large number of items (50 times more
than Netflix) Extreme rating sparsity (20 times more
sparse than Netflix) Taxonomy can help in combating
sparsely rated items. Fine time stamps with both date and
time allow sophisticated temporal modeling.
Track 2 Details
Track 2 Highlights Performance metric focus on ranking/
classification, which differs from traditional collaborative filtering.
No validation data provided, need to self-construct binary labeled data from rating data.
Unlike track 1, track 2 removed time stamps to focus more than long term preference rather than short term behaviors.
Submission Stats
WinnersTrack 1 Track 2
1st place National Taiwan University National Taiwan University
2nd place Commendo (Netflix Prize Winnder)
Chinese Academy of Science,Hulu Labs
3rd place Hong Kong University of Science and Technology,Shanghai Jiaotong University
Commendo (Netflix Prize Winnder)
Chinese Teams at KDDCUP (NTU, CAS, HKUST)
Key Techniques Track 1:
Blending of multiple techniques Matrix factorization models Nearest neighbor models Restricted Bolzmann machines Temporal modelings
Track 2: Importance sampling of negative instances Taxonomical modelings Use of pairwise ranking objective functions
30
Summary To place on top of KDDCUP requires
Team work Expertise in domain knowledge as well as
mathematical tools Often done by world famous institutes and
companies Recent trends:
Dataset increasingly more realistic Participants increasingly more professional Tasks are increasingly more difficult
31
Summary
KDD Cup is an excellent source to learn the state-of-art KDD techniques
KDDCUP dataset often becomes the standard benchmark for future research, development and teaching
Top winners are highly regarded and respected
32
References
Elkan C. (1997). Boosting and Naive Bayesian Learning. Technical Report No. CS97-557, September 1997, UCSD.
Decisionhouse (1998). KDD Cup 98: Quadstone Take Bronze Miner Award. Retrieved March 15, 2001 from http://www.kdnuggets.com/meetings/kdd98/quadstone/index.html
Urbane Science (1998). Urbane Science wins the KDD-98 Cup. Retrieved March 15, 2001 from http://www.kdnuggets.com/meetings/kdd98/gain-kddcup98-release.html
Georges, J. & Milley, A. (1999). KDD’99 Competition: Knowledge Discovery Contest. Retrieved March 15, 2001 from http://www.cse.ucsd.edu/users/elkan/saskdd99.pdf
Rosset, S. & Inger A. (1999). KDD-Cup 99 : Knowledge Discovery In a Charitable Organization’s Donor Database. Retrieved March 15, 2001 from http://www.cse.ucsd.edu/users/elkan/KDD2.doc
33
References (Cont.)
Sebastiani P., Ramoni M. & Crea A. (1999). Profiling your Customers using Bayesian Networks. Retrieved March 15, 2001 from http://bayesware.com/resources/tutorials/kddcup99/kddcup99.pdf
Inger A., Vatnik N., Rosset S. & Neumann E. (2000). KDD-Cup 2000: Question 1 Winner’s Report. Retrieved March 18, 2000 from http://www.ecn.purdue.edu/KDDCUP/amdocs-slides-1.ppt
Neumann E., Vatnik N., Rosset S., Duenias M., Sasson I. & Inger A. (2000). KDD-Cup 2000: Question 5 Winner’s Report. Retrieved March 18, 2000 from http://www.ecn.purdue.edu/KDDCUP/amdocs-slides-5.ppt
Salford System white papers: http://www.salford-systems.com/whitepaper.html
Summary talk presented at KDD (2000)http://robotics.stanford.edu/~ronnyk/kddCupTalk.ppt
34
References (cont) http://www.cs.wisc.edu/~dpage/kddcup2001/Cheng.pdf http://www.cs.wisc.edu/~dpage/kddcup2001/Krogel.pdf http://www.cs.wisc.edu/~dpage/kddcup2001/Hayashi.pdf