data mining for web personalizationpeterb/2480-012/datamining.pdf · 4/8/11 2 personalization goal...
TRANSCRIPT
4/8/11
1
Patrick Dudas
Data Mining for Web Personalization
Outline Personalization Data mining
Examples
Web mining MapReduce Data Preprocessing Knowledge Discovery Evaluation Information High
4/8/11
2
Personalization Goal of data mining approach is for “automatic
personalization” Automatic Personalization:
Content-based Collaborative Rule-based
Rule-based (Brief overview) Create decision rules
Implicitly/Explicitly
Highly domain dependent Rules nontransferable
Profiles are based on user input Biased Static Degrade over time
4/8/11
3
Content-based (Brief overview) Profile based on users past experiences and their interest
(ratings) Think Amazon, Pandora, eBay..
Vector similarities based on cosine similarity Bayesian classification Remember: Ratings = Profile = Recommendation
Collaborative (Brief overview) Creating groups of users based on ratings
Nearest neighbor approach
Once grouped, recommendation based on the other neighbors are presented
More users or items = more dimensions of data Dynamic or real-time not applicable
4/8/11
4
Data Mining Data rich descriptions Large volumes of data
reliable models
Automated data collection Evaluate results/make decisions Integration with existing data
sources
Examples of Large Datasets http://aws.amazon.com/datasets Featured data sets:
Illumina - Jay Flatley (CEO of Illumina) Human Genome Data Setcience 315(5814): 972. 350 GB
YRI Trio Dataset 700GB
Sloan Digital Sky Survey DR6 Subset 160 GB
Genome, survey data, Google Books n-gram corpuses, traffic statistics, OpenStreetMap dataset, Wikipedia traffic…
4/8/11
5
Data Mining Web Personalization Recommendations based on Web objects:
Items Pages Documents
Navigation by links
Web mining Pros: Personalization (duh.), real-time, more enriched datasets Cons: Privacy issues, building complex systems that
misrepresent the individual
Extend the Data Mining Paradigm
1) Data Preparation and Transformation
2) Pattern Discovery
3) Recommendation
4/8/11
6
Data Preparation and Transformation Web logs
Date/time usage Site information Resource requested (image, video, etc.)
Site files/meta-data The power of the cookie
Server-side cookies!
Data Preparation and Transformation (cont.) Pageview:
User actions (where they clicked and the path) User events (what they are trying to accomplish)
Session: Sequence of page views
4/8/11
7
MapReduce Google design Hoodop implemented C++, C#, Erlang, Java,
Ocaml, Perl, Python, Ruby, F#, R..
Example
4/8/11
8
Usage Data Pre-Processing
Pattern Discovery We have data! Now what?
Cluster Classification Association Rule Discovery Sequential pattern Discovery Markov Models Latent Variable Model
4/8/11
9
Clustering Partitioning
Split your data into groups K-means
Hierarchical Divisive (top-down)
Start with everything, find groups
Agglomerative (bottom-up) Start with a cluster and add additional information
Model-based Building a model for the data (best fit)
K-means
4/8/11
10
User-Based Clustering Start with the user profile Partition into k-groups of profiles
Based on similarity
Association Discovery
Support min(support)
Confidence min(confidence)
4/8/11
11
Evaluation (Personalization Model) Challenges:
Recommendation algorithms may require unique set of evaluation metrics
Personalization actions may be different Domain Intended application Data gathered
Check for overfitting data Training set ROC Curve
ROC Curve TPR = TP / (TP + FN) FPR = FP / (FP + TN)
"No" "Yes"
SN
N
.95 .05
.20 .80
4/8/11
12
Information High Information is addictive Information can be misleading Information ethics Information is power, sometimes too powerful
Personal Suggestions Develop a hypothesis Figure out what data is needed Make informed decisions
Don’t trust just your judgment Experts are experts for a reason! Develop a way to validate based on experience
Then get more data if needed
4/8/11
13
Sources Kohavi, R. and F. Provost (2001). "Applications of data mining to electronic
commerce." Data Mining and Knowledge Discovery 5(1): 5-10. Dean, J. and S. Ghemawat (2008). "MapReduce: Simplified data processing on
large clusters." Communications of the ACM 51(1): 107-113. Mobasher, B. (2007). "Data mining for web personalization." The adaptive web:
90-135. http://en.wikipedia.org/wiki/File:FrequentItems.png Zaharia, M., A. Konwinski, et al. (2008). Improving mapreduce performance in
heterogeneous environments, USENIX Association. Witten, I. H. and E. Frank (2002). "Data mining: practical machine learning
tools and techniques with Java implementations." ACM SIGMOD Record 31(1): 76-77.
Senthil kumar, A. (2011). Knowledge discovery practices and emerging applications of data mining : trends and new domains. Hershey, PA, Information Science Reference.
Thank you! Questions?