data mining for web personalizationpeterb/2480-012/datamining.pdf · 4/8/11 2 personalization goal...

4/8/11

1

Patrick Dudas

Data Mining for Web Personalization

Outline   Personalization   Data mining

  Examples

  Web mining   MapReduce   Data Preprocessing   Knowledge Discovery   Evaluation   Information High

4/8/11

2

Personalization   Goal of data mining approach is for “automatic

personalization”   Automatic Personalization:

 Content-based  Collaborative  Rule-based

Rule-based (Brief overview)   Create decision rules

  Implicitly/Explicitly

  Highly domain dependent  Rules nontransferable

  Profiles are based on user input   Biased   Static  Degrade over time

4/8/11

3

Content-based (Brief overview)   Profile based on users past experiences and their interest

(ratings)  Think Amazon, Pandora, eBay..

  Vector similarities based on cosine similarity   Bayesian classification   Remember: Ratings = Profile = Recommendation

Collaborative (Brief overview)   Creating groups of users based on ratings

 Nearest neighbor approach

  Once grouped, recommendation based on the other neighbors are presented

  More users or items = more dimensions of data  Dynamic or real-time not applicable

4/8/11

4

Data Mining   Data rich descriptions   Large volumes of data

  reliable models

  Automated data collection   Evaluate results/make decisions   Integration with existing data

sources

Examples of Large Datasets   http://aws.amazon.com/datasets   Featured data sets:

  Illumina - Jay Flatley (CEO of Illumina) Human Genome Data Setcience 315(5814): 972.   350 GB

 YRI Trio Dataset   700GB

  Sloan Digital Sky Survey DR6 Subset   160 GB

  Genome, survey data, Google Books n-gram corpuses, traffic statistics, OpenStreetMap dataset, Wikipedia traffic…

4/8/11

5

Data Mining Web Personalization   Recommendations based on Web objects:

  Items   Pages  Documents

  Navigation by links

  Web mining   Pros: Personalization (duh.), real-time, more enriched datasets  Cons: Privacy issues, building complex systems that

misrepresent the individual

Extend the Data Mining Paradigm

1) Data Preparation and Transformation

2) Pattern Discovery

3) Recommendation

4/8/11

6

Data Preparation and Transformation   Web logs

 Date/time usage   Site information  Resource requested (image, video, etc.)

  Site files/meta-data   The power of the cookie

Server-side cookies!

Data Preparation and Transformation (cont.)   Pageview:

 User actions (where they clicked and the path)  User events (what they are trying to accomplish)

  Session:   Sequence of page views

4/8/11

7

MapReduce   Google design   Hoodop implemented   C++, C#, Erlang, Java,

Ocaml, Perl, Python, Ruby, F#, R..

Example

4/8/11

8

Usage Data Pre-Processing

Pattern Discovery   We have data! Now what?

 Cluster  Classification  Association Rule Discovery   Sequential pattern Discovery  Markov Models   Latent Variable Model

4/8/11

9

Clustering   Partitioning

  Split your data into groups  K-means

  Hierarchical  Divisive (top-down)

  Start with everything, find groups

 Agglomerative (bottom-up)   Start with a cluster and add additional information

  Model-based   Building a model for the data (best fit)

K-means

4/8/11

10

User-Based Clustering   Start with the user profile   Partition into k-groups of profiles

  Based on similarity

Association Discovery

  Support   min(support)

  Confidence   min(confidence)

4/8/11

11

Evaluation (Personalization Model)   Challenges:

 Recommendation algorithms may require unique set of evaluation metrics

  Personalization actions may be different   Domain   Intended application   Data gathered

  Check for overfitting data   Training set   ROC Curve

ROC Curve   TPR = TP / (TP + FN)   FPR = FP / (FP + TN)

"No" "Yes"

SN

N

.95 .05

.20 .80

4/8/11

12

Information High   Information is addictive   Information can be misleading   Information ethics   Information is power, sometimes too powerful

Personal Suggestions   Develop a hypothesis   Figure out what data is needed   Make informed decisions

 Don’t trust just your judgment   Experts are experts for a reason!  Develop a way to validate based on experience

  Then get more data if needed

4/8/11

13

Sources   Kohavi, R. and F. Provost (2001). "Applications of data mining to electronic

commerce." Data Mining and Knowledge Discovery 5(1): 5-10.   Dean, J. and S. Ghemawat (2008). "MapReduce: Simplified data processing on

large clusters." Communications of the ACM 51(1): 107-113.   Mobasher, B. (2007). "Data mining for web personalization." The adaptive web:

90-135.   http://en.wikipedia.org/wiki/File:FrequentItems.png   Zaharia, M., A. Konwinski, et al. (2008). Improving mapreduce performance in

heterogeneous environments, USENIX Association.   Witten, I. H. and E. Frank (2002). "Data mining: practical machine learning

tools and techniques with Java implementations." ACM SIGMOD Record 31(1): 76-77.

  Senthil kumar, A. (2011). Knowledge discovery practices and emerging applications of data mining : trends and new domains. Hershey, PA, Information Science Reference.

Thank you!   Questions?