pakdd'15 data mining competition: gender prediction based on e-commerce data team members:...

17
PAKDD'15 DATA MINING COMPETITION: GENDER PREDICTION BASED ON E-COMMERCE DATA Team members: Maria Brbic, Dragan Gamberger, Jan Kralj, Matej Mihelcic, Matija Piskorec, Tomislav Smuc

Upload: melinda-blair

Post on 24-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PAKDD'15 DATA MINING COMPETITION: GENDER PREDICTION BASED ON E-COMMERCE DATA Team members: Maria Brbic, Dragan Gamberger, Jan Kralj, Matej Mihelcic, Matija

PAKDD'15 DATA MINING COMPETITION: GENDER PREDICTION BASED ON E-COMMERCE DATA

Team members: Maria Brbic, Dragan Gamberger, Jan Kralj, Matej Mihelcic, Matija Piskorec, Tomislav Smuc

Page 2: PAKDD'15 DATA MINING COMPETITION: GENDER PREDICTION BASED ON E-COMMERCE DATA Team members: Maria Brbic, Dragan Gamberger, Jan Kralj, Matej Mihelcic, Matija

OVERVIEW

Team name: gambi Final score: 0.8102 (12. position) Crucial part of the work: how to construct good

features? Feature construction process:

1. Features containing general information2. Construction from the time slot information3. Construction from the information about the viewed items 4. Construction from the dataset enriched using recommendation

algorithms5. Construction from the error analysis

Page 3: PAKDD'15 DATA MINING COMPETITION: GENDER PREDICTION BASED ON E-COMMERCE DATA Team members: Maria Brbic, Dragan Gamberger, Jan Kralj, Matej Mihelcic, Matija

SOLUTION WORKFLOW

Page 4: PAKDD'15 DATA MINING COMPETITION: GENDER PREDICTION BASED ON E-COMMERCE DATA Team members: Maria Brbic, Dragan Gamberger, Jan Kralj, Matej Mihelcic, Matija

SOLUTION WORKFLOW

Page 5: PAKDD'15 DATA MINING COMPETITION: GENDER PREDICTION BASED ON E-COMMERCE DATA Team members: Maria Brbic, Dragan Gamberger, Jan Kralj, Matej Mihelcic, Matija

GENERAL INFORMATION FEATURES

Number of items viewed during the session

Duration of the session Some sessions had unreasonable length Assumption: some users forgot to log out Too long sessions got missing value

Number of items divided by duration

Page 6: PAKDD'15 DATA MINING COMPETITION: GENDER PREDICTION BASED ON E-COMMERCE DATA Team members: Maria Brbic, Dragan Gamberger, Jan Kralj, Matej Mihelcic, Matija

SOLUTION WORKFLOW

Page 7: PAKDD'15 DATA MINING COMPETITION: GENDER PREDICTION BASED ON E-COMMERCE DATA Team members: Maria Brbic, Dragan Gamberger, Jan Kralj, Matej Mihelcic, Matija

TEMPORAL FEATURES

Day of the week, month, year 7 features for the day of the week Extra binary features for Saturday and Sunday Extra binary features for the beginning of the month, middle

and end Month Christmas time

Five or less days before Christmas Hour

Exact hour (24 features) Binary features: working hours, morning, evening, night

The percentage of males in a given time slot

Page 8: PAKDD'15 DATA MINING COMPETITION: GENDER PREDICTION BASED ON E-COMMERCE DATA Team members: Maria Brbic, Dragan Gamberger, Jan Kralj, Matej Mihelcic, Matija

SOLUTION WORKFLOW

Page 9: PAKDD'15 DATA MINING COMPETITION: GENDER PREDICTION BASED ON E-COMMERCE DATA Team members: Maria Brbic, Dragan Gamberger, Jan Kralj, Matej Mihelcic, Matija

VIEWED ITEMS FEATURES

Single categories One feature for each category ID appearing in at least 3 distinct

sessions Categories pairs and triplets

One feature for each pair/triplet of categories appearing in at least 3 distinct sessions calculated as the number of joint appearances

Ordered pairs of categories Number of times category i precedes category j at the distance k

or less (for k=1,2,3) First category

Category that was viewed first in the session for A,B,C hierarchy levels

Page 10: PAKDD'15 DATA MINING COMPETITION: GENDER PREDICTION BASED ON E-COMMERCE DATA Team members: Maria Brbic, Dragan Gamberger, Jan Kralj, Matej Mihelcic, Matija

VIEWED ITEMS FEATURES

Rare categories Categories appearing in less than three instances One feature for each hierarchy level belonging to the specific

parent category calculated as a sum of all rare categories

Due to the sparsity of generated features (0.15% of the elements wer non-zero) we performed non-negative matrix factorization (NMF) on merged train and test set

100 factors from NMF were then used as features

Page 11: PAKDD'15 DATA MINING COMPETITION: GENDER PREDICTION BASED ON E-COMMERCE DATA Team members: Maria Brbic, Dragan Gamberger, Jan Kralj, Matej Mihelcic, Matija

SOLUTION WORKFLOW

Page 12: PAKDD'15 DATA MINING COMPETITION: GENDER PREDICTION BASED ON E-COMMERCE DATA Team members: Maria Brbic, Dragan Gamberger, Jan Kralj, Matej Mihelcic, Matija

DATA ENRICHMENT

Idea: to enrich data by using recommendation algorithms

Weighted item k-NN with 10 nearest neighbours was used to predict new items For each session we recommended 4 new items with top ranked

scores Recommendations for sessions containing only items unique to

the sessions were ignored New features constructed from this more abundant set

of items For each category all appearances of distinct items that were

viewed only by a predefined percentage of females/males were summed

Page 13: PAKDD'15 DATA MINING COMPETITION: GENDER PREDICTION BASED ON E-COMMERCE DATA Team members: Maria Brbic, Dragan Gamberger, Jan Kralj, Matej Mihelcic, Matija

SOLUTION WORKFLOW

Page 14: PAKDD'15 DATA MINING COMPETITION: GENDER PREDICTION BASED ON E-COMMERCE DATA Team members: Maria Brbic, Dragan Gamberger, Jan Kralj, Matej Mihelcic, Matija

ERROR ANALYSIS

We extracted instances labelled as males that were misclassified on a cross-validation

The main source of errors: products in A00002 category and its subcategories

New feature for sessions with A00002 category calculated as the percentage of males that viewed the same B product from A00002 category in the same time slot If there were more B products from A00002 category, only the first

one was taken into account The number of sessions in which the B product in a given time slot

appeared had to be greater than one Sessions without A00002 category got missing value for feature

value

Page 15: PAKDD'15 DATA MINING COMPETITION: GENDER PREDICTION BASED ON E-COMMERCE DATA Team members: Maria Brbic, Dragan Gamberger, Jan Kralj, Matej Mihelcic, Matija

CLASSIFICATION

We used 642 features in total Classification algorithm: Random Forest with 1000 trees After obtaining predictions we optimized the threshold

to predict more males accurately on the cost of females misclassification Using 10-fold cross-validation results we checked where to

put a threshold on a classifier confidence Chosen threshold was then used on the test set predictions

Cross validation score: 0.822, 0.862 after optimization Preliminary evaluation score: 0.8235 Final evaluation score: 0.8102

Page 16: PAKDD'15 DATA MINING COMPETITION: GENDER PREDICTION BASED ON E-COMMERCE DATA Team members: Maria Brbic, Dragan Gamberger, Jan Kralj, Matej Mihelcic, Matija

CLASSIFICATION

We also analyzed the importance of each generated feature

The most important features are temporal and NMF features

Page 17: PAKDD'15 DATA MINING COMPETITION: GENDER PREDICTION BASED ON E-COMMERCE DATA Team members: Maria Brbic, Dragan Gamberger, Jan Kralj, Matej Mihelcic, Matija

THANK YOU FOR YOUR ATTENTION!