Download - Internet Mathematics 2011
![Page 1: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/1.jpg)
Yandex Relevance Prediction ChallengeOverview of “CLL” team’s solution
R. Gareev1, D. Kalyanov2, A. Shaykhutdinova1, N. Zhiltsov1
1Kazan (Volga Region) Federal University
210tracks.ru
28 December 2011
1 / 52
![Page 2: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/2.jpg)
Outline
1 Problem Statement
2 Features
3 Feature Extraction
4 Statistical analysis
5 Contest Results
6 Appendix A. References
7 Appendix B. R functions
2 / 52
![Page 3: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/3.jpg)
Outline
1 Problem Statement
2 Features
3 Feature Extraction
4 Statistical analysis
5 Contest Results
6 Appendix A. References
7 Appendix B. R functions
3 / 52
![Page 4: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/4.jpg)
Problem statement
I Predict document relevance from user behavior a.k.a«Implicit Relevance Feedback»
I See also http://imat-relpred.yandex.ru/en formore details
4 / 52
![Page 5: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/5.jpg)
User session example
RegionQ1 ⇒ 1 2 3 4 5 T = 0
3 T = 105 T = 351 T = 100
Q2 ⇒ 6 7 8 9 10 T = 1306 T = 1509 T = 170
5 / 52
![Page 6: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/6.jpg)
Labeled data
Given judgements for some pairs of documents and queries:
I a document Dj is relevant for a query Qi from aregion Ror
I a document Dj is not relevant for a query Qi from aregion R
6 / 52
![Page 7: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/7.jpg)
The problem
I Given a set Q of search queries, for each (q, R) ∈ Qprovide a sorted list of documents D1, . . . , Dm that arerelevant to q in the region R
I Area Under the ROC Curve (AUC) averaged over allthe test query-pairs is the target evaluation metric
7 / 52
![Page 8: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/8.jpg)
AUC score
I Consider list of documents: D1, . . . , Di︸ ︷︷ ︸Prefix of length i
, . . . Dm
I (FPR(i),TPR(i)) gives a single point in ROC curveI AUC is the area under ROC curveI AUC = Probability that randomly chosen relevant document
come before randomly chosen non-relevand document8 / 52
![Page 9: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/9.jpg)
Our problem restatement
I We consider it as a machine learning taskI Using relevance judgements, learn a classifierH(R,Q,D) that predicts that document D is relevantto a query Q from a region R
I Replace RegionID, QueryID and DocumentID withrelated features extracted from click log
I Use the classifier H(R,Q,D) to compute a list, sortedw.r.t. classifier’s certainty scores, for a query Q from aregion R
9 / 52
![Page 10: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/10.jpg)
Outline
1 Problem Statement
2 Features
3 Feature Extraction
4 Statistical analysis
5 Contest Results
6 Appendix A. References
7 Appendix B. R functions
10 / 52
![Page 11: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/11.jpg)
Features
I A «feature» is a function of (Q,R,D).I Each feature is associated/not associated with itsrelated region
TypesI Document featuresI Query featuresI Time-concerned features
11 / 52
![Page 12: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/12.jpg)
Document features
1 (Q,D)→ Number of occurences of an URL in the SERP list2 (Q,D)→ Number of clicks3 (Q,D)→ Click-through rate4 (Q,D)→ Average position in the click sequence5 (Q,D)→ Average rank in the SERP list6 (Q,D)→ Average rank in the SERP list when URL is clicked7 (Q,D)→ Probability of being last clicked8 (Q,D)→ Probability of being first clicked
12 / 52
![Page 13: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/13.jpg)
User session example
RegionQ1 ⇒ 1 2 3 4 5 T = 0
3 T = 105 T = 351 T = 100
Q2 ⇒ 6 7 8 9 10 T = 1306 T = 1509 T = 170
13 / 52
![Page 14: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/14.jpg)
Query features
1 (Q)→ Average number of clicks in subsession2 (Q)→ Probability of being rewritten (being not last query insession)
3 (Q)→ Probability of being resolved (probability of its resultsbeing last clicked)
14 / 52
![Page 15: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/15.jpg)
User session example
RegionQ1 ⇒ 1 2 3 4 5 T = 0
3 T = 105 T = 351 T = 100
Q2 ⇒ 6 7 8 9 10 T = 1306 T = 1509 T = 170
15 / 52
![Page 16: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/16.jpg)
Time-concerned features
1 (Q)→ Average time to first click2 (Q,D)→ Average time spent reading a document D
16 / 52
![Page 17: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/17.jpg)
User session example
RegionQ1 ⇒ 1 2 3 4 5 T = 0
3 T = 105 T = 351 T = 100
Q2 ⇒ 6 7 8 9 10 T = 1306 T = 1509 T = 170
17 / 52
![Page 18: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/18.jpg)
Outline
1 Problem Statement
2 Features
3 Feature Extraction
4 Statistical analysis
5 Contest Results
6 Appendix A. References
7 Appendix B. R functions
18 / 52
![Page 19: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/19.jpg)
Two phase extraction
1 Normalization• lookup filtering by ’Important triples’ set• normalization is specific for each feature
2 Grouping and aggregating
19 / 52
![Page 21: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/21.jpg)
Normalization
I Converting clicklog entries to a relational table withthe following attributes:
• feature domain attributes, e.g:• (Q,R,U), (Q,U) for document features• (Q,R), (Q) for query features
• feature attribute valueI Sequential processing ’session-by-session’
• reject spam sessions• emit values (probably repeated)
21 / 52
![Page 22: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/22.jpg)
Normalization example (I)Click log (with SessionID, TimePassed omitted):Action QueryID RegionID URLsQ 174 0 1625 1627 1623 2510 2524Q 1974 0 2091 17562 1626 1623 1627C 17562C 1627C 1625C 2510
Intermediate table for ’Average click position’ feature:QueryID URLID RegionID ClickPosition1974 17562 0 11974 1627 0 2174 1625 0 1174 2510 0 2
22 / 52
![Page 23: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/23.jpg)
Normalization example (II)
Click log (sessionID was omitted):Time QueryID RegionID URLs
0 Q 5 0 99 16 87 396 C 84
120 Q 558 0 84 5043 5041 5039125 Q 8768 0 74672 74661 74674 74671145 C 74661
Intermediate table for ’Time to first click’ feature:QueryID RegionID FirstClickTime
5 0 68768 0 20
23 / 52
![Page 26: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/26.jpg)
Outline
1 Problem Statement
2 Features
3 Feature Extraction
4 Statistical analysis
5 Contest Results
6 Appendix A. References
7 Appendix B. R functions
26 / 52
![Page 27: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/27.jpg)
Our final ML based solution in a nutshell
I Binary classification task for predicting assessors’ labelsI 26 features extracted from the click logI Gradient Boosted Trees learning model (gbm Rpackage)
I Tuning model’s parameters w.r.t. AUC averaged overgiven query-region pairs
I Ranking URLs according to the best model’sprobability scores
27 / 52
![Page 32: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/32.jpg)
Data Analysis Scheme1 Given initial training and test sets
2 Partitioning the initial training set into two sets:• training set (3/4)• test set (1/4)
3 Consider the following models:• Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)• Gradient Boosted Trees (Adaboost distribution; exponential lossfunction)
• Logistic Regression4 Learning and tuning parameters w.r.t. the target metric (Areaunder the ROC curve) on the training set using 3-foldcross-validation
5 Obtain the estimates for the target metric on the test set6 Choose the optimal model, refit it on the whole initial trainingset and apply it on the initial test set
32 / 52
![Page 33: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/33.jpg)
Data Analysis Scheme1 Given initial training and test sets2 Partitioning the initial training set into two sets:
• training set (3/4)• test set (1/4)
3 Consider the following models:• Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)• Gradient Boosted Trees (Adaboost distribution; exponential lossfunction)
• Logistic Regression4 Learning and tuning parameters w.r.t. the target metric (Areaunder the ROC curve) on the training set using 3-foldcross-validation
5 Obtain the estimates for the target metric on the test set6 Choose the optimal model, refit it on the whole initial trainingset and apply it on the initial test set
32 / 52
![Page 34: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/34.jpg)
Data Analysis Scheme1 Given initial training and test sets2 Partitioning the initial training set into two sets:
• training set (3/4)• test set (1/4)
3 Consider the following models:• Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)• Gradient Boosted Trees (Adaboost distribution; exponential lossfunction)
• Logistic Regression
4 Learning and tuning parameters w.r.t. the target metric (Areaunder the ROC curve) on the training set using 3-foldcross-validation
5 Obtain the estimates for the target metric on the test set6 Choose the optimal model, refit it on the whole initial trainingset and apply it on the initial test set
32 / 52
![Page 35: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/35.jpg)
Data Analysis Scheme1 Given initial training and test sets2 Partitioning the initial training set into two sets:
• training set (3/4)• test set (1/4)
3 Consider the following models:• Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)• Gradient Boosted Trees (Adaboost distribution; exponential lossfunction)
• Logistic Regression4 Learning and tuning parameters w.r.t. the target metric (Areaunder the ROC curve) on the training set using 3-foldcross-validation
5 Obtain the estimates for the target metric on the test set6 Choose the optimal model, refit it on the whole initial trainingset and apply it on the initial test set
32 / 52
![Page 36: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/36.jpg)
Data Analysis Scheme1 Given initial training and test sets2 Partitioning the initial training set into two sets:
• training set (3/4)• test set (1/4)
3 Consider the following models:• Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)• Gradient Boosted Trees (Adaboost distribution; exponential lossfunction)
• Logistic Regression4 Learning and tuning parameters w.r.t. the target metric (Areaunder the ROC curve) on the training set using 3-foldcross-validation
5 Obtain the estimates for the target metric on the test set
6 Choose the optimal model, refit it on the whole initial trainingset and apply it on the initial test set
32 / 52
![Page 37: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/37.jpg)
Data Analysis Scheme1 Given initial training and test sets2 Partitioning the initial training set into two sets:
• training set (3/4)• test set (1/4)
3 Consider the following models:• Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)• Gradient Boosted Trees (Adaboost distribution; exponential lossfunction)
• Logistic Regression4 Learning and tuning parameters w.r.t. the target metric (Areaunder the ROC curve) on the training set using 3-foldcross-validation
5 Obtain the estimates for the target metric on the test set6 Choose the optimal model, refit it on the whole initial trainingset and apply it on the initial test set
32 / 52
![Page 38: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/38.jpg)
Boosting[Schapire, 1990]
I Given training set (x1, y1), . . . , (xN , yN),yi ∈ {−1,+1}
I For t = 1, . . . , T• construct distribution Dt on {1, . . . , N}• sample examples from it concentrating on the “hardest” ones• learn a “weak classifier” (at least better than random)
ht : X → {−1,+1}
with error εt on Dt:
εt = Pi∼Dt(ht(xi) 6= yi)
I Output the final classifier H as a weighted majorityvote of ht
33 / 52
![Page 39: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/39.jpg)
AdaBoost[Freund & Schapire, 1997]
I Constructing Dt:• D1(i) =
1N
• given Dt and ht:
Dt+1(i) =Dt(i)
Zt×{e−αt if yi = ht(xi)eαt if yi 6= ht(xi)
where Zt – normalization factor and
αt =1
2ln
(1− εtεt
)> 0
I Final classifier:
H(x) = sign
(∑t
αtht(x)
)34 / 52
![Page 40: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/40.jpg)
Gradient boosted trees[Friedman, 2001]
I Stochastic gradient decent optimization of the lossfunction
I Decision trees model as a weak classifierI Do not require feature normalizationI There is no need to handle missing values specificallyI Reported good performance in relevance predictionproblems [Piwowarski et al., 2009], [Hassan et al.,2010] and [Gulin et al., 2011]
35 / 52
![Page 41: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/41.jpg)
Gradient boosted treesgbm R package implementation
I There are two available distributions for classificationtasks: Bernoulli and AdaBoost
I Three basic parameters: interaction depth (depth ofeach tree), number of trees (or iterations) andshrinkage (learning rate)
36 / 52
![Page 42: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/42.jpg)
Logistic regressionglm, stats R package
I Preprocess the initial training data – imputing missingvalues with the help of bagged trees
I Fit the generalized linear model:
f(x) =1
1 + e−z,
where z = β0 + β1x1 + · · ·+ βkxk
37 / 52
![Page 43: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/43.jpg)
Tuning gbmbernoulli model
3-fold CV estimate of AUC for the optimal parameters: 0.6457435
38 / 52
![Page 44: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/44.jpg)
Tuning gbmadaboost model
3-fold CV estimate of AUC for the optimal parameters: 0.6455384
39 / 52
![Page 45: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/45.jpg)
Comparative performance of three optimal modelsTest error estimates
Model Optimal parameter values Test estimate of AUCgbmbernoulli interaction.depth=2, 0.6324717
n.trees=500,shrinkage=0.01
gbmadaboost interaction.depth=4, 0.6313393n.trees=700,shrinkage=0.01
logistic regression - 0.618648
40 / 52
![Page 47: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/47.jpg)
Outline
1 Problem Statement
2 Features
3 Feature Extraction
4 Statistical analysis
5 Contest Results
6 Appendix A. References
7 Appendix B. R functions
42 / 52
![Page 48: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/48.jpg)
Contest statistics
I 101 participants, 84 of them are eligible for prizeI Two-stage evaluation procedure: validation set and testset (their sizes were unknown during the contest)
I Validation set size is ≈ 11 000 instancesI Test set size is ≈ 20 000 instances
43 / 52
![Page 49: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/49.jpg)
Preliminary ResultsValidation set
19th place (AUC=0.650004)
44 / 52
![Page 50: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/50.jpg)
Final ResultsTest set
34th place (AUC=0.643346)
# Team AUC1 cointegral* 0.6673622 Evlampiy* 0.665063 alsafr* 0.6645274 alexeigor* 0.6631695 keinorhasen 0.6609826 mmp 0.6599147 Cutter* 0.6594528 S-n-D 0.658103. . . . . . . . .34 CLL 0.643346. . . . . . . . .
45 / 52
![Page 51: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/51.jpg)
Acknowledgements
We would like to thank:
I the organizers from Yandex for an exciting challenge
I E.L. Stolov, V.Y. Mikhailov, V.D. Solovyev and other colleaguesfrom Kazan Federal University for fruitful discussions and support
46 / 52
![Page 52: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/52.jpg)
Outline
1 Problem Statement
2 Features
3 Feature Extraction
4 Statistical analysis
5 Contest Results
6 Appendix A. References
7 Appendix B. R functions
47 / 52
![Page 53: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/53.jpg)
References I
[Freund & Schapire, 1997] Freund, Y., Schapire, R. A decision-theoreticgeneralization of on-line learning and an application to boosting // Journalof Computer and System Sciences. – V. 55. – No. 1. – 1997. – P. 119–139.
[Friedman, 2001] Friedman, J. Greedy Function Approximation: A GradientBoosting Machine // Annals of Statistics. – V. 29. – No.5.– 2001. – P.1189-1232.
[Gulin et al., 2011] Gulin, A., Kuralenok, I., Pavlov, D. Winning TheTransfer Learning Track of Yahoo!’s Learning To Rank Challenge withYetiRank // JMLR: Workshop and Conference Proceedings. – 2011. – P.63-76.
[Hassan et al., 2010] Hassan, A., Jones, R., Klinkner, K.L. Beyond DCG:User behavior as a predictor of a successful search // Proceedings of thethird ACM international conference on Web search and data mining. –ACM. – 2010. – P. 221-230.
48 / 52
![Page 54: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/54.jpg)
References II
[Piwowarski et al., 2009] Piwowarski, B., Dupret, G., Jones, R. Mining UserWeb Search Activity with Layered Bayesian Networks or How to Capture aClick in its Context // Proceedings of the Second ACM InternationalConference on Web Search and Data Mining. – ACM. – 2009. – P. 162-171.
[Schapire, 1990] Schapire, R. The strength of weak learnability // MachineLearning. – V.5. – No. 2. – 1990. – P. 197–227.
49 / 52
![Page 55: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/55.jpg)
Outline
1 Problem Statement
2 Features
3 Feature Extraction
4 Statistical analysis
5 Contest Results
6 Appendix A. References
7 Appendix B. R functions
50 / 52
![Page 56: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/56.jpg)
Compute AUC for gbm modelComputeAUC <- function(fit, ntrees, testSet) {require(ROCR)require(foreach)require(gbm)
pureTestSet <- subset(testSet, select=-c(QueryID, RegionID, URLID, RelevanceLabel))queryRegions <- unique(subset(testSet, select=c(QueryID, RegionID)))count <- nrow(queryRegions)
aucValues <- foreach (i=1:count, .combine="c") %do% {queryId <- queryRegions[i,"QueryID"]regionId <- queryRegions[i,"RegionID"]true.labels <- testSet[testSet$QueryID == queryId & testSet$RegionID == regionId, ]$RelevanceLabelm <- mean(true.labels)if (m == 0 | m == 1) {
pred <- NAperf <- NAcurAUC <- NA
}else {
gbm.predictions <-predict.gbm(fit,pureTestSet[testSet$QueryID == queryId & testSet$RegionID == regionId,],n.trees=ntrees, type="response")
pred <- prediction(gbm.predictions, true.labels)perf <- performance(pred, "auc")curAUC <- [email protected] [[1]] [1]}
curAUC}return (mean(aucValues, na.rm=T))}
51 / 52
![Page 57: Internet Mathematics 2011](https://reader034.vdocument.in/reader034/viewer/2022051818/54b771d64a795985568b45d8/html5/thumbnails/57.jpg)
Tuning AUC for gbm model
TuningGbmFit <- function(trainSet, foldsNum = 3, interactionDepth=4, minNumTrees=100, maxNumTrees = 1500,step=100, shrinkage=.01, distribution="bernoulli", aucfunction=ComputeAUC) {require(gbm)require(foreach)require(caret)require(sqldf)FUN <- match.fun(aucfunction)ntreesSeq <- seq(from=minNumTrees, to=maxNumTrees, by=step)
folds <- createFolds(trainSet$QueryID, foldsNum, T, T)aucvalues <- foreach (i=1:length(folds), .combine="rbind") %do% {
inTrain <- folds[[i]]cvTrainData <- trainSet[inTrain,]cvTestData <- trainSet[-inTrain,]pureCvTrainData <- subset(cvTrainData, select=-c(QueryID, RegionID, URLID))
gbmFit <- gbm(formula=formula(pureCvTrainData), data=pureCvTrainData, distribution=distribution,interaction.depth=interactionDepth, n.trees=maxNumTrees, shrinkage=shrinkage)
foreach(n=ntreesSeq, .combine="rbind") %do% {auc <- FUN(gbmFit, n, cvTestData)(c(n, auc))
}}aucvalues <- as.data.frame(aucvalues)avgAuc <- sqldf("select V1 as ntrees, avg(V2) as AvgAUC from aucvalues group by V1")return (avgAuc)
}
52 / 52