introduction of feature hashing
TRANSCRIPT
![Page 1: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/1.jpg)
Introduction of Feature HashingCreate a Model Matrix via Feature Hashing with a FormulaInterfaceWush WuTaiwan R User Group
![Page 2: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/2.jpg)
Wush Chi-Hsuan Wu
Ph.D. Student
Familiar tools: R, C++, Apache Spark
·
Display Advertising
Large Scale Machine Learning
-
-
·
2/64
![Page 3: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/3.jpg)
source: http://www.nbcnews.com
Taiwan, a Mountainous IslandHighest Mountain: Yushan (3952m)·
3/64
![Page 4: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/4.jpg)
Taipei City and Taipei 101
source: https://c2.staticflickr.com/6/5208/5231215951_c0e0036b17_b.jpg
4/64
![Page 5: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/5.jpg)
New Landmark of Taipei5/64
![Page 7: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/7.jpg)
An Example of FeatureHashing
7/64
![Page 8: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/8.jpg)
Sentiment Analysis
Provided by Lewis Crouch
Show the pros of using FeatureHashing
·
·
8/64
![Page 9: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/9.jpg)
Dataset: IMDBPredict the rating of the movie according to the text in review.
Training Dataset: 25000 reviews
Binary response: positive or negative.
·
·
·
## [1] 1
Cleaned review·
## [1] "kurosawa is a proved humanitarian this movie is totally about people living in" ## [2] "poverty you will see nothing but angry in this movie it makes you feel bad but" ## [3] "still worth all those who s too comfortable with materialization should spend 2"## [4] "5 hours with this movie"
9/64
![Page 10: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/10.jpg)
Word Segmentation## [1] "" "kurosawa" "is" ## [4] "a" "proved" "humanitarian" ## [7] "" "this" "movie" ## [10] "is" "totally" "about" ## [13] "people" "living" "in" ## [16] "poverty" "" "you" ## [19] "will" "see" "nothing" ## [22] "but" "angry" "in" ## [25] "this" "movie" "" ## [28] "it" "makes" "you" ## [31] "feel" "bad" "but" ## [34] "still" "worth" "" ## [37] "all" "those" "who" ## [40] "s" "too" "comfortable" ## [43] "with" "materialization" "should" ## [46] "spend" "2" "5" ## [49] "hours" "with" "this" ## [52] "movie" ""
10/64
![Page 11: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/11.jpg)
Word Segmentation and Feature Extraction
ID MESSAGE HATE BAD LIKE GRATUITOUS
"5814_8" TRUE TRUE TRUE TRUE FALSE
"2381_9" FALSE FALSE FALSE TRUE FALSE
"7759_3" FALSE FALSE FALSE TRUE FALSE
"3630_4" FALSE FALSE FALSE FALSE FALSE
"9495_8" FALSE FALSE FALSE FALSE TRUE
"8196_8" FALSE FALSE TRUE TRUE TRUE
11/64
![Page 12: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/12.jpg)
FeatureHashing
FeatureHashing implements split in the formulainterface.
·
hash_size <- 2̂16m.train <- hashed.model.matrix(~ split(review, delim = " ", type = "existence"), data = imdb.train, hash.size = hash_size, signed.hash = FALSE)m.valid <- hashed.model.matrix(~ split(review, delim = " ", type = "existence"), data = imdb.valid, hash.size = hash_size, signed.hash = FALSE)
12/64
![Page 13: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/13.jpg)
Type of split
existence
count
tf-idf which is contributed by Michaël Benesty andwill be announced in v0.9.1
·
·
·
13/64
![Page 14: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/14.jpg)
Gradient Boosted Decision Tree with xgboost
dtrain <- xgb.DMatrix(m.train, label = imdb.train$sentiment)dvalid <- xgb.DMatrix(m.valid, label = imdb.valid$sentiment)watch <- list(train = dtrain, valid = dvalid)g <- xgb.train(booster = "gblinear", nrounds = 100, eta = 0.0001, max.depth = 2, data = dtrain, objective = "binary:logistic", watchlist = watch, eval_metric = "auc")
[0] train-auc:0.969895 valid-auc:0.914488[1] train-auc:0.969982 valid-auc:0.914621[2] train-auc:0.970069 valid-auc:0.914766...[97] train-auc:0.975616 valid-auc:0.922895[98] train-auc:0.975658 valid-auc:0.922952[99] train-auc:0.975700 valid-auc:0.923014
14/64
![Page 15: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/15.jpg)
Performance
The AUC of g is in the public leader board· 0.90110It outperforms the benchmark in Kaggle-
15/64
![Page 16: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/16.jpg)
The Purpose of FeatureHashing
Make our life easier·
16/64
![Page 17: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/17.jpg)
Formula Interface in R
17/64
![Page 18: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/18.jpg)
Algorithm in Text BookRegression
Real DataSEPAL.LENGTH SEPAL.WIDTH PETAL.LENGTH PETAL.WIDTH SPECIES
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
51 7.0 3.2 4.7 1.4 versicolor
52 6.4 3.2 4.5 1.5 versicolor
101 6.3 3.3 6.0 2.5 virginica
102 5.8 2.7 5.1 1.9 virginica
y = Xβ + ε
How to convert the real data to ?· X
18/64
![Page 19: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/19.jpg)
Feature Vectorization and Formula Interface
(INTERCEPT) SEPAL.LENGTH SEPAL.WIDTH PETAL.LENGTH PETAL.WIDTH SPECIESVERSICOLOR SPECIESVIRGINICA
1 1 5.1 3.5 1.4 0.2 0 0
2 1 4.9 3.0 1.4 0.2 0 0
51 1 7.0 3.2 4.7 1.4 1 0
52 1 6.4 3.2 4.5 1.5 1 0
101 1 6.3 3.3 6.0 2.5 0 1
102 1 5.8 2.7 5.1 1.9 0 1
is usually constructed via model.matrix in R· X
model.matrix(~ ., iris.demo)
19/64
![Page 20: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/20.jpg)
Formula Interface
y ~ a + b·
y is the response
a and b are predictors
-
-
20/64
![Page 21: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/21.jpg)
Formula Interface: +
(INTERCEPT) A B
1 1 5.1 3.5
2 1 4.9 3.0
51 1 7.0 3.2
52 1 6.4 3.2
101 1 6.3 3.3
102 1 5.8 2.7
+ is the operator of combining linear predictors·
model.matrix(~ a + b, data.demo)
21/64
![Page 22: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/22.jpg)
Formula Interface: :
(INTERCEPT) A B A:B
1 1 5.1 3.5 17.85
2 1 4.9 3.0 14.70
51 1 7.0 3.2 22.40
52 1 6.4 3.2 20.48
101 1 6.3 3.3 20.79
102 1 5.8 2.7 15.66
: is the interaction operator·
model.matrix(~ a + b + a:b, data.demo)
22/64
![Page 23: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/23.jpg)
Formula Interface: *
(INTERCEPT) A B A:B
1 1 5.1 3.5 17.85
2 1 4.9 3.0 14.70
51 1 7.0 3.2 22.40
52 1 6.4 3.2 20.48
101 1 6.3 3.3 20.79
102 1 5.8 2.7 15.66
* is the operator of cross product·
# a + b + a:bmodel.matrix(~ a * b, data.demo)
23/64
![Page 24: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/24.jpg)
Formula Interface: (
(INTERCEPT) A:C B:C
1 1 7.14 4.90
2 1 6.86 4.20
51 1 32.90 15.04
52 1 28.80 14.40
101 1 37.80 19.80
102 1 29.58 13.77
: and * are distributive over +·
# a:c + b:cmodel.matrix(~ (a + b):c, data.demo)
24/64
![Page 25: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/25.jpg)
Formula Interface: .
(INTERCEPT) SEPAL.LENGTH SEPAL.WIDTH PETAL.LENGTH PETAL.WIDTH SPECIESVERSICOLOR SPECIESVIRGINICA
1 1 5.1 3.5 1.4 0.2 0 0
2 1 4.9 3.0 1.4 0.2 0 0
51 1 7.0 3.2 4.7 1.4 1 0
52 1 6.4 3.2 4.5 1.5 1 0
101 1 6.3 3.3 6.0 2.5 0 1
102 1 5.8 2.7 5.1 1.9 0 1
. means all columns of the data.·
# ~ Sepal.Length + Sepal.Width + Petal.Length +# Petal.Width + Speciesmodel.matrix(~ ., iris.demo)
25/64
![Page 26: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/26.jpg)
Please check ?formula·
26/64
![Page 27: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/27.jpg)
Categorical Features
27/64
![Page 28: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/28.jpg)
Categorical Feature in R
Dummy Coding
A categorical variables of categories are transformedto a -dimentional vector.
There are many coding systems and the mostcommonly used is Dummy Coding.
· KK + 1
·
The first category are transformed to .- 0⃗
## versicolor virginica## setosa 0 0## versicolor 1 0## virginica 0 1
28/64
![Page 29: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/29.jpg)
Categorical Feature in Machine LearningPredictive analysis
Regularization
The categorical variables of categories aretransformed to -dimentional vector.
·
·
· KK
The missing data are transformed to .- 0⃗
contr.treatment(levels(iris.demo$Species), contrasts = FALSE)
## setosa versicolor virginica## setosa 1 0 0## versicolor 0 1 0## virginica 0 0 1
29/64
![Page 30: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/30.jpg)
Motivation of the FeatureHashing
30/64
![Page 31: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/31.jpg)
Kaggle: Display Advertising Challenge
Given a user and the page he is visiting, what is theprobability that he will click on a given ad?
·
31/64
![Page 32: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/32.jpg)
The Size of the Dataset
13 integer features and 26 categorical features
instances
Download the dataset via this link
·
· 7 × 107
·
32/64
![Page 33: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/33.jpg)
Vectorize These Features in R
After binning the integer features, I got categories.
Sparse matrix was required
· 3 × 107
·
A dense matrix required 14PB memory...
A sparse matrix required 40GB memory...
-
-
33/64
![Page 34: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/34.jpg)
Estimating Computing Resources
Dense matrix: nrow ( ) ncol ( 8bytes
Sparse matrix: nrow ( ) ( ) ( )bytes
· 7 × 107 × 3 × )107 ×
· 7 × 107 × 13 + 26 × 4 + 4 + 8
34/64
![Page 35: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/35.jpg)
Vectorize These Features in R
sparse.model.matrix is similar to model.matrix butreturns a sparse matrix.
·
35/64
![Page 36: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/36.jpg)
Fit Model with Limited Memory
A post Beat the benchmark with less then 200MB ofmemory describes how to fit a model with limitedresources.
·
online logistic regression
feature hash trick
adaptive learning rate
-
-
-
36/64
![Page 37: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/37.jpg)
Online Machine Learning
In online learning, the model parameter is updated afterthe arrival of every new datapoint.
In batch learning, the model paramter is updated afteraccess to the entire dataset.
·
Only the aggregated information and the incomingdatapoint are used.
-
·
37/64
![Page 38: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/38.jpg)
Memory Requirement of Online Learning
Related to model parameters·
Require little memory for large amount of instances.-
38/64
![Page 39: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/39.jpg)
Limitation of model.matrix
The vectorization requires all categories.·
contr.treatment(levels(iris$Species))
## versicolor virginica## setosa 0 0## versicolor 1 0## virginica 0 1
39/64
![Page 40: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/40.jpg)
My Work Around
Scan all data to retrieve all categories
Vectorize features in an online fashion
The overhead of exploring features increases
·
·
·
40/64
![Page 41: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/41.jpg)
Observations
Mapping features to is one of methodto vectorize feature.
· {0, 1, 2, . . . , K}
setosa =>
versicolor =>
virginica =>
- e1⃗ - e2⃗ - e3⃗
## setosa versicolor virginica## setosa 1 0 0## versicolor 0 1 0## virginica 0 0 1
41/64
![Page 42: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/42.jpg)
Observations
contr.treatment ranks all categories to map thefeature to integer.
What if we do not know all categories?
·
·
Digital features are integer.
We use a function maps to
-
- ℤ {0, 1, 2, . . . , K}
charToRaw("setosa")
## [1] 73 65 74 6f 73 61
42/64
![Page 43: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/43.jpg)
What is Feature Hashing?
A method to do feature vectorization with a hashfunction
·
For example, Mod %% is a family of hash function.·
43/64
![Page 44: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/44.jpg)
Feature Hashing
Choose a hash function and use it to hash all thecategorical features.
The hash function does not require global information.
·
·
44/64
![Page 45: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/45.jpg)
An Example of Feature Hashing of Criteo's DataV15 V16 V17
68fd1e64 80e26c9b fb936136
68fd1e64 f0cf0024 6f67f7e5
287e684f 0a519c5c 02cf9876
68fd1e64 2c16a946 a9a87e68
8cf07265 ae46a29d c81688bb
05db9164 6c9c9cf3 2730ec9c
The categorical variables have been hashed onto 32 bitsfor anonymization purposes.
Let us use the last 4 bits as the hash result, i.e. the hashfunction is function(x) x %% 16
The size of the vector is , which is called hash size
·
·
· ( )24
45/64
![Page 46: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/46.jpg)
An Example of Feature Hashing of Criteo's DataV15 V16 V17
68fd1e64 80e26c9b fb936136
68fd1e64 f0cf0024 6f67f7e5
287e684f 0a519c5c 02cf9876
68fd1e64 2c16a946 a9a87e68
8cf07265 ae46a29d c81688bb
05db9164 6c9c9cf3 2730ec9c
68fd1e64, 80e26c9b, fb936136 => 4, b, 6
68fd1e64, f0cf0024, 6f67f7e5 => 4, 4, 5
·
- (0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0)·
- (0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
46/64
![Page 47: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/47.jpg)
Hash Collision
68fd1e64, f0cf0024, 6f67f7e5 => 4, 4, 5
Hash function is a many to one mapping, so differentfeatures might be mapped to the same index. This iscalled collision.
In the perspective of statistics, the collision in hashfunction makes the effect of these featuresconfounding.
·
- (0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)·
·
47/64
![Page 48: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/48.jpg)
Choosing Hash Function
Less collision rate
High throughput
FeatureHashing uses the Murmurhash3 algorithmimplemented by digest
·
Real features, such as text features, are notuniformly distributed in .
-ℤ
·
·
48/64
![Page 49: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/49.jpg)
Pros of FeatureHashing
A Good Companion of Online Algorithmlibrary(FeatureHashing)hash_size <- 2̂16w <- numeric(hash_size)for(i in 1:1000) { data <- fread(paste0("criteo", i)) X <- hashed.model.matrix(V1 ~ ., data, hash.size = hash_size) y <- data$V1 update_w(w, X, y)}
49/64
![Page 50: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/50.jpg)
Pros of FeatureHashing
A Good Companion of Distributed Algorithmlibrary(pbdMPI)library(FeatureHashing)hash_size <- 2̂16w <- numeric(hash_size)i <- comm.rank()data <- fread(paste0("criteo", i))X <- hashed.model.matrix(V1 ~ ., data, hash.size = hash_size)y <- data$V1# ...
50/64
![Page 51: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/51.jpg)
Pros of FeatureHashing
Simple Training and Testinglibrary(FeatureHashing)model <- is_click ~ ad * (url + ip)m_train <- hashed.model.matrix(model, data_train, hash_size)m_test <- hashed.model.matrix(model, data_test, hash_size)
51/64
![Page 52: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/52.jpg)
Cons of FeatureHashingHash Size
52/64
![Page 53: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/53.jpg)
Cons of FeatureHashing
Lose InterpretationCollision makes the interpretation harder.
It is inconvenient to reverse the indices to feature.
·
·
m <- hashed.model.matrix(~ Species, iris, hash.size = 2̂4, create.mapping = TRUE)hash.mapping(m) %% 2̂4
## Speciessetosa Speciesvirginica Speciesversicolor ## 7 13 8
53/64
![Page 54: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/54.jpg)
The Result of the Competition...
No.1: Field-aware Factorization Machine
No.3, no.4 and no.9 (we): Neuron Network
·
The hashing trick do not contribute anyimprovement on the leaderboard. They apply thehashing trick only because it makes their life easierto generate features..
-
·
54/64
![Page 55: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/55.jpg)
Extending Formula Interface in R
55/64
![Page 56: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/56.jpg)
Formula is the most R style interface·
56/64
![Page 57: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/57.jpg)
Tipsterms.formula and its argument specials·
tf <- terms.formula(~ Plant * Type + conc * split(Treatment), specials = "split", data = CO2)attr(tf, "factors")
## Plant Type conc split(Treatment) Plant:Type## Plant 1 0 0 0 1## Type 0 1 0 0 1## conc 0 0 1 0 0## split(Treatment) 0 0 0 1 0## conc:split(Treatment)## Plant 0## Type 0## conc 1## split(Treatment) 1
57/64
![Page 58: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/58.jpg)
Specials
attr(tf, "specials") tells which rows of attr(tf,"factors") need to be parsed further
·
rownames(attr(tf, "factors"))
## [1] "Plant" "Type" "conc" ## [4] "split(Treatment)"
attr(tf, "specials")
## $split## [1] 4
58/64
![Page 59: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/59.jpg)
Parse
parse extracts the information from the specials·
options(keep.source = TRUE)p <- parse(text = rownames(attr(tf, "factors"))[4])getParseData(p)
## line1 col1 line2 col2 id parent token terminal text## 9 1 1 1 16 9 0 expr FALSE ## 1 1 1 1 5 1 3 SYMBOL_FUNCTION_CALL TRUE split## 3 1 1 1 5 3 9 expr FALSE ## 2 1 6 1 6 2 9 '(' TRUE (## 4 1 7 1 15 4 6 SYMBOL TRUE Treatment## 6 1 7 1 15 6 9 expr FALSE ## 5 1 16 1 16 5 9 ')' TRUE )
59/64
![Page 60: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/60.jpg)
Efficiency
Rcpp
Invoking external C functions
·
The core functions are implemented in C++-
·
digest exposes the C function:
FeatureHashing imports the C function:
-
Register the c function
Add the helper header file
-
-
-
Import the c function-
60/64
![Page 61: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/61.jpg)
Let's Discuss the Implementation Later
source: http://www.commpraxis.com/wp-content/uploads/2015/01/commpraxis-audience-confusion.png
61/64
![Page 62: Introduction of Feature Hashing](https://reader034.vdocument.in/reader034/viewer/2022051516/55d37123bb61ebce1b8b48eb/html5/thumbnails/62.jpg)
Summary
When should I use FeatureHashing?Short Answer: Predictive Anlysis with a large amount of categories.
Pros of FeatureHashing
Cons of FeatureHashing
·
Make it easier to do feature vectorization forpredictive analysis.
Make it easier to tokenize the text data.
-
-
·
Decrease the prediction accuracy if the hash sizeis too small
Interpretation becomes harder.
-
-
62/64