treasure data summer internship final report
TRANSCRIPT
![Page 1: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/1.jpg)
Summer Internship Final ReportNaoki Ishikawa (@NeokiStones)
2015/09/30 13:30-
![Page 2: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/2.jpg)
Who am I
2
• Naoki Ishikawa
• Waseda University, Information Science M1
• Research: Evolutional Computation/ Reinforcement Learning
• Laboratory: Sugawara Lab
• Laboratory theme: Artificial Intelligence
![Page 3: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/3.jpg)
• Implemented Algorithm
• Factorization Machine
• Latent Dirichlet Allocation
3
Table of contents
![Page 4: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/4.jpg)
• Implemented Algorithm
• Factorization Machine
• Latent Dirichlet Allocation
4
Table of contents
![Page 5: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/5.jpg)
Factorization Machine
5
• Algorithm for Recommendation
• Classification(Clustering)
• Regression
• Supervised Learning
• Need Input/Output Data
• Suitable for Sparse Data
![Page 6: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/6.jpg)
Application
![Page 7: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/7.jpg)
Application
7
• Prediction of Movie Rating • Task: Prediction movie rating (real number)
• Regression - Input: Self-designed Matrix - Output: Rating Vector
![Page 8: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/8.jpg)
8
Input Output
Prediction of Movie Rating
![Page 9: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/9.jpg)
INPUT Details
9
• Identifier- User Identifier : [0, 0, …, 0, 1, 0, …,0] - Movie Identifier : [0, 0, …, 0, 0, 1, 0, …,0]
• Designed Feature- Rating of Other Movie- Time- Last Movie rated
![Page 10: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/10.jpg)
10
Recommendation Algorithm
• Collaborative Filtering
• Associations Analysis
• Bayesian Network
![Page 11: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/11.jpg)
Prediction of Movie Rating
11
• Hivemall
• Matrix Factorization
• Recommendation
![Page 12: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/12.jpg)
12
Difference from Matrix Factorization• Data Structure
• Matrix Factorization
• User-Item Matrix
http://ampcamp.berkeley.edu/big-data-mini-course/img/matrix_factorization.png
Input Learning Parameter
![Page 13: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/13.jpg)
13
Difference from Matrix Factorization
• Factorization Machine
Vv
kInput
Learning Parameter
Wk1
![Page 14: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/14.jpg)
14
• Factorization Machine
• Consider
• context data
• Interaction between valuables
Advantage of Factorization Machine
![Page 15: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/15.jpg)
15
Difference from Matrix Factorization
Prediction by Factorization Machine (d=2)
![Page 16: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/16.jpg)
16
Difference from Matrix Factorization
Prediction by Factorization Machine (d=2)
(mean)Global bias
Interaction
Factorization(Wkj)
Regression coefficienceof k-th variable
![Page 17: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/17.jpg)
17
Difference from Matrix Factorization
Prediction by Factorization Machine (d=2)
Learning MethodStochastic Gradient descent(SGD)
![Page 18: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/18.jpg)
18
Local Implementation
![Page 19: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/19.jpg)
19
Difference from Matrix Factorization
• d-way
• FM / MF
• assume K latent attributes
• Matrix Factorization: d = 2
• Factorization Machine: d ≧2
![Page 20: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/20.jpg)
20
HyperParameter
• K: the number of hidden factor
• η: the regulation parameter
![Page 21: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/21.jpg)
21
Implemented Model
• Implemented Model
• d = 2
• MapModel
• ArrayModel
![Page 22: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/22.jpg)
22
Implemented Model
• MapModel
• For unknown data
• Flexible
• Suitable for Online Learning
![Page 23: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/23.jpg)
23
Implemented Model
• ArrayModel
• For known data
• less overhead
![Page 24: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/24.jpg)
24
Other Use Case• E-Commerce User-Item Recommendation
• Input Data
• Age
• Purchase timezone
• Past bought items
• Cluster ID
• Target Data
• Evaluation of an Item by User
![Page 25: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/25.jpg)
• Implemented Algorithm
• Factorization Machine
• Latent Dirichlet Allocation
25
Table of contents
![Page 26: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/26.jpg)
Latent Dirichlet Allocation
26
• Most Popular Algorithm of Topic Model
• Mostly applied for text data
• Find hidden structure of data
• Unsupervised Learning
• Need Input Data only
• Generative Model
![Page 27: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/27.jpg)
Latent Dirichlet Allocation
27
• Generative Modelling in LDA
• Mimic how to generate Document
• 1. Choose what you write about
• 2. Choose word from the Topic
• 3. Write
![Page 28: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/28.jpg)
Latent Dirichlet Allocation
28
• Input
• Text data (Documents)
• Output
• Topic-word distribution
• Document-Topic distribution
![Page 29: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/29.jpg)
Latent Dirichlet Allocation
29https://www.vappingo.com/word-blog/wp-content/uploads/2011/01/paper2.jpghttps://wellecks.wordpress.com/2014/10/26/ldaoverflow-with-online-lda/
![Page 30: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/30.jpg)
Learning Method
30
• Define Generative model
• For documents
• Learn parameters to reproduce the document
![Page 31: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/31.jpg)
Learning Method
31
K
Topic
![Page 32: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/32.jpg)
Learning Method
32 http://heartruptcy.blog.fc2.com/blog-entry-124.html
![Page 33: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/33.jpg)
Graphical Model(Code)
33
• For Topic ={1,…, K}
• WordDistribution[k] ~ Dir(β)
For Document={1,…, D}
TopicDistribution[d] ~ Dir(α)
For Word={1,…, numOfWord[d]}
WordTopic[d][n] ~ TopicDistribution[d]
Word[d][n] ~ WordDistribution[WordTopic[d][n]]
![Page 34: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/34.jpg)
Learning Method
34
• Variational Bayes
• Gibbs Sampling (MCMC)
• Particle Filtering
![Page 35: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/35.jpg)
Learning Method
35
• Variational Bayes
• Gibbs Sampling (MCMC)
• Particle Filtering
faster than Gibbs Sampling
![Page 36: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/36.jpg)
Mini-batch Online LDA
36
• Faster than Batch Algorithm
• Less noise than pure Online LDA
Pure Online Mini-batch Online Batch
Batch Size
![Page 37: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/37.jpg)
37
Implemented Model• Mini-Batch Map Model
• For unknown data
• Don’t assume Vocabulary List
• Mini-Batch Array Model (Other implementation)
• For known data
• Assume Vocabulary List
![Page 38: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/38.jpg)
• Mini-Batch Map Model
• For unknown data
• Don’t assume Vocabulary List
38
Implemented Model
• Mini-Batch Array Model (Other implementation)
• For known data
• Assume Vocabulary List
![Page 39: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/39.jpg)
• Meaning Less word
• LDA: Clustering word by co-occurrence
• “a”, “the”, “I”, “He”, “is”, “in”, “on”
• Stop Word: Ignore them
• TF-IDF: “how important a word is to a document in a collection or dataset ”
39
Faced Implementation Problem
![Page 40: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/40.jpg)
40
Faced Implementation Problem
• Meaning Less word
• LDA: Clustering word by co-occurrence
• “a”, “the”, “I”, “He”, “is”, “in”, “on”
• Stop Word: Ignore them
• TF-IDF: “how important a word is to a document in a collection or dataset”
![Page 41: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/41.jpg)
• TF-IDF
• can be calculated by Hivemall
• Input Data: (DocId, Words)
• https://github.com/myui/hivemall/wiki/TFIDF-calculation
41
Faced Implementation Problem
![Page 42: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/42.jpg)
• 1 ["justice:0.1641245850805637","found:0.06564983513276658","discussion:0.06564983513276658","law:0.065
• 64983513276658","based:0.06564983513276658","religion:0.06564983513276658","viewpoints:0.03282491756638329","
• rationality:0.03282491756638329","including:0.03282491756638329","context:0.03282491756638329","concept:0.032
• 82491756638329","rightness:0.03282491756638329","general:0.03282491756638329","many:0.03282491756638329","dif
• fering:0.03282491756638329","fairness:0.03282491756638329","social:0.03282491756638329","broadest:0.032824917
• 56638329”,"equity:0.03282491756638329","includes:0.03282491756638329","theology:0.03282491756638329"]
42
Faced Implementation Problem
• TF-IDF
![Page 43: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/43.jpg)
• Vocabulary List Model
• Initialize all lambda for all words at first
• if word does not appear in the Doc:
• Lambda decreases at the same rate
• No initialization problem
43
Faced Implementation Problem
![Page 44: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/44.jpg)
• Online Map Model
• Initialize lambda when new word fetched
• final lambda: depend on the first appeared time
• Initialize problem
44
Faced Implementation Problem
![Page 45: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/45.jpg)
• Prepared Dummy Lambda
• Initialize dummy lambdas at first
• Apply lambda update rule for dummy lambda
45
Faced Implementation Problem
![Page 46: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/46.jpg)
• Implicit Φ Normalization
• Not written implicitly
46
Faced Implementation Problem
![Page 47: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/47.jpg)
• Implicit Φ Normalization
• Not written implicitly
47
Faced Implementation Problem
![Page 48: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/48.jpg)
• Implicit Φ Normalization
• Not written explicitly
48
Faced Implementation Problem
![Page 49: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/49.jpg)
49
Faced Implementation Problem
• Difficult Debugging
• Circular reference
Φ
γ β
:dependence
![Page 50: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/50.jpg)
• Data: 20News
• Topic:6
• Iteration:10
50
Result: Online LDA
![Page 51: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/51.jpg)
• Topic:1
• No.0 writes[6]: 0.007909349
• No.1 article[7]: 0.006535292
• No.2 apr[3]: 0.0034389505
• No.3 team[4]: 0.00340712
• No.4 game[4]: 0.0033219245
• No.5 year[4]: 0.0032751847
• No.6 good[4]: 0.0032546786
• No.7 time[4]: 0.0030503264
• No.8 play[4]: 0.00262638
• No.9 games[5]: 0.002433915
• No.10 season[6]: 0.0022433712
• No.11 ll[2]: 0.0020719478
• No.12 players[7]: 0.0020332362
• No.13 win[3]: 0.0019284738
• No.14 hockey[6]: 0.001887098951
Result: Online LDA
• No.15 league[6]: 0.0018450991
• No.16 baseball[8]: 0.0018226414
• No.17 years[5]: 0.0017960512
• No.18 mail[4]: 0.0017936684
• No.19 people[6]: 0.0017642054
• No.20 teams[5]: 0.0016675185
• No.21 great[5]: 0.001642102
• No.22 ve[2]: 0.0015846819
• No.23 point[5]: 0.0015730233
• No.24 cs[2]: 0.0015609838
• No.25 didn[4]: 0.0015398773
• No.26 lot[3]: 0.0015123658
• No.27 mike[4]: 0.0014935194
• No.28 university[10]: 0.0014718652
• No.29 player[6]: 0.0014655796
![Page 52: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/52.jpg)
• Topic:1
• No.0 writes[6]: 0.007909349
• No.1 article[7]: 0.006535292
• No.2 apr[3]: 0.0034389505
• No.3 team[4]: 0.00340712
• No.4 game[4]: 0.0033219245
• No.5 year[4]: 0.0032751847
• No.6 good[4]: 0.0032546786
• No.7 time[4]: 0.0030503264
• No.8 play[4]: 0.00262638
• No.9 games[5]: 0.002433915
• No.10 season[6]: 0.0022433712
• No.11 ll[2]: 0.0020719478
• No.12 players[7]: 0.0020332362
• No.13 win[3]: 0.0019284738
• No.14 hockey[6]: 0.001887098952
Result: Online LDA
• No.15 league[6]: 0.0018450991
• No.16 baseball[8]: 0.0018226414
• No.17 years[5]: 0.0017960512
• No.18 mail[4]: 0.0017936684
• No.19 people[6]: 0.0017642054
• No.20 teams[5]: 0.0016675185
• No.21 great[5]: 0.001642102
• No.22 ve[2]: 0.0015846819
• No.23 point[5]: 0.0015730233
• No.24 cs[2]: 0.0015609838
• No.25 didn[4]: 0.0015398773
• No.26 lot[3]: 0.0015123658
• No.27 mike[4]: 0.0014935194
• No.28 university[10]: 0.0014718652
• No.29 player[6]: 0.0014655796
Sports
![Page 53: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/53.jpg)
• Topic:3
• No.0 writes[6]: 0.0065424195
• No.1 article[7]: 0.005621346
• No.2 apr[3]: 0.002746017
• No.3 work[4]: 0.002731466
• No.4 good[4]: 0.00266331
• No.5 ve[2]: 0.0025969497
• No.6 time[4]: 0.0025880735
• No.7 system[6]: 0.0024449623
• No.8 problem[7]: 0.002349667
• No.9 mail[4]: 0.0023234019
• No.10 windows[7]: 0.0021310966
• No.11 people[6]: 0.0018598152
• No.12 find[4]: 0.0018072439
• No.13 computer[8]: 0.0017470584
• No.14 email[5]: 0.001720405353
Result: Online LDA
• No.15 drive[5]: 0.0017121765
• No.16 bit[3]: 0.0016401116
• No.17 program[7]: 0.001636191
• No.18 software[8]: 0.0016341405
• No.19 university[10]: 0.0015907411
• No.20 ll[2]: 0.0015530549
• No.21 thing[5]: 0.0015159848
• No.22 card[4]: 0.0013826761
• No.23 doesn[5]: 0.0013809163
• No.24 phone[5]: 0.0013786326
• No.25 question[8]: 0.0013721529
• No.26 internet[8]: 0.001368883
• No.27 file[4]: 0.0013417117
• No.28 things[6]: 0.0013097903
• No.29 set[3]: 0.0013029057
![Page 54: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/54.jpg)
• Topic:3
• No.0 writes[6]: 0.0065424195
• No.1 article[7]: 0.005621346
• No.2 apr[3]: 0.002746017
• No.3 work[4]: 0.002731466
• No.4 good[4]: 0.00266331
• No.5 ve[2]: 0.0025969497
• No.6 time[4]: 0.0025880735
• No.7 system[6]: 0.0024449623
• No.8 problem[7]: 0.002349667
• No.9 mail[4]: 0.0023234019
• No.10 windows[7]: 0.0021310966
• No.11 people[6]: 0.0018598152
• No.12 find[4]: 0.0018072439
• No.13 computer[8]: 0.0017470584
• No.14 email[5]: 0.001720405354
Result: Online LDA
• No.15 drive[5]: 0.0017121765
• No.16 bit[3]: 0.0016401116
• No.17 program[7]: 0.001636191
• No.18 software[8]: 0.0016341405
• No.19 university[10]: 0.0015907411
• No.20 ll[2]: 0.0015530549
• No.21 thing[5]: 0.0015159848
• No.22 card[4]: 0.0013826761
• No.23 doesn[5]: 0.0013809163
• No.24 phone[5]: 0.0013786326
• No.25 question[8]: 0.0013721529
• No.26 internet[8]: 0.001368883
• No.27 file[4]: 0.0013417117
• No.28 things[6]: 0.0013097903
• No.29 set[3]: 0.0013029057
Computer
![Page 55: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/55.jpg)
Impression about Internship
55
• Machine Learning
• Implementing ML algorithm from Scratch was fun
• Contributing for OSS is precious experience for me
![Page 56: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/56.jpg)
Unfinished Business
56
• Documentation
• write entry for FM/Online LDA
• UDTF
• build the function into Hivemall
![Page 57: Treasure Data Summer Internship Final Report](https://reader034.vdocument.in/reader034/viewer/2022051318/5876d52a1a28ab1d238b55bb/html5/thumbnails/57.jpg)
57
• Thank you for Listening