[smartnews] globally scalable web document classification using word2vec
TRANSCRIPT
![Page 1: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/1.jpg)
Globally Scalable Web Document Classification Using Word2Vec
Kohei Nakaji (SmartNews)
![Page 2: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/2.jpg)
![Page 3: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/3.jpg)
keyword: machine learning for discovery
![Page 4: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/4.jpg)
SmartNews Demo
![Page 5: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/5.jpg)
About SmartNews
Japan
Launched 2013
4M+ Monthly Active Users 50% DAU/MAU 100+ Publishers
2013 App of The Year
USLaunched Oct 2014
1M+ Monthly Active Users Same engagement
80+ Publishers Top News Category App
International
Launched Feb 2015
10M Downloads WW Same engagement
English beta Featured App
Funding: $50M
![Page 6: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/6.jpg)
Outline of our algorithm
Structure Analysis
Semantics Analysis
URLs Found
Importance Estimation
10 million/day
1000+/day
Diversification
Signals on the Internet
![Page 7: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/7.jpg)
Outline of our algorithm
Structure Analysis
Semantics Analysis
URLs Found
Importance Estimation
10 million/day
1000+ /day
Diversification
Signals on the Internet
Web Document Classification ⊂
![Page 8: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/8.jpg)
Web Document Classification
ENTERTAINMENT
SPORTS
TECHNOLOGY
LIFESTYLE
SCIENCE
…
Task definition:When an arbitrary web document arrives, choose one category exclusively from a pre-determined category set.
WORLD
![Page 9: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/9.jpg)
Web Document Classification
ENTERTAINMENT
① Main Content Extraction
② Text Classification
① ②
There are roughly two steps:
![Page 10: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/10.jpg)
There are roughly two steps:
Web Document Classification
ENTERTAINMENT
① Main Content Extraction
② Text Classification
① ②
![Page 11: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/11.jpg)
Main Content Extraction
Two approaches:
html
html
easier, but takes time
difficult, but fast
・Extract after rendering whole page
・Extract from HTML
![Page 12: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/12.jpg)
Main Content Extraction
・Extract after rendering whole page
・Extract from HTML
html
html
easier, but takes time
difficult, but fast
Two approaches:
Our Approach
![Page 13: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/13.jpg)
Main Content Extraction from HTML
<html> <body><div>click <a>here</a> for </div> <div><a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office.</p><a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p> </div> </body> </html>
Example:
main content
not main content
![Page 14: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/14.jpg)
Main Content Extraction from HTML
Rule1: div which hastext length > 200 num of ‘a’ tag < 3 is Main Content
Rule-based extraction algorithm is possible.
English:Rule2: div which hastext length < 100 num of ‘p’ tag > 4 is Main Content
RuleN:
…
![Page 15: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/15.jpg)
Main Content Extraction from HTML
Rule1: div which hastext length > 200 num of ‘a’ tag < 3 is Main Content
Rule-based extraction algorithm is possible.
English:Rule2: div which hastext length < 100 num of ‘p’ tag > 4 is Main Content
RuleN:
…
But not scalable.
Japanese:…… …
…
![Page 16: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/16.jpg)
Main Content Extraction from HTML
② live data
(features)block1:block2:block3:
(features)(features)
…
① training
(features, main)(features, not main)(features, main)
block1:block2:block3:
…
decision tree
block separation & feature extraction
We are using a machine learning approach;See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
![Page 17: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/17.jpg)
Main Content Extraction from HTML
② live data
(features)block1:block2:block3:
(features)(features)
…
① training
(features, main)(features, not main)(features, main)
block1:block2:block3:
…
decision tree
block separation & feature extraction
We are using a machine learning approach;See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
![Page 18: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/18.jpg)
Feature Extraction from HTML
<html> <body><div>click <a>here</a> for </div><div><a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office. </p><a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p></div> </body> </html>
Separate HTML into ‘text block’s
Step1:
![Page 19: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/19.jpg)
Feature Extraction from HTML
<html> <body><div>click <a>here</a> for </div><div><a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office. </p><a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p></div> </body> </html>
Step1:
Separate HTML into ‘text block’s
Step2:
Extract local features for every text block
ex: word count = 36, num of <a> = 0
![Page 20: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/20.jpg)
Feature Extraction from HTML
<html> <body><div>click <a>here</a> for </div><div><a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office. </p><a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p></div> </body> </html>
Step1:
Separate HTML into ‘text block’s
Step2:
Extract local features for every text block
ex: word count = 36, num of <a> = 0
Step3: Define feature of each text block as combination of local features
word count(current block) : 36, num of <a>(current block) : 0, word count (previous block) : 4, num of <a> (previous block) : 1
ex:
…
![Page 21: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/21.jpg)
Main Content Extraction from HTML
② live data
(features)block1:block2:block3:
(features)(features)
…
① training
(features, main)(features, not main)(features, main)
block1:block2:block3:
…
decision tree
block separation & feature extraction
We are using a machine learning approach:See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
![Page 22: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/22.jpg)
Main Content Extraction from HTML
② live data
(features)block1:block2:block3:
(features)(features)
…
① training
(features, main)(features, not main)(features, main)
block1:block2:block3:
…
decision tree
block separation & feature extraction
We are using a machine learning approach;See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
![Page 23: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/23.jpg)
Making Main Content Using Decision Tree
(features)block1:not main
(features)block2:not main
(features)block3:main
(features)block5:main
(features)block4:not main
![Page 24: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/24.jpg)
Main Content Extraction From HTML
② live data
(features)block1:block2:block3:
(features)(features)
…
① training
(features, main)(features, not main)(features, main)
block1:block2:block3:
…
decision tree
block separation & feature extraction
We are using machine learning approach;See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
![Page 25: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/25.jpg)
There are roughly two steps:
Web Document Classification
ENTERTAINMENT
① Main Content Extraction
② Text Classification
① ②
![Page 26: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/26.jpg)
Text Classification
Ordinary text classification architecture:
② live data
(features)
① training
(features, entertainment)(features, sports)(features, entertainment)
features
? ?
…
entertainment
sports
(features, politics) …
sports
training algorithm
classifier
feature extraction
![Page 27: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/27.jpg)
Text Classification
Ordinary text classification architecture:
② live data
(features)
① training
(features, entertainment)(features, sports)(features, entertainment)
features
? ?
…
entertainment
sports
(features, politics) …
sports
training algorithm
classifier
feature extraction
![Page 28: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/28.jpg)
Feature Extraction in Text Classification
Will LeBron James deliver an NBA championship to Cleveland?
‘Bag-of-words’ is commonly used as a feature vector.
Willdeliver
an NBAchampionship
to
Cleveland
JamesLeBron
![Page 29: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/29.jpg)
Feature Extraction in Text Classification
Will LeBron James deliver an NBA championship to Cleveland?
‘Bag-of-words’ is commonly used as a feature vector
Willdeliver
an NBAchampionship
to
Cleveland
JamesLeBron
stop wordssports players dictionary
with some feature engineering.
NBA_PLAYER
tf-idf
![Page 30: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/30.jpg)
Feature Extraction in Text Classification
Similarly used in Japanese.
私は中路です。 よろしくお願いします。
stop wordsperson dictionary
私は中路
よろしくお願い
し ます
です
PERSON
tf-idf
![Page 31: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/31.jpg)
Another Option: Paragraph Vector
![Page 32: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/32.jpg)
Example:
私は中路です。 よろしくお願いします。 [0.2, 0.3, ……0.2]
Will LeBron James deliver an NBA championship to Cleveland?
[0.1, 0.4, ……0.1]
Paragraph Vector
(dimension ~ several 100)
![Page 33: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/33.jpg)
Outline of Distributed Representation
・word2vec
・paragraph vector
every word is mapped to unique word vector.
every document is mapped to unique vector.
(Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)
(https://code.google.com/p/word2vec/)
![Page 34: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/34.jpg)
Outline of Distributed Representation
・word2vec
・paragraph vector
every word is mapped to unique word vector.
every document is mapped to unique vector.
(https://code.google.com/p/word2vec/)
(Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)
![Page 35: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/35.jpg)
Word Vector in word2vec Model
Every word is mapped to unique word vector with good properties.
[0.1, 0.2, ……0.2]=
[0.1, 0.1, ……-0.1] =
[0.3, 0.4, ……0]=
[0.3, 0.3, ……0.3] =
Germany Berlin
Paris France
…
“Germany - Berlin = France - Paris”
vFrance
vParis
vGermany
vBerlin
![Page 36: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/36.jpg)
Procedure to Create Word Vectors
Mikolov et al. (http://arxiv.org/pdf/1301.3781.pdf)
cat
sat
the
street
on
A cat sat on the street.
…
I love cat very much.
w220
w221He comes from Japan.
…
…
TX
t=1
logP (wt|wt�c, · · ·wt+c)
P (wt|wt�c, · · ·wt+c) =exp(uwt · v)PW exp(uW · v)
v =X
t0 6=t,�ct0c
vw0t
for and uw vw
vw is word vector for w.
Word vectors are trained so that it becomes a good feature for predicting surrounding words.
Objective Function (cbow-case)
Model (sum-case)
=
Procedure① Maximize
②
L
L
![Page 37: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/37.jpg)
Outline of Distributed Representation
・word2vec
・paragraph vector
every word is mapped to unique word vector.
every document is mapped to unique vector.
(Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)
![Page 38: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/38.jpg)
Example:
私は中路です。 よろしくお願いします。 [0.2, 0.3, ……0.2]
Will LeBron James deliver an NBA championship to Cleveland?
[0.1, 0.4, ……0.1]
Paragraph Vectors
(dimension ~ 100s)
![Page 39: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/39.jpg)
Procedure to Create Paragraph Vectors
for uw vw
A cat sat on the street.
…
doc_1 : doc_2 :
…
I love cat very much.w220
He comes from Japan.
…
w221
Mikolov et al. (http://arxiv.org/pdf/1301.3781.pdf)
cat
sat
the
street
on
doc_1
TX
t=1
logP (wt|wt�c, · · ·wt+c,doc i)
P (wt|wt�c, · · ·wt+c,doc i) =exp(uwt · v)PW exp(uW · v)
v =X
t0 6=t,�ct0c
vw0t+ di
, and di
wt is included
vw② Preserve uw , as uw , vw
document where
Add a vector to the model for each document.Objective Function (dbow-case)
=
Model (sum-case)
Procedure① Maximize
L
L
![Page 40: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/40.jpg)
Procedure to Create Paragraph Vector
for uw vw, and di
vw② Preserve uw , as uw , vw
After training, we can get a good paragraph vector as a feature for a new document.Objective Function (dbow-case)
Model (sum-case)
Procedure① Maximize
TX
t=1
logP (wt|wt�c, · · ·wt+c,doc)
P (wt|wt�c, · · ·wt+c,doc) =exp(uwt · v)PW exp(uW · v)
v =X
t0 6=t,�ct0c
vwt0 + d
We love SmartNews.
…
doc :
I love SmartNews very much.
d
Ldoc
=
③ Maximize for
L
Ldoc
d
④ Use as a paragraph vectord
training
live data
![Page 41: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/41.jpg)
Procedure to Create Paragraph Vector
Feature Extractor
[0.2, 0.3, ……0.2]d
uw vw
Paragraph Vector :
Lmaximize
Ldoc
maximize
![Page 42: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/42.jpg)
Text Classification
Ordinary text classification architecture:
② live data
([0.1, -0.1, …])
① training
([0.1, 0.3, …], entertainment)([0.2, -0.3, …], sports)([0.1, 0.1, …], entertainment)
features
? ?
…
entertainment
sports
([0.1, -0.2, …], politics) …
sports
training algorithm
classifier
feature extraction
![Page 43: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/43.jpg)
Good
Benefits of Using Paragraph Vector
・High Scalability
・High Precision in Text ClassificationSeveral percent better than using Bag-of-Words with feature engineering in our Japanese/English data set.
We don’t need to work hard for feature engineering in each language.
Bad
・Difficulty in analyzing error
It is hard to understand the meaning of each component of paragraph vector.
labeled: ~several 10000 unlabeled: ~100000
![Page 44: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/44.jpg)
Benefits of Using Paragraph Vector
It is important that Paragraph Vector has a different nature than Bag-of-Words
Reason: We can get a better classifier by combining two different types of classifiers.
![Page 45: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/45.jpg)
Our Use Case
Validation
Use one to validate the other.
CombinationUse the more reliable result of two classifiers: Bag-of-Words-based classifier vs. Paragraph Vector-based classifier
![Page 46: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/46.jpg)
In multilingual localizationUse only Paragraph Vector-based classifier without any feature engineering.
Our Use Case (future)
![Page 47: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/47.jpg)
Web Document Classification
ENTERTAINMENT
① Main Content Extraction
② Text Classification
① ②
There are roughly two steps:
![Page 48: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/48.jpg)
The Challenge
![Page 49: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/49.jpg)
The Challenge
News is uncertainty seeking for long-term values.
Exploitation Exploration
What SmartNews does:
uncertainty seeking discovery
What Big Data Firms typically do:
preference estimation and risk quantification
What if parents don't feed vegetables to children who only like meat?What if you keep hearing only opinions that match yours?
![Page 50: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/50.jpg)
The Challenge
Searching not optimal, but acceptable form of exploration.
Why? Humans are not rational enough to simply accept the optimum. Without acceptance, users will never read SmartNews.
・topic extraction
We are developing:
・image extraction
・multi-arm bandit based scoring model
① For better Feature Vector of users and articles
② For Human-Acceptable Explorationuser
interests
①
②
…
feature vector for 10 million users
real-time feature vector for articlesx
![Page 51: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/51.jpg)
We are building our engineering team in SF - please join us!
採用してます・ML/NLP Engineer
・Data Science Engineer
…
![Page 53: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/53.jpg)
References
Main Content Extraction
・Christian Kohlschütter, Peter Fankhauser, Wolfgang Nejdl
Text Classification
Boilerplate Detection using Shallow Text Features
・BoilerPipe (GoogleCode)
・Quoc V. Le, Tomas MikolovDistributed Representations of Sentences and Documents
・Word2Vec (GoogleCode)
![Page 54: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec](https://reader030.vdocument.in/reader030/viewer/2022032616/55a5c3041a28ab58588b4588/html5/thumbnails/54.jpg)
References
About SmartNews
・Japan’s SmartNews Raises Another $10M At A $320M Valuation To Expand In The U.S.
・SmartNews, The Minimalist News App That's A Hit In Japan, Sets Its Sights On The U.S.
・Japanese news app SmartNews nabs $10M bridge round, at pre-money valuation of $320M
・About our Company SmartNews
Articles about SmartNews