Download - Mapping Domain Names to Categories
![Page 1: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/1.jpg)
Mapping Domain Names to Categories
Maya Rotmensch, Sorcha Gilroy, Corina GurauAcademic Mentor: Cristina Garcia-Cardona
Industry Sponsor: Oversee.net (Kryztof Urban)
Institute of Pure and Applied MathematicsResearch in Industrial Projects
August 15, 2013
Institute for Pure & Applied Mathematics
University of California, Los Angeles
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 1 / 41
![Page 2: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/2.jpg)
Outline
1 Oversee.net
2 Problem StatementWhy so complicated?ESA - Explicit Semantic AnalysisHow Oversee.net Does It
3 Our ProjectOur FocusMethodologyResults
4 Concluding Remarks
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 2 / 41
![Page 3: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/3.jpg)
Outline
1 Oversee.net
2 Problem StatementWhy so complicated?ESA - Explicit Semantic AnalysisHow Oversee.net Does It
3 Our ProjectOur FocusMethodologyResults
4 Concluding Remarks
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 3 / 41
![Page 4: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/4.jpg)
Oversee.net’s Business Model
Person Website
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 4 / 41
![Page 5: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/5.jpg)
Person looking for games A gaming website
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 5 / 41
![Page 6: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/6.jpg)
Oversee.net’s Business Model
Person looking for games Domain A gaming website
Direct Navigation: when users navigate to a website by using theaddress bar instead of a search engine.
looking for a gaming website → navigates to ’addictinggamas.com’
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 6 / 41
![Page 7: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/7.jpg)
Oversee.net’s Business Model
Domain parking + traffic matching −→ Oversee.net
Person Domain Category Website
Monetized Domain Parking
I The registration of internet domain names without placing anycontent on the domain.
I Owners monetize traffic by displaying links and advertisements
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 7 / 41
![Page 8: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/8.jpg)
Oversee.net’s Business Model
AdvertisersI Partners of Oversee.net
I Choose the types of traffic they want from Oversee.net’s category tree
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 8 / 41
![Page 9: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/9.jpg)
Oversee.net’s Business Model
Parked domains do not have any content
Mapping Domains to Categories is extremely difficult
I Oversee.net uses Keywords to describe Domains and Categories
Domain Keywords Keywords Category
Not enough, as we are not guaranteed use of same language!
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 9 / 41
![Page 10: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/10.jpg)
Outline
1 Oversee.net
2 Problem StatementWhy so complicated?ESA - Explicit Semantic AnalysisHow Oversee.net Does It
3 Our ProjectOur FocusMethodologyResults
4 Concluding Remarks
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 10 / 41
![Page 11: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/11.jpg)
So what’s the big deal?
Reasoning about concepts
Scarcity of input information
I Example 1 - Spelling errorcheapvacatins.com
I Example 2 - Ambiguous meaningbigbearhuts.com (animals? huts? it’s supposed to be winter sports)
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 11 / 41
![Page 12: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/12.jpg)
Text Categorization
Our problem can be thought of as a problem of categorization. Weneed to assign a domain to one or more classes or categories
I A natural choice is topic modeling
I However, unlike most text categorization problems, we don’t actuallyhave documents to classify, as we are dealing with undevelopeddomains
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 12 / 41
![Page 13: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/13.jpg)
Topic Modeling
This method analyzes the relationships between documents in a corpus byisolating a set of topics from the documents
For meaningful results, one must work with a set of large texts
I Our data set consists of keywords, as our domains are undeveloped
This method results in organic generation of topics
I The categories we are attempting to map into are pre-defined
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 13 / 41
![Page 14: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/14.jpg)
ESA - Explicit Semantic AnalysisBuilding a Semantic Interpreter
Using a Vector Space Model + an exogeneous knowledge base−→ represent the meaning of text
1
# of articles ∼ 3.5 Million# of terms ∼ 45 Million
1Evgeniy Gabrilovich and Shaul Markovitch. Computing Semantic Relatedness using Wikipedia-based Explicit
Semantic Analysis, 2007. Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI)
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 14 / 41
![Page 15: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/15.jpg)
ESA - Explicit Semantic Analysis
Government Finance Toys Children Bank School . . .
Law 0.2 0.3 0.8 0.9 0.2 0.7 . . .Article2 0.8 0.9 0.1 0.3 0.7 0.5 . . .Article3 0.5 0.2 0.3 0.6 0.4 0.8 . . .Article4 0.1 0.2 0.1 0.3 0.4 0.2 . . ....
......
......
......
...
Term frequency inverse document frequency:
tfidf (t, d ,D) = tf (t, d)× idf (t,D)
Logarithmically scaled term frequency:
tf (t, d) = log(f (t, d) + 1)
Inverse document frequency:
idf (t,D) = log|D|
|d ∈ D : t ∈ d |(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 15 / 41
![Page 16: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/16.jpg)
ESA - Explicit Semantic AnalysisUsing a Semantic Interpreter
Cosine similarity measure
similarity = cos(θ) =A · B||A|| ||B||
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 16 / 41
![Page 17: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/17.jpg)
How Oversee.net Does It
Instead of comparing two texts - compare two small sets of words!
Use keywords to describe domains and categories
Represent these keywords in terms of DBpedia articles
I A keyword is significantly related to an article if the TF-IDF is above acertain threshold
I The set of articles associated to a domain/category is the union of thesets of articles associated to its keywords
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 17 / 41
![Page 18: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/18.jpg)
How Oversee Does It
Compare the two sets of articles (A - domains, B - categories) usingthe Jaccard Index:
J(A,B) =|A ∩ B||A ∪ B|
Categories with highest scores using this index are matched to adomain
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 18 / 41
![Page 19: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/19.jpg)
Outline
1 Oversee.net
2 Problem StatementWhy so complicated?ESA - Explicit Semantic AnalysisHow Oversee.net Does It
3 Our ProjectOur FocusMethodologyResults
4 Concluding Remarks
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 19 / 41
![Page 20: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/20.jpg)
Our Focus
Domain Keywords Keywords Category
Critical link: domains to keywords
Improve quality of keywordsI Click Through Rate
I String Similarity
I Semantic Analysis
Keyword CTR String Similarity Semantic Similarity
industrial 20 80 0
industriel 20 89 0
industrie 20 100 0
china manufacturer 20 0 88
industries 20 80 98
industrial companies 20 0 86
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 20 / 41
![Page 21: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/21.jpg)
Domain Keywords
Focusing on developing the link between domains and keywords, the twomain questions we posed for our research were:
Could we use ESA to extend the number of meaningful keywords perdomain?
Could we use the keywords obtained through Oversee.net inhousestatistics as the basis of the new keywords?
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 21 / 41
![Page 22: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/22.jpg)
MethodologyExtending the set of keywords:
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 22 / 41
![Page 23: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/23.jpg)
MethodologyExtending the set of keywords:
When generating new keywords:
Only take top 3 articles
Only take top 2 terms
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 23 / 41
![Page 24: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/24.jpg)
MethodologyMethod 2 for extending the set of keywords:
Breaking up and correcting the domain name
chaselogon.com
haselogonaselogon
cha selogonchas elogonchase logonchasel ogonchaselo gon
chaselogchaselogo
Example: domain = ’chaselogon.com’
If entire string matches a word in reference file then stop
If both parts of broken string are exact words then stop
If substring is an exact word then correct other part using editdistances
I Corrections used: deletions, transpositions, replacements, insertions
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 24 / 41
![Page 25: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/25.jpg)
MethodologyMethod 2 for extending the set of keywords:
Reference file made up of collections of text, have added moreinformation
I Company namesI Popular websitesI Brand and store namesI Countries and major cities
Initial Keywords Keywords after parsing
chameloeon chas
chase
elson
login
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 25 / 41
![Page 26: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/26.jpg)
MethodologyGenerating new keywords and mapping to categories
bankfianancial.com
ncofinancialban
bankfinancial
financial institutionsfinancial centre
lobstersofficial personal
societies chairman. . .
Jaccard Index = 0.240492
finance
retirement pensiondebit card
tenant credit check...
Jaccard Index = 0.348147
credit cards
debit cardcredit applicationsrewards program
...
Jaccard Index = 0.219457
banking
savings bankingchecks
community bank...
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 26 / 41
![Page 27: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/27.jpg)
Results: Comparing Their Keywords to Semantic
We were given a sample of 300 domains that had been matched byhand to a total of 500 categories
CTR & String Similarity CTR, String Similarity & Semantic Analysis
Number of matches 25 309
percentage of match 5% 61.8%
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 27 / 41
![Page 28: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/28.jpg)
Results: Generating New Keywords
Using Method 1:
CTR & String Similarity Method 1 CTR & String Similarity & 7 Random
Number of matches 25 21 24
percentage of match 5% 4.2% 4.8%
Most of the time, the different methods yielded the same results
Cases where the new keywords improved the system:I thhetrainline.com
Cases where the base case did better:I inindustries.com
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 28 / 41
![Page 29: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/29.jpg)
Results
thhetrainline.com
thetrainline
Jaccard Index = 0.0001 microcars & city cars
Jaccard Index = 0.0002 property management
thhetrainline.com
thetrainlinestrafe train
moving departingtrain station
telecommunicationsgeorgia
rain shine. . .
Jaccard Index = 0.1348 bus & rail
Jaccard Index = 0.2255 libraries & museums
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 29 / 41
![Page 30: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/30.jpg)
Results
inindustries.com
industrialindustriasindustriel
. . .
Jaccard Index = 0.0786 manufacturing
inindustries.com
industrialindustriasindustriel
. . .ministry
quarterly garden/outdoorfilipino footballer
. . .
Jaccard Index = 0.099 tourist destinations
Jaccard Index = 0.1326 real estate
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 30 / 41
![Page 31: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/31.jpg)
Results: Parsing the Domains
Using Method 1 & 2:
CTR & String Similarity Method 1 & 2 CTR & String Similarity & 15 Random
Number of matches 25 93 23
percentage of match 5% 18.6% 4.6%
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 31 / 41
![Page 32: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/32.jpg)
Results - Parsing the Domains
chaselogon.com
chameloeon
No category matched
addictinggamas.com
chameloeonchaschaseelsonlogin
passwordjournalists cyberlogins expensive
beatles. . .
Jaccard Index =0.4637 credit cards
Jaccard Index = 0.4637 banking
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 32 / 41
![Page 33: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/33.jpg)
Results: Parsing the Domains
Using Method 2:CTR & String Sim. Method 1& 2 Method 2
Number of matches 25 97 77 out of 356
percentage of match 5% 19.4% ∼ 21.6 %
Initial results show that overall, just using parsing might be more beneficial→ depends on the amount of noise.
Example with a lot of noise:I mobilestorage.ca
Example with minimal noise:I addictinggamas.com
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 33 / 41
![Page 34: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/34.jpg)
Results - Amplification of noise
mobilestorage.ca
gfilestoragemobileshop
mobilestorage
ageinvestor
vilest. . .
Jaccard Index = 0.1011 mobile & wireless
Jaccard Index = 0.0959 music & audio
mobilestorage.ca
gfilestoragemobileshop
mobilestorage
ageinvestor
vilest. . .
legal agetaylor
phone companiesmobil
. . .
Jaccard Index =0.0942 music & audio
Jaccard Index = 0.0887 education
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 34 / 41
![Page 35: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/35.jpg)
Results - Minimal noise
addictinggamas.com
addictinggamsaddictivegamesadictigegames
. . .addict
addictinggamesingram
. . .
Jaccard Index = 0.0153 software
addictinggamas.com
addictinggamsaddictivegamesadictigegames
. . .addict
addictinggamesingram
. . .gameplay requires
gameimpulsedriven flash
add ons. . .
Jaccard Index = 0.2019 computer & video games
Jaccard Index = 0.1975 games
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 35 / 41
![Page 36: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/36.jpg)
Results: Extended Matches
Using Extended Matches:
We extended possible matches to parent and root nodes of thecategory tree.
I Checked in how many cases did the parent or root node of thecategories we got matched the manual matching.
CTR & String Sim. Method 1 Method 1& 2 Method 2
Number of matches 25 21 97 77 out of 356
percentage of match 5% 4.2% 19.4% ∼ 21.6 %
Number of extended matches 32 29 128 102 out of 356
Percentage of matches 6.4% 5.8% 25.6% ∼ 28.7 %
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 36 / 41
![Page 37: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/37.jpg)
Outline
1 Oversee.net
2 Problem StatementWhy so complicated?ESA - Explicit Semantic AnalysisHow Oversee.net Does It
3 Our ProjectOur FocusMethodologyResults
4 Concluding Remarks
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 37 / 41
![Page 38: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/38.jpg)
Conclusion
Implemented a program to match domains with categories
Created an ESA based method to amplify existing keywords
Adapted a domain name parsing and spell correcting method
Revisiting our research questions:
Could we use ESA to extend the number of meaningful keywords perdomain? → Yes
Could we use the keywords obtained through Oversee.net inhousestatistics as the basis of the new keywords? → No. Or at leastfurther processing must be done.
getting better & more keywords → getting a few good keywords
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 38 / 41
![Page 39: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/39.jpg)
Future Directions
Find out how many good initial keywords are required to use ourmethod successfully
Explore a better way of ranking keywords and determine which arethe most descriptive ones
I Click through rate and string similarity comparisons are not sufficientlydescriptive, need a better scoring method
Have a reference of the most popular websites, so that the domainsgiven could be compared to these
I Analyze content in websites to amplify domain to category mapping
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 39 / 41
![Page 40: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/40.jpg)
Thank you!
Academic Mentor: Cristina Garcia-Cardona
Industry Sponsor: Kryztof Urban and Oversee.net
RIPS Director: Dr. Michael Raugh
Director of IPAM: Dr. Russ Caflisch
IPAM Staff: Dimi, Stacey, Stacy, Roland, Stephanie, and everyonethat made RIPS possible
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 40 / 41
![Page 41: Mapping Domain Names to Categories](https://reader033.vdocument.in/reader033/viewer/2022050905/54b8a21e4a7959b83b8b466c/html5/thumbnails/41.jpg)
Questions?
Thank you for listening!
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 41 / 41