smart crawler: using committee machines for web pages continuous classification
TRANSCRIPT
Smart Crawler: Using Committee Machines for Web
Pages Continuous ClassificationLuiz Henrique Zambom Santana,
Prof. Dr. Ronaldo dos Santos Mello e Prof. Dr. Mauro Roisenberg
Federal University of Santa Catarina – Florianópolis/SC
WebMedia – Manaus, 2015
Agenda• Goals• Motivation• Model• Architecture• Implementation• Experiments• Conclusions
Goals• Idea:
• If:• www.infomoney.com.br = Finance• www.lance.com.br = Futbol• www.4rodas.com.br = Cars
• So:• www.valor.com.br = Finance• placar.abril.com.br = Futbol• revistaautoesporte.globo.com = Cars
Motivation• If we know the category of a page, then
• We can better parse• We can provide better search results• We can customize the user experience
• Classify web page contents, for generating dataset
Motivation• Using ML techniques seemed a good idea, but:
• We need to scale, so Matlab was not an option• We need to collect and classify pages continuously, so we need to index the
pages
• After find the right tools, we had the following question:• What is the best ML technique to use? We tried:
• Naive Bayes, but the degree of class overlapping is not small in our case• SVM, but it can only classify between two extremes
• We decided to create a committee machine of SVM models• Better generalization• Could be very slow
Model
Implementation• Cloud-ready technologies
• Apache Spark• Elasticsearch
• Java frameworks:• Crawler4J• Apache Lucene• Jsoup: parsing
Support vector machine (SVM)• Non-probabilistic binary linear classifier• Can parametrize the number of iteractions • Slow!• “One Vs. All” approach with committee [1 e 2]•The model that had more votes is the winner
[1] e Silva, Sergio Roberto de Lima, and Mauro Roisenberg. "Continuous authentication by keystroke dynamics using committee machines." Intelligence and Security Informatics. Springer Berlin Heidelberg, 2006. 686-687.[2] Sun, Bing-Yu, et al. "Support vector machine committee for classification."Advances in Neural Networks–ISNN 2004. Springer Berlin Heidelberg, 2004. 648-653.
Finance Vs. Sport Finance Vs. Movies Finance Vs. Cars
Sport Vs. Movies Sport Vs. Cars
Movies Vs. Cars
Achitecture
Implementation details - Training1. Set of pages is used as input to the models
String [] pagesCars={"http://g1.globo.com/carros/index.html","http://quatrorodas.abril.com.br/"};String [] pagesFinance={"http://www.valor.com.br/","http://www.infomoney.com.br/", "http://exame.abril.com.br/"};String [] pagesSport={"http://globoesporte.globo.com/","http://oledobrasil.com.br/","http://espn.uol.com.br"};String [] pagesMovies={"http://www.imdb.com/list/ls002231878/","http://www.adorocinema.com/","http://www.filmeb.com.br/", "http://www.revistabula.com/3165-lista-dos-100-melhores-filmes-de-todos-os-tempos-segundo-hollywood/"};
2. Set of pages is used as input to the models
Implementation details - Training3. Clean the page and calculate Feature Vector using HashingTF• Get only the page text (ie., exclude HTML tags)• Use Lucene to remove stopwords, simbols, numbers and other
meaning less parts• Calc the term frequence and create a feature vector
Implementation details - Training
16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://www.filmeb.com.br/ the model predicts movies16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://www.valor.com.br/ the model predicts finance16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://www.infomoney.com.br/ the model predicts finance16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://exame.abril.com.br/ the model predicts finance16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://oledobrasil.com.br/ the model predicts sport16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://espn.uol.com.br the model predicts sport16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://sportv.globo.com/site/ the model predicts movies
4. Test the data against the models
Experiments• First dataset
• Classes: Finance (Infomoney), Sports (Lance), Movies (IMDB), and Cars (4 Rodas)
• Second dataset• Classes: Life Style, Soup opera, Technology
• Most of the documents are correctly classified, but there was also lot of ambiguity:
Other problems• Templates in portals (headers and footer)• Documents with few information (e.g., assine já)• Documents with too much information (e.g., the main page)
Focused cralwer• 100 labeled pages of each kind, runned
the focused crawler with Carreira, Mercados, Onde Investir and Negócios
• The page structure is easier to test and provides much better results:
• The errors are due texts in more than one category, for instance:
Performance experiments• Three experiments:
• 1: 200000 classifications in 30 minutes, and 4 classes
• 2: 180000 classifications and 8 classes• 3: Focused crawler and 3 classes
Current version• eCrawler• Disambiguation• Pipeline of methods:
Conclusions• Cloud ready technologies, such as Apache Spark and Elasticsearch,
enables the Smart Crawler for expanding accordingly to the application necessities;
• The use of SVM, a traditional machine Learning method, implemented using a Machine Committee can improve the generalization power of the classification components;
• The architecture is created to be general-propose, so it can be used to crawl different domains and make this content available to transformations, search, and retrieval operations.
• The source code is available in:• https://github.com/lhzsantana/smart-crawler
Smart Crawler: Using Committee Machines for Web
Pages Continuous ClassificationLuiz Henrique Zambom Santana,
Prof. Dr. Ronaldo dos Santos Mello e Prof. Dr. Mauro Roisenberg
Obrigado!
Federal University of Santa Catarina
WebMedia - 2015