smart crawler: using committee machines for web pages continuous classification

Smart Crawler: Using Committee Machines for Web

Pages Continuous ClassificationLuiz Henrique Zambom Santana,

Prof. Dr. Ronaldo dos Santos Mello e Prof. Dr. Mauro Roisenberg

Federal University of Santa Catarina – Florianópolis/SC

WebMedia – Manaus, 2015

Agenda• Goals• Motivation• Model• Architecture• Implementation• Experiments• Conclusions

Goals• Idea:

• If:• www.infomoney.com.br = Finance• www.lance.com.br = Futbol• www.4rodas.com.br = Cars

• So:• www.valor.com.br = Finance• placar.abril.com.br = Futbol• revistaautoesporte.globo.com = Cars

Motivation• If we know the category of a page, then

• We can better parse• We can provide better search results• We can customize the user experience

• Classify web page contents, for generating dataset

Motivation• Using ML techniques seemed a good idea, but:

• We need to scale, so Matlab was not an option• We need to collect and classify pages continuously, so we need to index the

pages

• After find the right tools, we had the following question:• What is the best ML technique to use? We tried:

• Naive Bayes, but the degree of class overlapping is not small in our case• SVM, but it can only classify between two extremes

• We decided to create a committee machine of SVM models• Better generalization• Could be very slow

Implementation• Cloud-ready technologies

• Apache Spark• Elasticsearch

• Java frameworks:• Crawler4J• Apache Lucene• Jsoup: parsing

Support vector machine (SVM)• Non-probabilistic binary linear classifier• Can parametrize the number of iteractions • Slow!• “One Vs. All” approach with committee [1 e 2]•The model that had more votes is the winner

[1] e Silva, Sergio Roberto de Lima, and Mauro Roisenberg. "Continuous authentication by keystroke dynamics using committee machines." Intelligence and Security Informatics. Springer Berlin Heidelberg, 2006. 686-687.[2] Sun, Bing-Yu, et al. "Support vector machine committee for classification."Advances in Neural Networks–ISNN 2004. Springer Berlin Heidelberg, 2004. 648-653.

Finance Vs. Sport Finance Vs. Movies Finance Vs. Cars

Sport Vs. Movies Sport Vs. Cars

Movies Vs. Cars

Achitecture

Implementation details - Training1. Set of pages is used as input to the models

String [] pagesCars={"http://g1.globo.com/carros/index.html","http://quatrorodas.abril.com.br/"};String [] pagesFinance={"http://www.valor.com.br/","http://www.infomoney.com.br/", "http://exame.abril.com.br/"};String [] pagesSport={"http://globoesporte.globo.com/","http://oledobrasil.com.br/","http://espn.uol.com.br"};String [] pagesMovies={"http://www.imdb.com/list/ls002231878/","http://www.adorocinema.com/","http://www.filmeb.com.br/", "http://www.revistabula.com/3165-lista-dos-100-melhores-filmes-de-todos-os-tempos-segundo-hollywood/"};

2. Set of pages is used as input to the models

Implementation details - Training3. Clean the page and calculate Feature Vector using HashingTF• Get only the page text (ie., exclude HTML tags)• Use Lucene to remove stopwords, simbols, numbers and other

meaning less parts• Calc the term frequence and create a feature vector

Implementation details - Training

16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://www.filmeb.com.br/ the model predicts movies16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://www.valor.com.br/ the model predicts finance16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://www.infomoney.com.br/ the model predicts finance16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://exame.abril.com.br/ the model predicts finance16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://oledobrasil.com.br/ the model predicts sport16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://espn.uol.com.br the model predicts sport16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://sportv.globo.com/site/ the model predicts movies

4. Test the data against the models

Experiments• First dataset

• Classes: Finance (Infomoney), Sports (Lance), Movies (IMDB), and Cars (4 Rodas)

• Second dataset• Classes: Life Style, Soup opera, Technology

• Most of the documents are correctly classified, but there was also lot of ambiguity:

Other problems• Templates in portals (headers and footer)• Documents with few information (e.g., assine já)• Documents with too much information (e.g., the main page)

Focused cralwer• 100 labeled pages of each kind, runned

the focused crawler with Carreira, Mercados, Onde Investir and Negócios

• The page structure is easier to test and provides much better results:

• The errors are due texts in more than one category, for instance:

Performance experiments• Three experiments:

• 1: 200000 classifications in 30 minutes, and 4 classes

• 2: 180000 classifications and 8 classes• 3: Focused crawler and 3 classes

Current version• eCrawler• Disambiguation• Pipeline of methods:

Conclusions• Cloud ready technologies, such as Apache Spark and Elasticsearch,

enables the Smart Crawler for expanding accordingly to the application necessities;

• The use of SVM, a traditional machine Learning method, implemented using a Machine Committee can improve the generalization power of the classification components;

• The architecture is created to be general-propose, so it can be used to crawl different domains and make this content available to transformations, search, and retrieval operations.

• The source code is available in:• https://github.com/lhzsantana/smart-crawler

Smart Crawler: Using Committee Machines for Web

Pages Continuous ClassificationLuiz Henrique Zambom Santana,

Prof. Dr. Ronaldo dos Santos Mello e Prof. Dr. Mauro Roisenberg

Obrigado!

Federal University of Santa Catarina

WebMedia - 2015

smart crawler: using committee machines for web pages continuous classification

Internet