web crawling

Web-CrawlingWeb-Crawling

What is WebCrawler ?What is WebCrawler ?

Program that browses the www in a Program that browses the www in a methodical, automated manner or in methodical, automated manner or in an orderly fashionan orderly fashion

AntsAnts, , automatic indexersautomatic indexers, , botsbots,,Web Web spidersspiders,,Web robotsWeb robots or or WebscuttersWebscutters

process is called Web crawling or process is called Web crawling or spidering.spidering.

Common Usage of WebCommon Usage of Web CrawlerCrawler

Search EngineSearch Engine

Presenting information in Presenting information in manageable formatmanageable format

Testing / Validation /maintenance of Testing / Validation /maintenance of websitewebsite

Business usage of WebCrawlerBusiness usage of WebCrawler

Collecting Social media/ networking Collecting Social media/ networking data to identify potential customersdata to identify potential customers

Product Listings & Consumer Product Listings & Consumer ReviewsReviews

Custom Tariff Data CollectionCustom Tariff Data Collection

E-mail address harvestingE-mail address harvesting

How WebCrawler works?How WebCrawler works?

WebCrawler ExamplesWebCrawler Examples

Yahoo! Slurp -- Yahoo Search crawlerYahoo! Slurp -- Yahoo Search crawler

Bingbot -- Microsoft's Bing Bingbot -- Microsoft's Bing webcrawlerwebcrawler

Googlebot -- Google web crawlerGooglebot -- Google web crawler

How Google works?How Google works?

Googlebot, a web crawler that finds Googlebot, a web crawler that finds and fetches web pages.and fetches web pages.

The indexer that sorts every word on The indexer that sorts every word on every page and stores the resulting every page and stores the resulting index of words in a huge database.index of words in a huge database.

The query processor, which The query processor, which compares your search query to the compares your search query to the index and recommends the index and recommends the documents that it considers most documents that it considers most relevant.relevant.

Web-Crawling TechniquesWeb-Crawling Techniques

HttpUnitHttpUnit

HtmlUnitHtmlUnit

HttpUrlConnectionHttpUrlConnection

HttpUnitHttpUnit

Center of HttpUnit is the Web-Center of HttpUnit is the Web-Conversation classConversation class

WebConversation wc = new WebConversation();WebConversation wc = new WebConversation();

WebRequest req = new WebRequest req = new GetMethodWebRequest("http://www.metacube.coGetMethodWebRequest("http://www.metacube.com/productengg.asp");m/productengg.asp");

WebResponse resp=wc.getResponse( req);WebResponse resp=wc.getResponse( req);

Navigating Web Page Link Using the Navigating Web Page Link Using the response:response:

WebLink link = resp.getLinkWith( "linkId" ); WebLink link = resp.getLinkWith( "linkId" );

link.click(); link.click();

WebResponse secondResponse = WebResponse secondResponse = wc.getCurrentPage(); wc.getCurrentPage();

Retrieve Table Structure of WebPage:Retrieve Table Structure of WebPage:WebTable table = resp.getTables()[0];WebTable table = resp.getTables()[0];

Advantages of HttpUnitAdvantages of HttpUnit

Simple way to crawl a websiteSimple way to crawl a website Powerful tool for creating test suites Powerful tool for creating test suites

to ensure the end-to-end to ensure the end-to-end functionality of your Web functionality of your Web applications.applications.

Retrieve forms or table structure or Retrieve forms or table structure or frames associated with web page frames associated with web page and can also navigate to particular and can also navigate to particular web link.web link.

Limitations of HttpUnit:Limitations of HttpUnit:

Not able to parse whole javascript Not able to parse whole javascript associated with that web pageassociated with that web page

HttpUnitOptions.setScriptingEnabled(HttpUnitOptions.setScriptingEnabled(false) can disable the javascript but false) can disable the javascript but can result into wrong page can result into wrong page validationsvalidations

Things center around Things center around WebConversations, WebRequests WebConversations, WebRequests and WebResponses.and WebResponses.

HtmlUnitHtmlUnit Open source java library for creating Open source java library for creating HTTP calls which imitate the browser HTTP calls which imitate the browser functionality. functionality. Higher-level than HttpUnit’s, modeling Higher-level than HttpUnit’s, modeling web interaction in terms of the web interaction in terms of the documents and interface elements documents and interface elements which the user interacts withwhich the user interacts with Can be used for web application Can be used for web application testingtesting

Web ClientWeb Client Starting point and work as browser simulatorStarting point and work as browser simulator WebClient.getPage() is just like typing an address WebClient.getPage() is just like typing an address

in the browser. It returns an HtmlPage object.in the browser. It returns an HtmlPage object. HtmlPage lets you access to many of a web page HtmlPage lets you access to many of a web page

contentcontent @Test @Test

public void testGoogle(){public void testGoogle(){

WebClient webClient = new WebClient();WebClient webClient = new WebClient(); HtmlPage currentPage = HtmlPage currentPage = webClient.getPage("http://www.google.com/");webClient.getPage("http://www.google.com/");

assertEquals("Google", assertEquals("Google", currentPage.getTitleText());}currentPage.getTitleText());}

HTML ElementsHTML Elements HtmlPage lets you ability to access any of the HtmlPage lets you ability to access any of the

page HTML elements and all of their attributes page HTML elements and all of their attributes and sub elements. This includes forms, tables, and sub elements. This includes forms, tables, images, input fields, divs or any other Html images, input fields, divs or any other Html element you may imagine. element you may imagine.

Can also access any of the DOM elements by Can also access any of the DOM elements by using using XPathXPath

//Using XPath to get the first result in Google //Using XPath to get the first result in Google //query//query

HtmlElement element = HtmlElement element = (HtmlElement)currentPage.getByXPath("//h3").get(HtmlElement)currentPage.getByXPath("//h3").get(0);(0);

DomNode result = DomNode result = element.getChildNodes().get(0);element.getChildNodes().get(0);

Htmlunit Javascript supportHtmlunit Javascript support Uses the Uses the MozillaMozilla Rhino Rhino JavaScript engine. JavaScript engine. Lets you the ability to run pages with Lets you the ability to run pages with

JavaScript or even run JavaScript code by JavaScript or even run JavaScript code by commandcommand

ScriptResult result = ScriptResult result = currentPage.executeJavaScript(JavaScriptCcurrentPage.executeJavaScript(JavaScriptCode);ode);

Turn off the JavaScript all together using Turn off the JavaScript all together using currentPage.getWebClient().setJavaScriptEcurrentPage.getWebClient().setJavaScriptEnabled(false); nabled(false);

Advantages of HtmlUnit:Advantages of HtmlUnit:

JavaScript code is executed just like JavaScript code is executed just like in normal browsers when the page in normal browsers when the page loads or when an handler is loads or when an handler is triggered.triggered.

HtmlUnit provides the ability to inject HtmlUnit provides the ability to inject code into an existing page via code into an existing page via HtmlPage.executeJavascript(String HtmlPage.executeJavascript(String yourJsCode).yourJsCode).

Limitations of HtmlUnitLimitations of HtmlUnit Pages which use third party libraries Pages which use third party libraries

might not work when tested via might not work when tested via HtmlUnit like the following webpage HtmlUnit like the following webpage for crawling Iran:for crawling Iran:

http://portal.irica.gov.ir/Portal/Home/http://portal.irica.gov.ir/Portal/Home/Default.aspx?CategoryID=82cebca0-Default.aspx?CategoryID=82cebca0-58cc-4be6-bef5-a94e2242140a58cc-4be6-bef5-a94e2242140a

Turning off Javascript can result into Turning off Javascript can result into wrong page validations.wrong page validations.

HttpURLConnection:HttpURLConnection:

Crawl site data using standard Jdk Crawl site data using standard Jdk api onlyapi only

Ensure the Http request that we are Ensure the Http request that we are forming must be bit to bit identicalforming must be bit to bit identical

Analyze the http request using any Analyze the http request using any 33rdrd party tool and then identify the party tool and then identify the http request header and post datahttp request header and post data

HttpURLConnection ExampleHttpURLConnection ExampleAccess the URL object with the specifying url stringAccess the URL object with the specifying url string URL urlObj = new URL(urlString);URL urlObj = new URL(urlString); HttpURLConnection conn = urlObj.openConnection();HttpURLConnection conn = urlObj.openConnection();

//Add Header information to the particular connection//Add Header information to the particular connection conn.setRequestMethod("POST");conn.setRequestMethod("POST"); conn.addRequestProperty("Keep-Alive", "115");conn.addRequestProperty("Keep-Alive", "115"); conn.addRequestProperty("Accept-Language","en-us,en;q=0.5");conn.addRequestProperty("Accept-Language","en-us,en;q=0.5"); conn.addRequestProperty("Connection","keep-alive");conn.addRequestProperty("Connection","keep-alive"); conn.addRequestProperty("User-Agent", "Mozilla/5.0") conn.addRequestProperty("User-Agent", "Mozilla/5.0") conn.addRequestProperty("Host","portal.irica.gov.ir");conn.addRequestProperty("Host","portal.irica.gov.ir"); conn.setRequestProperty("Cookie",iranHomePageInfo.getCookie());conn.setRequestProperty("Cookie",iranHomePageInfo.getCookie()); conn.addRequestProperty("Accept","text/html");conn.addRequestProperty("Accept","text/html");

Continue……Continue……

HttpURLConnection Example HttpURLConnection Example Continued…..Continued…..

//Build the post data for navigation to particular page//Build the post data for navigation to particular page String postData = URLEncoder.encode("__EVENTTARGET", "UTF-8") String postData = URLEncoder.encode("__EVENTTARGET", "UTF-8")

+ "="+ + "="+ URLEncoder.encode("WebPart_c91c1c00_0fcc_4fea_8582_3576b4URLEncoder.encode("WebPart_c91c1c00_0fcc_4fea_8582_3576b443d5f2$LF0$PagerObject","UTF-8");43d5f2$LF0$PagerObject","UTF-8");

postData += "&" +URLEncoder.encode("__EVENTARGUMENT", "UTF-postData += "&" +URLEncoder.encode("__EVENTARGUMENT", "UTF-8") + "="+ URLEncoder.encode(pageNumberStr, "UTF-8");8") + "="+ URLEncoder.encode(pageNumberStr, "UTF-8");

postData += "&" + URLEncoder.encode("__VIEWSTATE", "UTF-8") + postData += "&" + URLEncoder.encode("__VIEWSTATE", "UTF-8") + "="+ URLEncoder.encode(iranHomePageInfo.getViewState(), "="+ URLEncoder.encode(iranHomePageInfo.getViewState(), "UTF-8");"UTF-8");

//Then write post data to commection//Then write post data to commection OutputStreamWriter out = new OutputStreamWriter out = new

OutputStreamWriter(conn.getOutputStream());OutputStreamWriter(conn.getOutputStream()); //write post data to connection//write post data to connection out.write(postData);out.write(postData); out.flush();out.flush(); out.close();out.close();

HttpUrlConnection Pros: HttpUrlConnection Pros:

Basic/standard Jdk api to be used for Basic/standard Jdk api to be used for crawlingcrawling

Can handle complex java script and Can handle complex java script and 33rdrd party libraries javascript also party libraries javascript also

More focus on Http request formationMore focus on Http request formation

HttpUrlConnection ConsHttpUrlConnection Cons

More cumbersome since we have to More cumbersome since we have to write complex code for building write complex code for building complex request bit to bit identicalcomplex request bit to bit identical

Have to match cookies and all Have to match cookies and all header information as it is displaying header information as it is displaying via Httpanalyzer or 3via Httpanalyzer or 3rdrd party tool party tool

ConclusionsConclusionsUse HttpUnit only forUse HttpUnit only for Simple web site Simple web site Web pages don't have much java scriptWeb pages don't have much java script

Use HtmlUnit when:Use HtmlUnit when: HttpUnit is not able to crawl the web pageHttpUnit is not able to crawl the web page We require GUI-Less browser simulation for Java We require GUI-Less browser simulation for Java

programs. programs. Models HTML documents and provides an API that Models HTML documents and provides an API that

allows you to invoke pages, fill out forms, click allows you to invoke pages, fill out forms, click links, etc... just like we do in our "normal" links, etc... just like we do in our "normal" browser.browser.

Conclusions Continued..Conclusions Continued..

Use HttpUrlConnection when:Use HttpUrlConnection when: HttpUnit and HtmlUnit both are not HttpUnit and HtmlUnit both are not

workingworking Web pages having fairly amount of Web pages having fairly amount of

javascript and 3javascript and 3rdrd party javascript party javascript librarieslibraries

Questions ??Questions ??

web crawling

Documents