web crawling
TRANSCRIPT
Web-CrawlingWeb-Crawling
What is WebCrawler ?What is WebCrawler ?
Program that browses the www in a Program that browses the www in a methodical, automated manner or in methodical, automated manner or in an orderly fashionan orderly fashion
AntsAnts, , automatic indexersautomatic indexers, , botsbots,,Web Web spidersspiders,,Web robotsWeb robots or or WebscuttersWebscutters
process is called Web crawling or process is called Web crawling or spidering.spidering.
Common Usage of WebCommon Usage of Web CrawlerCrawler
Search EngineSearch Engine
Presenting information in Presenting information in manageable formatmanageable format
Testing / Validation /maintenance of Testing / Validation /maintenance of websitewebsite
Business usage of WebCrawlerBusiness usage of WebCrawler
Collecting Social media/ networking Collecting Social media/ networking data to identify potential customersdata to identify potential customers
Product Listings & Consumer Product Listings & Consumer ReviewsReviews
Custom Tariff Data CollectionCustom Tariff Data Collection
E-mail address harvestingE-mail address harvesting
How WebCrawler works?How WebCrawler works?
WebCrawler ExamplesWebCrawler Examples
Yahoo! Slurp -- Yahoo Search crawlerYahoo! Slurp -- Yahoo Search crawler
Bingbot -- Microsoft's Bing Bingbot -- Microsoft's Bing webcrawlerwebcrawler
Googlebot -- Google web crawlerGooglebot -- Google web crawler
How Google works?How Google works?
Googlebot, a web crawler that finds Googlebot, a web crawler that finds and fetches web pages.and fetches web pages.
The indexer that sorts every word on The indexer that sorts every word on every page and stores the resulting every page and stores the resulting index of words in a huge database.index of words in a huge database.
The query processor, which The query processor, which compares your search query to the compares your search query to the index and recommends the index and recommends the documents that it considers most documents that it considers most relevant.relevant.
Web-Crawling TechniquesWeb-Crawling Techniques
HttpUnitHttpUnit
HtmlUnitHtmlUnit
HttpUrlConnectionHttpUrlConnection
HttpUnitHttpUnit
Center of HttpUnit is the Web-Center of HttpUnit is the Web-Conversation classConversation class
WebConversation wc = new WebConversation();WebConversation wc = new WebConversation();
WebRequest req = new WebRequest req = new GetMethodWebRequest("http://www.metacube.coGetMethodWebRequest("http://www.metacube.com/productengg.asp");m/productengg.asp");
WebResponse resp=wc.getResponse( req);WebResponse resp=wc.getResponse( req);
Navigating Web Page Link Using the Navigating Web Page Link Using the response:response:
WebLink link = resp.getLinkWith( "linkId" ); WebLink link = resp.getLinkWith( "linkId" );
link.click(); link.click();
WebResponse secondResponse = WebResponse secondResponse = wc.getCurrentPage(); wc.getCurrentPage();
Retrieve Table Structure of WebPage:Retrieve Table Structure of WebPage:WebTable table = resp.getTables()[0];WebTable table = resp.getTables()[0];
Advantages of HttpUnitAdvantages of HttpUnit
Simple way to crawl a websiteSimple way to crawl a website Powerful tool for creating test suites Powerful tool for creating test suites
to ensure the end-to-end to ensure the end-to-end functionality of your Web functionality of your Web applications.applications.
Retrieve forms or table structure or Retrieve forms or table structure or frames associated with web page frames associated with web page and can also navigate to particular and can also navigate to particular web link.web link.
Limitations of HttpUnit:Limitations of HttpUnit:
Not able to parse whole javascript Not able to parse whole javascript associated with that web pageassociated with that web page
HttpUnitOptions.setScriptingEnabled(HttpUnitOptions.setScriptingEnabled(false) can disable the javascript but false) can disable the javascript but can result into wrong page can result into wrong page validationsvalidations
Things center around Things center around WebConversations, WebRequests WebConversations, WebRequests and WebResponses.and WebResponses.
HtmlUnitHtmlUnit Open source java library for creating Open source java library for creating HTTP calls which imitate the browser HTTP calls which imitate the browser functionality. functionality. Higher-level than HttpUnit’s, modeling Higher-level than HttpUnit’s, modeling web interaction in terms of the web interaction in terms of the documents and interface elements documents and interface elements which the user interacts withwhich the user interacts with Can be used for web application Can be used for web application testingtesting
Web ClientWeb Client Starting point and work as browser simulatorStarting point and work as browser simulator WebClient.getPage() is just like typing an address WebClient.getPage() is just like typing an address
in the browser. It returns an HtmlPage object.in the browser. It returns an HtmlPage object. HtmlPage lets you access to many of a web page HtmlPage lets you access to many of a web page
contentcontent @Test @Test
public void testGoogle(){public void testGoogle(){
WebClient webClient = new WebClient();WebClient webClient = new WebClient(); HtmlPage currentPage = HtmlPage currentPage = webClient.getPage("http://www.google.com/");webClient.getPage("http://www.google.com/");
assertEquals("Google", assertEquals("Google", currentPage.getTitleText());}currentPage.getTitleText());}
HTML ElementsHTML Elements HtmlPage lets you ability to access any of the HtmlPage lets you ability to access any of the
page HTML elements and all of their attributes page HTML elements and all of their attributes and sub elements. This includes forms, tables, and sub elements. This includes forms, tables, images, input fields, divs or any other Html images, input fields, divs or any other Html element you may imagine. element you may imagine.
Can also access any of the DOM elements by Can also access any of the DOM elements by using using XPathXPath
//Using XPath to get the first result in Google //Using XPath to get the first result in Google //query//query
HtmlElement element = HtmlElement element = (HtmlElement)currentPage.getByXPath("//h3").get(HtmlElement)currentPage.getByXPath("//h3").get(0);(0);
DomNode result = DomNode result = element.getChildNodes().get(0);element.getChildNodes().get(0);
Htmlunit Javascript supportHtmlunit Javascript support Uses the Uses the MozillaMozilla Rhino Rhino JavaScript engine. JavaScript engine. Lets you the ability to run pages with Lets you the ability to run pages with
JavaScript or even run JavaScript code by JavaScript or even run JavaScript code by commandcommand
ScriptResult result = ScriptResult result = currentPage.executeJavaScript(JavaScriptCcurrentPage.executeJavaScript(JavaScriptCode);ode);
Turn off the JavaScript all together using Turn off the JavaScript all together using currentPage.getWebClient().setJavaScriptEcurrentPage.getWebClient().setJavaScriptEnabled(false); nabled(false);
Advantages of HtmlUnit:Advantages of HtmlUnit:
JavaScript code is executed just like JavaScript code is executed just like in normal browsers when the page in normal browsers when the page loads or when an handler is loads or when an handler is triggered.triggered.
HtmlUnit provides the ability to inject HtmlUnit provides the ability to inject code into an existing page via code into an existing page via HtmlPage.executeJavascript(String HtmlPage.executeJavascript(String yourJsCode).yourJsCode).
Limitations of HtmlUnitLimitations of HtmlUnit Pages which use third party libraries Pages which use third party libraries
might not work when tested via might not work when tested via HtmlUnit like the following webpage HtmlUnit like the following webpage for crawling Iran:for crawling Iran:
http://portal.irica.gov.ir/Portal/Home/http://portal.irica.gov.ir/Portal/Home/Default.aspx?CategoryID=82cebca0-Default.aspx?CategoryID=82cebca0-58cc-4be6-bef5-a94e2242140a58cc-4be6-bef5-a94e2242140a
Turning off Javascript can result into Turning off Javascript can result into wrong page validations.wrong page validations.
HttpURLConnection:HttpURLConnection:
Crawl site data using standard Jdk Crawl site data using standard Jdk api onlyapi only
Ensure the Http request that we are Ensure the Http request that we are forming must be bit to bit identicalforming must be bit to bit identical
Analyze the http request using any Analyze the http request using any 33rdrd party tool and then identify the party tool and then identify the http request header and post datahttp request header and post data
HttpURLConnection ExampleHttpURLConnection ExampleAccess the URL object with the specifying url stringAccess the URL object with the specifying url string URL urlObj = new URL(urlString);URL urlObj = new URL(urlString); HttpURLConnection conn = urlObj.openConnection();HttpURLConnection conn = urlObj.openConnection();
//Add Header information to the particular connection//Add Header information to the particular connection conn.setRequestMethod("POST");conn.setRequestMethod("POST"); conn.addRequestProperty("Keep-Alive", "115");conn.addRequestProperty("Keep-Alive", "115"); conn.addRequestProperty("Accept-Language","en-us,en;q=0.5");conn.addRequestProperty("Accept-Language","en-us,en;q=0.5"); conn.addRequestProperty("Connection","keep-alive");conn.addRequestProperty("Connection","keep-alive"); conn.addRequestProperty("User-Agent", "Mozilla/5.0") conn.addRequestProperty("User-Agent", "Mozilla/5.0") conn.addRequestProperty("Host","portal.irica.gov.ir");conn.addRequestProperty("Host","portal.irica.gov.ir"); conn.setRequestProperty("Cookie",iranHomePageInfo.getCookie());conn.setRequestProperty("Cookie",iranHomePageInfo.getCookie()); conn.addRequestProperty("Accept","text/html");conn.addRequestProperty("Accept","text/html");
Continue……Continue……
HttpURLConnection Example HttpURLConnection Example Continued…..Continued…..
//Build the post data for navigation to particular page//Build the post data for navigation to particular page String postData = URLEncoder.encode("__EVENTTARGET", "UTF-8") String postData = URLEncoder.encode("__EVENTTARGET", "UTF-8")
+ "="+ + "="+ URLEncoder.encode("WebPart_c91c1c00_0fcc_4fea_8582_3576b4URLEncoder.encode("WebPart_c91c1c00_0fcc_4fea_8582_3576b443d5f2$LF0$PagerObject","UTF-8");43d5f2$LF0$PagerObject","UTF-8");
postData += "&" +URLEncoder.encode("__EVENTARGUMENT", "UTF-postData += "&" +URLEncoder.encode("__EVENTARGUMENT", "UTF-8") + "="+ URLEncoder.encode(pageNumberStr, "UTF-8");8") + "="+ URLEncoder.encode(pageNumberStr, "UTF-8");
postData += "&" + URLEncoder.encode("__VIEWSTATE", "UTF-8") + postData += "&" + URLEncoder.encode("__VIEWSTATE", "UTF-8") + "="+ URLEncoder.encode(iranHomePageInfo.getViewState(), "="+ URLEncoder.encode(iranHomePageInfo.getViewState(), "UTF-8");"UTF-8");
//Then write post data to commection//Then write post data to commection OutputStreamWriter out = new OutputStreamWriter out = new
OutputStreamWriter(conn.getOutputStream());OutputStreamWriter(conn.getOutputStream()); //write post data to connection//write post data to connection out.write(postData);out.write(postData); out.flush();out.flush(); out.close();out.close();
HttpUrlConnection Pros: HttpUrlConnection Pros:
Basic/standard Jdk api to be used for Basic/standard Jdk api to be used for crawlingcrawling
Can handle complex java script and Can handle complex java script and 33rdrd party libraries javascript also party libraries javascript also
More focus on Http request formationMore focus on Http request formation
HttpUrlConnection ConsHttpUrlConnection Cons
More cumbersome since we have to More cumbersome since we have to write complex code for building write complex code for building complex request bit to bit identicalcomplex request bit to bit identical
Have to match cookies and all Have to match cookies and all header information as it is displaying header information as it is displaying via Httpanalyzer or 3via Httpanalyzer or 3rdrd party tool party tool
ConclusionsConclusionsUse HttpUnit only forUse HttpUnit only for Simple web site Simple web site Web pages don't have much java scriptWeb pages don't have much java script
Use HtmlUnit when:Use HtmlUnit when: HttpUnit is not able to crawl the web pageHttpUnit is not able to crawl the web page We require GUI-Less browser simulation for Java We require GUI-Less browser simulation for Java
programs. programs. Models HTML documents and provides an API that Models HTML documents and provides an API that
allows you to invoke pages, fill out forms, click allows you to invoke pages, fill out forms, click links, etc... just like we do in our "normal" links, etc... just like we do in our "normal" browser.browser.
Conclusions Continued..Conclusions Continued..
Use HttpUrlConnection when:Use HttpUrlConnection when: HttpUnit and HtmlUnit both are not HttpUnit and HtmlUnit both are not
workingworking Web pages having fairly amount of Web pages having fairly amount of
javascript and 3javascript and 3rdrd party javascript party javascript librarieslibraries
Questions ??Questions ??