data scraping with excel - campixx 2013 - maik schmidt
DESCRIPTION
Auf meiner Campixx Session habe ich gezeigt, wie man mit Hilfe von Excel und X-Path Daten aus dem Web scrapen kann. u.a. wurden gescraped: Standard KPIs, Malware Checker, Index Checker, Google SERPs, Google SuggestTRANSCRIPT
![Page 1: Data Scraping with Excel - Campixx 2013 - Maik Schmidt](https://reader036.vdocument.in/reader036/viewer/2022062300/5584e3fdd8b42ad73a8b5260/html5/thumbnails/1.jpg)
17.03.2013 – Berlin - SEO Campixx
Data Scraping with Excel – by Maik Schmidt
![Page 2: Data Scraping with Excel - Campixx 2013 - Maik Schmidt](https://reader036.vdocument.in/reader036/viewer/2022062300/5584e3fdd8b42ad73a8b5260/html5/thumbnails/2.jpg)
Wer ich bin
• Maik Schmidt• SEO Consultant bei Catbird Seat (2010)• SEO-Contest „KubaSEOTräume“ Gewinner
@chillboyy
facebook.com/chillboy.de
xing.com/profile/Maik_Schmidt11
![Page 3: Data Scraping with Excel - Campixx 2013 - Maik Schmidt](https://reader036.vdocument.in/reader036/viewer/2022062300/5584e3fdd8b42ad73a8b5260/html5/thumbnails/3.jpg)
Was scrapen wir heute?
• Standard KPIs• Malware Checker• Index Checker• Google SERPs• Google Suggest
![Page 4: Data Scraping with Excel - Campixx 2013 - Maik Schmidt](https://reader036.vdocument.in/reader036/viewer/2022062300/5584e3fdd8b42ad73a8b5260/html5/thumbnails/4.jpg)
Warum Excel?
• Weil ich nicht programmieren kann
Nachteile:• Langsam• Begrenzte Datenmengen
?
![Page 5: Data Scraping with Excel - Campixx 2013 - Maik Schmidt](https://reader036.vdocument.in/reader036/viewer/2022062300/5584e3fdd8b42ad73a8b5260/html5/thumbnails/5.jpg)
Was benötige ich?
• Excel• Niels Bosma SEO Tools for Excel
http://nielsbosma.se/projects/seotools/
![Page 6: Data Scraping with Excel - Campixx 2013 - Maik Schmidt](https://reader036.vdocument.in/reader036/viewer/2022062300/5584e3fdd8b42ad73a8b5260/html5/thumbnails/6.jpg)
Niels Bosma SEO Tools (1/4)
Onpage • LinkCount• HtmlTitle• HtmlMetaDescription• HtmlMetaKeywords• HtmlMeta• HtmlFirst• HtmlH1• HtmlH2• HtmlH3• HtmlCanonical• W3CValidate• PageCodeToTextRatio• PageSize• PageTextSize• PageCodeSize• HttpStatus• HttpHeader• ResponseTime• PageEncoding• IsFoundOnPage
Content • FindDuplicatedContent• CountWords• LCS• SpinTextBacklinks• CheckBacklink• GooglePageRank• GoogleResultCount• GoogleIndexCount• GoogleLinkCount• AlexaReach• AlexaPopularity• AlexaLinkCount• DmozEntries• WikipediaLinksSocial • FacebookLikes• GooglePlusCount• TwitterCount
u.v.m.
![Page 7: Data Scraping with Excel - Campixx 2013 - Maik Schmidt](https://reader036.vdocument.in/reader036/viewer/2022062300/5584e3fdd8b42ad73a8b5260/html5/thumbnails/7.jpg)
Niels Bosma SEO Tools (2/4)
SEOlytics• Backlinks• SVR (Sichtbarkeit)• Keyword Rankings• Domain Metriken• LinkCount/URL• Link History
![Page 8: Data Scraping with Excel - Campixx 2013 - Maik Schmidt](https://reader036.vdocument.in/reader036/viewer/2022062300/5584e3fdd8b42ad73a8b5260/html5/thumbnails/8.jpg)
Niels Bosma SEO Tools (3/4)
MajesticSEO• Größte Backlink DB• Fresh Index• Historischer Index• Trust/Citation Flow
![Page 9: Data Scraping with Excel - Campixx 2013 - Maik Schmidt](https://reader036.vdocument.in/reader036/viewer/2022062300/5584e3fdd8b42ad73a8b5260/html5/thumbnails/9.jpg)
Niels Bosma SEO Tools (4/4)
Google Analytics• Ähnlich:
http://ga-dev-tools.appspot.com/explorer/
• =GoogleAnalytics(string id,string metrics,string startDate,string endDate,[string dimensions,string segment,string filter,string sort,integer startIndex,integer maxResults,bool excludeHeaderInResult,bool excludeDimensionsInResult]) : {string}
![Page 10: Data Scraping with Excel - Campixx 2013 - Maik Schmidt](https://reader036.vdocument.in/reader036/viewer/2022062300/5584e3fdd8b42ad73a8b5260/html5/thumbnails/10.jpg)
X-Path Basics
Mit X-Path kann man bestimmte Teile innerhalb eines XML-Dokumentes adressieren
Beispiele:Document root node:/ Direct child element: XML_element_name Direct child of the root node: /XML_element_name Child of a child: XML_element_name/XML_element_nameDescendant of the root://XML_element_name Descendant of a node: XML_element_name//XML_element_name Parent of a node:../ A far cousin of a node ../../XML_element_name/XML_element_name
/html/body/div/div/div/h3[position()=1
Holt sich in diesem Pfad den Inhalt des ersten H3 Tags
//h3[@class='x']/a);"href"
Holt sich alle Links innerhalb H3 Tags mit der Class „X“
Um Google SERPs zu scrapen
Um Sichtbarkeitsindex.de zu scrapen
![Page 11: Data Scraping with Excel - Campixx 2013 - Maik Schmidt](https://reader036.vdocument.in/reader036/viewer/2022062300/5584e3fdd8b42ad73a8b5260/html5/thumbnails/11.jpg)
X-Path easy rausfinden
Mit dem Firefox Plugin Firebug (und FirePath) lässt sich der X-Path ziemlich schnell und leicht finden:
![Page 12: Data Scraping with Excel - Campixx 2013 - Maik Schmidt](https://reader036.vdocument.in/reader036/viewer/2022062300/5584e3fdd8b42ad73a8b5260/html5/thumbnails/12.jpg)
Standard KPIs
QUELLEN:Free SI: Sichtbarkeitsindex.de/deinedomain.deSI API: http://api.sistrix.net/domain.sichtbarkeitsindex?api_key=xy&domain=deinewebseite.deAlexa Rank: http://www.alexa.com/siteinfo/deinedomain.de
=XPathOnUrl[URL];"/html/body/div/div/div/h3[position()=1]")
=XPathOnUrl([SI API URL];"response/answer/sichtbarkeitsindex";"value")
=XPathOnUrl([Alexa URL];"//table[@id='siteStats']/tbody/
tr[1]/td[2]/div")
![Page 13: Data Scraping with Excel - Campixx 2013 - Maik Schmidt](https://reader036.vdocument.in/reader036/viewer/2022062300/5584e3fdd8b42ad73a8b5260/html5/thumbnails/13.jpg)
Google Save Browsing API
Quelle: http://safebrowsing.clients.google.com/safebrowsing/diagnostic?site=domain.de
=UrlProperty([URL];"domain")
=XPathOnUrl([Google SafeBrowsing URL]; "/html/body/center/div/div/blockquote/p[position()=1]")
![Page 14: Data Scraping with Excel - Campixx 2013 - Maik Schmidt](https://reader036.vdocument.in/reader036/viewer/2022062300/5584e3fdd8b42ad73a8b5260/html5/thumbnails/14.jpg)
Index Checker
Quelle: http://www.google.de/search?gcx=c&sourceid=chrome&ie=UTF-8&pws=0&q=info:deinewebseite.de
=HttpStatus([USER URL])
=WENN(ISTFEHLER(IDENTISCH(TEIL(XPathOnUrl("http://www.google.de/search?gcx=c&sourceid=chrome&ie=UTF-8&q=
"&("info:"&(A2))&"&pws=0";"//li[@class='g']//h3[@class='r']//a";"href");8;LÄNGE(A2));A2));"not indexed";"indexed")
=WENN(HtmlCanonical(A2)=A2;"self canonical";HtmlCanonical(A2))
![Page 15: Data Scraping with Excel - Campixx 2013 - Maik Schmidt](https://reader036.vdocument.in/reader036/viewer/2022062300/5584e3fdd8b42ad73a8b5260/html5/thumbnails/15.jpg)
Google Suggest Scrapen
• Quelle: http://google.de/complete/search?output=toolbar&hl=de&q=
• Scraped das KW + mit/ohne Leerzeichen und einem Buchstaben
• Matrix Funktion um 10er Ergebnisse zu scrapen
• 2. Iteration der Top 10
Über 600 suggested Keywords!
![Page 16: Data Scraping with Excel - Campixx 2013 - Maik Schmidt](https://reader036.vdocument.in/reader036/viewer/2022062300/5584e3fdd8b42ad73a8b5260/html5/thumbnails/16.jpg)
Google SERPs scrapen
/url?q=http://de.wikipedia.org/wiki/Suchmaschinenoptimierung&sa=U&ei=bTU2UP6sPMfNsgbAnoHYBQ&ved=0CB0QFjAA&usg=AFQjCNHwx6lcRxVC0-eBeDJ6GgHBiHGtFQ
Quelle:
http://www.google.de/search?q=deinkeyword&num=100&start=0&pws=0
Ergebnis:
Formel:
=RECHTS(B1;LÄNGE(B1)-7) =RECHTS(C1;LÄNGE(C1)-SUCHEN("&";C1))
=XPathOnUrl([URL];"(//h3[@class='r']/a)["&A1&"]";"href")
&
![Page 17: Data Scraping with Excel - Campixx 2013 - Maik Schmidt](https://reader036.vdocument.in/reader036/viewer/2022062300/5584e3fdd8b42ad73a8b5260/html5/thumbnails/17.jpg)
Google SERPs scrapen
![Page 18: Data Scraping with Excel - Campixx 2013 - Maik Schmidt](https://reader036.vdocument.in/reader036/viewer/2022062300/5584e3fdd8b42ad73a8b5260/html5/thumbnails/18.jpg)
Watt noch?
Analytics for Twitter von Microsoft
&Power Pivot
![Page 19: Data Scraping with Excel - Campixx 2013 - Maik Schmidt](https://reader036.vdocument.in/reader036/viewer/2022062300/5584e3fdd8b42ad73a8b5260/html5/thumbnails/19.jpg)
Ende
Be Creative!Die live gezeigte Excel-Dateien werden auf dem Blog von www.catbirdseat.de als Download zur Verfügung stehen
Mit gezeigten Beispielen & Tools kann man theoretisch jede x-beliebige Webseite abscrapen und in Excel verarbeiten