© 2008 crawlwall.com competitive counter-intelligence stop snooping competitors techniques for...
TRANSCRIPT
© 2008 CrawlWall.com
Competitive Counter-Intelligence
Stop Snooping Competitors
Techniques for protecting your SEO investment from prying competitive eyes
Bill AtchisonChief Web Arachnologist
CrawlWall.com“The ‘Bot Stops Here!”
© 2008 CrawlWall.com
Intelligence Gathering Motives
Why let competitors use your hard work and paid research as low hanging fruit to launch their business?
• Competitors want the path of least resistance to encroach on your online business
• SEOs want to make easy money helping competitors rank by leveraging your information
• SEO tool vendors want to sell your data for a profit and help those competing against you
© 2008 CrawlWall.com
Who else gathers intelligence?
• Competitive Shopping Sites• Intelligence gathering Spybots
– Copyright Compliance– Branding Compliance– Corporate Security Monitoring– Media Monitoring (mp3, mpeg, etc.)– Myriad of Safe-Site Monitoring solutions
• Data Aggregators• Lawyers!• And many more…
© 2008 CrawlWall.com
How is Information Collected?
Various places used to gather data:
– Google, Yahoo! and MSN’s cache pages
– Internet Archives
– Whois, Domaintools, etc.
– SEO tools that directly crawl your site
© 2008 CrawlWall.com
NOARCHIVE Bans Cache
Eliminate search engine cache to stop covert researchers from gathering data on your meta tags, internal anchor text and outbound links
Insert this meta tag in all your web pages:
<meta content='NOARCHIVE' name='ROBOTS'>
Visit http://www.noarchive.net for more information
© 2008 CrawlWall.com
Ban the Internet Archive
The Internet Archive (Archive.org) is used to covertly gather both historical and recent site data
Block Archive.org in robots.txt:
User-agent: ia_archiverDisallow: /
Also block in .htaccess as they still tend to crawl and results show up in Alexa cache pages:
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver
RewriteRule ^.* - [F]
Lawyers love Archive.org!
© 2008 CrawlWall.com
Private Whois Registration
Remove clues about you, your administrative and technical contacts, and how many domains you have.
The WHOIS data is easily blocked using proxy domain registrations.
Sample WHOIS using DomainsByProxy:
MYDOMAINNAME.COMDomains by Proxy, Inc.DomainsByProxy.comxxxx N. Rd., Ste xxx, PMB xxxScottsdale, Arizona xxxxxUnited States
Another upside, no public WHOIS email addresses for spammers to harvest.
© 2008 CrawlWall.com
Whitelist Robots.txt
Use robots.txt to tell well behaved ‘bots whether they’re allowed to crawl or not
Sample robots.txt file:
# allow bots we likeUser-agent: GooglebotUser-agent: SlurpUser-agent: MsnbotDisallow:
# all other bots bannedUser-agent: * Disallow: /
© 2008 CrawlWall.com
Whitelist .htaccessBadly behaved crawlers that won’t honor robots.txt get stopped at the server with .htaccess.
Sample .htaccess code:
#allow just search engines we like, we're OPT-IN onlyBrowserMatchNoCase Google good_passBrowserMatchNoCase Slurp good_passBrowserMatchNoCase msnbot good_passBrowserMatchNoCase Teoma good_passBrowserMatchNoCase Jeeves good_pass
#allow Firefox, MSIE, Opera etc. BrowserMatchNoCase ^Mozilla good_passBrowserMatchNoCase ^Opera good_pass
order deny,allowdeny from allallow from env=good_pass
© 2008 CrawlWall.com
Verify Spider IdentityMake sure the search engines are who they claim using full trip reverse DNS checking, avoid spoofing
IP 66.249.73.58 -> crawl-66-249-73-58.googlebot.com -> IP 66.249.73.58
Sample .htaccess code:
<FilesMatch "\.(s?html?|php[45]?)$"> SetEnvIfNoCase User-Agent "!(Googlebot|msnbot|Teoma)" notRDNSbot Order Deny,Allow Deny from all Allow from env=notRDNSbot Allow from googlebot.com Allow from search.live.com Allow from ask.com </FilesMatch>
© 2008 CrawlWall.com
‘Bot Blockers for Everything Else
There will still be crawlers gathering competitive information that don’t want to get caught that pretend to be human browsers
Tools such as robots.txt and .htaccess can’t stop those that don’t want to be stopped
Complete your arsenal with a ‘bot blocker script specifically designed to stop the unstoppable crawlers
© 2008 CrawlWall.com
Summary
Remove Competitive Vulnerabilities:• Eliminate Search Engine Cache Pages• OPT-OUT of Archive.org• OPT-IN only allowed spiders• ‘Bot blocker scripts to catch hidden threats
Get Better Results:• Tighter controls on copyrighted content• Improved search engine ranking after thwarting
unwanted competition• Better server performance for visitors and legit
search engine crawls
© 2008 CrawlWall.com
Resources
Visit the following sites and forums for more details:
Robots.txt Forumhttp://www.webmasterworld.com/robots_txt/
Apache Web Server Forumhttp://www.webmasterworld.com/apache/
Search Engine Spider Identification Forum http://www.webmasterworld.com/search_engine_spiders/
The NoArchive Initiativehttp://www.noarchive.net/
The Web Robots Pageshttp://www.robotstxt.org/
© 2008 CrawlWall.com
Thank You!
Bill AtchisonChief Web Arachnologist
CrawlWall.com“The ‘Bot Stops Here!”