© 2008 crawlwall.com competitive counter-intelligence stop snooping competitors techniques for...

© 2008 CrawlWall.com

Competitive Counter-Intelligence

Stop Snooping Competitors

Techniques for protecting your SEO investment from prying competitive eyes

Bill AtchisonChief Web Arachnologist

CrawlWall.com“The ‘Bot Stops Here!”


Intelligence Gathering Motives

Why let competitors use your hard work and paid research as low hanging fruit to launch their business?

• Competitors want the path of least resistance to encroach on your online business

• SEOs want to make easy money helping competitors rank by leveraging your information

• SEO tool vendors want to sell your data for a profit and help those competing against you


Who else gathers intelligence?

• Competitive Shopping Sites• Intelligence gathering Spybots

– Copyright Compliance– Branding Compliance– Corporate Security Monitoring– Media Monitoring (mp3, mpeg, etc.)– Myriad of Safe-Site Monitoring solutions

• Data Aggregators• Lawyers!• And many more…


How is Information Collected?

Various places used to gather data:

– Google, Yahoo! and MSN’s cache pages

– Internet Archives

– Whois, Domaintools, etc.

– SEO tools that directly crawl your site


NOARCHIVE Bans Cache

Eliminate search engine cache to stop covert researchers from gathering data on your meta tags, internal anchor text and outbound links

Insert this meta tag in all your web pages:

<meta content='NOARCHIVE' name='ROBOTS'>

Visit http://www.noarchive.net for more information


Ban the Internet Archive

The Internet Archive (Archive.org) is used to covertly gather both historical and recent site data

Block Archive.org in robots.txt:

User-agent: ia_archiverDisallow: /

Also block in .htaccess as they still tend to crawl and results show up in Alexa cache pages:

RewriteCond %{HTTP_USER_AGENT} ^ia_archiver

RewriteRule ^.* - [F]

Lawyers love Archive.org!


Private Whois Registration

Remove clues about you, your administrative and technical contacts, and how many domains you have.

The WHOIS data is easily blocked using proxy domain registrations.

Sample WHOIS using DomainsByProxy:

MYDOMAINNAME.COMDomains by Proxy, Inc.DomainsByProxy.comxxxx N. Rd., Ste xxx, PMB xxxScottsdale, Arizona xxxxxUnited States

Another upside, no public WHOIS email addresses for spammers to harvest.


Whitelist Robots.txt

Use robots.txt to tell well behaved ‘bots whether they’re allowed to crawl or not

Sample robots.txt file:

# allow bots we likeUser-agent: GooglebotUser-agent: SlurpUser-agent: MsnbotDisallow:

# all other bots bannedUser-agent: * Disallow: /


Whitelist .htaccessBadly behaved crawlers that won’t honor robots.txt get stopped at the server with .htaccess.

Sample .htaccess code:

#allow just search engines we like, we're OPT-IN onlyBrowserMatchNoCase Google good_passBrowserMatchNoCase Slurp good_passBrowserMatchNoCase msnbot good_passBrowserMatchNoCase Teoma good_passBrowserMatchNoCase Jeeves good_pass

#allow Firefox, MSIE, Opera etc. BrowserMatchNoCase ^Mozilla good_passBrowserMatchNoCase ^Opera good_pass

order deny,allowdeny from allallow from env=good_pass


Verify Spider IdentityMake sure the search engines are who they claim using full trip reverse DNS checking, avoid spoofing

IP 66.249.73.58 -> crawl-66-249-73-58.googlebot.com -> IP 66.249.73.58

Sample .htaccess code:

<FilesMatch "\.(s?html?|php[45]?)$"> SetEnvIfNoCase User-Agent "!(Googlebot|msnbot|Teoma)" notRDNSbot Order Deny,Allow Deny from all Allow from env=notRDNSbot Allow from googlebot.com Allow from search.live.com Allow from ask.com </FilesMatch>


‘Bot Blockers for Everything Else

There will still be crawlers gathering competitive information that don’t want to get caught that pretend to be human browsers

Tools such as robots.txt and .htaccess can’t stop those that don’t want to be stopped

Complete your arsenal with a ‘bot blocker script specifically designed to stop the unstoppable crawlers


Summary

Remove Competitive Vulnerabilities:• Eliminate Search Engine Cache Pages• OPT-OUT of Archive.org• OPT-IN only allowed spiders• ‘Bot blocker scripts to catch hidden threats

Get Better Results:• Tighter controls on copyrighted content• Improved search engine ranking after thwarting

unwanted competition• Better server performance for visitors and legit

search engine crawls


Resources

Visit the following sites and forums for more details:

Robots.txt Forumhttp://www.webmasterworld.com/robots_txt/

Apache Web Server Forumhttp://www.webmasterworld.com/apache/

Search Engine Spider Identification Forum http://www.webmasterworld.com/search_engine_spiders/

The NoArchive Initiativehttp://www.noarchive.net/

The Web Robots Pageshttp://www.robotstxt.org/


Thank You!

Bill AtchisonChief Web Arachnologist

CrawlWall.com“The ‘Bot Stops Here!”

© 2008 crawlwall.com competitive counter-intelligence stop snooping competitors techniques for...

Documents