blocking bad bots and scrapers with .htaccess

Upload: arogall7926

Post on 09-Oct-2015

57 views

Category:

Documents


0 download

DESCRIPTION

Stopping Spam and Bots

TRANSCRIPT

  • AskApache Web DevelopmentFREE THOUGHT FREE SOFTWARE FREE WORLD(http://www.fsf.org/register_form?referrer=7511)

    SkipHome Htaccess Blocking Bad Bots and Scrapers with .htaccess

    Blocking Bad Bots and Scrapers with .htaccess

    Crazy Cache WordPress Plugin ReleasedUndetectable Sniffing On Ethernet

    by Charles Torvalds34 comments

    (http://uploads.askapache.com/2008/04/bad_robot.png) This article shows 2methods of blocking this entire list of bad robots and web scrapers with .htaccess files using SetEnvIfNoCase or usingRewriteRules with mod_rewrite

    Contents [hide]Blocking Bad Robots and Web Scrapers with RewriteRulesAlternate RewriteCond Rules1. Block Bad Bots with SetEnvIfNoCaseOriginal Bad Bot / Web Scraper List

    Blocking Bad Robots and Web Scrapers with RewriteRules ^

    Blocking Bad Bots and Scrapers with .htaccess http://www.askapache.com/htaccess/blocking-bad-bots-and-scrapers...

    1 of 11 11/10/2014 5:09 AM

  • ErrorDocument 403 /403.htmlRewriteEngine OnRewriteBase /# IF THE UA STARTS WITH THESERewriteCond %{HTTP_USER_AGENT} ^(aesop_com_spiderman|alexibot|backweb|bandit|batchftp|bigfoot) [NC,OR]RewriteCond %{HTTP_USER_AGENT} ^(black.?hole|blackwidow|blowfish|botalot|buddy|builtbottough|bullseye) [NC,OR]RewriteCond %{HTTP_USER_AGENT} ^(cheesebot|cherrypicker|chinaclaw|collector|copier|copyrightcheck) [NC,OR]RewriteCond %{HTTP_USER_AGENT} ^(cosmos|crescent|curl|custo|da|diibot|disco|dittospyder|dragonfly) [NC,OR]RewriteCond %{HTTP_USER_AGENT} ^(drip|easydl|ebingbong|ecatch|eirgrabber|emailcollector|emailsiphon) [NC,OR]RewriteCond %{HTTP_USER_AGENT} ^(emailwolf|erocrawler|exabot|eyenetie|filehound|flashget|flunky) [NC,OR]RewriteCond %{HTTP_USER_AGENT} ^(frontpage|getright|getweb|go.?zilla|go-ahead-got-it|gotit|grabnet) [NC,OR]RewriteCond %{HTTP_USER_AGENT} ^(grafula|harvest|hloader|hmview|httplib|httrack|humanlinks|ilsebot) [NC,OR]RewriteCond %{HTTP_USER_AGENT} ^(infonavirobot|infotekies|intelliseek|interget|iria|jennybot|jetcar) [NC,OR]RewriteCond %{HTTP_USER_AGENT} ^(joc|justview|jyxobot|kenjin|keyword|larbin|leechftp|lexibot|lftp|libweb) [NC,OR]RewriteCond %{HTTP_USER_AGENT} ^(likse|linkscan|linkwalker|lnspiderguy|lwp|magnet|mag-net|markwatch) [NC,OR]RewriteCond %{HTTP_USER_AGENT} ^(mata.?hari|memo|microsoft.?url|midown.?tool|miixpc|mirror|missigua) [NC,OR]RewriteCond %{HTTP_USER_AGENT} ^(mister.?pix|moget|mozilla.?newt|nameprotect|navroad|backdoorbot|nearsite) [NC,OR]RewriteCond %{HTTP_USER_AGENT} ^(net.?vampire|netants|netcraft|netmechanic|netspider|nextgensearchbot) [NC,OR]RewriteCond %{HTTP_USER_AGENT} ^(attach|nicerspro|nimblecrawler|npbot|octopus|offline.?explorer) [NC,OR]RewriteCond %{HTTP_USER_AGENT} ^(offline.?navigator|openfind|outfoxbot|pagegrabber|papa|pavuk) [NC,OR]RewriteCond %{HTTP_USER_AGENT} ^(pcbrowser|php.?version.?tracker|pockey|propowerbot|prowebwalker) [NC,OR]RewriteCond %{HTTP_USER_AGENT} ^(psbot|pump|queryn|recorder|realdownload|reaper|reget|true_robot) [NC,OR]RewriteCond %{HTTP_USER_AGENT} ^(repomonkey|rma|internetseer|sitesnagger|siphon|slysearch|smartdownload) [NC,OR]RewriteCond %{HTTP_USER_AGENT} ^(snake|snapbot|snoopy|sogou|spacebison|spankbot|spanner|sqworm|superbot) [NC,OR]RewriteCond %{HTTP_USER_AGENT} ^(superhttp|surfbot|asterias|suzuran|szukacz|takeout|teleport) [NC,OR]RewriteCond %{HTTP_USER_AGENT} ^(telesoft|the.?intraformant|thenomad|tighttwatbot|titan|urldispatcher) [NC,OR]RewriteCond %{HTTP_USER_AGENT} ^(turingos|turnitinbot|urly.?warning|vacuum|vci|voideye|whacker) [NC,OR]RewriteCond %{HTTP_USER_AGENT} ^(libwww-perl|widow|wisenutbot|wwwoffle|xaldon|xenu|zeus|zyborg|anonymouse) [NC,OR]# STARTS WITH WEBRewriteCond %{HTTP_USER_AGENT} ^web(zip|emaile|enhancer|fetch|go.?is|auto|bandit|clip|copier|master|reaper|sauger|sit# ANYWHERE IN UA -- GREEDY REGEXRewriteCond %{HTTP_USER_AGENT} ^.*(craftbot|download|extract|stripper|sucker|ninja|clshttp|webspider|leacher|collecto# ISSUE 403 / SERVE ERRORDOCUMENTRewriteRule . - [F,L]

    Alternate RewriteCond Rules ^

    RewriteEngine on#Block spambotsRewriteCond %{HTTP:User-Agent} (?:Alexibot|Art-Online|asterias|BackDoorbot|Black.Hole|BlackWidow|BlowFish|botALot|BuiltbotTough|Bullseye|BunnySlippers|Cegbfeieh|Cheesebot|CherryPicker|ChinaClaw|CopyRightCheck|cosmos|Crescent|Custo|DISCo|DittoSpyder|DownloadsDemon|eCatch|EirGrabber|EmailCollector|EmailSiphon|EmailWolf|EroCrawler|ExpresssWebPictures|ExtractorPro|EyeNetIE|FlashGet|Foobot|FrontPage|GetRight|GetWeb!|Go-Ahead-Got-It|Go!Zilla|GrabNet|Grafula|Harvest|hloader|HMView|httplib|HTTrack|humanlinks|ImagesStripper|ImagesSucker|IndysLibrary|InfonaviRobot|InterGET|InternetsNinja|Jennybot|JetCar|JOCsWebsSpider|Kenjin.Spider|Keyword.Density|larbin|LeechFTP|Lexibot|libWeb/clsHTTP|LinkextractorPro|LinkScan/8.1a.Unix|LinkWalker|lwp-trivial|MasssDownloader|Mata.Hari|Microsoft.URL|MIDownstool|MIIxpc|Mister.PiX|MistersPiX|moget|Mozilla/3.Mozilla/2.01|Mozilla.*NEWT|Navroad|NearSite|NetAnts|NetMechanic|NetSpider|NetsVampire|NetZIP|NICErsPRO|NPbot|Octopus|Offline.Explorer|OfflinesExplorer|OfflinesNavigator|Openfind|Pagerabber|PapasFoto|pavuk|pcBrowser|ProgramsSharewares1|ProPowerbot/2.14|ProWebWalker|ProWebWalker|psbot/0.1|QueryN.Metasearch|ReGet|RepoMonkey|RMA|SiteSnagger|SlySearch|SmartDownload|Spankbot|spanner|Superbot|SuperHTTP|Surfbot|suzuran|Szukacz/1.4|tAkeOut|Teleport|TeleportsPro|Telesoft|The.Intraformant|TheNomad|TightTwatbot|Titan|toCrawl/UrlDispatcher|toCrawl/UrlDispatcher|True_Robot|turingos|Turnitinbot/1.5|URLy.Warning|VCI|VoidEYE|WebAuto|WebBandit|WebCopier|WebEMailExtrac.*|WebEnhancer|WebFetch|WebGosIS|Web.Image.Collector|WebsImagesCollector|WebLeacher|WebmasterWorldForumbot|WebReaper|WebSauger|WebsiteseXtractor|Website.Quester|WebsitesQuester|Webster.Pro|WebStripper|WebsSucker|WebWhacker|WebZip|Wget|Widow|[Ww]eb[Bb]andit|WWW-Collector-E|WWWOFFLE|XaldonsWebSpider|Xenu's|Zeus) [NC]RewriteRule .? - [F]

    Block Bad Bots with SetEnvIfNoCase ^

    Blocking Bad Bots and Scrapers with .htaccess http://www.askapache.com/htaccess/blocking-bad-bots-and-scrapers...

    2 of 11 11/10/2014 5:09 AM

  • ErrorDocument 403 /403.html# IF THE UA STARTS WITH THESESetEnvIfNoCase ^User-Agent$ .*(aesop_com_spiderman|alexibot|backweb|bandit|batchftp|bigfoot) HTTP_SAFE_BADBOTSetEnvIfNoCase ^User-Agent$ .*(black.?hole|blackwidow|blowfish|botalot|buddy|builtbottough|bullseye) HTTP_SAFE_BADBOTSetEnvIfNoCase ^User-Agent$ .*(cheesebot|cherrypicker|chinaclaw|collector|copier|copyrightcheck) HTTP_SAFE_BADBOTSetEnvIfNoCase ^User-Agent$ .*(cosmos|crescent|curl|custo|da|diibot|disco|dittospyder|dragonfly) HTTP_SAFE_BADBOTSetEnvIfNoCase ^User-Agent$ .*(drip|easydl|ebingbong|ecatch|eirgrabber|emailcollector|emailsiphon) HTTP_SAFE_BADBOTSetEnvIfNoCase ^User-Agent$ .*(emailwolf|erocrawler|exabot|eyenetie|filehound|flashget|flunky) HTTP_SAFE_BADBOTSetEnvIfNoCase ^User-Agent$ .*(frontpage|getright|getweb|go.?zilla|go-ahead-got-it|gotit|grabnet) HTTP_SAFE_BADBOTSetEnvIfNoCase ^User-Agent$ .*(grafula|harvest|hloader|hmview|httplib|httrack|humanlinks|ilsebot) HTTP_SAFE_BADBOTSetEnvIfNoCase ^User-Agent$ .*(infonavirobot|infotekies|intelliseek|interget|iria|jennybot|jetcar) HTTP_SAFE_BADBOTSetEnvIfNoCase ^User-Agent$ .*(joc|justview|jyxobot|kenjin|keyword|larbin|leechftp|lexibot|lftp|libweb) HTTP_SAFE_BADSetEnvIfNoCase ^User-Agent$ .*(likse|linkscan|linkwalker|lnspiderguy|lwp|magnet|mag-net|markwatch) HTTP_SAFE_BADBOTSetEnvIfNoCase ^User-Agent$ .*(mata.?hari|memo|microsoft.?url|midown.?tool|miixpc|mirror|missigua) HTTP_SAFE_BADBOTSetEnvIfNoCase ^User-Agent$ .*(mister.?pix|moget|mozilla.?newt|nameprotect|navroad|backdoorbot|nearsite) HTTP_SAFE_BASetEnvIfNoCase ^User-Agent$ .*(net.?vampire|netants|netcraft|netmechanic|netspider|nextgensearchbot) HTTP_SAFE_BADBOTSetEnvIfNoCase ^User-Agent$ .*(attach|nicerspro|nimblecrawler|npbot|octopus|offline.?explorer) HTTP_SAFE_BADBOTSetEnvIfNoCase ^User-Agent$ .*(offline.?navigator|openfind|outfoxbot|pagegrabber|papa|pavuk) HTTP_SAFE_BADBOTSetEnvIfNoCase ^User-Agent$ .*(pcbrowser|php.?version.?tracker|pockey|propowerbot|prowebwalker) HTTP_SAFE_BADBOTSetEnvIfNoCase ^User-Agent$ .*(psbot|pump|queryn|recorder|realdownload|reaper|reget|true_robot) HTTP_SAFE_BADBOTSetEnvIfNoCase ^User-Agent$ .*(repomonkey|rma|internetseer|sitesnagger|siphon|slysearch|smartdownload) HTTP_SAFE_BADBSetEnvIfNoCase ^User-Agent$ .*(snake|snapbot|snoopy|sogou|spacebison|spankbot|spanner|sqworm|superbot) HTTP_SAFE_BADBSetEnvIfNoCase ^User-Agent$ .*(superhttp|surfbot|asterias|suzuran|szukacz|takeout|teleport) HTTP_SAFE_BADBOTSetEnvIfNoCase ^User-Agent$ .*(telesoft|the.?intraformant|thenomad|tighttwatbot|titan|urldispatcher) HTTP_SAFE_BADBOTSetEnvIfNoCase ^User-Agent$ .*(turingos|turnitinbot|urly.?warning|vacuum|vci|voideye|whacker) HTTP_SAFE_BADBOTSetEnvIfNoCase ^User-Agent$ .*(widow|wisenutbot|wwwoffle|xaldon|xenu|zeus|zyborg|anonymouse) HTTP_SAFE_BADBOTSetEnvIfNoCase ^User-Agent$ .*web(zip|emaile|enhancer|fetch|go.?is|auto|bandit|clip|copier|master|reaper|sauger|site.SetEnvIfNoCase ^User-Agent$ .*(craftbot|download|extract|stripper|sucker|ninja|clshttp|webspider|leacher|collector|grSetEnvIfNoCase ^User-Agent$ .*(libwww-perl|aesop_com_spiderman) HTTP_SAFE_BADBOTDeny from env=HTTP_SAFE_BADBOT

    Original Bad Bot / Web Scraper List ^

    Blocking Bad Bots and Scrapers with .htaccess http://www.askapache.com/htaccess/blocking-bad-bots-and-scrapers...

    3 of 11 11/10/2014 5:09 AM

  • WebBandit2icommerceAccoonaActiveTouristBotadressendeutschlandaipbotAlexibotAlligatorAllSubmitteralmadenanarchieAnonymousApexooAqua_ProductsasteriasASSORTATHENSAtHomeAtomzattacheautoemailspiderautohttpb2wbewBackDoorBotBadassBaiduspiderBaiduspider+BecomeBotbertsBitacleBiz360Black.HoleBlackWidowbladder fusionBlog CheckerBlogPeopleBlogshares SpidersBloodhoundBlowFishBoard BotBookmark search toolBotALotBotRightHereBot mailto:craftbot@yahoo.comBropwersBrowsezillaBuiltBotToughBullseyeBunnySlippersCegbfeiehCFNetworkCheeseBotCherryPickerCrescentcharlotte/ChinaClawConveraCopernicCopyRightCheckcosmosCrescent

    Blocking Bad Bots and Scrapers with .htaccess http://www.askapache.com/htaccess/blocking-bad-bots-and-scrapers...

    4 of 11 11/10/2014 5:09 AM

  • c-spidercurlCustoCyberzDataCha0sDaumDewebDiggerDigimarcdigout4uagentDIIbotDISCoDittoSpyderDnloadMageDownloaddragonflyDreamPassportDSurfDTS AgentdumbotDynaWebe-collectorEasyDLEBrowseeCatchecollectoredgeioefp@gmx.netEirGrabberEmail ExtractorEmailCollectorEmailSiphonEmailWolfEmeraldShieldEnterprise_SearchEroCrawlerESurfEvalEverest-VulcanExabotExpressExtractorExtractorProEyeNetIEFairAdfastlwspiderfetchFEZheadFileHoundfindlinksFlaming AttackBotFlashGetFlickBotFoobotForexFranklin LocatorFreshDownloadFrontPageFSurfGaisbotGamespy_ArcadegenieBot

    Blocking Bad Bots and Scrapers with .htaccess http://www.askapache.com/htaccess/blocking-bad-bots-and-scrapers...

    5 of 11 11/10/2014 5:09 AM

  • GetBotGetleftGetRightGetWeb!Go!ZillaGo-Ahead-Got-ItGOFORITBOTGrabNetGrafulagrubHarvestHatena AntennaheritrixHLoaderHMViewholmesHooWWWerHouxouCrawlerHTTPGethttplibHTTPRetrieverHTTrackhumanlinksIBM_PlanetwideiCCrawlerichiroiGetterImage StripperImage Suckerimagefetchimds_monitorIncyWincyIndustry ProgramIndyInetURLInfoNaviRobotInstallShield DigitalWizardInterGETIRLbotIron33ISSpiderIUPUI Research BotJakartajava/JBH AgentJennyBotJetCarjeteyejeteyebotJoBoJOC Web SpiderKapereKenjinKeyword DensityKRetrieveksoapKWebGetLapozzBotlarbinleechLeechFTPLeechGet

    Blocking Bad Bots and Scrapers with .htaccess http://www.askapache.com/htaccess/blocking-bad-bots-and-scrapers...

    6 of 11 11/10/2014 5:09 AM

  • leipzig.deLexiBotlibWeblibwww-FMlibwww-perlLightningDownloadLinkextractorProLinkieLinkScanlinktigerLinkWalkerlmcrawlerLNSpiderguyLocalcomBotlooksmartLWPMac FinderMail Sweepermark.bloninMaSagoolMassMata HariMCspiderMetaProducts Download ExpressMicrosoft Data AccessMicrosoft URL ControlMIDownMIIxpcMirrorMissaugaMissouri College BrowseMisterMonstermkdbmogetMoreoverbotmothra/netscanMovableTypeMozi!Mozilla/22Mozilla/3.0 (compatible)Mozilla/5.0 (compatible; MSIE 5.0)MSIE_6.0MSIECrawlerMSProxyMVAClientMyFamilyBotMyGetRightnameprotectNASA SearchNaverNavroadNearSiteNetAntsnetattacheNetCartaNetMechanicNetResearchServerNetSpiderNetZIPNet VampireNEWT ActiveX

    Blocking Bad Bots and Scrapers with .htaccess http://www.askapache.com/htaccess/blocking-bad-bots-and-scrapers...

    7 of 11 11/10/2014 5:09 AM

  • NextopiaNICErsPROninjaNimbleCrawlernoxtrumbotNPBotOctopusOfflineOK MozillaOmniExplorerOpaLOpenbotOpenfindOpenTextSiteCrawlerOracle Ultra SearchOutfoxBotP3PPackRatPageGrabberPagmIEDownloadpanscientPapa FotopavukpcBrowserperlPerManPersonaPilotPHP versionPlantyNet_WebRobotplaystarmusicPluckerPort HuronProgram SharewareProgressive DownloadProPowerBotprospectorProWebWalkerProzillapsbotpsycheclonepufPushSitePussyCatPuxaRapidoPython-urllibQuepasaCreepQueryNRadiationRealDownloadRedCarpetRedKernelReGetrelevantnoiseRepoMonkeyRMARoverRsyncRTG30RufusSAPOSBIderscooter

    Blocking Bad Bots and Scrapers with .htaccess http://www.askapache.com/htaccess/blocking-bad-bots-and-scrapers...

    8 of 11 11/10/2014 5:09 AM

  • ScoutAboutscriptsearchpreviewsearchtermsSeekbotSeriousShaishelobShim-CrawlerSickleBotsitecheckSiteSnaggerSlurpy VerifierSlySearchSmartDownloadsna-snaggerSnoopysogousootleSo-net bat_botSpankBot bat_botspanner bat_botSpeedDownloadSpeglaSphereSphiderSpiderBotsprooseSQ WebscannerSqwormStaminaStanfordstudybotSuperBotSuperHTTPSurfbotSurfWalkersuzuranSzukacztAkeOutTALWinHttpClienttarspiderTeleportTelesoftTempletonTestBEDThe IntraformantTheNomadTightTwatBotTitantoCrawl/UrlDispatcherTrue_RobotturingosTurnitinBotTwisted PageGetterUCmoreUdmSearchUMBCUniversalFeedParserURL ControlURLGetFile

    Blocking Bad Bots and Scrapers with .htaccess http://www.askapache.com/htaccess/blocking-bad-bots-and-scrapers...

    9 of 11 11/10/2014 5:09 AM

  • URLy WarningURL_Spider_ProUtilMindvayalavobsubVCIVoidEYEVoilaBotvoyagerw3mirWeb Image CollectorWeb SuckerWeb2WAPWebaltBotWebAutoWebBanditWebCapturewebcollageWebCopierWebCopyWebEMailExtracWebEnhancerWebFetchWebFilterWebFountainWebGoWebLeacherWebMinerWebMirrorWebReaperWebSaugerWebSnakeWebsiteWebStripperWebVacwebwalkWebWhackerWebZIPWells SearchWEP Search 00WeRelateBotWgetWhosTalkingWidowWildsoft SurferWinHttpRequestWinHTTrackWUMPUSWWWOFFLEwwwsterWWW-CollectorXaldonXenu'sXenusXGETY!TunnelProYahooYSMcmYaDirectBotYetiZadeZBotzerxbot

    Blocking Bad Bots and Scrapers with .htaccess http://www.askapache.com/htaccess/blocking-bad-bots-and-scrapers...

    10 of 11 11/10/2014 5:09 AM

  • ZeusZyBorg

    Tags ^

    April 8th, 2008

    Crazy Cache WordPress Plugin ReleasedUndetectable Sniffing On Ethernet

    Comments Welcome ^

    [hide]

    It's very simple - you read the protocol and write the code. -Bill Joy

    RSS(http://feedvalidator.org/check.cgi?url=http://www.askapache.com/feed/) | XHTML 1.1(http://validator.w3.org/check/referer?ss=1;outline=1;sp=1;debug) | CSS 2.1(http://jigsaw.w3.org/css-validator/check/referer?warning=0)

    Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution 3.0License(http://creativecommons.org/licenses/by/3.0/) , just credit with a link.This site is not supported or endorsed by The Apache Software Foundation (ASF). All software and documentation producedby The ASF is licensed. "Apache" is a trademark of The ASF. NCSA HTTPd(http://hoohoo.ncsa.illinois.edu/) .UNIX is a registered Trademark of The Open Group(http://www.opengroup.org/) . POSIX is a registered Trademark of TheIEEE(http://standards.ieee.org/) .

    +Askapache(https://plus.google.com/+Askapache) | askapache(http://profiles.wordpress.org/askapache)

    Site Map | Contact Webmaster | License and Disclaimer | Terms of ServiceMain(http://www.quantcast.com/p-5e44cjdXWaqOA) (http://www.alexa.com/data/details/main/www.askapache.com)

    Blocking Bad Bots and Scrapers with .htaccess http://www.askapache.com/htaccess/blocking-bad-bots-and-scrapers...

    11 of 11 11/10/2014 5:09 AM