googling for gold: web crawlers, hacking and defense explained

IntroductionImagine your company’s secrets availableto competitors at the touch of a button.Imagine your internal costing and five-year plan available for all to see – evencached in a search engine! Such a scenariomay seem far-fetched but in fact suchincidents happen each and every day onthe WWW.

In this article, some of the ways inwhich a hacker or competitor mightexamine a company’s information assetsonline are examined. Although tools like whois and “looking glasses” can provide lots of interesting data, the focusof this article is the inadvertent disclosureof information via the Web. While thisrisk is not particularly new, it is one thatappears to be growing: in our experi-ments, most companies we targeted hadinformation that we believe was notintentionally published to web, and thatwas at least sensitive, if not confidential or secret.

Part of the reason for this explosion inthe publication of confidential data is theease and flexibility of Web publishing.When sharing documents or sending information such as presenta-tions to various groups both inside andoutside a company, it is often tempting to place materials on the Web; not necessarily on a company’s “public” Website, but sometimes on a different “pri-vate” server. Unless this location is

properly secured, such a practice is a lotlike having a conversation in a crowdedroom: while the conversation may feelprivate, the privacy relies on others notlistening in. Similarly, if one places confi-dential documents on the Web, theirsecurity is a matter of chance rather thandesign.

All of the above sounds obvious, yet it iseasy to find examples using any searchengine of large companies who haverealms of sensitive data available to anyone who would look. In this paper, we chose to use the Google search enginefor our examples – not because Google isparticularly intrusive in any way, but sim-ply because it is our personal favouritesearch engine, and has a handy pagecaching feature which can assist in ourtask.

Wearing our black hatsBefore entering into a full discussion ofthe specific threat, it is important toestablish the context of this article. Oneof the most important lessons in corpo-rate security is understanding the mind of

the adversary… that is, the person orgroup from which you are trying to pro-tect the enterprise. While this topic aloneis enough for an entire article, for the pur-poses of this discussion it is sufficient toconsider two different groups of people:those who are trying to break into anycomputer, and those who are both highlyskilled and specifically attempting tobreak into your company’s machines.

In the former case, the preventativesoutlined in this article will not actuallyadd a great deal of protection; thoseintent on simply breaking in wheneverthey have time to kill seem quite contentscanning a few million IP addresses, look-ing for vulnerable machines. Such attack-ers carry out little research on their target,as they do not have a specific target inmind. However, even these “script kid-dies” use the Web to search for Web pass-word files left in readable directories, orother configuration errors.

However, it is against the determined,targeted attacker that the real risk of Webpublication becomes clear. Let us assumethat one has done due diligence: a goodfirewall is installed and well configured;users are trained, and one’s defenses arerobust. What steps might a determinedadversary take against the network? Well,like any good campaign, the battle beginsnot with a portscan, but with a muchmore subtle approach: intelligence gath-ering.

While the task of intelligence gather-ing may seem daunting at first glance,further reflection reveals many differentsources of information about a company,ranging from the fairly benign (forexample, domain registration informa-tion) to the dangerous (the company’strash, for example). For the sake ofbrevity, we will focus on just one of thesemechanisms: Web search engines.However, it is our belief that all theseother channels of information leakage

web information leakage

10

Googling for Gold: WebCrawlers, Hacking andDefense ExplainedRichard Ford, PhD Helayne Ray, Center for Information

Assurance, Florida Institute of Technology

Information leakage from a company can occur in many different ways. In this arti-cle the problem of confidential data becoming available on the World Wide Web isexamined. Companies are recommended to put in place a regular audit of the infor-mation available about themselves online. Once this data is gathered, it can be usedto plug procedural weaknesses as well as to deny confidential and proprietary data tohackers and competitors.

Even script kiddies use theWeb to search for Web

password files

Hackers rely on websitescontaining hidden links

to sensitive pages

11


are just as dangerous, if not as conve-nient to the attacker.

Crawlers and historyWeb search engines have become a part ofdaily life. In fact, they are such a powerfultool that when researching a new topic,the first step is to “google” for the infor-mation in order to get a brief overview.Most subjects and words provide a raft ofuseful information, and most users spendlittle time thinking about how that infor-mation is generated.

As one might imagine, search enginesgenerally gather data by visiting websites.This process, known as Web Crawling, isvery effective, and can map large parts ~of Web space in a fairly automated manner.

Crawler-based search engines start witha software program called a “spider” or“crawler” which visits a Web page, readsit, and then follows links to other pageswithin the site. This software program isrun on a regular basis to keep up withchanges made to websites. The contentsof every Web page are recorded in the index of a search engine, such as Google, thereby providing the searchengine the capability to sift through millions of pages recorded in the index to find matches to a search using keywords.

Web crawling is not new; spiders havebeen around since the early days of theWeb, primarily the province of searchengines. As John M. Fisher of CIAC (the Department of Energy emergencysecurity response team) said, searchengines are so efficient at gathering infor-mation of all sorts that: ... ‘perhaps these search engines havebecome a little too powerful. Some websiteshave provided a few too many links,including information related to systemconfiguration. For example, doing a searchfor keywords such as "root", "daemon:","passwd", etc, will return back a few stray/etc/passwd or /etc/group files, located onsystems that have very poor Web configura-tions. This is a way to quickly find thoseWeb sites that are not maintained very welland are most vulnerable to attack. ‘CIAC bulletin Mar 1996

Such warnings periodically recur; a recentarticle in New Scientist echoed this con-cern to a wider audience, yet the messagestill does not seem to be filtering throughto those who need to hear it: hackers nolonger need to visit a website to attack it,instead they can get all the informationthey need from pages cached by Googleand since the information is cached,hackers can access this information with-out alerting a Webmaster, even if the datahas already been removed from the web(according to several sources, the averagelife span of a Web page is 100 days, butcached pages exist much longer).

Since technology has made the Web anintegral and necessary part of a businessoperation, hackers are using this tech-nique to find confidential informationwhich they use as backdoor methods intoa company’s innermost secrets. The hack-ers rely on the fact that websites some-times contain hidden links to sensitivepages which contain usernames and pass-words required to access secure parts ofthe site. Since the spider has “crawled”through the Web and recorded the contents of all links to pages, sensitive or not, the hacker only has to use well-chosen keywords with a searchengine to find confidential information.

One well-documented exploit is tosearch for the keyword “bash history”.

Bash is an acronym for Bourne AgainShell, which is an operating system shellfor Unix, and it stores history from thecommands it executes. If a spider can finda link to a file called “bash history”, it willindex it, and Google may add it to itscache. If a hacker searches for this file,they may be rewarded with recently useduserid’s and passwords. Similarly, one cansearch for “private” files used by E-com-merce applications or other applications(such as ICQ). When encountered, thesefiles can be downloaded and their con-tents exploited. In preparation of thisarticle, we found many files that con-tained usernames and encrypted pass-words plainly visible online, as well asseveral misconfigured deployments ofMicrosoft Frontpage extensions. Thetrick is knowing exactly what to searchfor.

Another approach often taken by hack-ers is to search for directories that areworld-readable – that is, those directoriesthat do not contain a default Web page,but allow directory listing. This is triviallyaccomplished by searching for the words“Index of” and “Name Last Modified SizeDescription” plus whatever target key-word the hacker is searching for. Thisapproach can yield very interesting information.

ArachnophobiaDespite the fact that the problem is notnew, a few minutes of playing with one’sfavourite search engine can provide infor-mation ranging from amusing to, at leastin security terms, horrifying. Based onsearches conducted in preparation of thisarticle, several well-known companies areeither not aware of this issue or are blaséabout the information we discovered. Inthe following examples, the companynames and details have been changed toprotect the naïve; however, these exam-ples are all pulled from personal experi-ence and research.

One of the first things that one can dousing Web crawlers is search for itemsthat really have no business being on theWeb in the first place. In our searches, weencountered internal security policies,internal sales price lists and even five year

Our searches revealedinternal security policies,internal sales prices listsand five year strategicplans for well known

companies

Examples of simple searches usingcertain keywords:

• Search for “bash history”

• Search for world-readable directories

• Search for “Company X” followed by“Confidential” or “Proprietary”


12

strategic plans. While the latter casesimpact corporate security indirectly,and are a marvelous (though legallyquestionable) source of business intelli-gence, the former example has a directimpact on security.

Just as easily, a hacker can gain con-fidential information by using key-words such as the name of yourcompany followed by the wordConfidential, or Proprietary. We test-ed this on two well known softwarecompanies, using a keyword with theirname followed by the wordConfidential. Within five minutes, wefound 10 sites for each. Upon closerexamination of the content of thesites, we determined that most likelyone third of articles markedConfidential were probably publicarticles that were incorrectly labeled“confidential”. The remaining docu-ments, however, did appear to be genuinely proprietary in nature; goodmanners and etiquette prevents usfrom disclosing too much of their con-tent, but it should suffice to say that in at least two cases, we can think of nolegitimate reason why a companywould ever intentionally publish suchinformation to general public.

Among other gems stored onlinewere a demo intranet server that includ-ed a complete description of the com-pany’s VPN; while the actual server was no longer available, Google came to therescue: despite not being able to return images, the text of the page wasstill cached though the site itself waslong gone. Other items of interest were internal pricing schemes (want to knowthe real dealer cost of a product?), a raft of sales presentations and even afive year strategic plan for a well-known company!

One interesting aspect of all this search-ing was that it illustrated the power of thenet. Not only were problems found at company sites, but also companieswere exposed due to the carelessness ofbusiness partners. Various consultancyfirms have inadvertently had their private customer repositories crawled,leading to beta versions of documentsand proposals becoming public; suchdocuments could be extremely interestingreading to competitors and investors! Somuch more information is now availableonline, with a (false) expectation of priva-cy, it is easy to think of examples whereyou yourself may have placed confidentialinformation on the Web for easy down-load. Did you remove it when you werefinished? Did a Web crawler inadvertent-ly snag it while it was up there?

CountermeasuresGiven the overall scale of the problem, itseems clear that many companies are not taking the problem of Web information leakage seriously: despite evidence to the contrary, the risk ofexploitation seems to be either underesti-mated or ignored by administrators.

This underestimation may be due tosome of the difficulties faced in determin-ing actual damage. Linking a particularattack with Web-crawled data is difficult (if even possible) and measuringthe cost of information leakage is alsochallenging. However, given the ease withwhich it is possible to monitor and man-age published data, it seems reasonablethat any company should engage in a reg-ular Web content security audit. Oncecrawled information is gathered, various steps can be taken to remove ormodify the offending data (or businesspartner!).

Based upon our experiments, muchinformation about a company can be gathered by searching for simple

combinations such as “Company XConfidential” or “Company X proprietary”. Even the legalize in confi-dential documents can be used to track down other similar documents once one example is found! Thus, it is agood exercise to attempt to purge the Web of offending content. Carry outsearches on the popular Web searchengines, exactly like a hacker might on aregular basis; if this has not been a part ofyour regular audit process, you may bevery surprised by what you find!

In addition to this, look for directorylistings in Google from domains youown, or that contain your companyname. By making good use of Web searchBoolean functions, the large numbers ofsearch hits can be culled down to just ahandful that require a second (and moretime consuming) manual search. Strongknowledge of the Web applications thatyour company provides will also help insearching for critical files that should notbe readable.

If (or, if this is your first experiment,when) offending content is found, severalof the search engines are very flexible inaiding in content removal. For example,Google offers several ways to remove con-tent. (see http://www.google.com/remove.html). According to Google, first,your webmaster inserts the appropriatemeta tags into the page’s HTML code ofthe pages to be removed. To prevent onlyGoogle’s robots from caching the page,use the following tag: <METANAME=”GOOGLEBOT” CONTENT”NOARCHIVE”>, otherwise use<META NAME=”ROBOTS” CON-TENT ”NOARCHIVE”>. This tag onlyremoves the cached link for the page thenext time Google crawls your site. If yourrequest is urgent, you must create arobots.txt file and provide Google withthe URL to it and they will remove it within 24 hours (see http://services.google.com:8882/urlconsole/controller?cmd=robots ).

ConclusionThe problem of publication of confiden-tial or sensitive information crawled by asearch engine is far from new, yet it

Many companies are nottaking the problem of

Web information leakage seriously

Any company shouldengage in a regular Webcontent security audit

http://services.google.com:8882/urlconsole/controller?cmd=robots



http://www.google.com/remove.html

http://www.google.com/remove.html

13

appears that security practitioners are notactively monitoring this threat. Such a sit-uation is unacceptable and dangerous, asthe impact can range from merely embar-rassing to critical information disclosure.In this article we have illustrated the cur-rent scope of the threat, and suggested anumber of steps that can be taken to ame-liorate disclosure both by the company and

by third parties related to the company.Go Google your online information, andyou may be surprised by what you find.

About the authorDr. Richard Ford is an internationallyacknowledged computer security expert, wholectures worldwide on malicious softwareand viruses, computer security, and the

integration of security and technology.Before being appointed Research Professor atFlorida Institute of Technology's Center forInformation Assurance, Dr. Ford worked ina variety of high-technology roles, atCenetec, Verio and IBM Research.

Author ContactEmail: rford,[email protected]

For instance, the most noted mass mailattack of 2003 occurred at the beginningof the year when SQL Slammer becamethe fastest-spreading malware — ever.The attack targeted Microsoft SQLServer 2000, executing a denial-of-serviceattack that resided in the memory ofaffected Microsoft SQL servers. Thismeant antivirus scanners, which wereunable to support memory scans, failedto detect the code. By some estimates, theglobal spread of SQL Slammer was com-pleted in less than 10 minutes.

Indeed, despite ongoing developmentsand improvements to content manage-ment solutions, including antivirus, fire-wall and other security innovations, theferocity of the latest batch of codes is athreat which cannot be ignored. As weenter 2004, vigilance remains a priority ifdigital information is to remain secure.Whether at Internet gateways or onhome desktops, users will need to exerciseextreme caution in light of increasinglyeffective methods of infiltration. Yet,

however concerning, last year’s viraldevelopments simply act as a precursor tothe malware that will shape 2004. Thisarticle will consider the trends that deter-mined viral activity last year, and thosewhich will influence the year to come.

Malware essentials Computer viruses have come a long waysince they were first described by FredCohen, a then New Haven Professor, in a 1984-research paper entitledExperiments with Computer Viruses. Init he warns: “Viral attacks appear to beeasy to develop… can be designed toleave few if any traces in most current sys-tems, are effective against modern securi-ty policies for multi-level usage, andrequire only minimal expertise to imple-ment. Their potential threat is severe andthey can spread very quickly through acomputer system.”

This rather credulous summary becameincreasingly familiar as malware emergedfrom its floppy-disk origins into today’s

mass-mailing world. Back then, virusessimply attacked one file at a time untilthey were detected and removed byantivirus products. Their modern-dayequivalents reflect an inter-connecteddigital society, where viruses utilise users’bandwidth to propagate malware to amuch wider audience, at a much fasterrate. For instance, statistics collected byTrend Micro, show that between 2001and 2003, 100% of viral outbreaks hadInternet worm-like characteristics.

These figures are part of a wider studythat outlines the broad range of viralactivities that took place in 2003. Thecomprehensive study shows that out-breaks caused by script and macro viruseshave seemingly lost their potency, virtual-ly disappearing by 2003.

In addition, Trend Micro’s data indi-cates that Internet relay chat (IRC) hasre-emerged in 2003 as a potent vector ofmalware distribution, especially for infes-tations that appear void of any wormingcapabilities. For example, in a corporatesetting, the act of connecting to roguechat servers via a company’s network canoften facilitate the spread of catastrophicviral infections that would otherwise havelittle chance of dissemination.

The investigation also shows that mass-mailing worms have dominated malwareactivity over the last 12 months, despitethe best efforts of developers to improveattachment filters. This kind of insur-gence has been confirmed as an extremelyeffective distribution method, especiallywhen used in conjunction with socialengineering tactics that entice users toclick and execute attachments.

yearly review

Trends, codes and virusattacks - 2003 year inreviewRoger Levenhagen, managing director of Trend Micro UK

A shortlist of the most memorable events of 2003 will undoubtedly include thedemise of Saddam Hussein’s regime but what will surely be omitted from most top10s is the continuing threat posed by malware — viruses and other malicious code –and the nuisance of spam messaging. These omissions are made despite worryingdevelopments in mass mail worms, blended threats and increasing spam complexity.

mailto:[email protected]

mailto:[email protected]

googling for gold: web crawlers, hacking and defense explained

Documents