how google works - baseline

Upload: geraintp

Post on 30-May-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 How Google Works - Baseline

    1/12

    executive summary

    T w t cmp mg fmttcg ft g utcd. But Gguct ppc t fmtmgmt bt g ffct d gffctd m b t w t gz-t dp tcg t futu. H h.

    216C a s e

  • 8/14/2019 How Google Works - Baseline

    2/12

    C

    a

    se

    216

    g

    o

    o

    g

    le

    Baseline

    july

    2006

    31

    WWW.BaselineMaG.COM

    How

    With his unruly hair dipping across his forehead, Douglas Merrill walks up to the lec-tern set up in a ballroom of the Arizona Biltmore Resort and Spa, looking like a slightly

    rumpled university professor about to start a lecture. In fact, he is here on this Aprilmorning to talk about his work as director of internal technology for Google to a crowdof chief information officers gathered at a breakfast sponsored by local recruiting firmPhoenix Staffing.

    Google, the secretive, extraordinarily successful $6.1 billion global search enginecompany, is one of the most recognized brands in the world. Yet it selectively discussesits innovative information management infrastructurewhich is based on one of thelargest distributed computing/grid systems in the world.

    Merrill is about to give his audience a rare glimpse into the future according to Google,and explain the workings of the company and the computer systems behind it.

    For all the razzle-dazzle surrounding Googleeverything from the press it gets forits bring-your-dog-to-work casual workplace, to its stock price, market share, dizzyingarray of beta product launches and its death-match competition with Microsoftit

    must also solve more basic issues like billing, collection, reporting revenue, trackingprojects, hiring contractors, recruiting and evaluating employees, and managing video-conferencing systemsin other words, common business problems.

    But this does not mean that Google solves these problems in a conventional way, asMerrill is about to explain.

    Were about not ever accepting that the way something has been done in the pastis necessarily the best way to do it today, he says.

    Among other things, that means that Google often doesnt deploy standard businessapplications on standard hardware. Instead, it may use the same text parsing technologythat drives its search engine to extract application input from an e-mail, rather than a con-ventional user interface based on data entry forms. Instead of deploying an application to aconventional server, Merrill may deploy it to a proprietary server-clustering infrastructurethat runs across its worldwide data centers.

    Google runs on hundreds of thousands of serversby one estimate, in excess of

    GoogleWorks ByDaviD F. Carr

    a D i s s e C t i o n

  • 8/14/2019 How Google Works - Baseline

    3/12

    C

    a

    se216g

    o

    o

    g

    le

    2

    WWW.BaselineMaG.COM

    450,000racked up in thousands of clusters in dozens ofdata centers around the world. It has data centers in Dublin,Ireland; in Virginia; and in California, where it just acquiredthe million-square-foot headquarters it had been leasing. Itrecently opened a new center in Atlanta, and is currentlybuilding two football-field-sized centers in The Dalles, Ore.

    By having its servers and data centers distributed geo-

    graphically, Google delivers faster performance to its world-wide audience, because the speed of the connection betweenany two computers on the Internet is partly a factor of thespeed of light, as well as delays caused by network switchesand routers. And although search is still Googles big money-maker, those servers are also running a fast-expanding familyof other applications like Gmail, Blogger, and now even Web-based word processors and spreadsheets.

    Thats why there is so much speculation about Google theMicrosoft-killer, the latest firm nominated to drive every-thing to the Web and make the Windows desktop irrelevant.Whether or not you believe that, its certainly true thatGoogle and Microsoft are banging heads. Microsoft expects

    to make about a $1.5 billion capital investment in server anddata structure infrastructure this year. Google is likely tospend at least as much to maintain its lead, following a $838million investment in 2005.

    And at Google, large-scale systems technology is all-impor-tant. In 2005, it indexed 8 billion Web pages. Meanwhile,its market share continues to soar. According to a recentComScore Networks qSearch survey, Googles market sharefor search among U.S. Internet users reached 43% in April,compared with 28% for Yahoo and 12.9% for The MicrosoftNetwork (MSN). And Googles market share is growing; a yearago, it was 36.5%. The same survey indicates that Americansconducted 6.6 billion searches online in April, up 4% from

    the previous month. Google sites led the pack with 2.9 billionsearch queries performed, followed by Yahoo sites (1.9 billion)and MSN-Microsoft (858 million).

    This growth is driven by an abundance of scalable tech-nology. As Google noted in its most recent annual reportfiling with the SEC: Our business relies on our software andhardware infrastructure, which provides substantial com-puting resources at low cost. We currently use a combina-tion of off-the-shelf and custom software running on clustersof commodity computers. Our considerable investment indeveloping this infrastructure has produced several key ben-efits. It simplifies the storage and processing of large amountsof data, eases the deployment and operation of large-scale

    global products and services, and automates much of theadministration of large-scale clusters of computers.

    Google buys, rather than leases, computer equipment formaximum control over its infrastructure. Google chief execu-tive officer Eric Schmidt defended that strategy in a May 31call with financial analysts. We believe we get tremendouscompetitive advantage by essentially building our own infra-structures, he said.

    Google does more than simply buy lots ofPC-class serversand stuff them in racks, Schmidt said: Were really buildingwhat we think of internally as supercomputers.

    Because Google operates at such an extreme scale, its asystem worth studying, particularly if your organization is

    pursuing or evaluating the grid computing strategy, in which

    high-end computing tasks are performed by many low-costcomputers working in tandem.

    Despite boasting about this infrastructure, Google turneddown requests for interviews with its designers, as well asfor a follow-up interview with Merrill. Merrill did answerquestions during his presentation in Phoenix, however, andthe division of the company that sells the Google SearchAppliance helped fill in a few blanks.

    In general, Google has a split personality when it comesto questions about its back-end systems. To the media, itsanswer is, Sorry, we dont talk about our infrastructure. Yet,Google engineers crack the door open wider when addressingcomputer science audiences, such as rooms full of graduatestudents whom it is interested in recruiting. As a result,sources for this story included technical presentations avail-able from the University of Washington Web site, as well asother technical conference presentations, and papers pub-lished by Googles research arm, Google Labs.

    What Other CIOs Can LearnWhile few organizations work on the scale of Google,

    CIOs can learn much from how the company deploys its tech-nology and the clues it provides to the future of technology.Even if Google falters as a company, or competitors and imi-tators eventually take away its lead, the systems architectureit exemplifies will live on. Google is a harbinger in the sameway that in the early 1960s, a fully loaded IBM 1460 main-frame gave us a taste of more powerful computers to come,saysBaseline columnist Paul A. Strassmann.

    Strassmann, whose career includes senior informa-tion-technology management roles at government agenciesincluding the Department of Defense, has been advising themilitary that it needs to become more like Google. This rec-ommendation is partly inspired by his admiration for the com-

    panys skill at rapidly deploying computing power throughoutBaselinejuly

    2006

  • 8/14/2019 How Google Works - Baseline

    4/12

    C

    a

    se

    216

    g

    o

    o

    g

    le

    Baseline

    july

    2006

    33

    WWW.BaselineMaG.COM

    player rosTer

    Douglas MerrillVice President,Engineering, and SeniorDirector of InformationSystemsMerrill plays the role of chiefinformation officer at Google,managing both traditionalenterprise systems and theapplications developed byGoogle engineers for internaluse. A student of social andpolitical organization whoholds a Ph.D. in psychologyfrom Princeton, Merrillworked on computer simu-lations of education, teamdynamics and organizationaleffectiveness for the Rand

    Corp. and went on to serveas a senior vice president atCharles Schwab, where hewas responsible for, amongother things, informationsecurity and the company'scommon infrastructure.

    Eric SchmidtChairman and CEOSchmidt shares responsibilityfor Googles day-to-day man-agement with Page and Brin.Schmidt, who was previouslythe chief executive officer ofNovell and, before that, chieftechnology officer of SunMicrosystems, was brought

    into the company in 2001 togive Google a more seasonedtechnology leader.

    Dave GirouardVice President andGeneral Manager,Google EnterpriseGirouard leads the divisionresponsible for adaptingGoogle technologies for theenterprise market, with theGoogle Search Appliance andGoogle Mini utilities as itsprimary products. Girouardcomes from multimediasearch company Virage andpreviously was a systemsconsultant for Booz Allen

    Hamilton and Accenture(then known as AndersenConsulting).

    W.M. Coughran Jr.Vice President ofEngineering for SystemsInfrastructureCoughran is responsible forlarge-scale distributed com-puting programs underlyingGoogles products. Since

    joining Google in early 2003,he has refined the teamsthat handle Web crawling,storage and other systems.

    Urs HlzleSenior Vice President,

    Operations, and GoogleFellowHlzle was Googles firstengineering vice presidentand led the development ofthe companys operationalinfrastructure. He nowfocuses on keeping itrunning and improvingits performance.

    Vinton G. CerfVice President and ChiefInternet EvangelistA father of the Internet, heis responsible for identifyingnew enabling technologiesand applications for Google.

    Terry WinogradProfessor and directorof the Human-ComputerInteraction Group,Stanford UniversityAn early adviser to theGoogle co-founders,Winograd joined them inauthoring an academic paperon PageRank and worked forthe company in 2002 and2003, while on leave fromStanford. As a specialistin human interaction withcomputers, he contributedto Googles user interfacedesign in areas such as thedisplay of search results.

    Larry Page (above, right)Co-Founder & President,ProductsSergey Brin (above, left)Co-Founder & President,Technology

    Page devised Googlesoriginal search rel-evancy ranking algorithm,PageRank, while workingtoward a Ph.D. in computerscience from Stanford. Hewas also heavily involved in

    the design of Googles earlydistributed computer designand server racks. Brinsresearch interests includesearch engines, informationextraction from unstruc-tured sources, and datamining of large text collec-tions and scientific data.The two met while Ph.D.candidates at Stanford,where they collaborated onthe Web indexing projectthat became Google.

    the world as its done the past few years. Clearly, they havethe largest [computer] network anywhere, he says.

    Among other things, Google has developed the capabilityto rapidly deploy prefabricated data centers anywhere in theworld by packing them into standard 20- or 40-foot shippingcontainers, Strassmann says. Thats just the sort of capabilityan army on the move would like to have at its disposal to pro-

    vide tactical support for battle or relief operations.According to Strassmann, the idea of portable data cen-ters has been kicking around the military for years, but todayexists mostly as a collection of PowerPoint slides. And Google-like rapid access to information might be just whats neededin a war zone, where reports from the field too often pile upunread, he says. The military equivalent of Googling your com-petition would be for a field officer to spend a few secondstyping a query into a mobile device to get the latest intelligenceabout a hostile village before the troops move in. Its Googleskinned down into the hands of a Marine, he says.

    Googles creation of the shipping container data centerswas also reported in November byTriumph of the Nerds author

    Robert X. Cringely in his blog at Pbs.org. He describes it as asystem of 5,000 Opteron processors and 3.5 petabytes of diskstorage that can be dropped off overnight by a tractor-trailer rig.[A petabyte is a quadrillion bytes, the next order of magnitudeup from a terabyte, which is a trillion]. The idea is to plant oneof these puppies anywhere Google owns access to fiber, basicallyturning the entire Internet into a giant processing and storagegrid. Google will not comment on these reports.

    Greg Linden, one of the architects of Amazon.coms earlysystems, is fascinated by what Google has created but wary of thehype it sometimes attracts. Google definitely has been influen-tial, changing how many companies think about the computersthat power their systems, says Linden, nowCEO at Findory.

    com, a personalized news site he founded. But it might be goinga bit far to say they are leading a revolution by themselves.

    Although Googles requirements might seem so exotic thatfew other organizations would need anything like its technologystack, Linden says he spoke with a hedge fund technology man-ager who said how much he would love to employ the distrib-

    google Base CaseHaas: 1600 Amphitheatre Pkwy., Mountain View,CA 94043

    Pho: (650) 253-0000

    Bsiss: Web search and supporting information productsChif Ifoaio Offic: Douglas Merrill, vice president,engineering, and senior director of information systems

    2005 Fiacias: $6 billion revenue; $1.46 billion net profit.

    Chag: Deliver technologies to organize the worldsinformation and make it universally accessible and useful.

    Basi Goas:

    4Build the systems and infrastructure to support a global,$100 billion company.

    4Expand data center infrastructure, which, as measuredby the property and equipment, more than doubled, from$379 million in 2004 to $962 million last year.

    4 Maintain high productivity as measured by revenue peremployee, which ranged from $1.38 million to $1.55 millionin 2005.

  • 8/14/2019 How Google Works - Baseline

    5/12

    C

    a

    se216g

    o

    o

    g

    le

    Baselinejuly

    2006

    6

    avoid making money through advertising, which they thoughtwould corrupt the integrity of search results. But when theirinitial technology licensing business model fell flat, they com-promised. While keeping the Google home page ad-free, theybegan inserting text ads next to their search results, with adselection driven by the search keywords. The ads were clearlyseparated from the search results, but highly effective because

    of their relevance. At pennies per click, the ad revenue pouringinto Google began mounting quickly.When Google went public in 2004, analysts, competitors and

    investors were stunned to learn how much money the companywith the ad-free home page was raking in$1.4 billion in thefirst half of 2004 alone. Last year, revenue topped $6 billion.

    Google and its information-technology infrastructure hadhumble beginnings, as Merrill illustrates early in his talk with aslide-show photo of Google.stanford.edu, the academic researchproject version of the search engine from about 1997, before theformation of Google Inc., when the server infrastructure con-sisted of a jumble ofPCs scavenged from around campus.

    Would any of you be really proud to have this in your

    data center? Merrill asks, pointing to the disorderly stack ofservers connected by a tangle of cables. But this is the start ofthe story, he adds, part of an approach that says dont neces-sarily do it the way everyone else did. Just find some way ofdoing it cheap and effectivelyso we can learn.

    The basic tasks that Google had to perform in 1997 are thesame it must perform today. First, it scours the Web for content,crawling from one page to another, following links. Copies ofpages it finds must be stored in a document repository, alongwith their Internet addresses, and further analyzed to produceindexes of words and links occurring in each document. Whensomeone types keywords into Google, the search engine com-pares them against the index to determine the best matches

    and displays links to them, along with relevant excerpts fromthe cached Web documents. To make all this work, Google hadto store and analyze a sizable fraction of all the content on theWeb, which posed both technical and economic challenges.

    By 1999, the Google.com search engine was running in pro-fessionally managed Internet data centers run by companieslike Exodus. But the equipment Google was parking there was,if anything, more unconventional, based on hand-built rackswith corkboard trays. The hardware design was led by Page, anatural engineer who once built a working ink-jet printer outof Legos. His team assembled racks of bare motherboards,mounted four to a shelf on corkboard, with cheap no-name harddrives purchased at Frys Electronics. These were packed close

    together (like blade servers before there were blade servers,Merrill says). The individual servers on these racks exchangedand replicated data over a snarl of Ethernet cables pluggedinto Hewlett-Packard network switches. The first Google.comproduction system ran on about 30 of these racks. You can seeone for yourself at the Computer History Museum, just a fewblocks away from Googles Mountain View headquarters.

    Part of the inspiration behind this tightly packed configura-tion was that, at the time, data centers were charging by the squarefoot, and Page wanted to fit the maximum computer power intothe smallest amount of space. Frills like server enclosures aroundthe circuit boards would have just gotten in the way.

    This picture makes the data center manager in Merrill

    shudder. Like, the cable management is really terrible, he

    uted computing software behind Googles data centers to rundata-intensive market simulations. And Linden says BigTable,Googles system for managing very large databases, sounds likesomething I would have loved to have had at Amazon.

    It is, however, something that Merrill can tap into at will.One of five engineering vice presidents, Merrill is Googles seniordirector of information systemseffectively, the CIO.

    For some basic corporate functions like financial manage-ment, Merrill chooses the same kind of technologies you wouldfind in any other corporate data center. On the other hand,

    Google doesnt hesitate to create applications for internal useand put them on its own server grid. If the companys softwareengineers think they can tinker up something to make them-selves more productivefor example, a custom-built projecttracking systemMerrill doesnt stand in their way.

    The Start of the StoryGoogle started with a research project into the structure ofthe Web led by two Stanford University Ph.D. candidates,Larry Page and Sergey Brin. After initially offering to sell thesearch engine they had created to an established firm, such asYahoo, but failing to find a buyer, they established Google in1998 to commercialize the technology. For the first few years of

    the companys existence, the co-founders were determined to6

    APPlICAtIOn PrOduCt SuPPlIer

    disib fi ss Google File System Google proprietary

    disib schig Global Work Queue Google proprietary

    V ag aabas BigTable, Google proprietary,aag sss Berkeley DB Sleepycat

    Software/Oracle

    Sv opaig ss Red Hat Linux Red Hat, Google(with kernel-level

    modifications byGoogle)

    Wb pooco accao NetScaler Citrix SystemsApplication Delivery

    Wb co asaio Rosette Language Basis TechnologyAnalyzers forChinese, Japaneseand Korean (usedin combination withGoogle proprietarytranslation technology)

    Fi covsio a Outside In Stellentco acio

    GooGles primary proGramminG lanGuaGes include c/c++, Java and python. Guido van

    rossum, pythons creator, went to work for GooGle at the end of 2005. the company also

    has created sawzall, a special-purpose distributed computinG Job preparation lanGuaGe.

    Google primarily relies on is owia vop sofwa fordata and network management andhas a reputation for being skeptical

    of not invented here technologies,so relatively few vendors can claimit as a customer.

    Base TeChnoloGies

    WWW.BaselineMaG.COM

  • 8/14/2019 How Google Works - Baseline

    6/12

    C

    a

    se216g

    o

    o

    g

    le

    Baselinejuly

    2006

    8

    WWW.BaselineMaG.COM

    the server racks. Google also discovered that there were down-sides to its approach, such as the intense cooling and powerrequirements of densely packed servers.

    In a 2003 paper, Google noted that power requirements of adensely packed server rack could range from 400 to 700 wattsper square foot, yet most commercial data centers could supportno more than 150 watts per square foot. In response, Google

    was investigating more power-efficient hardware, and reportedlyswitched from Intel to AMD processors for this reason. Googlehas not confirmed the choice ofAMD,which was reported earlierthis year by Morgan Stanley analyst Mark Edelstone.

    Although Googles infrastructure has gone through manychanges since the companys launch, the basic clustered serverrack design continues to this day. With each iteration, we gotbetter at doing this, and doing it our way, Merrill says.

    A Role ModelFor Google, the point of having lots of servers is to get them working in parallel. For any given search, thousands of com-puters work simultaneously to deliver an answer.

    For CIOs to understand why parallel processing makes somuch sense for search, consider how long it would take forone person to find all the occurrences of the phrase service-oriented architecture in the latest issue ofBaseline. Now assignthe same task to 100 people, giving each of them one page toscan, and have them report their results back to a team leaderwho will compile the results. You ought to get your answerclose to 100 times faster.

    On the other hand, if one of the workers on this project turnsout to be dyslexic, or suddenly drops dead, completion of theoverall search job could be delayed or return bad results. So,when designing a parallel computing systemparticularly onelike Googles, built from commodity components rather than

    top-shelf hardwareits critical to build in error correction andfault tolerance (the ability of a system to recover from failures).

    Some basic concepts Google would use to scale up had alreadybeen defined by 1998, when Page and Brin published their paper onThe Anatomy of a Large-Scale Hypertextual Web Search Engine.In its incarnation as a Stanford research project, Google had alreadycollected more than 25 million Web pages. Each document wasscanned to produce one index of the words it contained and theirfrequency, and that, in turn, was transformed into an invertedindex relating keywords to the pages in which they occurred.

    But Google did more than just analyze occurrences of key-words. Page had come up with a formula he dubbed PageRankto improve the relevance of search results by analyzing the

    link structure of the Web. His idea was that, like frequentcitations of an academic paper, links to a Web site were cluesto its importance. If a site with a lot of incoming linkslikethe Yahoo home pagelinked to a particular page, that linkwould also carry more weight. Google would eventually haveto modify and augment Pages original formula to battle searchengine spam tactics. Still, PageRank helped establish Googlesreputation for delivering better search results.

    Previous search engines had not analyzed links in such a sys-tematic way. According to The Google Story, a book byWashington

    Postwriter David Vise and Mark Malseed, Page had noticed thatearly search engine king AltaVista listed the number of linksassociated with a page in its search results but didnt seem to be

    making any other use of them. Page saw untapped potential. In

    says. Why arent cables carefully color-coded and tied off? Whyis the server numbering scheme so incoherent?

    The computer components going into those racks were alsopurchased for rock-bottom cost, rather than reliability. Hard driveswere of a poorer quality than you would put into your kids com-puter at home, Merrill says, along with possibly defective memoryboards sold at a fire-sale discount. But that was just part of thestrategy of getting a lot of computing power without spending alot of money. Page, Brin and the other Google developers started

    with the assumption that components would fail regularly anddesigned their search engine software to work around that.

    Googles co-founders knew the Web was growing rapidly,so they would have to work very hard and very smart to makesure their index of Web pages could keep up. They pinnedtheir hopes on falling prices for processors, memory chips andstorage, which they believed would continue to fall even fasterthan the Web was growing. Back at the very beginning, theywere trying to build a search engine on scavenged universityresources. Even a little later, when they were launching theircompany on the strength of a $100,000 investment from SunMicrosystems co-founder Andy Bechtolsheim, they had tomake their money stretch as far as possible. So, they built their

    system from the cheapest parts money would buy.At the time, many dot-com companies, flush with cash, were

    buying high-end Sun servers with features like RAID hard drives.RAID, for redundant arrays of independent disks, boosts thereliability of a storage device through internal redundancy andautomatic error correction. Google decided to do the same thingby different meansit would make entire computers redundantwith each other, making many frugally constructed computerswork in parallel to deliver high performance at a low cost.

    By 1999, when Google received $25 million in venture funding,the frugality that had driven the early systems design wasnt asmuch of a necessity. But by then, it had become a habit.

    Later Google data centers tidied up the cabling, and cork-

    board (which turned out to pose a fire hazard) vanished from

  • 8/14/2019 How Google Works - Baseline

    7/12

    C

    a

    se

    216

    g

    o

    o

    g

    le

    39

    WWW.BaselineMaG.COM

    addition to recording which pages linked to which other pages,he designed Google to analyze the anchor textthe text of alink on a Web page, typically displayed to a Web site visitor withunderlining and coloras an additional clue to the content ofthe target page. That way, it could sometimes divine that aparticular page was relevant even though the search keywordsdidnt appear on that page. For example, a search for cardiopul-

    monary resuscitation might find a page that exclusively usesthe abbreviation CPR because one of the Web pages linkingto it spells out the term in the links anchor text.

    All this analysis requires a lot of storage. Even back atStanford, the Web document repository alone was up to 148gigabytes, reduced to 54 gigabytes through file compression,and the total storage required, including the indexes and linkdatabase, was about 109 gigabytes. That may not sound likemuch today, when you can buy a Dell laptop with a 120-giga-byte hard drive, but in the late 1990s commodityPC harddrives maxed out at about 10 gigabytes.

    To cope with these demands, Page and Brin developed avirtual file system that treated the hard drives on multiple com-

    puters as one big pool of storage. They called it BigFiles. Ratherthan save a file to a particular computer, they would save itto BigFiles, which in turn would locate an available chunk ofdisk space on one of the computers in the server cluster andgive the file to that computer to store, while keeping track ofwhich files were stored on which computer. This was the startof what essentially became a distributed computing softwareinfrastructure that runs on top of Linux.

    The overall design of Googles software infrastructurereflects the principles of grid computing, with its emphasison using many cheap computers working in parallel to achievesupercomputer-like results.

    Some definitions of what makes a grid a grid

    would exclude Google as having too much ofa homogenous, centrally controlled infra-structure, compared with a gridthat teams up computers runningmultiple operating systems andowned by different organizations.

    But the company follows manyof the principles of grid architec-ture, such as the goal of minimizingnetwork bottlenecks by minimizingdata transmission from one com-puter to another.

    Instead, whenever possible, pro-

    cessing instructions are sent to theserver or servers containing the rel-evant data, and only the results arereturned over the network.

    Theyve gotten to the point where theyre distrib-uting what really should be considered single computersacross continents, says Colin Wheeler, an informationsystems consultant with experience in corporate grid com-puting projects, including a project for the Royal Bank ofCanada.

    Having studied Googles publications, he notes that thecompany has had to tinker with computer science funda-mentals in a way that few enterprises would: I mean, who

    writes their own file system these days?

    The Google File SystemIn 2003, Googles research arm, Google Labs, published apaper on the Google File System (GFS), which appears to bea successor to the BigFiles system Page and Brin wrote aboutback at Stanford, as revamped by the systems engineers theyhired after forming Google. The new document covered the

    requirements of Googles distributed file system in moredetail, while also outlining other aspects of the companyssystems such as the scheduling of batch processes and recov-ery from subsystem failures.

    The idea is to store data reliably even in the presence ofunreliable machines, says Google Labs distinguished engineerJeffrey Dean, who discussed the system in a 2004 presentationavailable by Webcast from the University of Washington.

    For example, the GFS ensures that for every file, at leastthree copies are stored on different computers in a givenserver cluster. That means if a computer program triesto read a file from one of those computers, and it fails torespond within a few milliseconds, at least two others will

    be able to fulfill the request. Such redundancy is impor-tant because Googles search system regularly experiencesapplication bugs, operating system bugs, human errors, andthe failures of disks, memory, connectors, networking andpower supplies, according to the paper.

    The files managed by the system typically range from 100megabytes to several gigabytes. So, to manage disk space effi-ciently, the GFS organizes data into 64-megabyte chunks, which are roughly analogous to the blocks on a conven-

    tional file systemthe smallest unit of data the systemis designed to support. For comparison, a typical Linux

    block size is 4,096 bytes. Its the difference betweenmaking each block big enough to store a few pages

    of text, versus several fat shelves full of books.To store a 128-megabyte file, the GFS would use

    two chunks. On the other hand, a 1-megabytefile would use one 64-megabyte chunk, leavingmost of it empty, because such small files areso rare in Googles world that theyre not worthworrying about (files more commonly consumemultiple 64-megabyte chunks).

    A GFS cluster consists of a master server andhundreds or thousands of chunkservers, thecomputers that actually store the data. The masterserver contains all the metadata, including filenames, sizes and locations. When an application

    requests a given file, the master server provides theaddresses of the relevant chunkservers. The masteralso listens for a heartbeat from the chunkserversit managesif the heartbeat stops, the master

    assigns another server to pick up the slack.In technical presentations, Google talks about running

    more than 50 GFS clusters, with thousands of servers percluster, managing petabytes of data.

    More recently, Google has enhanced its software infra-structure with BigTable, a super-sized database managementsystem it developed, which Dean described in an Octoberpresentation at the University of Washington. Big Table storesstructured data used by applications such as Google Maps,

    Google Earth and My Search History. Although Google does

    FIleSCAnBe SeVerAlGIGAByteS.tO mAnAGe dISk SPACeeFFICIently,tHe GFSOrGAnIzeS dAtA IntO

    64-meGAByte CHunkS.

    Baseline

    july

    2006

  • 8/14/2019 How Google Works - Baseline

    8/12

    C

    a

    se216g

    o

    o

    g

    le

    Baselinejuly

    2006

    0

    WWW.BaselineMaG.COM

    searching for recurring categories of complaints in warrantyclaims against your products. But its particularly key forGoogle, which invests heavily in a statistical style of com-puting, not just for search but for solving other problemslike automatic translation between human languages suchas English and Arabic (using common patterns drawn fromexisting translations of words and phrases to divine the rules

    for producing new translations).MapReduce includes its own middlewareserver soft- ware that automatically breaks computing jobs apart andputs them back together. This is similar to the way a Java pro-grammer relies on the Java Virtual Machine to handle memorymanagement, in contrast with languages like C++ that makethe programmer responsible for manually allocating andreleasing computer memory. In the case of MapReduce, theprogrammer is freed from defining how a computation willbe divided among the servers in a Google cluster.

    Typically, programs incorporating MapReduce load largequantities of data, which are then broken up into pieces of 16to 64 megabytes. The MapReduce run-time system creates

    duplicate copies of each map or reduce function, picks idleworker machines to perform them and tracks the results.

    Worker machines load their assigned piece of input data,process it into a structure of key-value pairs, and notify themaster when the mapped data is ready to be sorted and passedto a reduce function. In this way, the map and reduce functionsalternate chewing through the data until all of it has been pro-cessed. An answer is then returned to the client application.

    If something goes wrong along the way, and a worker failsto return the results of its map or reduce calculation, themaster reassigns it to another computer.

    As of October, Google was running about 3,000 com-puting jobs per day through MapReduce, representing thou-

    sands of machine-days, according to a presentation by Dean.Among other things, these batch routines analyze the latestWeb pages and update Googles indexes.

    Googles SecretsFor all the papers it has published, Google refuses to answermany questions. We generally dont talk about our strategy... because its strategic, Page told Time magazine when inter-viewed for a Feb. 20 cover story.

    One of the technologies Google has made public, PageRank,is Pages approach to ranking pages based on the interlinkedstructure of the Web. It has become one of the most famouselements of Googles technology because he published a paper

    on it, including the mathematical formula. Stanford holds thepatent, but through 2011 Google has an exclusive license toPageRank.

    Still, Yahoos research arm was able to treat PageRank asfair game for a couple of its own papers about how PageRankmight be improved upon; for example, with its own TrustRankvariation based on the idea that trustworthy sites tend to link toother trustworthy sites. Even if competitors cant use PageRankper se, the information Page published while still at Stanfordgave competitors a starting point to create something similar.

    PageRank is well known because Larry published itwell, theyll never do that again, observes Simson Garfinkel,a postdoctoral fellow at Harvards Center for Research on

    Computation and Society, and an authority on information

    use standard relational databases, such as MySQL, the volumeand variety of data Google manages drove it to create its owndatabase engine. BigTable database tables are broken intosmaller pieces called tablets that can be stored on differentcomputers in a GFS cluster, allowing the system to managetables that are too big to fit on a single server.

    Reducing ComplexityGoogles distributed storage architecture for data is com-bined with distributed execution of the software that parsesand analyzes it.

    To keep software developers from spending too much timeon the arcana of distributed programming, Google inventedMapReduce as a way of simplifying the process. According toa 2004 Google Labs paper, without MapReduce the companyfound the issues of how to parallelize the computation, dis-tribute the data and handle failures tended to obscure the sim-plest computation with large amounts of complex code.

    Much as theGFSoffers an interface for storage across mul-tiple servers, MapReduce takes programming instructions and

    assigns them to be executed in parallel on many computers.It breaks calculations into two partsa first stage, whichproduces a set of intermediate results, and a second, whichcomputes a final answer. The concept comes from functionalprogramming languages such as Lisp (Googles version isimplemented in C++, with interfaces to Java and Python).

    A typical first-week training assignment for a new pro-grammer hired by Google is to write a software routine that usesMapReduce to count all occurrences of words in a set of Webdocuments. In that case, the map would involve tallying alloccurrences of each word on each pagenot bothering to addthem at this stage, just ticking off records for each one like hashmarks on a sheet of scratch paper. The programmer would then

    write a reduce function to do the mathin this case, taking thescratch paper data, the intermediate results, and producing acount for the number of times each word occurs on each page.

    One example, from a Google developer presentation,shows how the phrase to be or not to be would movethrough this process.

    While this might seem trivial, its the kind of calculationGoogle performs ad infinitum. More important, the generaltechnique can be applied to many statistical analysis prob-lems. In principle, it could be applied to other data mining

    problems that might exist within your company, such as

    Map

    key tO Be Or nOt tO Be

    value 1 1 1 1 1 1

    reDuCe

    key tO Be Or nOt

    value 2 2 1 1

  • 8/14/2019 How Google Works - Baseline

    9/12

    Baseline

    july

    2006

    41

    WWW.BaselineMaG.COM

    security and Internet privacy. Today, Google seems to havecreated a very effective cult of secrecy, he says. People Iknow go to Google, and I never hear from them again.

    Because Google, which now employs more than 6,800,is hiring so many talented computer scientists from aca-demiaaccording to The Mercury News in San Jose, it hireson average 12 new employees a day and recently listed 1,800

    open jobsit must offer them some freedom to publish,Garfinkel says. He has studied the GFS paper and finds itreally interesting because of what it doesnt say and whatit glosses over. At one point, they say its important to haveeach file replicated on more than three computers, but theydont say how many more. At the time, maybe the data wason 50 computers. Or maybe it was three computers in eachcluster. And although the GFS may be one important partof the architecture, there are probably seven layers [ofundisclosed technology] between the GFS system and whatusers are seeing.

    One of Googles biggest secrets is exactly how manyservers it has deployed. Officially, Google says the last con-

    firmed statistic for the number of servers it operates was10,000. In his 2005 book The Google Legacy, Infonortics ana-lyst Stephen E. Arnold puts the consensus number at 150,000to 170,000. He also says Google recently shifted from usingabout a dozen data centers with 10,000 or more servers tosome 60 data centers, each with fewer machines. ANew YorkTimes report from June put the best guess at 450,000 serversfor Google, as opposed to 200,000 for Microsoft.

    The exact number of servers in Googles arsenal is irrel-evant, Garfinkel says. Anybody can buy a lot of servers. Thereal point is that they have developed software and manage-ment techniques for managing large numbers of commoditysystems, as opposed to the fundamentally different route

    Microsoft and Yahoo went.Other Web operations like Yahoo that launched earlier

    built their infrastructure around smaller numbers of rela-tively high-end servers, according to Garfinkel. In addition tosaving money, Googles approach is better because machinesfail, and they fail whether you buy expensive machines orcheap machines.

    Of particular interest to CIOs is one widely cited estimatethat Google enjoys a 3-to-1 price-performance advantage overits competitorsthat is, that its competitors spend $3 forevery $1 Google spends to deliver a comparable amount ofcomputing power. This comes from a paper Google engineerspublished in 2003, comparing the cost of an eight-processor

    server with that of a rack of 176 two-processor servers thatdelivers 22 times more processor power and three times asmuch memory for a third of the cost. In this example, theeight-processor server was supposed to represent the tradi-tional approach to delivering high performance, comparedwith Googles relatively cheap servers, which at the time usedtwin Intel Xeon processors.

    But although Google executives often claim to enjoy aprice-performance advantage over their competitors, thecompany doesnt necessarily claim that its a 3-to-1 difference.The numbers in the 2003 paper were based on a hypothet-ical comparison, not actual benchmarks versus competitors,according to Google. Microsoft and Yahoo have also had a

    few years to react with their own cost-cutting moves.

    How GooGle manaGes itsgloBal WorkforCe

    GOOGleS dOuGlAS

    merrIll, a corporate

    technology director with a

    background in psychology, is

    firmly convinced that while

    technology plays a role in

    keeping projects on track,

    so do the Ping-Pong tables

    and the elaborate cafeteria

    at company headquarters in

    Mountain View, Calif.

    the social settings that get

    coders away from their desks

    and talking to each other about their projects.

    As he explained during a presentation in Phoenix, thats one

    of the things that made Google successful as a startup, but now

    the challenge is to maintain that quirky culture as the company

    grows. It now has more than 6,800 employees and dozens of

    sales and engineering offices around the world.

    Having overseas engineering offices is proving to be importantbecause people who are based France, or India, or China tend

    to do a better job of localizing Googles applicationsthat is,

    not just translating the user interface but changing the way the

    system works to meet local requirements.

    E-mail is a pretty good answer for tracking projects, Merrill

    says, but e-mail alone is not sufficient for really managing them,

    and not for keeping the people who are working on them happy

    and productive. At the end of the day, face time rules, he says.

    The closest thing to a technological answer to reaching those

    overseas workers is videoconferencing, and even then face-to-

    face contact is still better. So, Google has had to hire a cadre of

    managers who spend most of their time on planes, in addition

    to hiring local managers for those offices. That goes against the

    grain for a company that prides itself on its flat management

    structure, but its necessary, Merrill says. Remote offices also tend to suffer disproportionately from

    minor inconsistencies in technology infrastructure, which can

    hobble efforts at collaboration between engineers in different

    offices. For example, an application might depend on a certain

    configuration of Lightweight Directory Access Protocol (LDAP)

    servers, which are used to store information about networks,

    devices and user accounts. We dont really want to spend days

    tracking down problems in code because the LDAP schema that

    we use in Dublin is different than the LDAP schema that we use

    in New York, Merrill says. Those sorts of things used to happen.

    The time we spent doing that cost money, it cost productivity

    and, most importantly, it cost frustration.

    Trying to dictate the proper configuration from the home office,

    and have remote offices set things up to those specifications,

    didnt work, he says. Instead, he developed an office-in-a-box,with LDAP servers and other basic infrastructure preconfigured at

    headquarters and shipped into the field. (This almost sounds like

    the prefab data center in a container Google is reported to have

    created, except that this is more the size of a steamer trunk.)

    We took the data center, we shrunk it, put it in a brightly col-

    ored box and rolled one into every office, Merrill explains. Then

    we built a whole bunch of tools that work in the background to

    verify that the LDAP change made in one place is replicated to all

    the other offices-in-a-box around the world.

    Keeping the technology synchronized is one way of keeping

    people and projects in sync, but there are always limits to what it

    can accomplish, he says: Thats Merrills Law: There are no tech-

    nological solutions to social problems. D.F.C.

    C

    a

    se

    216

    g

    o

    o

    g

    le

  • 8/14/2019 How Google Works - Baseline

    10/12

    C

    a

    se216g

    o

    o

    g

    le

    Baselinejuly

    2006

    4

    WWW.BaselineMaG.COM

    Google is very proud of the cost of its infrastructure, butweve also driven to a very low cost, says Lars Rabbe, who aschief information officer at Yahoo is responsible for all datacenter operations.

    Microsoft provided a written statement disputing theidea that Google enjoys a large price-performance advantage.Microsoft uses clusters of inexpensive computers where itmakes sense, but it also uses high-end multi-processor sys-tems where that provides an advantage, according to thestatement. And Windows does a fine job of supporting both,according to Microsoft.

    Certainly, Yahoos systems design is different from

    Googles, Rabbe says. Google grew up doing one thing onlyand wound up with a search-driven architecture that is veryuniform, with a lot of parallelism. Yahoo has increased itsfocus on the use of parallel computing with smaller servers,he says, but is likely to continue to have a more heteroge-neous server infrastructure. For example, Yahoo launched thecompany on the FreeBSD version of Unix but has mixed inLinux and Windows servers to support specific applications.While the Yahoo and Google companies compete as searchengines, he says, We also do 150 other things, and some ofthose require a very different type of environment.

    Still, Gartner analyst Ray Valdes believes Google retainsan advantage in price-performance, as well as in overall com-

    puting power. I buy that. Even though Im usually pretty

    cynical and skeptical, as far as I know, nobody has gone tothat extent and pushed the envelope in the way they have, hesays. Google is still doing stuff that others are not doing.

    The advantage will erode over time, and Google will eventu-ally run up against the limits of how much homegrown tech-nology it can manage, Valdes predicts: The maintenance of theirown system software will become more of a drag to them.

    But Google doesnt buy this traditional argument for whyenterprises should stick to application-level code and leavemore fundamental technologies like file systems, operatingsystems and databases to specialized vendors. Googles leadersclearly believe they are running a systems engineering company

    to compete with the best of them.

    Exotic but Not UniqueGoogles systems seem to work well for Google. But if youcould run your own systems on the Google File System,would you want to? Or is this an architecture only a searchengine could love?

    Distributed file systems have been around since the 1980s, when the Sun Microsystems Network File System and theAndrew File System, developed at Carnegie Mellon University,first appeared. Software engineer and blogger Jeff Darcy saysthe system has a lot in common with the HighRoad systemhe worked on at EMC. However, he notes that Googles deci-

    sion to relax conventional requirements for file system data

    WHIle tHere IS nO dOuBt ABOut tHe POWer GOOGle.COm

    commands among advertisers and Webmasters, Google the enter-

    prise vendor is another thing entirely.

    Since 99% of Googles $6 billion in revenue continues to come

    from advertising, the Google Enterprise division represents a tiny

    part of the overall business. This is the group that wants to sell

    you a little bit of Google in a boxthe Search Appliance productlinethat embeds a variation of the Google.com search engine

    software that powers Google.com in a yellow server box or blue

    blade server that enterprise customers can plug into their own

    data centers.

    tHe APPlIAnCe PrOduCt lIne InCludeS tHe entry-leVel

    GOOGle mInI, which starts at $1,995 for a model capable of

    indexing 100,000 documents. It is often used to provide the search

    capability for public Web sites, although it is also used internally by

    small enterprises or departments of larger ones. Beyond that level,

    the Search Appliance line includes the Google Appliance GB-1001,

    which can handle up to a million documents; and the GB-5005 and

    GB-8008, which, when delivered in the form of multiple servers in

    a rack and working together as a Google File System cluster, can

    handle many millions of documents. It really is like a little Google

    data center in a box, says Matt Glotzbach, head of products forGoogle Enterprise.

    Because the technology is delivered in the form of an appliance,

    customers arent supposed to crack open the case and tinker with

    the technology inside, and they arent provided with root access to

    the server operating environment.

    Instead, Google provides a set of Web-based administration

    screens, as well as application programming interfaces for modi-

    fying the appliances behavior. The most significant development on

    that front is tHe IntrOduCtIOn OF tHe OneBOx APPlICAtIOn

    PrOGrAmmInG InterFACe, which allows data drawn from other

    systems that otherwise wouldnt be indexed by the search appli-

    ance to be displayed at the top of the search results.

    The enterprise version of OneBox is modeled after the feature

    of Googles public Web site that inserts links to data from weather

    reports, phone listings or maps into search results when the search

    engine recognizes a pattern associated with an address, a city

    name or a phone number in the keywords entered by the user.

    Similarly, the appliance can be programmed to recognize patterns

    associated with purchase order numbers or common businessqueries, and insert links to related data. For example, a search for

    quarterly sales data could go beyond searching intranet Web con-

    tent and pop up a link to a more structured data source, such as a

    Cognos financial analysis application.

    Dave Girouard, vice president and general manager of the Google

    Enterprise business unit, says Google has been beefing up its capa-

    bilities to address enterprise requirements, making the appliances

    easier to install, use and manage.

    HOWeVer, GOOGle StIll needS tO dO A Better jOB OF

    AddreSSInG enterPrISe requIrementS, particularly in terms

    of support, according to Gartners Whit Andrews, an authority on

    the evolution of search technology. It has taken Google a while to

    recognize that it needs to do business with the enterprise in a dif-

    ferent way from how it does business with the advertiser, he says.

    Enterprises that have bought the appliances often give positivereports on the value Google delivers for the money and on the ease

    of setup and administration, Andrews says. But support is another

    story, according to Andrews, who has talked to customers who say

    Googles response to problems with the appliances is too often

    along the lines of, Yeah, we know about that, well get back to you.

    Google says it is investing in improved support. Andrews agrees

    that the support is better, but says its still not enough.

    However, Google can point to many happy customers. Brown

    Rudnick Berlack Israels, an international law firm based in Boston,

    got the Google Mini it purchased up and running in about an hour,

    says Keith Schultz, who manages the firms Web sites. Weve

    really had no problems with it at all, Schultz says.D.F.C.

    GooGle courts tHe enterprise

  • 8/14/2019 How Google Works - Baseline

    11/12

    C

    a

    se

    216

    g

    o

    o

    g

    le

    Baseline

    july

    2006

    47

    WWW.BaselineMaG.COM

    consistency in the pursuit of performance makes its systemunsuitable for a wide variety of potential applications. Andbecause it doesnt support operating system integration, itsreally more of a storage utility than a file system per se, he says.To a large extent, Googles design strikes him as more of a syn-thesis of many prior efforts in distributed storage.

    Despite those caveats, Darcy says he also sees many

    aspects of the GFS as cool and useful, and gives Googlecredit for bringing things that might have been donemostly as research projects and turning them into a systemstable enough and complete enough to be used in commer-cial infrastructure.

    Google software engineers considered and rejectedmodifying an existing distributed file system because theyfelt they had different design priorities, revolving aroundredundant storage of massive amounts of data on cheap andrelatively unreliable computers.

    Despite having published details on technologies like theGoogle File System, Google has not released the software asopen source and shows little interest in selling it. The only way

    it is available to another enterprise is in embedded formifyou buy a high-end version of the Google Search Appliance,one that is delivered as a rack of servers, you get Googles tech-nology for managing that cluster as part of the package.

    However, the developers working on Nutch, an ApacheSoftware Foundation open-source search engine, have created adistributed software environment called Hadoop that includesa distributed file system and implementation of MapReduceinspired by Googles work.

    Google, the EnterpriseIn the filing for its 2004 IPO, Google included a letter fromPage and Brin that declared, Google is not a conventional

    company. We do not intend to become one. Maybe so, butGoogle still has to satisfy conventional corporate require-ments for managing money and complying with laws.

    In 2001, Page and Brin hired a more seasoned technologyexecutive to be Googles official corporate leaderEricSchmidt, formerly chief technology officer of Sun Microsystemsand then CEOof Novell.

    As recounted in The Google Story, one of the first battlesSchmidt had to fight was to bring in Oracle Financials to controlthe companys finances. Page and Brin had been using Quicken,a financial management system for individuals and small busi-nesses. But by this time, Google had 200 employees and $20million in revenue.

    The book quotes Schmidt as saying, That was a huge fight.They couldnt imagine why it made sense to pay all that moneyto Oracle when Quicken was available.

    So, Google does employ conventional enterprise tech-nologysometimes. But if you want to understand how the kindof proprietary technology Google possesses can be employedwithin an enterprise, look at how its used within Google.

    In public presentations, Merrill tries to put Googlestechnology in the context of how he is using it to addressbasic business processes like interviewing and hiring the bestpeople, tracking their performance and coordinating projects.We say our mission is to organize information and make ituniversally accessible and useful, Merrill says. We had to

    figure out how to apply that internally as well.

    Consider how Google handles project management. Everyweek, every Google technologist receives an automatically gen-erated e-mail message asking, essentially, what did you do thisweek and what do you plan to do next week? This homegrownproject management system parses the answer it gets back andextracts information to be used for follow-up. So, next week,Merrill explains, the system will ask, Last week, you said you

    would do these six things. Did you get them done?A more traditional project tracking application woulduse a form to make users plug the data into different fieldsand checkboxes, giving the computer more structured datato process. But instead of making things easier for the com-puter, Googles approach is to make things easier for the userand make the computer work harder. Employees submit theirreports as an unstructured e-mail, and the project trackingsoftware works to understand the content of those e-mailnotes in the same way that Googles search engine extractscontext and meaning from Web pages.

    If Google employees found the project tracking system tobe a hassle to work with, they probably wouldnt use it, regard-

    less of whether it was supposed to be mandatory, Merrillsays. But because its as easy as reading and responding to ane-mail, We get pretty high compliance.

    Those project tracking reports go into a repositorysearchable, of courseso that managers can dip in at anytime for an overview on the progress of various efforts. OtherGoogle employees can troll around in there as well and reg-ister their interest in a project they want to track, regardlessof whether they have any official connection to that project.

    What were looking for here is lots of accidental cross-pollination, Merrill explains, so that employees in differentoffices, perhaps in different countries, can find out aboutother projects that might be relevant to their own work.

    Despite Googles reputation for secrecy toward outsiders,internally the watchword is living out loud, Merrill says.Everything we do is a 360-degree public discussion.

    The company takes a more traditional approach withrecording financial transactions, however. Hey, I want thoserevenues, I really do, he says. That means running financialmanagement software on servers with more conventional vir-tues like disks that dont fail very often, he says.

    On the other hand, Google runs internally developedhuman-resources systems on the clustered server architec-ture, and thats been working fine, according to Merrill: Ingeneral, because of the price-performance trade-off, undercurrent market conditions I can get about a 1,000-fold com-

    puter power increase at about 33 times lower cost if I go tothe failure-prone infrastructure. So, if I can do that, I will.

    Numbers like those ought to catch the attention of anytechnology manager. Of course, its true that most corporateenterprises dont run the same applications at the same scalethat Google does, and few would find it worthwhile to investin tinkering with file systems and other systems engineeringfundamentals.

    But as much as it has poured effort into perfecting its useof those technologies, Google did not invent distributed com-puting. Companies willing to experiment with commerciallyavailable and open-source products for grid computing, dis-tributed file systems and distributed processing may be able to

    find their own route to Google-like results.3

  • 8/14/2019 How Google Works - Baseline

    12/12

    Baselinejuly

    2006

    8

    Fogig aha wih h sa bsi-ss o fo o ha 12 asigh s o ha o so i hcosa chagig wo of ifoa-io choog, b bsiss cso-s sa r Ha was i w.

    th raigh, n.C.-bas fi hasb sig a sppoig op-soc li sofwa sic is fo-ig i 1993. Asi fo Goog, rHas csos ic Aao.co,

    lha Bohs a yahoo, a osbs coi o ip hi foas or Has sppo a picig.

    B whi r Ha has po bag abois vs a po ha 80% aa i h paswo as, a h copa caci op-soc iwacopa jBossso csos sacopio nov a is SuSe lisofwa is ippig a r Has hs.

    B h jBoss acisiio hpsgh r Has aioshipwih Fisv Ivs SppoSvics (Fisv ISS), a dv-basfiacia svics fi.

    Fisv ISS a pafo o is w aa wahos a o r Ha. ric ka, h fischif ifoaio offic, sas r Haiv o h pois of high-vsppo a cosig hoghoh poc. A r Has episli wo wih wo bac-sss ha Fisv ISS ha ihiwh h copa was ca fofo sa fis.

    th copa ha aso b siga jBoss appicaio sv, b havigboh h r Ha a jBoss pocs o oof was a ps fo ka.

    toa, Fisv ISS has 35 iff sv-s ig r Ha sofwa, aka sas h pa h po- ov h fiv as.

    this is co bocig-a-acigchoog ifasc ha has owo, ka sas. A i has.

    th Ci of Chicago is aohhapp cso, hogh pafoachic A nisbach sas novappas o hav a ipovs,sch as ig Oac aabass aipovig is sppo, sic has ago, wh sh op fo r Ha.

    I 2003, h Wi Ci saa pio poga wih h r Ha

    epis li pafo o pow hcis Bsiss & Ifoaio Svicsdpa, which sppos -pis appicaios hogho Chicago.

    nisbach a copa paco S Soais svs a isar Ha o dl580 G2 svs foHw-Paca. Sh sas h$65,000 sh pas aa fo r Haicss a sppo o 65 svs iso-foh h pic of copios

    pocs. Sh aso sas h r Hapocs hav i o oaiac sic isaaio.

    joaha mi, ico of ifo-aio-choog vop agiig a lib uivsi ilchbg, Va., bogh io r Hab is 100% so.

    mi, chag i 2004 wih ca-ig a o aagab viofo h schoos bsig Wb opa-ios, op fo r Ha bcas of ispaio fo ffici Wb sviga is Goba Fi Ss, a op-aig ss fa ha aagssv css i h os i hschoos w soag aa wo.

    th oo hp c h bof hos p w o a-

    ag h vio o 5, fo 15 o20, sic h poc was copi Api, accoig o mi. H asocis r Has cosas wihachig his a abo h sof-was is a os.

    Si, spi a posiiv picwih r Ha, mi sas hs p-ig his o nov SuSe liwhich has a a sv aag- ooa oh copios.W a r Ha caps fo ow,mi sas. no saig w b ar Ha caps fov.

    BRIAN P. WATSON

    1801 Varsity DriVe / raleigh, NC 27606 / (919) 754-3700 / www.reDhat.Com

    tICker: RHAT (NASDAQ) emPlOyeeS: 1,300

    MattHeW J. sZulik Chairman, CEO & Presidentpaul CorMier EVP, EngineeringproDuCts Enterprise Linux 4 gives businesses an open-source computing platform on asubscription basis, with new versions released every 18 months. Global File System man-ages and connects clustered servers.

    Red Hat: Still Savvy

    FISerV InVeStment SuPPOrtSerVICeS PlAnS tO InCreASeItS uSe OF red HAt SOFtWAre IntHe next FIVe yeArS FOllOWInGA dAtA WAreHOuSInG PrOjeCt,SAyS CIO rICk kendAll.

    ** accumulated deficit represents net loss to date source: company reports

    fiserv investMent

    support serviCes

    rk [email protected]

    City of CHiCago

    a nbhPlatform Architect(312) 744-8735

    Heritage propane

    partners

    mk wDir., I.T.(406) 442-9759 151

    liBerty university

    Jh mDir., Engineering andDevelopment

    bi@ib.

    Red Hat:At A Glance

    OPEN-SOURCE SOFTWAREDossier

    referenCe CHeCks

    SOurCe OF OPtImISm

    0

    50

    100

    150

    200

    250

    300

    Acca fici,i iios**

    $290.96m

    $152.11m

    finanCials*

    2006Fytd 2005Fy 2004Fy

    rv $278.33M $151.13M $82.41M

    n ico $79.69M $45.43M $13.73Mr&d spig 82.6% 80.4% 72.8%* fiscal year ends feb. 28.