2009 claremont report on dbs acm

Upload: muhammad-osama-raza

Post on 04-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 2009 Claremont Report on DBs ACM

    1/20

    56 CO mmun iCati Ons Of th e aC m | j UNE 2009 | v Ol. 52 |

    contribute d ar ticles

    Rakesh

    aGRaWaL

    anastasia

    aiLamaki PhiLiP a.

    BeRnstein eRiCa.

    BReWeRmiChae LJ.

    CaRey suRaJit

    ChauDhuRi anha i

    DOan

    Danie LafLOResCu

    miChaeLJ. fRankLin

    heCtORGaRCia-

    mOLina JOhannes

    G

    ehRkeLeGRuenWaLD

    LauRa m. haa s

    aLOny.

    haLeVy

    JOse Ph m. heLLeRstein

    yanni s e.

    iOanni Dis han kf.

    kORth DOnaLD

    kOss mann samueL

    maDDen ROGeR

    maGOuLas BenG

    ChinOOi

    tim

    OReiLLy RaGhuRamak Rishnan sunita

    saRaWaGi miChaeLstOneBRakeRaLeXanDeR

    s.szaLay GeRhaRDWeikum

  • 7/30/2019 2009 Claremont Report on DBs ACM

    2/20

    57 CO mmun iCati Ons Of th e aC m | j UNE 2009 | v Ol. 52 |

    DO

    i:10.11

    45/151

    6046.1

    51606

    Databaseresearc h isxpanding , with major

    ffo rts in syste m

    rchitecture , new

    anguages, clou d

    ervices , mobil e and

    irtua l worlds,

    nd interpla y betwee n

    tructur e and text.

    the

    Claremont

    Reporton

    Database

    Res

    earch

    a Grou P o F databas e researchers , architects , users,and pundit s met in May 2008 at the Claremon t Resortin Berkeley, CA, to discus s the state ofdatabas eresearch and its effects onpractice . This was thesevent h meet - ing of this sort over the past 20 years andwas distin - guishe dby abroa d consensu s that thedatabase commu nity is at a turnin g poin t in its history,due tobot h an explosio n of data and usage scenario sand majo rshifts in computin g hardwar e and

    platforms.Here, we explore the conclusions of this self-assessment. It is by definition somewhat inward-focused but may be ofinterest to the broadercomputing community as both a window intoupcoming directions in database research and

  • 7/30/2019 2009 Claremont Report on DBs ACM

    3/20

    iLLustration

    byg

    L

    uek

    it

    a description of some of the

    community issues and initiatives that

    surfaced. We describe the groups

    consensus view ofnew focus areas for

    research, including database engine

    architectures, declara- tive

    programming languages, interplay of

    structured data and free text, cloud

    data services, and mobile and virtualworlds. We also report on discussions

    of the database communitys growth

    and processes that may be of interest

    to other research areas facing similar

    challenges.

    Over the past 20 years, small groups

    of databas e researcher s have periodi -

    cally gathere d to asses s the state ofthe

    field and propos e direction s for future

    research. 1,37 Report s of the meetings

    served to foster debat e within the data -

    base researc h commu nity, explain

    researc h direction s to externa l orga -

    nizations , and help focus community

    efforts on timely challenges.

    The theme of the Claremont meet-

    ing was that database research and

    the data-management industry are at

    a turning point, with unusually rich

    opportunities for technical advances,

    intellectual achievement, entrepre-

    neurship, and benefits for science

    and society. Given the large number

    of opportunities, it is important for

    the database research community toaddress issues that maximize relevance

    within the field, across computing, and

    in external fields as well.

    The sense of change that emerged

    in the meeting was a function ofsever-

    al factors:

    Excitement overbigdata. In recent

    years, the number of communities

    working with large volumes of data has

    grown considerably to include not only

    traditional enterprise applications and

    Web search but also e-science efforts

    (in astronomy, biology, earth science,and more), digital entertainment, natu-

    ral-language processing, and social-

    network analysis. While the user base

    for traditional database management

    systems (DBMSs) is growing quickly,

    there is also a groundswell of effort to

    design new custom data-management

    solutions from simpler components.

    The ubiquity of big data is expanding

    the base of users and developers of

    data-management technologies and

    will undoubtedly shake up the data-

    base research field.Data analysis as profit center. In tradi -

    tional enterpris e settings , the barriers

    betwee n IT department s and business

    unit s are comin g down, and there are

    many example s of companie s where

    data is indee d the busines s itself. As a

    consequence , data capture , integra -

    tion, and analysis are no longer viewed

    as a busines s cost but as the keys to

    efficiency and profit. The value of soft -ware to suppor t data analytics has been

    growing as a result . In 2007, corporate

    acquisition s of business-intelligence

    vendor s alone totale d $15billion, 2 and

    that is only the front end of the data-

    analytics tool chain. Market pressure for

    bette r analytic s also bring s new users

    to the technolog y with new demands.

    Statistically sophisticate d analysts are

    being hired in a growing numbe r of

    industries , with increasin g interes t inrunnin g their formula e on the raw data.

    At the same time, a growing numbe rof

    nontechnica l decisio n maker s want to

    get their hand s on the numbers as

    well in simpl e and intuitiv e ways.

    Ubiquity of structured and unstruc-

    tured data. There is an explosion of

    structured data on the Web and on

    enterprise intranets. This data is from a

    variety of sources beyond traditional

    databases, including large-scale efforts

    to extract structured information from

    text, software logs and sensors, and

  • 7/30/2019 2009 Claremont Report on DBs ACM

    4/20

    crawls of deep-Web sites. There is

    also an explosion of text-focused

    semistruc- tured data in the public

    domain in the form of blogs, Web

    2.0 communities, and instant

    messaging. New incentive structures

    and Web sites have emerged for

    publishing and curating structured

    data in a shared fashion as well.Text- centric approaches to

    managing the data are easy to use

    but ignore latent structure in the

    data that might add significant

    value. The race is on to develop

    techniques that extract useful

    data from mostly noisy text and

    struc- tured corpora, enable deeper

    explo- ration into individual data

    sets, and connect data sets together

    to wring out as much value aspossible.

    Expanded developer demands.

    Programmer adoption of relational

    DBMSs and query languages hasgrown significantly in recent years,

    acceler- ated by the maturation of

    open source systems (such as

    MySQL and Postgr- eSQL) and the

    growing popularity of object-

    relational mapping packages (such as

    Ruby on Rails). However, the

    expanded user base brings new

    expec- tations for programmability

    and usabil- ity from a larger, broader,

    less-special- ized community of

    programmers.

    Some of them are unhapp y or

    unwill - ing to drop into SQL, viewing

    DBMSs

  • 7/30/2019 2009 Claremont Report on DBs ACM

    5/20

    contribute d ar ticles

    iLLustration

    byg

    L

    uek

    it

    as unnecessaril y complicate d and

    dauntin g to learn and manag e relative

    to othe ropen source components . As

    the ecosyste m for databas e manage -

    ment evolves beyond the typical DBMS

    user base, opportunitie s are emerging

    for new programmin g model s and new

    system component s for data manage -

    ment and manipulation.Architectural shifts in computing.

    While the variety of user scenarios is

    increasing, the computing substrates

    for data management are shifting

    dramatically as well. At the macro

    scale, the rise of cloud computing

    services suggests fundamental

    changes in software architecture. It

    democratizes access toparallel clusters

    of computers; every programmer has

    the opportunity and motivation todesign systems and services that

    scale out incrementally to arbitrary

    degrees of parallelism. At a micro

    scale, computer architectures have

    shifted the focus of Moores Law from

    increasing clock speed per chip to

    increasing the number of processor

    cores and threads per chip. In storage

    technologies, major changes are under

    way in the memory hierarchy due to the

    availability of more and larger on-chip

    caches, large inexpensive RAM, and

    flash memor y. Power consumptionhas become an increasingly impor-

    tant aspect of the price/performance

    metric of large systems. These hard-

    ware trends alone motivate a wholesale

    reconsideration of data-management

    software architecture.

    These factors together signal an

    urgent, widespread need for new data-

    management technologies. There is

    an opportunity for making a positivedifference. Traditionally, the database

    community is known for the practical

    relevance of its research; relational

    databases are emblematic of technol-

    ogy transfe r. But in recent years, the

    externally visible contribution of the

    database research community has

    not been as pronounced, and there

    is a mismatch between the notable

    expansion of the communitys portfo-

    lio and its contribution to other fieldsof research and practice. In todays

    increasingly rich technical climate, the

    database community must recommit

    itself to impact and breadth. Impact

    is evaluated by external measures, so

    success involves helping new classes of

    users, powering new computing plat-

    forms, and making conceptual break-

    throughs across computing. These

    should be the motivating goals for the

    next round ofdatabase research.

    To achieve these goals, discussion

    at the 2008 Claremon t Resort meeting

  • 7/30/2019 2009 Claremont Report on DBs ACM

    6/20

    contribute d ar ticlesrevolved aroun d two broa d

    agendas we call reformatio n and

    synthesis. The reformatio n agend a

    involves decon - structin g traditiona l

    data-centri c ideas and systems and

    reformin g them fornew application s

    and architectura l real - ities. One part

    of this entail s focusing outsid e the

    traditiona l RDBMS stack and its

    existing interfaces , emphasiz - ing

    new data-managemen t systems for

    growth areas (such as e-science).

    Another part of the reformatio n

    agen - da involves taking data-

    centri c ideas like declarativ e

    programmin g and query

    optimizatio n outsid e their origi - nal

    context in storag e and retrieval to

    attac k new areas of computin g

    where a data-centri c mindse t

    promise s to yield significan t benefit .

    The synthesis agend a is intende d to

    leverage research ideas in areas thathave yet to develop identifiable ,

    agreed-upo n system archi - tectures ,

    includin g data integration,

    informatio n extraction , and data

    priva - cy. Many of these

    subcommunitie s of databas e

    researc h seem ready to move out of

    the conceptua l and algorithmic

    phas e to work togethe r on

    comprehen - sive artifact s (such as

    systems , languag - es, and services)

    that combin e mu ltiple technique s to

    solve complex userprob - lems. Effortstoward synthesis can serve as rallying

    point s for research , likely leadin g to

    new challenge s and break- throughs ,

    and promis e to increas e the overall

    visibility of the work.

    Research

    Opportunities

    After two days of intense

    discussion at the 2008 Claremont

    meeting, it was surprisingly easy for

    the group to reach consensus on a set

    ofresearch topics for investigation incoming years. Before exploring

    them, we stress a few points

    regarding what is not on the list. First,

    while we tried to focus on new

    opportunities, we do not propose

    they be pursued at the expense of

    existing good work. Several areas we

    deemed critical were left off

    because they are already focus topics

    in the database communit y. Many

    were mentioned in previous

    reports 1,37 and are the subject of

    significant efforts that requirecontinued investigation and

    funding. Second, we kept the list short,

    favoring focus over coverage. Though

    most of us have other promising

    research topics we would have liked to

    discuss at great- er length here, we

    focus on topics that

  • 7/30/2019 2009 Claremont Report on DBs ACM

    7/20

    attracted the broadest interest within

    the group.

    In addition to the listed topics, the

    main issues raised during the meeting

    included management of uncertain

    information, data privacy and security,

    e-science and other scholarly appli-

    cations, human-centric interaction

    with data, social networks and Web2.0, personalization and contextual-

    ization of query- and search-related

    tasks, streaming and networked data,

    self-tuning and adaptive systems, and

    the challenges raised by new hardware

    technologies and energy constraints.

    Most are captured in the following

    discussion, with many cutting across

    multiple topics.

    Revisiting database engines. System R

    and Ingres pioneered the architecture

    and algorithms of relational

    databases; current commercial

    databases are still based on their

    designs. But many of the changes in

    applications and technolo- gy demand

    a reformation of the entire system

    stack for data management. Current

    big-market relational database

    systems have well-known limitations.

    While they provide a range of features,

    they have only narrow regimes in which

    they provide peakperformance; online

    transaction processing (OLTP) systems

    are tuned for lots of small, concurrenttransactional debit/credit workloads,

    while decision-support systems are

    tuned for a few read-mostly, large-join-

    and-aggregation workloads. Mean-

    while, for many popular data-intensive

    tasks developed over the past decade,

    relational databases provide poor

    price/performance and have been

    rejected; critical scenarios include

    text indexing, serving Web pages, and

    media delivery. New workloads are

    emerging in the sciences, Web 2.0-style

    applications, and other environmentswhere database-engine technology

    could prove useful but is not bundled

    in current database systems.

    Even within traditional applica-

    tion domains, the database market-

    place today suggests there is room for

    significant innovation. Forexample, in

    the analytics markets forbusiness and

    science, customers can buy petabytes

    of storage and thousands ofproces-

    sors, but the dominant commercial

    database systems typically cannot

    scale that far for many workloads. Evenwhen they can, the cost of software and

    theubiquity ofbig data is

    expandin g thebas e ofusersand developer s ofdata-managementtechnologie s and willundoubted ly shak e upthe databaseresearc h field.

  • 7/30/2019 2009 Claremont Report on DBs ACM

    8/20

    management relative to

    hardware is exorbitant. In the

    OLTP market, busi- ness

    imperatives like regulatory

    compli- ance and rapid response

    to changing business conditions

    raise the need to address data

    life-cycle issues (such as data

    provenance, schema evolution,

    and versioning).

    Given these requirements,the commercial database market

    is wide open to new ideas and

    systems, as reflected in the recent

    funding climate for

    entrepreneurs. It is difficult to

    recall when there were so many

    start- up companies developing

    database engines, and the

    challenging economy has not

    trimmed the field much. The

    market will undoubtedly

    consolidate over time, but things

    are changing fast, and it remains

    a good time to try radi- cal ideas.

    Some research projects have

    begun taking revolutionary steps

    in database system architecture.

    There are two distinct

    directions: broadening the

    useful range of applicability for

    multi- purpose database systems

    (for exam- ple, to incorporate

    streams, text search, XML, and

    information integration) and

    radically improving performanceby designing special-purpose

    database systems for specific

    domains (for exam- ple, read-

    mostly analytics, streams, and

    XML). Both directions have merit,

    and the overlap in their stated

    targets suggests they may be more

    synergistic than not. Special-

    purpose techniques (such as new

    storage and compres- sion

    formats) may be reusable in more

    general-purpose systems, and

    general- purpose architecturalcomponents (such as extensible

    query optimizer frameworks) may

    help speed prototyp- ing of new

    special-purpose systems.

    Important research topics in

    the core database engine area

    include: Designing systems for

    clusters of many-core processors

    that exhibit limited andnonuniform access to off- chip

    memory;

    Exploiting remote RAM andFlash as persistent media, rather

    than rely- ing solely on

    magnetic disk; Treating query

    optimization and physical

    data layout as a unified, adap-

    tive, self-tuning task to be

    carried out continuously; Compressing and

    encrypting data at the

    storage layer, integrated

    with data layout and queryoptimization;

  • 7/30/2019 2009 Claremont Report on DBs ACM

    9/20

  • 7/30/2019 2009 Claremont Report on DBs ACM

    10/20

    This opens opportunities for the

    tabase community to extend its

    ntribution to the broader commu- nity,

    veloping more powerful and efficient

    guages and runtime mech- anisms that

    p these developers address more

    mplex problems.

    As another example of declarative

    ogramming, in the past five years a

    riety of new declarative languages, often

    ounded in Datalog, have been developeddomain-specific systems in fields as

    verse as networking and distributed

    stems, computer games, machine

    rning and robotics, compil- ers, security

    otocols, and information extraction. In

    ny of these scenarios, the use of a

    clarative language has reduced code size

    orders of magni- tude while also

    abling distributed orparallel execution.

    rprisingly, the groups behind these

    orts have coor- dinated very little with

    e another; the move to revive

    clarative languages in these new

    ntexts has grown up organicall y.

    A third example arises in enter- prise-

    plication programming. Recent

    nguage extensions (such as Ruby on

    ils and LINQ) encour- age query-like

    gic in programmer design patterns. But

    se packages have yet to address the

    allenge of enterprise-style programming

    ross multiple machines; the closest

    ort here is DryadLINQ, focusing on

    ral- lel analytics rather than on

    tributed application development. Forter- prise applications, a key

    tributed design decision is the

    rtitioning of logic and data across

    ultiple tiers, including Web clients,

    eb servers, application servers, and a

    ckend DBMS. Data independence is

    rticu- larly valuable here, allowing

    ograms to be specified without making a

    ori permanent decisions about physical

    ployment across tiers. Automatic

    timization processes could make these

    cisions and move data and code as

    eded to achieve efficiency andrrectness. XQuery has been proposed as

    existing language that would facilitate

    s kind of declarative programming, in

    rt because XM L is often used in cross-

    r protocols.

    It is unusual to see this much

    ergy surrounding new data-centric

    ogramming techniques, but the

    portunity brings challenges as

  • 7/30/2019 2009 Claremont Report on DBs ACM

    11/20

    iLLustration

    byg

    L

    uek

    it

    well. The research challenges include

    language design, efficient compilers

    and runtimes, and techniques to opti-

    mize code automatically across both

    the horizontal distribution ofparallel

    processors and the vertical distribu-

    tion of tiers. It seems natural that the

    techniques behind parallel and distrib-

    uted databasespartitioned dataflowand cost-based query optimization

    should extend to new environments.

    However, to succeed, these languages

    must be fairly expressive, going beyond

    simple MapReduce and select-project-

    join-aggregate dataflows. This agenda

    will require synthesis work to harvest

    useful techniques from the literature

    on database and logic programming

    languages and optimization, as well

    as to realize and extend them in new

    programming environments.

    To genuinely improve programmer

    productivity, these new approaches

    also need to pay attention to the soft-

    er issues that capture the hearts and

    minds ofprogrammers (such as attrac-

    tive syntax, typing and modularity,

    development tools, and smooth inter-

    action with the rest of the comput-

    ing ecosystem, including networks,

    files, user interfaces, Web services,

    and other languages). This work also

    needs to consider the perspective of

    programmers who want to use theirfavorite programming languages and

    data services as primitives in those

    languages. Example code and practical

    tutorials are also critical.

    To execute successfully, database

    research must look beyond its

    tradition- alboundaries and find allies

    through- out computing. This is a

    unique oppor- tunity for a fundamental

    reformation of the notion of data

    management, not as a single system

    but as a set of servic- es that can be

    embedded as needed in manycomputing contexts.

    Interplay of structured and unstruc-

    tured data. A growing number ofdata-

    management scenarios involve both

    structured and unstructured data.

    Within enterprises, we see large

    hetero- geneous collections of

    structured data linked with

    unstructured data (such as document

    and email repositories). On the Web,

    we also see a growing amount of

    structured data primarily from three

    sources: millions of data- baseshidden behind forms (the deep Web);

    hundreds of millions of high- quality data items in HTML tables on We

    pages and a growing number of mashup

    providing dynamic views on structured dat

    and data contributed by Web 2.0 servic

    (such as photo and video site

    collaborative annotation services, an

    online structured-data repositories).

    A significant long-term goal for th

    database community is to transition fromanaging traditional databases consisting

    of well-defined schemata for structured

    business data to the

    much more challenging task of manag- in

    a rich collection of structured, sem

    structured, and unstructured data sprea

    over many repositories in the enterprise an

    on the Websome- times referred to as th

    challenge ofmanaging dataspaces.

    In principle , this challeng e is close

    relate d to the genera l proble m of datintegration , a longstandin g area fo

    databas e research . The recen t advanc - es

    this area and the new issues due to We

    2.0 resulte d in significant discussio n at th

    Claremon t meeting. On the Web, th

    databas e community has contribute

    primaril y in two ways: First, it develope

    technology that enable s the generatio n

    domain- specific (vertical) searc

    engines with relatively little effort; an

    second,

  • 7/30/2019 2009 Claremont Report on DBs ACM

    12/20

    it developed domain-

    independent technolog y for

    crawling throug h forms (that is,

    automaticall y submittin g well-

    forme d querie s to forms) and

    surfac - ing the resultin g HTML

    pages in a search-engin e index.

    Withi n the enter- prise, the

    databas e researc h commu - nity

    recently contribute d to enterprise

    searc h and the discovery ofrelation - ships betwee n structure d

    and unstruc - tured data.

    The first challenge

    database researchers face is how

    to extract struc-

    ture and meaning from

    unstructured and semistructured

    data. Informa- tion-extraction

    technology can now pull

    structured entities and relation-

    ships out of unstructured text,even in unsupervised Web-scale

    contexts. We expect in coming

    years that hundreds of extractors

    will be applied to a given data

    source. Hence developers and

    analysts need techniques for

    applying and managing

    predictions from large numbers

    of independently developed

    extractors. They also need

    algorithms that can introspect

    about the correct- ness of

    extractions and thereforecombine multiple pieces of

    extraction evidence in a

    principled fashion. The database

    community is not alone in

    these efforts; to contribute in this

    area, database researchers

    should continue

  • 7/30/2019 2009 Claremont Report on DBs ACM

    13/20

    iLLustration

    byg

    L

    uek

    it

    to strengthen ties with researchers in

    information retrieval and machine

    learning.

    Context is a significant aspect

    of the semantics of the data, taking

    multiple forms (such as the text and

    hyperlinks that surround a table on a

    Web page, the name of the directory

    in which data is stored, accompany-ing annotations or discussions, and

    relationships to physically or tempo-

    rally proximate data items). Context

    helps analysts interpret the meaning

    of data in such applications because

    the data is often less precise than in

    traditional database applications, as

    it is extracted from unstructured text,

    extremely heterogeneous, or sensi-

    tive to the conditions under which it

    was captured. Better database tech-nology is needed to manage data in

    context. In particular, there is a need

    for techniques to discover data sourc-

    es, enhance the data by discovering

    implicit relationships, determine the

    weight of an objects context when

    assigning it semantics, and maintain

    the provenance of data through these

    steps ofstorage and computation.

    The second challenge is to develop

    methods for querying and deriving

    insight from the resulting sea of

    hetero- geneous data. A specificproblem is to

    develop methods to answer keyword

    queries over large collections ofhetero-

    geneous data sources. We must be able

    to break down the query to extract

    its intended semantics and route the

    query to the relevant sources(s) in the

    collection. Keyword queries are just

    one entry point into data exploration,

    and there is a need for techniques that

    lead users into the most appropriate

    querying mechanism. Unlike previ-

    ous work on information integration,

    the challenges here are that we cannot

    assume we have semantic mappings

    for the data sources and we cannot

    assume that the domain of the query or

    the data sources is known. We need to

    develop algorithms forproviding best-

    effort services on loosely integrated

    data. The system should provide mean-ingful answers to queries with no need

    for manual integration and improve

    over time in a pay-as-you-go fashion as

    semantic relationships are discovered

    and refined. Developing index struc-

    tures to support querying hybrid data

    is also a significant challenge. More

    generally, we need to develop new

    notions of correctness and consistency

    in order to provide metrics and enable

    users or system designers to make

    cost/quality trade-offs. We also need

    to develop the appropriate systems

  • 7/30/2019 2009 Claremont Report on DBs ACM

    14/20

    concepts around which these

    function- alities are tied.

    In additio n to managin g

    existing data collections , there is an

    opportu - nity to innovat e in the

    creatio n of data collections . The

    emergenc e of Web 2.0 create s the

    potentia l for new kinds of data-

    management scenario s in which

    users join ad hoc communitie s

    to create, collaborate , curate , and

    discuss data online . As an example ,

    consider creatin g a databas e of

    access to clean water in differen t

    places aroun d the world. Since such

    communitie s rarely agree on

    schemat a ahea d of time, the

    schemat a mus t be inferre d from

    the data; however, the resultin g

    schemata are still used to guide users

    to consen - sus. Systems in this context

    mu st incorporat e visualizations that

    drive exploratio n and analysis. Mostimpor- tant, these system s must be

    extremely easy to use and so will

    probably require compromisin g on

    some typical data -base functionality

    and providing more semiautomati c

    hints mine d from the data. There is

    an importan t opportunity for a

    feedbac k loop here; as more data is

    create d with such tools, information

    extraction and querying could

    become easier. Commercia l and

    academic prototypes arebeginnin g to

    appear , but there is plenty of roomfor additional innovatio n and

    contributions.

    Cloud data services. Economic

    and technological factors have

    motivated a resurgence of shared

    computing infrastructure,

    providing software and computing

    facilities as a service, an approach

    known as cloud services or cloud

    computing. Cloud services provide

    efficiencies for application providers

    by limiting up-front capital expenses

    and by reducing the cost ofownership over time. Such

    services are typically hosted in a

    data center using shared

    commodity hardware for

    computation and storage. A varied

    set of cloud services is available

    today, including application services

    (sales- force.com), storage services

    (Amazon S3), compute services

    (Amazon EC2, Google App Engine,

    and Microsoft Azure), and data

    services (Amazon SimpleDB,

    Microsoft SQL Data Servic- es, andGoogles Datastore). They represent a

    major reformation of data-

    management architectures, with more

    on the horizon. We anticipate many

    future data-centric applications lever-

  • 7/30/2019 2009 Claremont Report on DBs ACM

    15/20

    aging data services in the cloud.

    A cross-cutting theme in cloud

    services is the trade-off providers face

    between functionality and opera-

    tional costs. Todays early cloud data

    services offer an API that is much

    more restricted than that oftraditional

    database systems, with a minimalist

    query language, limited consistencyguarantees, and in some cases explicit

    constraints on resource utilization.

    This limited functionality pushes more

    programming burden on developers

    but allows cloud providers to build

    more predictable services and offer

    service-level agreements that would be

    difficult to provide for a full-function

    SQL data service. More work and expe-

    rience are needed on several fronts

    to fully understand the continuum

    between todays early cloud data servic-

    es and more full-function but possibly

    less-predictable alternatives.

    Manageability isparticularly impor-

    tant in cloud environments. Relative to

    traditional systems, it is complicated

    by three factors: limited human

    interven- tion, high-variance

    workloads, and a variety of shared

    infrastructures. In the majority of

    cloud-computing settings, there will

    be no database administra- tors or

    system administrators to assist

    developers with their cloud-basedapplications; the platform must do

    much of that work automaticall y.

    Mixed workloads have always been

    difficult to tune but may be

    unavoidable in this context.

    Even a single customers workload

    can vary widely over time; the elastic

    provisioning of cloud services makes

    it economical for a user to occasion-

    ally harness orders-of-magnitude more

    resources than usual for short bursts

    of work. Meanwhile, service tuning

    depends heavily on the way the sharedinfrastructure is virtualized. For

    example, Amazon EC2 uses hardware-

    level virtual machines as its program-

    ming interface. On the opposite end of

    the spectrum, salesforce.com imple-

    ments multi-tenant hosting ofmany

    independent schemas in a single

    managed DBMS. Many other virtual-

    ization solutions are possible, each

    with different views into the workloads

    above and platforms below and differ-

    ent abilities to control each. These

    variations require revisiting traditionalroles and responsibilities for resource

    Limited functionalitypushe s more

    programming burde non developer s butallow s cloudprovider s to build morepredictable servicesand offer service-level agreement s thatwoul d be difficult to

    provide fora full-function sQLdata service.

  • 7/30/2019 2009 Claremont Report on DBs ACM

    16/20

    management across layers.

    The need for manageability

    adds urgency to the

    development of self- managing

    database technologies that

    have been explored over the past

    decade. Adaptive, online

    techniques will be required to

    make these systems viable, while

    new architectures and APIs,

    including the flexibility to departfrom traditional SQL and

    transaction- al semantics when

    prudent, reduce requirements for

    backward compat- ibility and

    increase the motivation for

    aggressive redesign.

    The sheer scale of cloud

    computing involves its own

    challenges. Todays SQL

    databases were designed in an

    era of relatively reliable hardware

    and intensive human

    administration; as a result, they

    do not scale effectively to

    thousands of nodes being

    deployed in a massively shared

    infrastructure. On the storage

    front, it is unclear whether these

    limitations should be addressed

    with different transactional

    implementation techniques,

    different storage semantics, or

    both simultane- ously. The

    database literature is rich in

    proposals on these issues.Cloud services have begun to

    explore simple pragmatic

    approaches, but more work is

    needed to synthesize ideas from

    the literature in modern cloud

    computing regimes. In terms of

    query processing and

    optimization, it will not be

    feasible to exhaustively search a

    domain that considers thousands

    of processing sites, so some

    limitations on either the domain

    or the search will be required.Finally, it is unclear how

    program- mers will express their

    programs in the cloud, as

    discussed earlie r.

    The sharing of physical

    resources in a cloud

    infrastructure puts apremium on

    data security and privacy that

    cannot be guaranteed by physical

    boundaries of machines or

    networks. Hence cloud services

    are fertile ground for efforts to

    synthesize and accelerate thework the database community

    has done in these areas.

    The key to success is to

    specifically target usage

    scenarios in the cloud,

    seated in practical econom-

    ic incentives for service

    providers and customers.

    As cloud data services

    become popu- lar, new

    scenarios will emerge

    with their own challenges.For example, we anticipate

    specialized services that are

    pre-loaded with large data

    sets (such as

  • 7/30/2019 2009 Claremont Report on DBs ACM

    17/20

    stock prices, weather history, and Web

    crawls). The ability to mash up inter-

    esting data from private and public

    domains will be increasingly attractive

    and provide further motivation for the

    challenges discussed earlier concern-

    ing the interplay of structured and

    unstructured data. The desire to mash

    up data also points to the inevitabilityof services reaching out across clouds,

    an issue already prevalent in scien-

    tific data grids that typically have

    large shared data servers at multiple

    sites, even within a single discipline. It

    also echoes, in the large, the standard

    proliferation of data sources in most

    enterprises. Federated cloud architec-

    tures will only add to these challenges.

    Mobile applications and virtual

    worlds. This new class ofapplications,

    exemplified by mobile services and

    virtual worlds, is characterized by the

    need to manage massive amounts of

    diverse user-created data, synthesize

    it intelligently, and provide real-time

    services. The database community

    is beginning to understand the chal-

    lenges faced by these applications, but

    much more work is needed.

    According- ly, the discussion about

    these topics at the meeting was more

    speculative than about those of the

    earlier topics but still deserve

    attention.Two importan t trend s are changing

    the natur e of the field. First, the plat -

    forms on which mobil e applications

    are builthardware , software , and

    networkhave attracte d large user

    bases and ubiquitousl y suppor tpower-

    ful interaction s on the go. Second,

    mobil e search and social networks

    sugges t an exciting new set of mobile

    application s that can deliver timely

    informatio n (and advertisements ) to

    mobil e users dependin g on location,

    persona l preferences , social circles,and extraneou s factors (such as weath -

    er), as well as the context in which

    they operate . Providing these services

    require s synthesizing user inpu t and

    behavio r from multipl e source s to

    determin e user locatio n and intent.

    The popularit y of virtual worlds

    like Second Life has grown quickly

    and in many ways mirror s the themes

    of mobil e applications . While they

    began as interactive simulation s for

    multipl e users, they increasinglyblur

    the distinction s with the real worldand sugges t the potentia l for a more

    electroni c mediaunderscor e the

    moder n reality thatit iseas y to bewide ly published butmuch more difficul tto be wide ly read.

  • 7/30/2019 2009 Claremont Report on DBs ACM

    18/20

    ta-ric h mix. The term co-space is

    metime s used to refer to a coexist - ing

    ace for both virtual and physi - cal

    rlds. In it, location s and events in the

    ysical world are capture d by a large

    mbe rofsensor s and mobile devices and

    aterialize d within a virtual world.

    rrespondingly, certain action s or

    ents within the virtual world affect the

    ysical world (such as shopping , produc t

    omotion , and experientia l compute rming) . Appli- cation s of co-space

    lud e rich social networking , massive

    ulti-player games , military training ,

    utain - ment , and knowledg e sharing.

    In both areas, large amounts ofdata flow

    m users and get synthesized and used

    affect the virtual and/or real world. These

    plications raise new challenges,

    luding how to process heterogeneous

    a streams in order to materialize real-

    rld events, how to balance privacy

    ainst the collective benefit of sharing

    rsonal real-time information, and how

    apply more intelligent processing to

    nd interest- ing events in the co-space to

    meone in the physical world.

    The programmin g of virtual actors in

    mes and virtual worlds require s large-

    ale paralle l programming ; declarative

    thod s have been propose d as a solu - tion

    this environment, as discussed earlier.

    ese application s also require

    velopmen t of efficient systems , as

    ggeste d earlier in the context ofdata -base

    gines , includin g appropriate storag e andrieval methods , data- processin gengines ,

    ralle l and distrib - uted architectures , and

    wer-sensitive software technique s for

    anagin g the events and communication s

    oss large numbe rofconcurren t users.

    oving forward

    e 2008 Claremont meeting also involved

    cussions on the database research

    mmunitys processes, including

    ganization of publication procedures,

    earch agendas, attrac- tion and

    entorship of new talent, and efforts tosure a benefit from the research on

    actice and toward furthering our

    derstanding of the field. Some of the

    nds seen in data- base research are

    hoed in other areas of computer

    ence. Whether or not they are, the

    cussion may be ofbroader interest in

    field.

  • 7/30/2019 2009 Claremont Report on DBs ACM

    19/20

    Prior to the meeting, a team led by

    one of the participants performed a

    bit of ad hoc data analysis over data-

    base conference bibliographies from

    the DBLP repository (dblp.uni-trie r.

    de). While the effort was not scien-

    tific, the results indicated that the

    database research community has

    doubled in size over the past decade,as suggested by several metrics:

    number ofpublished papers, number

    ofdistinct authors, number ofdistinct

    institutions to which these authors

    belong, and number ofsession topics

    at conferences, loosely defined. This

    served as a backdrop to the discus-

    sion that followed. An open question is

    whether this phenomenon is emerging

    at larger scalesin computer science

    and in science in general. If so, it may

    be useful to discuss the management

    of growth at those larger scales.

    The growth of the database commu-

    nity puts pressure on the content

    and processes of database research

    publications. In terms of content, the

    increasingly technical scope of the

    community makes it difficult for indi-

    vidual researchers to keep track ofthe

    field. As a result, survey articles and

    tutorials are increasingly important to

    the communit y. These efforts should

    be encouraged informally within the

    community, as well as viaprofessionalincentive structures (such as academic

    tenure and promotion in industrial

    labs). In terms ofprocesses, the review-

    ing load for papers is increasingly

    burdensome, and there was a percep-

    tion at the Claremont meeting that the

    quality of reviews had been decreasing.

    It was suggested at the meeting that

    the lack of face-to-face program-

    commit- tee meetings in recent years

    has exac- erbated the problem of poor

    reviews and removed opportunities for

    risky or speculative papers to bechampioned effectively over well-

    executed but more pedestrian work.

    There was some discussio n at the

    meetin g about recen t effortsnota -

    bly by ACM-SIGMOD and VLDB

    to enhanc e the professionalis m of

    paper s and the reviewing proces s via

    such mechanism s as double-blind

    intellectua l and practica l relevance.At

    the same time, it was acknowledged

    that the databas e commu nitysgrowth

    increase s the need for clear and clearly

    enforce d processe s forscientifi cpubli -

    cation . The challeng e going forward

    is to find policie s that simultaneous -

    ly reward big ideas and risk-taking

    whileproviding clear and fair rules forachievin g thes e rewards . The publica -

    tion venues would do well to focus as

    muc h energy on processe s to encour-

    age relevanc e and innovatio n as they

    do on processe s to encourag e rigor

    and discipline.

    In additio n to tunin g the main -

    strea m publicatio n venues, there is an

    opportunit y to take advantag e ofother

    channel s of communication . For

    exam- ple, the databas e researc h

    community has had little presenc e in

    the relatively active marke t for

    technica l books. Given the growing

    populatio n of devel- opers working

    with big data sets, there is a need for

    accessibl e books on scal - able data-

    managemen t algorithms and

    technique s that programmer s can use

    to build software. The curren t crop of

    college textbook s is not targete d at

    this market . There is also an oppor-

    tunity to presen t databas e research

    contribution s as big ideas in their own

    right, targete d at intellectuall y curiousreader s outsid e the specialty. In addi -

    tion to books, electroni c medi a (such

    as blogs and wikis) can complement

    technica l paper s by openin g up differ-

    ent stages of the researc h life cycle to

    discussion , includin g statu s reports

    on ongoin g projects , concis e presen -

    tation of big ideas, vision statements,

    and speculation . Online fora can also

    spur debat e and discussio n if appro -

    priately provocative. Electroni c media

    underscor e the moder n reality that

    it is easy to be widely publishe d butmuc h more difficult to be widely read.

    This poin t shoul d be reflected in the

    mainstrea m publicatio n context , as

    well as by author s and reviewers. In the

    end, the consumer s of an idea define

    its value.

    Given the growth in the database

    research community, the time is ripe

    for ambitious projects to stimulate

    from all parties. Unlike previous

    efforts in this vein, the collection

    should not be designed for any

    particular bench- mark; in fact, it is

    likely that most of the interesting

    problems suggested by this data are as

    yet unidentified.

    There was also discussio n at the

    meetin g of the role of open sourcesoftware development in the database

    commu nity. Despite a traditio n of

    open source software , academi c

    database researcher s have only

    rarely reused or shared software .

    Given the current climate , it migh t be

    useful to move more aggressively

    toward sharin g software and

    collaboratin g on software projects

    across institutions . Informatio n inte -

    gratio n was mentione d as an area in

    which such an effort is emerging.

    Finally, interest was expressed

    in technical competitions akin to

    the Netflix Prize (ww w.netflixprize.

    com) and KDD Cup (ww w.sigkdd.org/

    kddcup/index.php) competitions.

    To kick off this effort in the database

    domain, meeting participants identi-

    fied two promising areas for competi-

    tions: system components for cloud

    computing (likely measured in terms

    of efficiency) and large-scale infor-

    mation extraction (likely measured

    in terms of accuracy and efficiency).While it was noted that each of these

    proposals requires a great deal oftime

    and care to realize, several participants

    volunteered to initiate efforts. That

    work has begun with the 2009 SIGMOD

    Programming Contest (db.csail.mit.

    edu/sigmod09contest).

    References

    1. abiteboul,s. et al. the Lowell database researchselfassessment. Commun. ACM48, 5 (may 2005),111118.

    2. austin, i.i.b.m. acquires cognos, maker ofbusinesssoftware, for $4.9 billion .New York Times (nov. 11,

    2007).3. bernstein, P.a. et al. theasilomarreport on database

    research. SIGMOD Record27, 4 (Dec. 1998), 7480.4. bernstein, P.a. et al. future directions in Dbms

    research: the Laguna beach participants. SIGMODRecord18, 1 (mar. 1989), 1726.

    5. silberschatz, a. and Zdonik,s.strategic directionsin database systems: breaking out of the box.

    ACM Computing Surveys 28, 4 (Dec. 1996), 764778.

    6. silberschatz, a.,stonebraker,m., and ullman, J.D.Database research: achievements and opportunitiesinto the 21st century.SIGMOD Record25, 1 (mar.1996), 52-63.

    7. silberschatz, a.,stoneb raker,m., and ullman, J.D.Database systems: achievements and opportunities.Commun. ACM34, 10 (oct. 1991), 110120.

    reviewing and technique s to encour- collaboration and cross-fertilization

    age experimenta l repeatabilit y. Many participant s were skeptica l that the efforts to date have contribute d to long

    http://www.netflixprize/http://www.netflixprize/http://www.sigkdd.org/http://www.sigkdd.org/http://www.sigkdd.org/http://www.netflixprize/
  • 7/30/2019 2009 Claremont Report on DBs ACM

    20/20

    term researc h quality, as measure d in of ideas. One proposal is to foster

    more data-driven research bybuilding

    a globally shared collection of struc-

    tured data, accepting contributions

    correspondence regarding this article should beaddressed to Joseph M. Hellerstein(hellerstein@ cs.berkele y.edu).

    2009 acm 0001-0782/09/060 0 $10.00