draft document [v1.0 may 7 2013] 1 mit big - [email protected]

DRAFT DOCUMENT [v1.0 May 7 2013]

1

MIT Big Data Initiative at CSAIL Member Workshop #1: Big Data Integration April 4, 2013

MEETING REPORT

TALK SUMMARIES

Table of Contents

TALK SUMMARIES ...................................................................................................................................................................1

Introduction and Workshop Objectives...................................................................................................................................2

Session I: Understanding Needs from a User Perspective....................................................................................................4 Executive Summary of Session I ..........................................................................................................................................4 Meltem Dincer, VP, Content Marketplace, Thomson Reuters .....................................................................................4

Content Marketplace ..............................................................................................................................................................................4 Simon Thompson, Chief Researcher, BT........................................................................................................................6

Data Integration at BT .............................................................................................................................................................................6 Dr. John Gilbertson, Director of Pathology Informatics, Massachusetts General Hospital...................................7

Pathology Practice (in the age of big data and computation) ..............................................................................................7

Session II: Systems and Tools I ..................................................................................................................................................9 Executive Summary of Session II.........................................................................................................................................9 Jayant Madhavan, Google Research ...................................................................................................................................9

Big Data Integration for Data Enthusiasts.........................................................................................................................................9 David Reed, SVP, Special Projects, SAP Labs............................................................................................................... 10

Perspectives on Big Data Integration...............................................................................................................................................10 Scott Schneider, IBM TJ Watson Research Center...................................................................................................... 12

Data Integration with IBM InfoSphere Streams ..........................................................................................................................12 Andrew McCallum, University of Massachusetts Amherst ........................................................................................ 12

From Unstructured Text to Big Databases.....................................................................................................................................12

Session III: Systems and Tools II............................................................................................................................................ 14 Executive Summary of Session III .................................................................................................................................... 14 Christopher Ahlberg, Co-Founder, Recorded Future ................................................................................................. 14

Web Intelligence and Data Integration............................................................................................................................................14 Michael Stonebreaker, MIT CSAIL................................................................................................................................. 15

Data Tamer: A Scalable Data Curation System ...........................................................................................................................15 John Fisher, MIT CSAIL ................................................................................................................................................... 16

Big Analytics – Challenges for Multi-Modal Data Fusion.......................................................................................................16 Fadel Adib and Sam Madden, MIT CSAIL ................................................................................................................... 17

Harvesting Application Information for Industry-Scale Relational Schema Matching ..................................................17

Breakout Groups ........................................................................................................................................................................ 19 Breakout Group Spreadsheets ....................................................................................................................................... 19


2

Introduction and Workshop Objectives Sam Madden and Elizabeth Bruce The mission of the Big Data Initiative at CSAIL is to identify and develop the technologies that are needed to solve the next generation of data challenges; this will involve the development of scalable and reusable software platforms, algorithms, interfaces, and visualizations designed to deal with data that is high-volume, high-rate, or high-complexity (or some combination of these!) The goal of the Big Data Integration Workshop is to understand and articulate the challenges around big data integration, with a particular focus on how Big Data makes the situation better or worse. Broadly speaking, "Data Integration" is the problem of combining or linking two or more data sets, so that they can be treated as a single coherent data set. A classic example of data integration involves company mergers or acquisitions, where employee databases need to be combined to form a single new database for the merged organization. Data integration problems that occur in this example include:

• Transforming the two data sets into the same schema (names, addresses, other identifying features), noting that the terminology (identifiers) used in the two schemas may be different and also there may not be a perfect mapping between the fields used in the schemas; one database might include additional details that are not present in the other database.

• Identifying duplicates –duplicate records may be present for employees who worked at both companies, possibly with conflicting information; clearly the goal is to preserve the current information and overwrite the stale information, while at the same time not losing anything of value.

• Resolving semantic ambiguities –ambiguities may arise around how the two human resource departments have accounted for personal days, sick days, unpaid leave days, and vacation time – their policies may not align perfectly and, in any case, the data integration process and accounting conventions on a going-forward basis will be driven by the personnel policies of the new merged entity.

More generally, data integration involves problems around:

• Different data types – e.g., text, images, records, and a plethora of file formats to be considered.

• Different representations of the same types of data – as given in the example previously, two different database schemes representing similar information.

• Different ways of referring to the same object – for example, First Name, Last Name; Professor Last Name; Initial, Last Name; Last Name only - all might point to the same person, or not. The ambiguities must be resolved efficiently and with a high degree of accuracy.

• Different granularities of information – for example, combining zip code-level crime statistics with block-level housing prices.


3

• Conflicts concerning uncertain facts and duplicates – for example, different sensors may produce different estimates of a car’s position or speed, or two different databases may report different addresses for the same person. In the first case, “sensor fusion” or other probabilistic techniques are often used to estimate the true position or speed., In the second case, the information in both records could be correct – perhaps the person has a home in the city and one in the country, or a post office box and a physical address.

Taking a broad view, the Big Data perspective which was articulated throughout the workshop highlighted several themes: Big Data means Big Variety – there is an increasing need to combine classic enterprise systems with disparate forms and new types of data, including text, graphics, video and other imagery, and social media. Big Data Integration involves more than just joining two databases together – there may be many more repositories or sources of data to be integrated, or perhaps no databases at all. To date, the solution to data integration has been manual effort – significant labor-intensive projects on the part of IT departments. It is difficult to make this approach work at scale: in a Big Data world, data integration often requires tens, hundreds, or thousands of data sets to be combined, requiring automated, intelligent processes that manage the integration process. A specific example that was given involved geospatial data, where road maps, locations of points of interest (businesses, homes, parks, etc), and other data, such as restaurant reviews, need to be combined to create a coherent map of the world (e.g., for presentation on a smartphone or website.) This is a notoriously difficult problem to get right, such that even multi-billion dollar companies like Apple and Google experience many well-publicized errors every year related to this integration challenge. A successful solution to such data integration challenges requires sophisticated entity resolution (the task of finding entries that refer to the same entity across different data sources), a non-trivial task due to the lack of uniformity in identifiers used. In the mapping example, consider the identifiers that are used to refer to roads and particular places such as “Boston”. Does Boston refer to a city in Massachusetts? A street name? A place? And even if there is agreement about what “Boston” is being referenced, what are the precise coordinates of Boston? Further, data format standards are complex; the NavTeq GDF format is thousands of pages long, and there are hundreds of different coordinate systems that can be used to reference a specific spatial location on the globe. There is a large body of work in the research community focused on these and other data integration and entity resolution problems. Some of the most desirable elements to solutions will include:

• Scalability to encompass many sources and types of data • Versatility and ability to handle incomplete, conflicting, and incorrect data • Adaptability to new data arriving • Improved methodologies in using schema, data, and context to resolve conflicts and

ambiguities

The workshop then turned to the first session; Understanding Needs from a User Perspective.


4

Session I: Understanding Needs from a User Perspective

Executive Summary of Session I Session I included speakers from Thomson Reuters, BT, and Massachusetts General Hospital. Each speaker provided examples from his or her organization related to big data integration, highlighting a range of issues from new product development, complexities due to legacy, organizational issues, and scale, and the need to conform to regulatory requirements. Human factors are highly important, both in developing the skill sets required to manage Big Data and as drivers of organizational change.

Meltem Dincer, VP, Content Marketplace, Thomson Reuters

Content Marketplace

This presentation focused on the challenges faced by a firm which has a wealth of data assets, gathered partly through acquisition of other firms. These assets are housed in different locations, in diverse formats, and unlocking their potential is, at this point in time, a largely manual effort. One hypothetical use case was offered – a medical device manufacturer, Medtronic, is approached by a pharmaceutical firm, Elan Corporation, to build a delivery system for a new drug. The medical device manufacturer will need the following information in order to assess the potential project:

• What is known about the drug and its mechanisms of action? • What patents does the pharma firm hold? • What do their financials look like? • Who is running the firm? • What partnerships do they currently have? • What legal disputes have they been involved in? • What judges have been involved in those disputes? • What similar cases are out there? • Etc..

Thomson Reuters (TR) has this information and TR Content Marketplace team has the task of assembling the data and putting the pieces together in a coherent way. Today, not all of this content is connected. To answer the questions in the hypothetical use case, a TR customer would have to use 18 TR products to search through at least 10 types of content to research how these elements are related. At the same time, if a TR product manager wanting to leverage the same information in a new and more streamlined product, would have to find and negotiate with each content owner and then, if all agreed, do months of development work necessary to make the data interoperable and accessible on a new platform. Among the goals of the TR Content Marketplace are:


5

• Standardizing the information model and business processes to eliminate duplication

across databases • Federated mastering of content, assigning a clear set of rules around ownership and

accountability • Consolidating core entities around dedicated authorities and organizations • Standardizing content databases to master unique identifiers that are resilient to change

over time • Connecting content databases through relationships and publishing the data models for all • Putting the customer in the driver’s seat by publishing content in a standard format

In the example of the device maker and the pharma firm, the types of data sought include: the organization in question, business contacts at the firm, drug databases, SEC filings, stock market trading symbols, patents, court actions, news articles, and a rational way to provide mapping between these disparate forms of information. What TR wants to end up with is a framework of "connected content" such that entities have unique identifiers and specified relationships between them, and a way to make it open so that anyone that wants to define new relationships can easily do so, and users can pick and choose which content sets they are interested in. The long-term vision includes a more organized process for data mining and navigation across the full set of relevant databases and provides a better way to handle both the more structured and the less structured data types. In the long term, TR will continue to have a spectrum of data from structured data to more unstructured data, and will bring in semantic web, images, and text data. TR will define its own ontology as well as mine external ontologies and relationships and put them all together building products based on customers needs, rather than on how content was constructed. The challenges are substantial; the stores of content at TR have grown through the acquisition of other companies over the years. The financial content alone contains over 30 content sets, with copies being maintained by the original owners, often in ways that suit their own needs. Intermediary systems developed to achieve concordance wind up adding their own interpretations. The problems were amplified after the integration between Thomson and Reuters and there is little room for error; the customers pay for the content and due to the typical nature of its use (finance, law, pharma), they require precision and accuracy. One of the biggest challenges is entity disambiguation. Today, beyond basic text matching it is mostly a manual process. An example is given on disambiguation of organizations when combining multiple data sets. So the question is, “Can the machines help? Can machines do what our people do?" Key areas include:

• Big data stores helping to bring disparate databases into one place. • Algorithms designed to mimic “expert content analysts”

o They would use multiple data points o Search across different content sets o Search the web o Find facts o Make judgment calls based on previous knowledge and patterns or “experience”


6

o Narrow down the choices • When the machine cannot replace a human, can it assist with parts of the work flow and

data gathering?

Simon Thompson, Chief Researcher, BT

Data Integration at BT British Telecommunications (BT) offers a good example of an organization that grapples with data integration complexity caused by legacy and organizational complexity as well as complexity caused by scale. Structurally BT in 2013 has five organizational units: Retail (consumer customers & small businesses in the UK), Global (services to multi-nationals), Wholesale (infrastructure resale to other communications providers), Openreach (which provides infrastructure for the local access network across the UK) & finally TSO (internal infrastructure and technology services). On the retail side, the firm is expanding into television and has recently made a 1 billion pound investment in sports broadcasting. It has also bought an LTE license in the UK. BT has several other operations and businesses it has acquired, such as Counterpane, Radianz, and Infonet. As a result, BT has many legacy systems it must work with. Legal issues can play a role in how data can be integrated. For example, as a telecommunications firm based in the UK, they are governed by OFCOM (an independent regulator in the UK) and must follow strict guidelines on their operations and business practices as laid out in the Data Protection act and in European directives. Certain data cannot be merged due to regulatory mandates, which prevent the merging of certain data sets so as to ensure a level playing field for all telecoms providers and preventing monopolies from emerging. BT cannot always do what would make good business sense because it is potentially anti-competitive BT has legacy issues, which can be difficult to resolve because there is a need a strong business case to change the status-quo and there is a certain “carrying capacity” for change. In facing change, the IT organization must flex and solutions must be provided to meet product launch times. Finally, they handle a range of customer, government, and financial data and they need to make sure that data does not migrate from one place to another, for privacy and security reasons. Two examples of data integration complexity were discussed, 1) In the case of consumer customer record keeping, BT Retail faces issues of ownership and entity resolution. BT Retail has data on 40M+ different addresses, 10M+ consumer customers, 30M+ live contacts over 20M+ access nodes - which requires aggregation of millions of records and entity resolution across 1.3B records (about 177 GB of data). In a normal residential household, for example, there might be a single user or two equal users (husband and wife – either of whom might pay the bills); in a small-business context, there are often multiple points of contact for billing, services, repairs, and emergency notification. One problem in the join process is the resolution of the business entity to the business contact point, as there may be multiple business contact points phoning in a fault event. For this application 3 year old hardware is currently used. To give some perspective on the complexity of the hardware roughly 31 servers, with 476 cores, and 820 GB of RAM are


7

required. The management and maintenance of server architectures of this sort with many layers of speciality and complexity each layer optimized for a particular function is non trivial. 2) The next example focused on network monitoring in UK Global Services -- where the data integration challenge is on merging data around real time events from different networks and integrating the data into a single consistent view showing users what's happening in the network real time. NETREP, a network monitoring and real time information system, is a typical example of a mid-sized application running on this system. It is a COTS package, providing real-time customer reporting in the GS stack (MPLS). There are dozens of similar applications within BT. The primary concern when procuring the software was to find a product that could be relied upon to scale. However even though this was the objective the change in operational context and demand profile have presented challenges and an increasingly complex This example illustrates the complexity of the hardware stack required and substantial amount of administration, solution design and network design, work that platforms have to do in building big applications. Much of the work would disappear if these applications could be developed on a single "big data architecture." In terms of organizational complexity, BT has 20 different platforms, each of which runs scores of individual systems. Each platform has a data architect, a platform architect, and a team working for them. All systems are registered and must be granted security and architectural compliance; this is a formal process, with a board of architects and signoff required for registration to take place. The central “Chief Architect’s Office” owns the Enterprise Data Model. Under this model, 900 entities are mapped and formalized (xsd) and there is a constant process of surveillance and inspection of coverage to platforms. In this way, data is managed at the platform interfaces so managers know what comes in and goes out, however, there are proprietary models in the different packages, there is legacy, there are “tactical” systems, and there is denormalization of data to enable performance. BT's requirements for new technology are :

• Is cheap to build (low entry costs), scales, and can be maintained in life indefinitely without massive fees

• Isn’t bewilderingly complex • Manages and mitigates the load on feeder systems/ingress • Has low operations costs and carbon impact • Is secure • Enables business continuity • Is easy to use • And offers functional properties that RDBMS do not

Future vision is data integration at the speed of software, not the speed of administration.

Dr. John Gilbertson, Director of Pathology Informatics, Massachusetts General Hospital

Pathology Practice (in the age of big data and computation)

What is a Pathologist? Pathology a big part of medicine. About 70% of medical decisions are based on a laboratory test. Pathologists analyze blood, fluids, and tissue for the presence and


8

nature of disease. When clinicians have questions, they take a sample, order a lab test and send the sample to pathology. Pathologists generate a lot of medical data and the volume is growing -- well over half the data elements in EMRs (Electronic Medical Records) or registries comes from pathology. The data is mainly quantified measurements and there is some free text data as well. Pathology generates a lot of raw data, especially in genetic sequencing and microscopic imaging (terabytes per year). Increasingly, pathologists have the ability to collect data at the molecular level. Genomics gives us the ability to collect data on 10s of 1000s of genes. Over the last several years, MGH has built up the ability to generate a lot of data and we have a lot better models of disease and therapy -- can machines help interpret the data? In both anatomic and clinical pathology, Molecular data has become central to the ways in which pathologists think about disease and it has been embraced by a large number of pathologists in all subspecialties; thus Molecular concepts are coming to define Pathology’s vision of itself, branching out into Laboratory and Molecular Medicine. Some of the challenges today,

• Data is linked directly from a patient to a specimen to a result, linking data in other ways, for example, linking directly different specimens from the same patient is very difficult.

• Data can be sparse since data is generated only when tests are requested -- the pathology department does not store data, it is not, now, in the business of creating a data warehouse

The longer term vision is that Molecular data and Genomics data can be integrated and used to infuse Clinical services with new ideas and techniques which will transform the way that the medical professional understands disease and practice medicine. Factors in unlocking this potential included:

• Establishing stewardship of the data in the EMR • Promoting tools for data analysis, presentation, and communication • Removing IT-related barriers to effective practice • Enhancing laboratory processes

For those working in the field, the key question is: “Can informatics be embraced by a large number of pathologists in all subspecialties, eventually becoming part of Pathology’s vision of itself and ultimately become central to the way in which Pathology approaches disease?” Part of the approach consisted of developing a clinical fellowship training program in pathology informatics, which encompasses the study and management of information, information systems, and processes in the practice of pathology. MGH Pathology Data Strategy:

• Fundamental Infrastructure. Today, have planned upgrades to Laboratory Information Systems, integration of Laboratory Information Systems across Partners Healthcare, and a new “common clinical” information system, including and new EMR.

• Transformational Infrastructure. Future, digitize specimens (including stained slides which provide data at molecular level) and apply computational power to study of morphology and the practice of anatomic pathology. Active curation of data and creation of a clinical computational environment (data warehouse).

• Transformational Environment. Leverage data for disease modeling, image analysis, creating knowledge bases, decision support, integrated reporting, etc.


9

In the context of Computational Pathology, there is activity in correlative research, predictive modeling, data exploration and visualization. At MGH, the department's vision for future of computational pathology means providing an ability to bring laboratory and clinical information together with disease models in order to generate reports and decisions effectively.

Session II: Systems and Tools I

Executive Summary of Session II Session II included speakers from Google Research, SAP Labs, the TJ Watson Research Center at IBM, and the University of Massachusetts Amherst. Discussions focused on various approaches to integrating data using new tools and provided details on classic problems in data integration including data merges (match, mismatch or no match), efficiency, speed, and accuracy. Both philosophically and practically, the best data integration systems will provide for revisability, explorability, interpolation and correspondence.

Jayant Madhavan, Google Research

Big Data Integration for Data Enthusiasts A common view of the challenge of big data is that it involves running computations over enormous data sets – petabytes, exabytes, or more. However, this is only one aspect of the challenge. One area of interest to the Structured Data Research group at Google concerns the activities and needs of “data enthusiasts” – people who are data experts within certain disciplines, but lack technical expertise; they may be journalists, social scientists, non-governmental organization staff, high school teachers or students. Many times they are engaged in advocacy efforts and are trying to achieve positive outcomes through data awareness. Google Labs has developed Fusion Tables to help meet the needs of this user group; as a process for data management, with integration in the cloud, Fusion Tables is easy to use and promotes sharing, collaborating, exploring, visualizing, and publishing data-driven research. Fusion Tables was launched in 2009 and are now part of Google Apps. It is an experimental data visualization web application to gather, visualize, and share larger data tables. Since 2009, many millions of tables have been uploaded, SQL API is used widely to access reference tables, and they have added search to find public tables and tables extracted from the Web. Embedded maps are popular with journalists and numerous examples may be given of how combining information from two data sets may enhance a story, or illustrate key points in a graphical and possibly interactive way. Fusion Tables offer many ways to visualize the data, clearly providing some of the most appropriate ones for users to choose from – maps, bar graphs, pie charts and other standard graphical elements. When users seek to produce interactive visualizations on large data sets, the applications require fast server-side retrieval, fast rendering, and low network delays. The success of fusion tables


10

hinges on support for merged tables - virtual tables that are the join of two or more underlying data tables. The joins must work, using proximate matching, entity-based matching, or other ways to pull the data together. Use cases include merging complementary data sets and merging reference data sets. The community benefits of merges flow from the fact that they foster an ecosystem for data reuse, in short:

• They draw on high quality reference tables that may be used by many, but are managed by a dedicated few

• The extent of the data reuse serves as a crowd-sourced quality signal • Per-table permissions lead to sophisticated sharing models

The performance benefits are found in efficient in-memory optimizations in two areas: 1) multiple visualizations share in-memory indices, reducing their combined footprint, and 2) splitting tables into fixed and changing sub-tables leads to higher update rates. In searching for tables which may be candidates for a merge, there are several challenges. The initial query will be aimed at finding tables on the web that match keyword searches. The first challenge is one of extraction – not everything within a <table> is a data table, so it is important to be able to identify and distinguish data tables from navigation and formatting ones. The second challenge is one of ranking – and this is not as simple as restricting Google.com results, in part because the content outside of the table might be necessary, but misleading. In table search, there are several ways to work on the quality of the results:

• Class-property queries are data-seeking and plentiful and they can be improved over web search

• Detecting subject columns and their corresponding semantic classes can improve search quality of results

• Detecting header columns and their corresponding properties can also improve results Returning to the discussion of merge search, there are several notes on process. First, in matching the join columns, the coverage must handle entity overlap. In matching keyword entries, the system may use token-based matching, synonymy, or other means. Considered as a whole, there is a subtle, yet critical difference from web search; in these cases, recall is more important than precision and traditional IR optimizations can get in the way. Looking to the future, Google is studying how to make data research easier for the enthusiasts. Ideas include finding ways to automatically suggest datasets that complete, complement, or contradict the ones which the enthusiasts are currently using. Similarly, it will be useful to suggest visualizations that highlight trends in the data automatically. The goal is to assist data enthusiasts in data integration, data analysis, and storytelling, resulting in stronger, more vibrant, data-driven communications.

David Reed, SVP, Special Projects, SAP Labs

Perspectives on Big Data Integration When there is a wealth of disparate details, there is a challenge in “seeing through the data” to determine what the whole picture is. In signal processing world, data is understood to provide an approximation of the “real world” and the real world is the one that matters the most. A dataset plus a model leads to a perspective and it is proposed that a better approach to sharing may be


11

perspective-based. The difficult issue in integration is in correlating independent or autonomous observations of a common reality. There is an interest in completeness, or the equivalence of a perspective, developed in part through observations and sampling. Inferences and assumptions come into play, enabling elaboration of the perspective. The methodology entails knowledge of the provenance of the data and a level of confidence in the reliability of the data. Finally, philosophical issues of knowability and practical issues of security and privacy must be covered. Once viewed from this standpoint, Big Data poses several opportunities and challenges including Revisability:

• The flawed sequencer – revising results o Data retention – immutability o Deterministic computation

• Bringing new information to old results – revising conclusions o Inference perspectives o Falsification and/or anomaly detection

• Improving capture – revising sources and methods o Validation and calibration o Integration across methodology

There is also the question of Explorability – questions are driven by the real world, not by the raw data per se. This view discounts the value of schema-based queries, because query languages should represent abstractions related to real-world measures, opinion polls as opposed to opinions themselves, for example. Data models and metadata more explicitly express applicability to the real world. However, there are further processes of analysis and interpretation at work to produce the final results, among them: Interpolation and Correspondence.

• Sampling processes are rarely synchronized across data sets o Naturally concurrent processes – no update locks on reality o Sensible queries do not depend on simultaneity or serializability

• Indicia rarely correspond o Coding of data entry or sensors fundamentally variable o No such thing as “equi-join” across geographic datasets, for example

• Interpretations are usually approximations o What’s wrong with sharing a model rather than the data o Model fitting rather than reasoning, with many potential models o Exceptions may be excluded, interpreted probabilistically, or used to falsify

To conclude, the concept is to move towards real abstractions for diversity in Big Data. One potential new abstraction lies in the Perspectives approach. Each operation on the data will create a new perspective, which has a new unique name. The perspective records how the result has been computed, what sources were used for the information and other details sufficient to recreate that perspective. New information capture will create a new perspective as well, a new version for all existing perspectives, in keeping with the principle of revisability. Again, perspectives are traceable back to original data sources – and encompass raw data, tools, and cleaning. Perspectives define methods for interpolation and correlation with other perspectives. The main difference from the DBMS view is that perspectives focus on reality, not tables and schemas.


12

Scott Schneider, IBM TJ Watson Research Center

Data Integration with IBM InfoSphere Streams This talk addressed Data Integration with the IBM Infosphere Streams. Streams applications are data flow graphs that consist of:

• Tuples – a structured data item • Operators – reusable stream processing elements for analytics • Streams – a series of tuples with a fixed type • Processing elements – operator groups

The stream processing language source enters the compiler, is divided up, and sent out to the processing elements across a grid of x86 hosts. Supported data formats include stock quotes, medical data, video, text, and many other forms of information. Companies have grown increasingly interested in the use of social media to gauge interest in movies, for example, and IBM has developed a unique Social Media Analytics Architecture that takes in streams of data from various social media outlets including Twitter, FaceBook, RSS feeds, and LinkedIn. After analyzing the material, profiling the consumers or prospects, and integrating the information, it generates predictive models that help customes make timely decisions on marketing and communications efforts. Other applications include medical data integration, using Infostreams to integrate large amounts of medical data into a snapshot of a neonatal patient, for example. Finally, Scott discussed MARIO, an integrated analytics composer that uses planning to generate application code from a high level set of goals. Specifically, given processing goals, MARIO integrates goal refinement guidance from business users and composes an application flow automatically, generating application code for Infosphere Streams or other platforms through the use of plug-ins.

Andrew McCallum, University of Massachusetts Amherst

From Unstructured Text to Big Databases A key challenge in Big Data is to build models that mine actionable knowledge from unstructured text. As an example, consider extracting lists of job openings from the classified ads and feeding them into a portal. Text would have to be mined to extract position title, company, salary, and required experience. This extracted job information could then be mined for information on trends in hiring, regional statistics, and cross-referenced against national figures on unemployment, hot sectors, and salary ranges. Another example of text mining involves academic research papers, thousands of which are published each year. If we could build tools that parse and understand papers, we could automate the construction of a knowledge base of all the publishing scientists in the world in order to help build better tools and accelerate the progress of science at the highest levels. The knowledge base could also help people find papers to read and cite in their own work, and find reviewers, collaborators, and prospective employees. It would also provide a broad view of the trends and


13

landscape of science. Finally, it could provide a platform for a new model of publishing, one where publications could be maintained in an archive and evaluated by peers, with public comments and ratings, forming an open peer review process for these papers. In the construction of such a knowledge base, information extraction components are not perfect and errors can snowball. To mitigate these issues, a variety of techniques are needed, representation and capture of uncertainty from the information extraction, figuring out how to use existing database contents to improve the performance of information extraction, and incrementally updating inferences as new data arrives. In addition, this information extraction process should be automated as much as possible, with humans support it with expert knowledge rather than operating the mechanical aspects of the integration process. Integrating human knowledge presents a number of challenges – sometimes humans are wrong, disagree or have out out of date information. On the other hand, they are able to reason jointly about truth and can serve as editors and arbiters of correctness. Given the inherently uncertain nature of knowledge extraction, Andrew calls his database an epistemological database, drawing on the tenet of epistemological philosophy that truth is inferred, not observed. The key challenge in such a system is determining what to do when, when new piece of information is added to the database, particularly when it conflicts with what is already present. There are several other characteristics of an epistemological database:

• Never ending inference o Traditional model: knowledge base entries are locked in o Epistemological model: knowledge base entries may always be reconsidered,

with more evidence, more time • Resolution is foundational

o Not just for the co-reference of entity-mentions o But also to align values, ontologies, schemas, relations, events

• Resource-bounded information gathering o Not just full processing on the whole web o But focusing queries and processing where it is needed and most fruitful

• Smart Parallelism o Not just MapReduce, or black box o But reasoning about inference and parallelism together

Research ingredients for this new database include learning, entity resolution, crowd-sourcing human edits, relation extraction and probabilistic programming. In his talk, Andrew focused on entity resolution with conditional random fields (CRFs). He argued for using high level properties of the data – groups of “super entities” – to partition the data across multiple nodes in a parallel system. Each node can then perform entity resolution among these coarse super entities, improve parallelism.. Hence, inference is used not only for truth discovery, but simultaneously for strategizing about the data distribution. This form of smart parallelism proves to be much faster. In addition, Andrew argued for moving from pair-based co-references to entity-based co-references. Specifically, entity-based co-references:

• Are more efficient, with fewer terms, avoiding N2 runtimes • Provide joint inference on all attributes of the entity in a way that pairwise co-cannot • Support human edits better


14

In one specific example, looking at PubMed and the Web of Science, there are 200 million author mentions. Taking up about 400 GB, the inference covered 100,000 samples per second and took 24 hours of inference time on 3 machines with 48 cores. Andrew also discussed the importance probabilistic programming languages, which make it easy to specify rich, complex models involving data structures, control mechanisms and abstraction. One example he gave was Factorie, implemented in Scala, which is:

• Object-oriented – variables, factors, inference, and learning methods are objects; there is inheritance

• Embedded in a general purpose programming language • Scalable to billions of variables and factors and integrates tightly into the backend of the

database

In summary, information extraction and integration is done with joint probabilistic inference and machine learning; Andrew argued it can be non-greedily and still scale. He present the model of epistemological databases, which contain entities and infer relations from evidence, rather than recording simple relational facts.

Session III: Systems and Tools II

Executive Summary of Session III Session III featured speakers from Recorded Future, Data Tamer, and MIT CSAIL. The themes included leveraging web-based information to develop models of human social behavior, the development of a scalable data curation system, a sophisticated approach to 3D mapping, and research on using application information for relational schema matching. The challenges presented in each case echoed some of the key issues raised in earlier talks and provided insights on some of the innovative approaches being developed related to the management and mining of Big Data.

Christopher Ahlberg, Co-Founder, Recorded Future

Web Intelligence and Data Integration The growth of social media and ubiquitous computing has changed the landscape for public activities, such as protests, around the world. Taking the example of the unrest in Egypt, a distant observer might ask, “What will happen next?” The answer depends, in part, on several questions, including, “Who is in the protest zone zone now? Who is speaking there? What is the sentiment? Is it getting better or worse? What has caused the unrest? How does it compare to other places?” There are tens of thousands of news websites and blogs, many thousands of government websites and organizations, and new global communications capabilities within single domains such as Facebook, Twitter, Instagram, PasteBin and others. Languages include include English, French, Spanish, Chinese, Arabic, Russian and Farsi. Recorded Future is a platform that provides a way to graph the content around a theme such as “Egypt Protest” and show the concentrations and trends. It makes it possible to conduct longitudinal analysis on events, looking at the progress of


15

events over time (e.g., in the last 12 months) and also allows users to compare events (e.g., the Cairo or Benghazi protests,), or to create even broader cross-comparisons between these and other possibly linked events. Recorded Future also makes it possible to search for patterns in dates or geography; highlighting trends for particular days of the week or in particular continents or countries, and offering several ways to cut across the data and produce relevant graphics. The tools allow to user to form hypotheses and compare them to actual data, evaluating the narratives and providing a base for further investigation. Looking ahead, the tools also allow for speculation – a researcher might see what is co-occurring with an event in Tahrir Square now, look back to see what occurred last month, and then look ahead to see what could co-occur next month, based on future dates mentioned in mined data soruces. At the technical level for such integrations, there are several considerations:

• Accurate spelling of key terms in seven languages • Accomodation of synonyms for the same term, particularly with regard to locations • Fitting a location, such as Tahrir Square, into a geographic hierarchy/ontology • Relating the location or entity to other people, places and organizations

If the data is integrated well, it will allow for analysis of the non-obvious and drive a great user experience; if not, it can lead to misleading inferences or missed observations.

Michael Stonebreaker, MIT CSAIL

Data Tamer: A Scalable Data Curation System Data curation involves ingesting, validating, transforming, correcting, consolidating, and visualizing information that is to be integrated. In conventional data integration settings, a person defines a global schema and assigns a programmer to each data source involved in the integration. The programmer then develops an understanding of the data source, writes local to global mapping in a scripting language, writes a cleaning routine, and runs the ETL process. This is labor intensive and scales to (at most) 25 data sources. A typical modern enterprise has hundreds of data sources inside its firewalls and wants additional sources that are in the public domain. Web aggregators often have more than that, giving rise to truly long tail applications. Examples of long tail applications include Verisk Health, which integrates 300 medical insurance sources; Novartis, which integrates 8,000 lab notebooks, and Goby.com, which integrates events and things to do from 80,000 URLs. The so-called “Goby Problem” is to determine which listings point to the same entity. Names may change slightly and specific details of entries may differ, creating a need to verify whether they are indeed the same listing or not. In general, integration is done in an ad-hoc manner and is application-specific, but, in the end, each application follows roughly the same process. Data Tamer seeks automate this process, replacing ad hoc techniques in use currently with a more systematic approach. It does this by inverting the normal ETL architecture, relying on machine learning and statistics and turning to humans for help only when the automated algorithms uncertain.


16

During the ingest phase, Data Tamer assumes that a data source consists of a collection of records, with each record composed of attribute-value pairs. This material is loaded into Postgres. In the schema integration phase, the program must be told whether there is a predefined partial schema, a complete global schema, or nothing. It then starts integrating the data sources, using synonyms, templates, and authoritative tables for help. For the first few sources, it will ask the crowd for answers, asking fewer questions over time as its confidence improves. To measure similarity, it employs standard information retrieval techniques like cosine similarity on attribute names and data. After modest training, Data Tamer was able to correctly and automatically match 90% of the attributes on the Goby and Novartis problems automatically, cutting the human costs of integration down dramatically. In cases where crowd-sourcing is used, the process consists of a hierarchy of experts with specializations. This is modulated by means of algorithms to adjust the “expertness” of the experts and a marketplace to perform load balancing. A large-scale evaluation is underway at Novartis and it is working. Moving to entity consolidation (aka duplicate elimination), entity matching is performed on all attributes, weighted by value, presence and distribution. The idea is to view this as a data clustering problem, with a first pass to try and identify groups of potentially duplicate entities, followed by pairwise comparison within blocks (otherwise N2 in the number of records.) Data Tamer is being spun out as a commercial company, which solicits and run pilots; as work progresses, they will also develop the visualization capabilities and address scalability issues.

John Fisher, MIT CSAIL

Big Analytics – Challenges for Multi-Modal Data Fusion John’s talk concerned distributed sensor systems that produce a high volume of unstructured data. The tools, methods and applications used in distributed sensing include information theory, machine learning, nonparametric Bayesian Models, inference and graphical models. John used the example of a rich data set gathered from six cameras on an airplane looking at the city of Columbus, Ohio. Terabytes of data are collected during a twenty minute flight. John discussed the specific task of constructing a 3D model of the region based on this imagery. Producing a high quality model requries incorporating multiple data sources, including full motion video (FMV), GPS/INS (which knows where the cameras were), LIDAR (which measures 3D geometry), and Open Street Maps (where the data includes major roads and locations on those roads, as well as waypoints.) John’s approach, called Multi-Modal Fusion is a 3D re-constructor that takes in multiple data sources. The approach is to formulate the problem as inference in a graphical model. Multi-Modal Fusion includes a variety of new algorithms and theoretical developments that combine all of the above data sets, producing a much better 3D image that that produced by existing techniques. The key is to use an optimal percentage of each type of the data, not all of it. To simplify the inputs to the rendering task, the system takes human inputs about what the data priorities are. When handling a massive amount of data, it can be helpful to pose the basic question that is driving the analysis – this will determine how much integration is needed over terabytes of data for an answer of a certain precision. The required precision depends on the reason for building a 3D model – emergency planning? Road maps? Traffic patterns? Each


17

purpose will require a particular level of detail, but if there is a clear view on how much data of each type will be needed to complete the task effectively, there will be significant computational savings. As Voltaire said, “The perfect is the enemy of the good.” So how much is just enough? Information sources vary in quality and costs; and higher cost is not necessarily commensurate with higher quality. To formalize this notion, there is a broad class of information - f-divergence – which is fundamentally linked to bounds on risk. Submodularity, as applied to information measures, is a key enabler and captures the diminishing returns in a quantitative fashion. This approach:

• Provides offline and online performance bounds • Guarantees a tractable planning method, and • Submodular properties are intimately related to the structures of the graphical model’s

local properties and computations and yield global properties

In asking the question, “What is the structure of the world?”, there are several challenges. One problem lies in Simultaneous Localization and Mapping (SLAM) – a robot, for example, might not know where it is and yet it is trying to map the world and figure out where it is in that world simultaneously. There are certain schemes that may be used to try and approximate that case. Specifically, information measures, as a proxy for optimal sensing, capture the magnitude of the uncertainty and can be related to a large family of risk functions. Among its goals are reducing uncertainty and relating uncertainty to risk. Hence, information gathering can be taken as an experimental data goal, where efficient information planning leads to guarantees on the information gathering. The method John has developed in his research is:

• Non-greedy – considers subsets of measurement • Non-myopic – plans multiple sources in future

A tractable plan might contain far less exponential complexity and yet be at least as good as an optimal plan. For example, if someone is studying cars at an intersection, is the focus on how many vehicles are there at a set point in time? Or is it important to see the cars moving – how they are positioned in relation to each other as they pass through? In talking about sensing problems, there is a physical model and a database model – a query will be an expression of some abstraction. With 30 years of research on databases, there is now interest in moving the computation to the data. Data sources are distributed, so if the researcher decides which ones to look at and can perform some pre-analysis locally, then he can be more judicious on what is selected and pulled in to a larger computation. There are tradeoffs between the number of resources available, the number of queries posed, and the amount of precision in the answers. Some of the methods discussed here can improve the results of these tradeoffs.

Fadel Adib and Sam Madden, MIT CSAIL

Harvesting Application Information for Industry-Scale Relational Schema Matching Relational schema matching relates elements from a source database to a target database. In many cases, there are not one-to-one matches between all attributes, leading to individual attributes in the source database that map to several in the target database, as well as attributes with no matching in the target. In industry-scale relational schema matching settings, there are


18

hundreds or even thousands of fields in the source and destination, with complexity that is exponential in the number of source. Fadel presented a scheme that tries to automate the matching process by harvesting information from the user interface and application dynamics. Application information can often provide more information than schemas alone. For example, the user interface helps match obscure fields; a typical input at that level might be “City” and “State”, while underneath, the fields might be “address_city” and “addrct”. The user interface also helps to resolve one to many matches and it provides additional relational information. The application information is extracted using a crawler that explores the application and finds new pages. It then employs a user interface to datbase linker to determine conncetions between the web fields database. It then concludes with a feature generator that generates features used in matching. The matching processing is first trained on one or more small integration problems, with the result being used to automate the matching of larger problems. The overall matching is achieved through an efficient quadratic optimization. Two experimental setups have shown good accuracy in results. For small schemas, the system was tested on a publication database with 20 fields. Baseline accuracy was 66%, while Fadel’s approach showed 90% accuracy. On industry scale schemas involving a CRM database with 300 fields, the baseline accuracy was 39% while Fadel’s approach achieved 82% accuracy.


19

Breakout Groups

Breakout Group Spreadsheets

Team 1


20

Team 2

Team 3


21

Team 4


22

Team 5 (aka Team 6)

draft document [v1.0 may 7 2013] 1 mit big - [email protected]

Documents