type:tutorial paper authors: arun kejriwal (machine zone inc.), sanjeev kulkarni, karthik...

Real Time Analytics: Algorithms and Systems

Type: Tutorial Paper

Authors: Arun Kejriwal (Machine Zone Inc.), Sanjeev Kulkarni, Karthik Ramasamy(Twitter Inc.)

Presented by: Siddhant Kulkarni

Term: Fall 2015

Motivation

In-Depth overview of streaming analyticsApplicationsAlgorithmsPlatforms

Contribution

Description of various types of data contributing to the field of Big Data Social Media IoT Healthcare Machine Data (cloud) Connected Vehicles

KATARA: Reliable Data Cleaning with Knowledge Bases and Crowdsourcing

Type: Demo Paper

Authors: Xu Chu, John Morcos, Ihab Ilyas, Paolo Papotti, Mourad Ouzzani, Nan Tang(Qatar Computing Research Institute, Yin Ye(Google)

Presented by: Siddhant Kulkarni

Term: Fall 2015

Motivation

Issue with Data Cleaning What are the external sources? Problem with External Sources

VINERy: A Visual IDE for Information Extraction

Presented by: Omar Alqahtani

Fall 2015

Demonstration Papaer

Authors Yunyao Li

IBM Research Almaden

Elmer Kim

Treasure Data, Inc.

Marc A. Touchette

IBM Silicon Valley Lab

Ramiya Venkatachalam


Hao Wang


Motivation

Extractor development remains a major bottleneck in satisfying the increasing demands of real-world applications based on IE.

Lowering the barrier to entry for extractor development becomes a critical requirement.

Related Works

Previous work has focused on reducing the manual effort involved in extractor development.

WizIE is a promising wizard-like environment but needs non-trivial rule language.

Special-purpose systems.

Contribution

VINERY, a Visual INtegrated Development Environment for Information extRaction, consists of:

The foundation of VINERY is VAQL, a visual programming language for information extraction.

VINERY embeds VAQL in an web-based visual IDE for constructing extractors, which are translated into AQL and executed

VINERY includes a rich set of easily customizable pre-built extractors to help jump-start extractor development.

VINERY provides features to support the entire life cycle of extractor development.

WADaR: Joint Wrapper and Data Repair

Author: Stefano Ortona, Giorgio Orsi, Marcello Buoncristiano, and Tim Furche Department of Computer Science, Oxford University, United Kingdom

Dipartimento di Matematica, Informatica ed Economia, Universit`a della Basilicata, Italy [email protected]

Paper Type: Demo

Presented by:

Ranjan_KY

Fall 2015

mailto:[email protected]

Motivation

Web scrapping (or wrapping) is a popular means for acquiring data from the web.

Today generation made scalable wrapper-generation possible and enabled data acquisition process involving thousands of sources.

No scalable tools exists that support these task.

.

Problem Modern wrapper-generation systems leverage a number of features

ranging from HTML and visual structures to knowledge bases and micro-data.

Nevertheless, automatically-generated wrappers often suffer from errors resulting in under/over segmented data, together with missing or spurious content.

Under and over segmentation of attributes are commonly caused by irregular HTML markups or by multiple attributes occurring within the same DOM node.

Incorrect column types are instead associated with the lack of domain knowledge, supervision, or micro-data during wrapper generation.

The degraded quality of the generated relations argues for means to repair both the data and the corresponding wrapper so that future wrapper executions can produce cleaner data

Demonstration

WADaR takes as input a (possibly incorrect) wrapper and a target relation schema, and iteratively repairs both the generated relations and the wrapper by observing the output of the wrapper execution.

A key observation is that errors in the extracted relations are likely to be systematic as wrappers are often generated from templated websites.

WADaR’s repair process (i) Annotating the extracted relations with standard entity

recognizers, (ii) Computing Markov chains describing the most likely

segmentation of attribute values in the records, and (iii) Inducing regular expressions which re-segment the input

relation according to the given target schema and that can possibly be encoded back into the wrapper.

Related work

In this paper, related work was not evaluated in detail

[1] M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti. Extraction and integration of partially overlapping web sources. PVLDB, 6(10):805–816, 2013.

[2] L. Chen, S. Ortona, G. Orsi, and M. Benedikt. Aggregating semantic annotators. PVLDB, 6(13):1486–1497, 2013.

[3] X. Chu, Y. He, K. Chakrabarti, and K. Ganjam. Tegra: Table extraction by global record alignment. In SIGMOD, pages 1713–1728. ACM, 2015.

Association Rules with Graph PatternsWenfei Fan 1,2 Xin Wang 3 Yinghui Wu 4 Jingbo Xu

1,21Univ. of Edinburgh 2Beihang Univ. 3Southwest

Presented by: Zohreh Raghebi

Fall 2015

Motivation

We propose graph-pattern association rules (GPARs) for social media marketing

Extending association rules for itemsets, GPARs help us discover regularities between entities in social graphs

We study the problem of discovering top k diversified GPARs

We also study the problem of identifying potential customers withGPARs

Introduction

A graph-pattern association rule (GPAR) R(x, y) is defined as Q(x, y) q(x, y), ⇒

where Q(x, y) is a graph pattern in which x and y are two designated nodes,

q(x, y) is an edge labeled q from x to y, on which the same search conditions as in Q are imposed

We refer to Q and q as the antecedent and consequent of R

We model R(x, y) as a graph pattern PR, by extending Q with a (dotted) edge q(x, y).

We treat q(x, y) as pattern Pq , and q(x, G) as the set of matches of x in G by Pq

DIVERSIFIED RULE DISCOVERY

We are interested in GPARs for a particular event q(x, y)

However, this often generates an excessive number of rules, which often pertain to the same or similar people

This motivates us to study a diversified mining problem, to discover GPARs that are both interesting and diverse

Problem. Based on the objective function, the diversified GPAR mining problem (DMP) is stated as follows.◦ Input: A graph G, a predicate q(x, y), a support bound σ and positive integers k and d.

Output: A set Lk of k nontrivial GPARs pertaining to q(x, y) such that (a) F(Lk) is maximized; and (b) for each GPAR R Lk, supp(R, G) ≥ ∈ σ

DIVERSIFIED RULE DISCOVERY

DMP is a bi-criteria optimization problem to discover GPARs for a particular event q(x, y) with high support and a balanced confidence and diversity.

In practice, users can freely specify q(x, y) of interests

proper parameters (e.g., support, confidence, diversity) can be estimated from query logs or recommended by domain experts

IDENTIFYING CUSTOMERS

Consider a set Σ of GPARs pertaining to the same q(x, y), i.e., their consequents are the same event q(x, y).

We define the set of entities identified by Σ in a (social) graph G with confidence η

Problem. We study the entity identification problem (EIP):◦ Input: A set Σ of GPARs pertaining to the same q(x, y), a confidence bound η > 0, and a graph G.

◦ Output: Σ(x, G, η). It is to find potential customers x of y in G identified by atleast one GPAR in Σ, with confidence of at least η.

Keys for Graphs

Wenfei Fan 1 , 2 Zhe Fan 3 Chao Tian 1 , 2 Xin Luna Dong 41 University o f Edinburgh 2 Beihang University 3 Hong Kong Baptist University 4 Google Inc.{wenfei@inf., chao.tian@ }ed.ac.uk, [email protected], [email protected]

Motivation

Keys for graphs aim to uniquely identify entities represented by vertices in a graph.

We propose a class of keys that are recursively defined in terms of graph patterns, and are interpreted with subgraph isomorphism.

Extending conventional keys for relations and XML, these keys find applications in:object identification, knowledge fusion and social network reconciliation.

As an application, we study the entity matching problem that, given a graph G and a set Σ of keys

to find all pairs of entities (vertices) in G that are identified by keys in Σ

we provide two parallel scalable algorithms for entity matching:

MapReduce and a vertex-centric asynchronous model

More details

Entity resolution is to identify records that refer to the same real-world entity.

Keys for graphs yield a deterministic method to provide an invariant connection between vertices and the real-world entities

The quality of matches identified by keys highly depends on keys discovered and used, although keys help us reduce false positives.

We defer the topic of key discovery to another paper

focus primarily on the efficiency of applying such constraints

Entity resolution

Finally, we remark that entity resolution is just one of the applications for keys for graphs besides:

e.g., digital citations and knowledge base expansion

entity matching is different to record matching that identify tuples in relations

that does not enforce topological constraints in the matching process

Graph pattern matching

Consider a graph G and an entity e in G.

We say that G matches Q(x) at e if there exist a set S of triples in G and a valuation ν of Q(x) in S such that ν(x) = e,

ν is a bijection between Q(x) and S.

We refer to S as a match of Q(x) in G at e under ν.

Intuitively, ν is an isomorphism from Q(x) to S when Q(x) and S are depicted as graphs.

That is, we adopt subgraph isomorphism for the semantics of graph pattern matching

Examples Example 4: Consider Q(x) and G,

a set S1 of triples in G 2: {(com 1, name of, “AT&T”),(com 4, name of, “AT&T”), (com 1, parent of, com 4), (com 3,parent of, com 4)}.

Then S1 is a match of Q 4(x) in G 2 at com 4, which maps variable x to com 4, name ∗ to “AT&T”, wildcard company to com 1, and company to com 3.

Keys for Graphs:

Keys. A key for entities of type τ is a graph pattern Q(x),where x is a designated entity variable of type τ

we provide two parallel scalable algorithms for entity matching:

MapReduce and a vertex-centric asynchronous model

Optimization of Common Table Expressions in MPPDatabase Systems

Amr El-Helw ∗ , Venkatesh Raghavan ∗ , Mohamed A. Soliman ∗ , George Caragea ∗ ,Zhongxian Gu†, Michalis Petropoulos‡

∗ Pivotal Inc.Palo Alto, CA, USA† Datometry Inc.

San Francisco, CA, USA‡ Amazon Web Services

Palo Alto, CA, USA


Fall 2015

Motivation Big Data analytics are becoming increasingly common in many business domains:

including financial corporations, government agencies, and insurance providers

Big Data analytics often include complex queries with similar oridentical expressions

Massively Parallel Processing (MPP) databases address these challenges by distributingstorage and query processing across multiple nodes and processes

Common Table Expressions (CTEs) are commonly used in complex analytical queries that often have many repeated computations

A CTE can be seen as a temporary table that exists just for one query.

The purpose of CTEs is to avoid re-execution of expressions referenced more than once within a query.

CTEs may be defined explicitly, or generated implicitly by the query optimizer

Background CTEs follow a producer/consumer model where the data is produced by the CTE definition consumed in all the locations where that CTE is referenced. One possible approach to execute CTEs is to expand (inline) all CTE

consumers, Rewriting the query internally to replace each reference to the CTE This approach simplifies query execution logic, but may incur performance

overhead due to executing the same expression multiple times.

Background the CTE expression is separately optimized and executed only once,

the results are kept in memory, or written to disk if the data does not fit in memory

The data is then read whenever the CTE is referenced.

This approach avoids the cost of repeated execution of the same expression,

although it may incur an overhead of disk I/O.

The impact of this approach on query optimization time is rather limited,

since the optimizer chooses one plan to be shared by all CTE consumers.

However, important optimization opportunities could be missed due to fixing one execution plan for all consumers

Challenges : Deadlock Hazard

MPP systems leverage parallel query execution

where different parts of the query plan execute simultaneously as separate processes,

possibly running on different machines.

In some cases, a process has to wait until another process produces the datait needs.

For complicated queries involving multiple CTEs, the optimizer needs to guarantee that no two or more processes could be waiting on each other during query execution.

CTE constructs need to be cleanly abstracted within the query optimization framework to guarantee deadlock-free plan

Enumerating Inlining Alternatives and Contextualized Optimization

The approaches of always inlining CTEs, or never inlining CTEs, can be easily proven to be sub-optimal

The query optimizer needs to efficiently enumerate and cost plan alternatives thatcombine the benefits of these approaches

CTEs should not be optimized in isolation without taking into account the context in which they occur.

Isolated optimization can easily miss several optimization opportunities

1. This approach avoids repeated computation However, this approach does not take advantage of the index on i color

2. The opposite approach: all occurrences of the CTE are replaced by the expansion of the CTE

This allows the optimizer to utilize the index on i color However, it suffers from the repeated computation

3. Figure 1(c) depicts a possible plan in which one occurrence of the CTE is expanded, allowing the use of the index while the other two occurrences are not inlined, to avoid recomputing the common expression.

Contributions

A novel framework for the optimization of CTEs in MPP database systems.

Our framework extends and builds upon our optimizer infrastructure to allow optimization of CTEs within the context where they are used in a query

A new technique in which a CTE does not get re-optimized for every reference in the query, but only when there are optimization opportunities, e.g. pushing down filters or sort operations.

This ensures that the optimization time does not grow exponentially with the number of CTE consumers

Contribution

A cost-based approach for deciding whether or not to expand CTEs in a given query.

The cost model takes into account disk I/O as well the cost of repeated CTE execution

A query execution model that guarantees that the CTE producer is always executed before the CTE consumer(s).

In MPP settings, this is crucial for deadlock-free execution

Fuzzy Joins in MapReduce: An Experimental Study

Ben Kimmett, Venkatesh Sr inivasan, Alex ThomoUniversity of Victoria, Canada{blk,srinivas,thomo }@uvic.ca


Fall 2015

Motivation We report experimental results for the MapReduce algorithms proposed by Afrati, Das Sarma, Menestrina, Parameswaran and Ullman in

ICDE’12 (Fuzzy join using mapreduce) to compute fuzzy joins of binary strings using Hamming Distance Their algorithms come with complete theoretical analysis however, no experimental evaluation is provided

Methods

there are several algorithms proposed for performing “fuzzy join” : (an operation that finds pairs of similar items) in MapReduce concentrates on binary strings and Hamming distance The algorithms proposed are: Naive, which compares every string in the set with every other Ball-Hashing, send strings to a ‘ball’ of all ‘nearby strings’ within a certain

similarity

Methods

Anchor Points, a randomized algorithm that selects a set of strings and compares any pair of strings that have a close enough distance to a member of the set

Splitting, an algorithm that splits the strings into pieces and compares only strings with matching pieces

Conclusion It is argued in that there is a tradeoff between communication cost and

processing cost that there is a skyline of the proposed algorithms; i.e. none dominates

another.

One of our objectives is to see whether we can observe thisskyline in practical terms.

We observe via experiments that some algorithms are almost always preferable to others.

Splitting is a clear winner Ball-Hashing suffers for all distance thresholds except the very small ones

A Natural Language Interface for Querying General and Individual

KnowledgePresented by: Shahab Helmi

Fall 2015

Paper InfoAuthors:

Yael Amsterdamer , Tel Aviv University

Anna Kukliansky, Tel Aviv University

Tova Milo, Tel Aviv University

Publication: VLDB 2015

Type: Research Paper

Motivation

Many real-life scenarios (queries) require the joint analysis of general knowledge, which includes facts about the world, with individual knowledge, which relates to the opinions or habits of individuals.

“What are the most interesting places near Forest Hotel, Buffalo, we should visit in the fall?“ Locations, opening hours.

Interesting locations: depends on the people’s opinions or habits.

Existing platforms require users to specify their information needs in a formal, declarative language, which may be too complicated for naive users.

Hence, a question in the natural language should be translated into a well-formed query.

Related Work

The NL-to-query translation problem has been previously studied for queries over general data (knowledge), including SQL/ XQuery/SPARQL queries.

Crowdsourcing: asking user to refine the translated query.

NL tools for parsing and detecting the semantics of NL sentences.

ChallengesThe mix of general and individual knowledge needs lead to unique challenges:

Distinguishing the individual and general part of the question (query).

The crowd info regarding the induvial part of the NL question may not be in the knowledge base. Most of the current techniques which are based on aliening questions to the knowledge

based, does not apply.

Integrating the generated queries for individual and general part of the question to a well-formed query.

Contributions The modular design of a translation framework, to solve the challenges mentioned in

the previous slide.

The development of new modules.

Knowledge Representation

Knowledge representation must be expressive enough

To account for both general knowledge, to be queried from an ontology,

For individual knowledge to be collected from the crowd.

RDF: publicly available knowledge bases such as DBPedia and Linked-GeoData. {Buffalo, NY inside USA}.

{Buffalo, NY has Label "interesting"}.

{I visit Buffalo, NY}.

Query Language

The query language to which NL questions are translated, should naturally match the knowledge representation.

QASSIS-QL query language, which extends SPARQL, the RDF query language, with crowd mining capabilities

NL Processing Tools Distinguishes the individual and general part of the question (query) according to the

grammatical roles.

IX DetectorDependency Parser: This tool parses a given text into a standard structure called a dependency graph. This structure is a directed graph (typically, a tree) with labels on the edges. It exposes different types of semantic dependencies between the terms of a sentence (grammatical role of the words).

Query Generators It is left to perform the translation from the NL representation to the query language representation.

Missing Parameters

Limit

Threshold

Experimental ResultsIn this experiment, we have arbitrarily chosen the first 500 questions from the Yahoo! Answers repositories.

Aggregate Estimations Over Location Based

ServicesPresented by: Shahab Helmi

Fall 2015

Paper InfoAuthors:

Weimo Liuy, The George Washington University

Md Farhadur Rahmanz, University of Texas at Arlington

Saravanan Thirumuruganathanz, University of Texas at Arlington

Nan Zhangy, The George Washington University

Gautam Das, University of Texas at Arlington

Publication: VLDB 2015

Type: Research Paper

Introduction Location-returned services (LR-LBS): this services return the location of the k returned

tuples. Google Maps.

Location-not-returned services (LNR-LBS): this services does not return the location of the k tuples and returns some other attributes such as ID, ranking and etc. WeChat

Sina Weibo

A K-nearest-neighbors (kNN) query: return the k nearest tuples to the query point according to a ranking function (Euclidian distance in this paper).

Introduction (2)LBS with a KNN interface: hidden databases with limited access, usually through a public web query interface or API.

These interfaces impose some constraints:

Query limitation: 10,000 per user per day in Google Maps

Maximum coverage limit, for example 5 miles away from the query point

Aggregate Estimations: For many applications, it is important to collect aggregate statistics in such hidden databases such as sum, count, or distributions of the tuples satisfying certain selection conditions.

A hotel recommendation application would like to know the average review scores for Marriott vs Hilton hotels in Google Maps;

A cafe chain startup would like to know the number of Starbucks restaurants in a certain geographical region;

A demographics researcher may wish know the gender ratio of users of social networks in China etc.

Motivation / GoalsAggregate information can be obtained by:

Entering into data sharing agreements with the location-based service providers, but this approach can often be extremely expensive, and sometimes impossible if the data owners are unwilling to share their data.

Getting the whole data using limited interfaces would take so long.

Goals:

Approximate estimates of such aggregates by only querying the database via its restrictive public interface.

Minimizing the query cost (i.e., ask as few queries as possible)

Making the aggregate estimations as accurate as possible.

Related Work Analytics and Inference over LBS:

Estimating COUNT and SUM aggregates.

Error reduction, such as bias correction

Aggregate Estimations over Hidden Web Repositories: Unbiased estimators for COUNT and SUM aggregates for static databases.

Efficient techniques to obtain random samples from hidden web databases that can then be utilized to perform aggregate estimation.

Estimating the size of search engines.

Contributions For LR-LBS interfaces: the developed algorithm (LR-LBS-AGG), for estimating COUNT

and SUM aggregates, represents a significant improvement over prior work along multiple dimensions: a novel way of precisely calculating Voronoi cells leads to completely unbiased estimations;

top-k returned tuples are leveraged rather than only top-1; several innovative techniques developed for reducing error and increasing efficiency.

For LNR-LBS interfaces: the developed algorithm (LNR-LBS-AGG) which was a novel problem with no prior work. The algorithm is not bias-free, but the bias can be controlled to any desired precision.

Background: Voronoi Diagrams

Top-1 Voronoi Top-2 Voronoi

In a Voronoi diagram, for each point, there is a corresponding region consisting of all points closer to that point than to any other.

LR-LBS-AGG | LNR-LBS-AGG Algorithms Precisely compute Voronoi cells

= , Count(*) =

Extensions:

Computing Voroni cells faster

Error reduction

Experimental ResultsDatasets:

Offline Real-World Dataset (OpenStreetMap, USA Portion): to verify the correctness of the algorithm.

Online LBS Demonstrations: to evaluate efficiency of the algorithm. Google Maps

WeChat

Sina Weibo

type:tutorial paper authors: arun kejriwal (machine zone inc.), sanjeev kulkarni, karthik...

Documents

data cleaningwhat

reliable data cleaning

visual ide

data repair author

visual structures

visual abstraction

visual programming language

external sources authors