type:tutorial paper authors: arun kejriwal (machine zone inc.), sanjeev kulkarni, karthik...
TRANSCRIPT
Real Time Analytics: Algorithms and Systems
Type: Tutorial Paper
Authors: Arun Kejriwal (Machine Zone Inc.), Sanjeev Kulkarni, Karthik Ramasamy(Twitter Inc.)
Presented by: Siddhant Kulkarni
Term: Fall 2015
Motivation
In-Depth overview of streaming analyticsApplicationsAlgorithmsPlatforms
Contribution
Description of various types of data contributing to the field of Big Data Social Media IoT Healthcare Machine Data (cloud) Connected Vehicles
KATARA: Reliable Data Cleaning with Knowledge Bases and Crowdsourcing
Type: Demo Paper
Authors: Xu Chu, John Morcos, Ihab Ilyas, Paolo Papotti, Mourad Ouzzani, Nan Tang(Qatar Computing Research Institute, Yin Ye(Google)
Presented by: Siddhant Kulkarni
Term: Fall 2015
Motivation
Issue with Data Cleaning What are the external sources? Problem with External Sources
VINERy: A Visual IDE for Information Extraction
Presented by: Omar Alqahtani
Fall 2015
Demonstration Papaer
Authors Yunyao Li
IBM Research Almaden
Elmer Kim
Treasure Data, Inc.
Marc A. Touchette
IBM Silicon Valley Lab
Ramiya Venkatachalam
IBM Silicon Valley Lab
Hao Wang
IBM Silicon Valley Lab
Motivation
Extractor development remains a major bottleneck in satisfying the increasing demands of real-world applications based on IE.
Lowering the barrier to entry for extractor development becomes a critical requirement.
Related Works
Previous work has focused on reducing the manual effort involved in extractor development.
WizIE is a promising wizard-like environment but needs non-trivial rule language.
Special-purpose systems.
Contribution
VINERY, a Visual INtegrated Development Environment for Information extRaction, consists of:
The foundation of VINERY is VAQL, a visual programming language for information extraction.
VINERY embeds VAQL in an web-based visual IDE for constructing extractors, which are translated into AQL and executed
VINERY includes a rich set of easily customizable pre-built extractors to help jump-start extractor development.
VINERY provides features to support the entire life cycle of extractor development.
WADaR: Joint Wrapper and Data Repair
Author: Stefano Ortona, Giorgio Orsi, Marcello Buoncristiano, and Tim Furche Department of Computer Science, Oxford University, United Kingdom
Dipartimento di Matematica, Informatica ed Economia, Universit`a della Basilicata, Italy [email protected]
Paper Type: Demo
Presented by:
Ranjan_KY
Fall 2015
Motivation
Web scrapping (or wrapping) is a popular means for acquiring data from the web.
Today generation made scalable wrapper-generation possible and enabled data acquisition process involving thousands of sources.
No scalable tools exists that support these task.
.
Problem Modern wrapper-generation systems leverage a number of features
ranging from HTML and visual structures to knowledge bases and micro-data.
Nevertheless, automatically-generated wrappers often suffer from errors resulting in under/over segmented data, together with missing or spurious content.
Under and over segmentation of attributes are commonly caused by irregular HTML markups or by multiple attributes occurring within the same DOM node.
Incorrect column types are instead associated with the lack of domain knowledge, supervision, or micro-data during wrapper generation.
The degraded quality of the generated relations argues for means to repair both the data and the corresponding wrapper so that future wrapper executions can produce cleaner data
Demonstration
WADaR takes as input a (possibly incorrect) wrapper and a target relation schema, and iteratively repairs both the generated relations and the wrapper by observing the output of the wrapper execution.
A key observation is that errors in the extracted relations are likely to be systematic as wrappers are often generated from templated websites.
WADaR’s repair process (i) Annotating the extracted relations with standard entity
recognizers, (ii) Computing Markov chains describing the most likely
segmentation of attribute values in the records, and (iii) Inducing regular expressions which re-segment the input
relation according to the given target schema and that can possibly be encoded back into the wrapper.
Related work
In this paper, related work was not evaluated in detail
[1] M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti. Extraction and integration of partially overlapping web sources. PVLDB, 6(10):805–816, 2013.
[2] L. Chen, S. Ortona, G. Orsi, and M. Benedikt. Aggregating semantic annotators. PVLDB, 6(13):1486–1497, 2013.
[3] X. Chu, Y. He, K. Chakrabarti, and K. Ganjam. Tegra: Table extraction by global record alignment. In SIGMOD, pages 1713–1728. ACM, 2015.
Association Rules with Graph PatternsWenfei Fan 1,2 Xin Wang 3 Yinghui Wu 4 Jingbo Xu
1,21Univ. of Edinburgh 2Beihang Univ. 3Southwest
Presented by: Zohreh Raghebi
Fall 2015
Motivation
We propose graph-pattern association rules (GPARs) for social media marketing
Extending association rules for itemsets, GPARs help us discover regularities between entities in social graphs
We study the problem of discovering top k diversified GPARs
We also study the problem of identifying potential customers withGPARs
Introduction
A graph-pattern association rule (GPAR) R(x, y) is defined as Q(x, y) q(x, y), ⇒
where Q(x, y) is a graph pattern in which x and y are two designated nodes,
q(x, y) is an edge labeled q from x to y, on which the same search conditions as in Q are imposed
We refer to Q and q as the antecedent and consequent of R
We model R(x, y) as a graph pattern PR, by extending Q with a (dotted) edge q(x, y).
We treat q(x, y) as pattern Pq , and q(x, G) as the set of matches of x in G by Pq
DIVERSIFIED RULE DISCOVERY
We are interested in GPARs for a particular event q(x, y)
However, this often generates an excessive number of rules, which often pertain to the same or similar people
This motivates us to study a diversified mining problem, to discover GPARs that are both interesting and diverse
Problem. Based on the objective function, the diversified GPAR mining problem (DMP) is stated as follows.◦ Input: A graph G, a predicate q(x, y), a support bound σ and positive integers k and d.
Output: A set Lk of k nontrivial GPARs pertaining to q(x, y) such that (a) F(Lk) is maximized; and (b) for each GPAR R Lk, supp(R, G) ≥ ∈ σ
DIVERSIFIED RULE DISCOVERY
DMP is a bi-criteria optimization problem to discover GPARs for a particular event q(x, y) with high support and a balanced confidence and diversity.
In practice, users can freely specify q(x, y) of interests
proper parameters (e.g., support, confidence, diversity) can be estimated from query logs or recommended by domain experts
IDENTIFYING CUSTOMERS
Consider a set Σ of GPARs pertaining to the same q(x, y), i.e., their consequents are the same event q(x, y).
We define the set of entities identified by Σ in a (social) graph G with confidence η
Problem. We study the entity identification problem (EIP):◦ Input: A set Σ of GPARs pertaining to the same q(x, y), a confidence bound η > 0, and a graph G.
◦ Output: Σ(x, G, η). It is to find potential customers x of y in G identified by atleast one GPAR in Σ, with confidence of at least η.
Keys for Graphs
Wenfei Fan 1 , 2 Zhe Fan 3 Chao Tian 1 , 2 Xin Luna Dong 41 University o f Edinburgh 2 Beihang University 3 Hong Kong Baptist University 4 Google Inc.{wenfei@inf., chao.tian@ }ed.ac.uk, [email protected], [email protected]
Motivation
Keys for graphs aim to uniquely identify entities represented by vertices in a graph.
We propose a class of keys that are recursively defined in terms of graph patterns, and are interpreted with subgraph isomorphism.
Extending conventional keys for relations and XML, these keys find applications in:object identification, knowledge fusion and social network reconciliation.
As an application, we study the entity matching problem that, given a graph G and a set Σ of keys
to find all pairs of entities (vertices) in G that are identified by keys in Σ
we provide two parallel scalable algorithms for entity matching:
MapReduce and a vertex-centric asynchronous model
More details
Entity resolution is to identify records that refer to the same real-world entity.
Keys for graphs yield a deterministic method to provide an invariant connection between vertices and the real-world entities
The quality of matches identified by keys highly depends on keys discovered and used, although keys help us reduce false positives.
We defer the topic of key discovery to another paper
focus primarily on the efficiency of applying such constraints
Entity resolution
Finally, we remark that entity resolution is just one of the applications for keys for graphs besides:
e.g., digital citations and knowledge base expansion
entity matching is different to record matching that identify tuples in relations
that does not enforce topological constraints in the matching process
Graph pattern matching
Consider a graph G and an entity e in G.
We say that G matches Q(x) at e if there exist a set S of triples in G and a valuation ν of Q(x) in S such that ν(x) = e,
ν is a bijection between Q(x) and S.
We refer to S as a match of Q(x) in G at e under ν.
Intuitively, ν is an isomorphism from Q(x) to S when Q(x) and S are depicted as graphs.
That is, we adopt subgraph isomorphism for the semantics of graph pattern matching
Examples Example 4: Consider Q(x) and G,
a set S1 of triples in G 2: {(com 1, name of, “AT&T”),(com 4, name of, “AT&T”), (com 1, parent of, com 4), (com 3,parent of, com 4)}.
Then S1 is a match of Q 4(x) in G 2 at com 4, which maps variable x to com 4, name ∗ to “AT&T”, wildcard company to com 1, and company to com 3.
Keys for Graphs:
Keys. A key for entities of type τ is a graph pattern Q(x),where x is a designated entity variable of type τ
we provide two parallel scalable algorithms for entity matching:
MapReduce and a vertex-centric asynchronous model
Optimization of Common Table Expressions in MPPDatabase Systems
Amr El-Helw ∗ , Venkatesh Raghavan ∗ , Mohamed A. Soliman ∗ , George Caragea ∗ ,Zhongxian Gu†, Michalis Petropoulos‡
∗ Pivotal Inc.Palo Alto, CA, USA† Datometry Inc.
San Francisco, CA, USA‡ Amazon Web Services
Palo Alto, CA, USA
Presented by: Zohreh Raghebi
Fall 2015
Motivation Big Data analytics are becoming increasingly common in many business domains:
including financial corporations, government agencies, and insurance providers
Big Data analytics often include complex queries with similar oridentical expressions
Massively Parallel Processing (MPP) databases address these challenges by distributingstorage and query processing across multiple nodes and processes
Common Table Expressions (CTEs) are commonly used in complex analytical queries that often have many repeated computations
A CTE can be seen as a temporary table that exists just for one query.
The purpose of CTEs is to avoid re-execution of expressions referenced more than once within a query.
CTEs may be defined explicitly, or generated implicitly by the query optimizer
Background CTEs follow a producer/consumer model where the data is produced by the CTE definition consumed in all the locations where that CTE is referenced. One possible approach to execute CTEs is to expand (inline) all CTE
consumers, Rewriting the query internally to replace each reference to the CTE This approach simplifies query execution logic, but may incur performance
overhead due to executing the same expression multiple times.
Background the CTE expression is separately optimized and executed only once,
the results are kept in memory, or written to disk if the data does not fit in memory
The data is then read whenever the CTE is referenced.
This approach avoids the cost of repeated execution of the same expression,
although it may incur an overhead of disk I/O.
The impact of this approach on query optimization time is rather limited,
since the optimizer chooses one plan to be shared by all CTE consumers.
However, important optimization opportunities could be missed due to fixing one execution plan for all consumers
Challenges : Deadlock Hazard
MPP systems leverage parallel query execution
where different parts of the query plan execute simultaneously as separate processes,
possibly running on different machines.
In some cases, a process has to wait until another process produces the datait needs.
For complicated queries involving multiple CTEs, the optimizer needs to guarantee that no two or more processes could be waiting on each other during query execution.
CTE constructs need to be cleanly abstracted within the query optimization framework to guarantee deadlock-free plan
Enumerating Inlining Alternatives and Contextualized Optimization
The approaches of always inlining CTEs, or never inlining CTEs, can be easily proven to be sub-optimal
The query optimizer needs to efficiently enumerate and cost plan alternatives thatcombine the benefits of these approaches
CTEs should not be optimized in isolation without taking into account the context in which they occur.
Isolated optimization can easily miss several optimization opportunities
1. This approach avoids repeated computation However, this approach does not take advantage of the index on i color
2. The opposite approach: all occurrences of the CTE are replaced by the expansion of the CTE
This allows the optimizer to utilize the index on i color However, it suffers from the repeated computation
3. Figure 1(c) depicts a possible plan in which one occurrence of the CTE is expanded, allowing the use of the index while the other two occurrences are not inlined, to avoid recomputing the common expression.
Contributions
A novel framework for the optimization of CTEs in MPP database systems.
Our framework extends and builds upon our optimizer infrastructure to allow optimization of CTEs within the context where they are used in a query
A new technique in which a CTE does not get re-optimized for every reference in the query, but only when there are optimization opportunities, e.g. pushing down filters or sort operations.
This ensures that the optimization time does not grow exponentially with the number of CTE consumers
Contribution
A cost-based approach for deciding whether or not to expand CTEs in a given query.
The cost model takes into account disk I/O as well the cost of repeated CTE execution
A query execution model that guarantees that the CTE producer is always executed before the CTE consumer(s).
In MPP settings, this is crucial for deadlock-free execution
Fuzzy Joins in MapReduce: An Experimental Study
Ben Kimmett, Venkatesh Sr inivasan, Alex ThomoUniversity of Victoria, Canada{blk,srinivas,thomo }@uvic.ca
Presented by: Zohreh Raghebi
Fall 2015
Motivation We report experimental results for the MapReduce algorithms proposed by Afrati, Das Sarma, Menestrina, Parameswaran and Ullman in
ICDE’12 (Fuzzy join using mapreduce) to compute fuzzy joins of binary strings using Hamming Distance Their algorithms come with complete theoretical analysis however, no experimental evaluation is provided
Methods
there are several algorithms proposed for performing “fuzzy join” : (an operation that finds pairs of similar items) in MapReduce concentrates on binary strings and Hamming distance The algorithms proposed are: Naive, which compares every string in the set with every other Ball-Hashing, send strings to a ‘ball’ of all ‘nearby strings’ within a certain
similarity
Methods
Anchor Points, a randomized algorithm that selects a set of strings and compares any pair of strings that have a close enough distance to a member of the set
Splitting, an algorithm that splits the strings into pieces and compares only strings with matching pieces
Conclusion It is argued in that there is a tradeoff between communication cost and
processing cost that there is a skyline of the proposed algorithms; i.e. none dominates
another.
One of our objectives is to see whether we can observe thisskyline in practical terms.
We observe via experiments that some algorithms are almost always preferable to others.
Splitting is a clear winner Ball-Hashing suffers for all distance thresholds except the very small ones
A Natural Language Interface for Querying General and Individual
KnowledgePresented by: Shahab Helmi
Fall 2015
Paper InfoAuthors:
Yael Amsterdamer , Tel Aviv University
Anna Kukliansky, Tel Aviv University
Tova Milo, Tel Aviv University
Publication: VLDB 2015
Type: Research Paper
Motivation
Many real-life scenarios (queries) require the joint analysis of general knowledge, which includes facts about the world, with individual knowledge, which relates to the opinions or habits of individuals.
“What are the most interesting places near Forest Hotel, Buffalo, we should visit in the fall?“ Locations, opening hours.
Interesting locations: depends on the people’s opinions or habits.
Existing platforms require users to specify their information needs in a formal, declarative language, which may be too complicated for naive users.
Hence, a question in the natural language should be translated into a well-formed query.
Related Work
The NL-to-query translation problem has been previously studied for queries over general data (knowledge), including SQL/ XQuery/SPARQL queries.
Crowdsourcing: asking user to refine the translated query.
NL tools for parsing and detecting the semantics of NL sentences.
ChallengesThe mix of general and individual knowledge needs lead to unique challenges:
Distinguishing the individual and general part of the question (query).
The crowd info regarding the induvial part of the NL question may not be in the knowledge base. Most of the current techniques which are based on aliening questions to the knowledge
based, does not apply.
Integrating the generated queries for individual and general part of the question to a well-formed query.
Contributions The modular design of a translation framework, to solve the challenges mentioned in
the previous slide.
The development of new modules.
Knowledge Representation
Knowledge representation must be expressive enough
To account for both general knowledge, to be queried from an ontology,
For individual knowledge to be collected from the crowd.
RDF: publicly available knowledge bases such as DBPedia and Linked-GeoData. {Buffalo, NY inside USA}.
{Buffalo, NY has Label "interesting"}.
{I visit Buffalo, NY}.
Query Language
The query language to which NL questions are translated, should naturally match the knowledge representation.
QASSIS-QL query language, which extends SPARQL, the RDF query language, with crowd mining capabilities
NL Processing Tools Distinguishes the individual and general part of the question (query) according to the
grammatical roles.
IX DetectorDependency Parser: This tool parses a given text into a standard structure called a dependency graph. This structure is a directed graph (typically, a tree) with labels on the edges. It exposes different types of semantic dependencies between the terms of a sentence (grammatical role of the words).
Query Generators It is left to perform the translation from the NL representation to the query language representation.
Missing Parameters
Limit
Threshold
Experimental ResultsIn this experiment, we have arbitrarily chosen the first 500 questions from the Yahoo! Answers repositories.
Aggregate Estimations Over Location Based
ServicesPresented by: Shahab Helmi
Fall 2015
Paper InfoAuthors:
Weimo Liuy, The George Washington University
Md Farhadur Rahmanz, University of Texas at Arlington
Saravanan Thirumuruganathanz, University of Texas at Arlington
Nan Zhangy, The George Washington University
Gautam Das, University of Texas at Arlington
Publication: VLDB 2015
Type: Research Paper
Introduction Location-returned services (LR-LBS): this services return the location of the k returned
tuples. Google Maps.
Location-not-returned services (LNR-LBS): this services does not return the location of the k tuples and returns some other attributes such as ID, ranking and etc. WeChat
Sina Weibo
A K-nearest-neighbors (kNN) query: return the k nearest tuples to the query point according to a ranking function (Euclidian distance in this paper).
Introduction (2)LBS with a KNN interface: hidden databases with limited access, usually through a public web query interface or API.
These interfaces impose some constraints:
Query limitation: 10,000 per user per day in Google Maps
Maximum coverage limit, for example 5 miles away from the query point
Aggregate Estimations: For many applications, it is important to collect aggregate statistics in such hidden databases such as sum, count, or distributions of the tuples satisfying certain selection conditions.
A hotel recommendation application would like to know the average review scores for Marriott vs Hilton hotels in Google Maps;
A cafe chain startup would like to know the number of Starbucks restaurants in a certain geographical region;
A demographics researcher may wish know the gender ratio of users of social networks in China etc.
Motivation / GoalsAggregate information can be obtained by:
Entering into data sharing agreements with the location-based service providers, but this approach can often be extremely expensive, and sometimes impossible if the data owners are unwilling to share their data.
Getting the whole data using limited interfaces would take so long.
Goals:
Approximate estimates of such aggregates by only querying the database via its restrictive public interface.
Minimizing the query cost (i.e., ask as few queries as possible)
Making the aggregate estimations as accurate as possible.
Related Work Analytics and Inference over LBS:
Estimating COUNT and SUM aggregates.
Error reduction, such as bias correction
Aggregate Estimations over Hidden Web Repositories: Unbiased estimators for COUNT and SUM aggregates for static databases.
Efficient techniques to obtain random samples from hidden web databases that can then be utilized to perform aggregate estimation.
Estimating the size of search engines.
Contributions For LR-LBS interfaces: the developed algorithm (LR-LBS-AGG), for estimating COUNT
and SUM aggregates, represents a significant improvement over prior work along multiple dimensions: a novel way of precisely calculating Voronoi cells leads to completely unbiased estimations;
top-k returned tuples are leveraged rather than only top-1; several innovative techniques developed for reducing error and increasing efficiency.
For LNR-LBS interfaces: the developed algorithm (LNR-LBS-AGG) which was a novel problem with no prior work. The algorithm is not bias-free, but the bias can be controlled to any desired precision.
Background: Voronoi Diagrams
Top-1 Voronoi Top-2 Voronoi
In a Voronoi diagram, for each point, there is a corresponding region consisting of all points closer to that point than to any other.
LR-LBS-AGG | LNR-LBS-AGG Algorithms Precisely compute Voronoi cells
= , Count(*) =
Extensions:
Computing Voroni cells faster
Error reduction
Experimental ResultsDatasets:
Offline Real-World Dataset (OpenStreetMap, USA Portion): to verify the correctness of the algorithm.
Online LBS Demonstrations: to evaluate efficiency of the algorithm. Google Maps
Sina Weibo