csis7101: course summary
DESCRIPTION
CSIS7101: Course summary. Spatial data Spatiotemporal data Multimedia and Time-series data Data mining I (association rules and sequence patterns) Data mining II (clustering and classification) Data warehousing and OLAP Strings and biological data Semi-structured and XML data - PowerPoint PPT PresentationTRANSCRIPT
Dr. N. Mamoulis Advanced Database Technologies
1
CSIS7101: Course summary
Spatial data Spatiotemporal data Multimedia and Time-series data Data mining I (association rules and sequence patterns)Data mining II (clustering and classification)Data warehousing and OLAP Strings and biological data Semi-structured and XML data Storage and query processing on modern machines Cache conscious indexes
Dr. N. Mamoulis Advanced Database Technologies
2
Spatial Data The R-tree and the R*-tree What are they?
Dynamic, balanced trees that index Minimum Bounding Rectangles
Each entry in a directory node is an <MBR,ptr> pair. Each entry in a data node is an <MBR,oid> pair
How are they constructed/updated? They use special insert/split algorithms The entry MBRs in directory nodes should have (i)
minimal margins – more “square-like”, (ii) minimal overlap – good search behavior, (iii) minimal area – small “dead space”.
Dr. N. Mamoulis Advanced Database Technologies
3
Spatial Data (cont’d) Spatial Joins Using R-trees How is the spatial join performed?
By synchronous traversal, following recursively directory node entries that overlap.
How is the computational cost minimized? The space restriction technique The plane sweep heuristic
How is the I/O cost minimized? Using some ordering and pinning
techniques (optional reading)
Dr. N. Mamoulis Advanced Database Technologies
4
Spatial Data (cont’d) Nearest Neighbor Search Using R-trees Which is the optimal algorithm for NN search?
INN method in the “distance browsing” paper. How does it work?
It uses a priority queue that organizes <dist, ptr> pairs according to their (smallest) distance from the query object.
Initially all entries of the root are placed in the heap At each step, the pair with the minimum distance is
retrieved. If it is an object, it is output If it is an object MBR, the actual object is fetched an
inserted in the queue entry at a directory level, the entries in the node pointed
by it are loaded and inserted in the queue.
Dr. N. Mamoulis Advanced Database Technologies
5
Spatiotemporal Data
Two types of problems:1. Indexing the current positions and
movements of objects and querying their anticipated future positions.
2. Indexing and querying the past movements of mobile objects.
On Indexing Mobile ObjectsIndexing the Positions of Continuously Moving ObjectsMV3R-Tree (optional reading)
1
2
Dr. N. Mamoulis Advanced Database Technologies
6
Spatiotemporal Data (cont’d)
Indexing current/future locations mobile objects (problem 1):
Dual transformation Transform the line trajectories to a dual
representation and index the resulting points How is the 1D problem solved in the dual plane?
The TPR-tree Like the R-tree, but the MBRs are time-
parameterized to conservative bounding intervals (CBI).
How are the CBI computed? What is the best way to group objects into a CBI? By minimizing an objective function (e.g., overlap)
over the time the TPR-tree is valid. How do we answer queries using the TPR-tree?
Dr. N. Mamoulis Advanced Database Technologies
7
Multimedia and time-series data
When is Nearest Neighbor Search Meaningful? Consider a query object q, with distance dist(q, NN)
from its nearest neighbor NN and with distance dist(q, FN) from its farthest neighbor FN.
If dist(q,NN)/dist(q,FN) converges to 1 with the increase of dimensionality then NN search is not meaningful.
In other words, if the distance of the nearest neighbor is not statistically different to the distance of the farthest neighbor, then NN search is meaningless.
If the data form clusters in the high dimensional space, then NN search is meaningful. Also, if the data actually lie in a (hidden) embedded space, which is low dimensional then NN search is meaningful.
Dr. N. Mamoulis Advanced Database Technologies
8
Multimedia and time-series data
What is FASTMAP? A technique to map objects from a high-
dimensional (unknown) space to points in a low dimensional space, so that the original distance between any pair of objects is preserved as much as possible after the transformation.
How does FASTMAP work? It picks the furthest pair of objects (Oa and Ob)
at a time, called pivots, and the coordinates of all points in that dimension is defined by their projection on the line segment OaOb.
Dr. N. Mamoulis Advanced Database Technologies
9
Multimedia and time-series data
Indexing Time-Series Data (optional) The similarity between two time-series is computed
using an expensive dynamic time-warping technique.
In order to index the data and reduce the similarity search cost, the series are approximated by an envelope that bounds it and at the same time makes DTW applicable on it.
The lower-upper bounds of the envelope are in turn approximated by MBRs and indexed using a high-dimensional index (e.g., R-tree).
An algorithm like INN is applied on the R-tree to filter fast series that may not be similar to the query time-series, in a branch-and-bound way.
Dr. N. Mamoulis Advanced Database Technologies
10
Data MiningData mining topics covered:
Mining association rules: If A and B is bought then also C is bought.
Mining frequent sequence patterns: If in a sequence of event transmissions B is close and
after A, then C will appear soon. Classification:
Given a training sample, create a classifier, that predicts the class labels of unclassified data.
Clustering: Classify a dataset into k clusters (of unknown class
label), such that the distance between tuples (objects) in the same cluster is small, and the distance between tuples in different clusters is large.focus
Dr. N. Mamoulis Advanced Database Technologies
11
Mining Association RulesFirst find, frequent itemsets, then create association rules (how?).How does the Apriori Algorithm work?
Finds frequent itemsets level-by-level First, frequent items are found (i.e., frequent sets
with only one item) Frequent itemsets of level l are joined to generate
candidate itemsets of level l+1. Candidates of level l+1 with a subset not appearing
in the frequent itemsets of level l are pruned. The frequency of the remaining ones is counted by scanning the database.
Dr. N. Mamoulis Advanced Database Technologies
12
Mining Association RulesHow does the FP-tree method work?
First finds the items with support>s and sorts them in decreasing support value.
Then the FP-tree is constructed for these items by considering the tuples found in the database as paths of the tree. Finally, the nodes of the tree with the same label (corresponding to the same item) are linked using a linked list.
Then the frequent itemsets are found from the FP-tree, for each item individually by accessing the paths containing the item.
Dr. N. Mamoulis Advanced Database Technologies
13
ClusteringHow does the basic k-means clustering algorithm work?
1. First the objects are distributed (randomly) to arbitrary clusters.
2. The center of each cluster is computed (by averaging the values in each dimension).
3. Rellocate each object to the cluster whose center is closest to it.
4. Reapply 2-3 until the clusters do not change.
Dr. N. Mamoulis Advanced Database Technologies
14
ClusteringHow do k-medoids clustering algorithm work?
1. They initially pick randomly k medoids (representatives).
2. Each object is assigned to the cluster defined by the nearest medoid.
3. Find the medoid-object pair, such that if they are swapped (the object takes the place of the medoid) the benefit is maximized.
4. Repeat 2-3 until there is no change.
Dr. N. Mamoulis Advanced Database Technologies
15
Hierarchical ClusteringWhat is hierarchical clustering?
Bottom-up or agglomerative: Initially each object is a cluster. Then the closest clusters are merged iteratively until only k clusters remain.
Top-down or divisive: Initially all objects are in one cluster. Then at each step the cluster with the best split is iteratively split until k clusters remain.
Dr. N. Mamoulis Advanced Database Technologies
16
CUREHow does CURE work?
It is an agglomerative algorithm, that creates the clusters bottom-up.
Medoid-based, but each cluster has more than one representatives.
The representatives are scattered, thus (i) non-spherical clusters can be discovered, (ii) clusters linked weakly are not merged (as a density-based method would do), (iii) outliers are not included in the clusters.
Random sampling is used to avoid the excessive O(n2) cost of applying the algorithm on the whole dataset.
Dr. N. Mamoulis Advanced Database Technologies
17
PROCLUSWhat is PROCLUS?
An algorithm that discovers clusters in a subspace of the original high-dimensional space.
Why is PROCLUS useful? In many cases, clustering (and nearest-
neighbor search) is not meaningful in a high dimensional space, because there are no natural clusters there.
However, clusters could be found in dimensional subspaces, where not all dimensions are relevant to the cluster.
Dr. N. Mamoulis Advanced Database Technologies
18
PROCLUSHow does PROCLUS work?
1. Use a greedy method to find a set M of >k medoids
2. Pick k medoids at random3. Approximate the optimal set of dimensions for
each medoid4. Assign points to clusters and evaluate the
cluster goodness5. Replace “bad” medoids with random ones
from M6. Repeat 3-5 until only good enough medoids
remain.7. Approximate the optimal set of dimensions for
each medoid and assign the points to clusters.
Dr. N. Mamoulis Advanced Database Technologies
19
Data Warehousing and OLAP
What is a data warehouse? a large collection of heterogeneous data which
have been cleaned, integrated and consolidated. The warehouse contains old, historical data which are useful for data analysis tasks.
What is On-Line Analytical Processing? Data analysis on Data Warehouses and decision
making.
What is the Data cube? Models all combinations of multidimensional
aggregate views of the warehouse.
How are the multidimensional views related? A lattice models the partial order of the views and
shows if a view can be computed from another.
12
11
98
34
8
Dr. N. Mamoulis Advanced Database Technologies
20
Data Warehousing and OLAP
What defines the hierarchical relationship between two views:
View A can be computed by view B if the dimensions of A are a subset of the dimensions of B. Example: We can compute the cuboid
(product,city) from the cuboid (product, city, time).
View A can be computed by view B if each dimension in A is at the same or higher hierarchical conceptual level than the corresponding dimension in B. Example: We can compute the cuboid
(product,country) from the cuboid (product, city).
12
11
98
34
8
Dr. N. Mamoulis Advanced Database Technologies
21
View Selection in Data Warehouses
What is the view selection problem? If we compute and materialize the whole cube
we can answer efficiently every OLAP query. However:
The size of the cube can be very large, thus we may not be able to store all views due to space constraints.
When the warehouse is updated, we may not have enough time to update all materialized views (maintenance cost constraint).
Therefore, given space and update time constraints we have to choose which set of views to materialize based on their benefit in answering queries
12
11
98
34
8
Dr. N. Mamoulis Advanced Database Technologies
22
View Selection in Data Warehouses
Two approaches in view selection: Static view selection: Given specific query
statistics, space and time constraints select the best views to materialize (paper: Implementing Data Cubes Efficiently).
Dynamic view materialization: The set materialized views is determined and dynamically based on (i) the frequency with which they are retrieved, (ii) their benefit to answer other queries, (iii) their space, (iv) their maintenance cost (paper: DynaMat).
12
11
98
34
8
Dr. N. Mamoulis Advanced Database Technologies
23
View Selection in Data Warehouses
Static view selection: How are the views selected?
Using a greedy algorithm.
Dynamic view materialization: How are the views replaced in the buffer
pool? Using a replacement policy that considers the
benefit of a materialized result dynamically. Which views are maintained during an
update? Those which can be maintained fast and provide a
larger benefit.
12
11
98
34
8
Dr. N. Mamoulis Advanced Database Technologies
24
Strings and Biological Data
What is approximate string matching? Given a text t, a query q and a distance
threshold k, find all substrings in t whose distance from q is at most k.
What distance function is used? The edit distance, which counts the
minimum number of operations (insert, delete, replace) to transform one string to the other.
How is the problem solved exactly? Using an (expensive) dynamic
programming algorithm.
Dr. N. Mamoulis Advanced Database Technologies
25
Strings and Biological Data
How does the dynamic programming algorithm work?
Problem: find the edit distance between strings x and y. Create a (|x|+1)(|y|+1) matrix C, where Ci,j represents the
minimum number of operations to match x1..i with y1..j. The matrix is constructed as follows.
Ci,0 = i C0,j = j Ci,j =
Ci-1,Cj-1 if xi=yi, 1+min(Ci-1,Cj, Ci,Cj-1, Ci-1,Cj-1), else.
How is it adapted for substring matching? The difference is that we set C0,j=0 for all j, since any text
position could be the potential start of a match. If the similarity distance bound is k, we report all positions,
where Cm ≤k (m is the last row – m = |q|).
Dr. N. Mamoulis Advanced Database Technologies
26
Strings and Biological Data
What can we do to reduce the cost of DP? We can use cheap filtering techniques that avoid
applying DP at each position of t. They do not provide solutions, but aim to reduce
the search effort. They are based on the idea that it is most probable
for a text area not to match a query rather than to match it.
They work in two steps. First a cheap heuristic is used to determine whether
an area of the text t could match with the query q. If not search is abandoned for this area, otherwise
an expensive search algorithm (i.e., dynamic programming) is applied
Dr. N. Mamoulis Advanced Database Technologies
27
Strings and Biological Data
Other ways to perform substring matching
Build a suffix-tree for t and use it for approximate and exact queries.
If t is large the suffix tree cannot be built in memory.
We can solve the problem, by splitting the possible substrings into groups and build a suffix tree in memory for each of them.
Paper: A Database Index to Large Biological Sequences (optional reading)
Dr. N. Mamoulis Advanced Database Technologies
28
Strings and Biological Data
Other ways to perform substring matching All substrings are transformed to high
dimensional points and indexed. A query is then applied on the indexes using a
lower bound of the edit distance. Given an alphabet Σ = {α1,α2,...,αd}, we can
transform a string to a d-dimensional point, called frequency vector, which stores the global frequence of each character in the substring.
The idea is extended to capture also the local frequencies of the characters in the substring, using wavelet transforms.
Dr. N. Mamoulis Advanced Database Technologies
29
Semi-structured Data
<A> <B>xxx</B> <C> <D>yyy</D> <D>zzz</D> </C></A>
What are semi-structured data? In various application domains, the data are semi-
structured; the database schema is loose-defined.
What is XML? A language like HTML, where tags describe the
data itself. Tags are called elements in XML. XML can be used to describe semi-structured data.
Why do we need special management techniques for semi-structured data? There is no well-defined schema, so we cannot
store the data in relational tables in an efficient way.
Dr. N. Mamoulis Advanced Database Technologies
30
Semi-structured Data
<A> <B>xxx</B> <C> <D>yyy</D> <D>zzz</D> </C></A>
How do we store semi-structured data in a database? Solution 1:
Use specialized storage methods, query languages and query evaluation techniques for semi-structured data.
Solution 2: Represent XML data in relational tables,
transform queries to SQL, and use the mature relational DB technology.
Dr. N. Mamoulis Advanced Database Technologies
31
Semi-structured Data
<A> <B>xxx</B> <C> <D>yyy</D> <D>zzz</D> </C></A>
How can we model semi-structured data? An XML document can be represented
as a node-labeled graph. The labels of the graph are element
tags, attribute names and values. Most documents can be represented
by trees. The edges that transform a tree to a graph come from ID references.
Dr. N. Mamoulis Advanced Database Technologies
32
Semi-structured Data
<A> <B>xxx</B> <C> <D>yyy</D> <D>zzz</D> </C></A>
What types of queries apply on XML data? absolute path expressions.
Example: book/author/name/lastname/Smith simple path expressions.
Example: //author/name/lastname/Smith regular path expressions.
Example: //author//Smith expressions of complex structure.
Example: book[/title/XML][//author//Smith]
Dr. N. Mamoulis Advanced Database Technologies
33
Indexing Semi-structured Data
<A> <B>xxx</B> <C> <D>yyy</D> <D>zzz</D> </C></A>
Several storage schemes and indexes have been proposed for XML queries. Some of them index the paths or
subgraphs of the XML structures. Example: the A(k)-index
Some decompose the information and flatten it into relational DB tables. Example: The method described in the
Structural Joins paper
Dr. N. Mamoulis Advanced Database Technologies
34
Indexing Semi-structured Data
<A> <B>xxx</B> <C> <D>yyy</D> <D>zzz</D> </C></A>
The A(k)-index: Indexes simple paths. Useful for
simple path queries. Creates a structural summary of the
XML graphs. All the paths up to length k in the documents are preserved into the summary graph.
Based on the notion of k-bisimilarity.
Dr. N. Mamoulis Advanced Database Technologies
35
Flattening XML data to relational tables
<A> <B>xxx</B> <C> <D>yyy</D> <D>zzz</D> </C></A>
The position of each element/attribute occurrence is represented as a 3-tuple (Document-id, StartPos:EndPos, LevelNum)Values (text) is encoded using (Document-id, StartPos, LevelNum): Document-id is the id of the document that
contains the element StartPos is the number of words from the
beginning of the document until the start of the element
EndPos is the number of words from the beginning of the document until the end of the element
LevelNum is the nesting depth of the element
Dr. N. Mamoulis Advanced Database Technologies
36
XML query processing
<A> <B>xxx</B> <C> <D>yyy</D> <D>zzz</D> </C></A>
The query is broken into binary parent-child or ancestor descendant relationships.Example:
book[/title/XML][//author//Smith]
Broken to: book/title title/XML book//author author//Smith
book
title
XML
author
Smith
Dr. N. Mamoulis Advanced Database Technologies
37
XML query processing
<A> <B>xxx</B> <C> <D>yyy</D> <D>zzz</D> </C></A>
Each binary query is executed as a join, and their results are “stitched” together to formulate the results of the whole query.Thus the “heart” of XML query processing is the algorithm that joins the elements table to retrieve the results for each individual query component.A simple, tree merge join may perform many passes to the “inner” table, one for each element in the outer table that matches the elements there. In order to avoid this a stack-tree join algorithm is proposed.
Dr. N. Mamoulis Advanced Database Technologies
38
XML query processing
<A> <B>xxx</B> <C> <D>yyy</D> <D>zzz</D> </C></A>
How does the stack-tree join algorithm work? It keeps a stack to keep nested AList
elements which are in the same path as the current element from DList.
When a qualifying element in DList is found, all elements of AList in the stack are output.
Dr. N. Mamoulis Advanced Database Technologies
39
DBMSs on Modern Machines
Memory and disk latency are not improving as fast as CPU processing speed, memory bandwidth and disk bandwidth:
1985 1990 1995 20001
10
100
CPU(kHz)
mem.bw
disk.bw
disk.lat(ms)
mem.lat(ns)
Also random accesses become very expensive and indexes less efficient
The new bottleneck is memory access and DB operators have to be tuned
DBMSs have to be redesigned for the characteristics of the new machines
Dr. N. Mamoulis Advanced Database Technologies
40
The new bottleneck: Memory Access
Memory is now hierarchical: two levels of caching
CPUL1 cache
L2 cache
Main memory
CPU
die
L1 cache-line
L2 cache-line
Memory page
Memory-latency: the time needed to transfer 1 byte from the main memory to the L2 cache.Cache (L2) miss: if the requested data is not in cache and needs to be fetched from main memoryCache-line: The transfer unit from main memory to cache (e.g., L2 cache-line = 128 bytes)
Dr. N. Mamoulis Advanced Database Technologies
41
The new bottleneck: Memory Access
Why is there memory latency? Accessing a specific address from memory loads also
the information next to it in the cache-line (in parallel over a wide bus). A cache-line at a time is accessed from main memory.
If the requested information is already in cache (e.g., the gender of the second tuple), main memory is not accessed. Otherwise, a cache-miss occurs.
“trash” may be loaded to cache together with the useful information. Therefore data which are likely to be accessed together by a program should be placed close in main memory.
Dr. N. Mamoulis Advanced Database Technologies
42
Storage schemes
How are database relations stored in memory/disk? The Normal Storage Model stores the tuples sequentially. The Decomposed Storage Model breaks the relation into
binary tables <id,att1>, <id,att2>, ... , <id,attn>. The Partition Attributes Accross (PAX) model stores the
relation like NSM, but decomposes the information in the pages, in order to reduce the cache misses (optional reading).
What is the impact of these storage schemes in query processing
How many cache misses will occur during a selection on a relation if it stored by NSM, or DSM? What if we use unary tables?
There is a trade-off between fetching garbage to trash (NSM) and joining decomposed data (DSM).
Dr. N. Mamoulis Advanced Database Technologies
43
Performing Joins in Main Memory
Join operators (which may require multiple passes on the data) should be implemented with care in main memory.What techniques can we use to speed-up join processing?
Preprocess the data to get rid of irrelevant information to the join.
Use blocking and partitioning to reduce the problem into small problems that can be performed in cache.
Use radix-clustering to perform hashing in multiple passes that reduce the overall hash-join cost.
Dr. N. Mamoulis Advanced Database Technologies
44
Database Indexing in Main Memory
In future databases all data but few large tables will be memory-resident.Therefore is it important to build efficient main-memory indexes.These indexes should consider the hierarchical memories and the memory-access bottleneck.Indexes like the B+-tree and the R-tree are also suitable for main memory, if the node size is set to the cache-linesize.
Dr. N. Mamoulis Advanced Database Technologies
45
Database Indexing in Main Memory
What techniques are used to improve the performance of these structures in memory?
Pointer elimination. Pointers are eliminated and the children of a specific node are allocated in sequential memory blocks.
How is the size of the structure and the search cost affected if pointer elimination is used?
The capacity of each node (nearly) doubles. Thus the height of the tree becomes shorter and queries are processed faster.
Hard-wiring binary search. The binary search code is replaced by if-else statements. We may choose to leave some key-slots empty in order to reduce the depth of the search tree.
Quantization and compression (for R-trees). We replace exact MBR coordinates by relative coordinates to the MBR of the containing node. We can compress further these coordinates by approximating them in a quantized grid.
Dr. N. Mamoulis Advanced Database Technologies
46
The format of the final exam
It consists of simple problems which require that you have digested the materialThere will be problems from (almost) all topics covered in the course.
The problems do not require knowledge of the technical details.
However, you should be able to apply the simple versions of the algorithms if you are asked to for simple problems:
Example 1: I may give you an AList and a DList, and I want you to show me the steps (and stack) of the structural join algorithm (topic: semi-structured data).
Example 2: I may give you the data warehouse settings and ask you to find the views selected for materialization if the greedy algorithm is applied (topic: data warehousing).