csis7101: course summary

Dr. N. Mamoulis Advanced Database Technologies

1

CSIS7101: Course summary

Spatial data Spatiotemporal data Multimedia and Time-series data Data mining I (association rules and sequence patterns)Data mining II (clustering and classification)Data warehousing and OLAP Strings and biological data Semi-structured and XML data Storage and query processing on modern machines Cache conscious indexes


2

Spatial Data The R-tree and the R*-tree What are they?

Dynamic, balanced trees that index Minimum Bounding Rectangles

Each entry in a directory node is an <MBR,ptr> pair. Each entry in a data node is an <MBR,oid> pair

How are they constructed/updated? They use special insert/split algorithms The entry MBRs in directory nodes should have (i)

minimal margins – more “square-like”, (ii) minimal overlap – good search behavior, (iii) minimal area – small “dead space”.


3

Spatial Data (cont’d) Spatial Joins Using R-trees How is the spatial join performed?

By synchronous traversal, following recursively directory node entries that overlap.

How is the computational cost minimized? The space restriction technique The plane sweep heuristic

How is the I/O cost minimized? Using some ordering and pinning

techniques (optional reading)


4

Spatial Data (cont’d) Nearest Neighbor Search Using R-trees Which is the optimal algorithm for NN search?

INN method in the “distance browsing” paper. How does it work?

It uses a priority queue that organizes <dist, ptr> pairs according to their (smallest) distance from the query object.

Initially all entries of the root are placed in the heap At each step, the pair with the minimum distance is

retrieved. If it is an object, it is output If it is an object MBR, the actual object is fetched an

inserted in the queue entry at a directory level, the entries in the node pointed

by it are loaded and inserted in the queue.


5

Spatiotemporal Data

Two types of problems:1. Indexing the current positions and

movements of objects and querying their anticipated future positions.

2. Indexing and querying the past movements of mobile objects.

On Indexing Mobile ObjectsIndexing the Positions of Continuously Moving ObjectsMV3R-Tree (optional reading)

1

2


6

Spatiotemporal Data (cont’d)

Indexing current/future locations mobile objects (problem 1):

Dual transformation Transform the line trajectories to a dual

representation and index the resulting points How is the 1D problem solved in the dual plane?

The TPR-tree Like the R-tree, but the MBRs are time-

parameterized to conservative bounding intervals (CBI).

How are the CBI computed? What is the best way to group objects into a CBI? By minimizing an objective function (e.g., overlap)

over the time the TPR-tree is valid. How do we answer queries using the TPR-tree?


7

Multimedia and time-series data

When is Nearest Neighbor Search Meaningful? Consider a query object q, with distance dist(q, NN)

from its nearest neighbor NN and with distance dist(q, FN) from its farthest neighbor FN.

If dist(q,NN)/dist(q,FN) converges to 1 with the increase of dimensionality then NN search is not meaningful.

In other words, if the distance of the nearest neighbor is not statistically different to the distance of the farthest neighbor, then NN search is meaningless.

If the data form clusters in the high dimensional space, then NN search is meaningful. Also, if the data actually lie in a (hidden) embedded space, which is low dimensional then NN search is meaningful.


8


What is FASTMAP? A technique to map objects from a high-

dimensional (unknown) space to points in a low dimensional space, so that the original distance between any pair of objects is preserved as much as possible after the transformation.

How does FASTMAP work? It picks the furthest pair of objects (Oa and Ob)

at a time, called pivots, and the coordinates of all points in that dimension is defined by their projection on the line segment OaOb.


9


Indexing Time-Series Data (optional) The similarity between two time-series is computed

using an expensive dynamic time-warping technique.

In order to index the data and reduce the similarity search cost, the series are approximated by an envelope that bounds it and at the same time makes DTW applicable on it.

The lower-upper bounds of the envelope are in turn approximated by MBRs and indexed using a high-dimensional index (e.g., R-tree).

An algorithm like INN is applied on the R-tree to filter fast series that may not be similar to the query time-series, in a branch-and-bound way.


10

Data MiningData mining topics covered:

Mining association rules: If A and B is bought then also C is bought.

Mining frequent sequence patterns: If in a sequence of event transmissions B is close and

after A, then C will appear soon. Classification:

Given a training sample, create a classifier, that predicts the class labels of unclassified data.

Clustering: Classify a dataset into k clusters (of unknown class

label), such that the distance between tuples (objects) in the same cluster is small, and the distance between tuples in different clusters is large.focus


11

Mining Association RulesFirst find, frequent itemsets, then create association rules (how?).How does the Apriori Algorithm work?

Finds frequent itemsets level-by-level First, frequent items are found (i.e., frequent sets

with only one item) Frequent itemsets of level l are joined to generate

candidate itemsets of level l+1. Candidates of level l+1 with a subset not appearing

in the frequent itemsets of level l are pruned. The frequency of the remaining ones is counted by scanning the database.


12

Mining Association RulesHow does the FP-tree method work?

First finds the items with support>s and sorts them in decreasing support value.

Then the FP-tree is constructed for these items by considering the tuples found in the database as paths of the tree. Finally, the nodes of the tree with the same label (corresponding to the same item) are linked using a linked list.

Then the frequent itemsets are found from the FP-tree, for each item individually by accessing the paths containing the item.


13

ClusteringHow does the basic k-means clustering algorithm work?

1. First the objects are distributed (randomly) to arbitrary clusters.

2. The center of each cluster is computed (by averaging the values in each dimension).

3. Rellocate each object to the cluster whose center is closest to it.

4. Reapply 2-3 until the clusters do not change.


14

ClusteringHow do k-medoids clustering algorithm work?

1. They initially pick randomly k medoids (representatives).

2. Each object is assigned to the cluster defined by the nearest medoid.

3. Find the medoid-object pair, such that if they are swapped (the object takes the place of the medoid) the benefit is maximized.

4. Repeat 2-3 until there is no change.


15

Hierarchical ClusteringWhat is hierarchical clustering?

Bottom-up or agglomerative: Initially each object is a cluster. Then the closest clusters are merged iteratively until only k clusters remain.

Top-down or divisive: Initially all objects are in one cluster. Then at each step the cluster with the best split is iteratively split until k clusters remain.


16

CUREHow does CURE work?

It is an agglomerative algorithm, that creates the clusters bottom-up.

Medoid-based, but each cluster has more than one representatives.

The representatives are scattered, thus (i) non-spherical clusters can be discovered, (ii) clusters linked weakly are not merged (as a density-based method would do), (iii) outliers are not included in the clusters.

Random sampling is used to avoid the excessive O(n2) cost of applying the algorithm on the whole dataset.


17

PROCLUSWhat is PROCLUS?

An algorithm that discovers clusters in a subspace of the original high-dimensional space.

Why is PROCLUS useful? In many cases, clustering (and nearest-

neighbor search) is not meaningful in a high dimensional space, because there are no natural clusters there.

However, clusters could be found in dimensional subspaces, where not all dimensions are relevant to the cluster.


18

PROCLUSHow does PROCLUS work?

1. Use a greedy method to find a set M of >k medoids

2. Pick k medoids at random3. Approximate the optimal set of dimensions for

each medoid4. Assign points to clusters and evaluate the

cluster goodness5. Replace “bad” medoids with random ones

from M6. Repeat 3-5 until only good enough medoids

remain.7. Approximate the optimal set of dimensions for

each medoid and assign the points to clusters.


19

Data Warehousing and OLAP

What is a data warehouse? a large collection of heterogeneous data which

have been cleaned, integrated and consolidated. The warehouse contains old, historical data which are useful for data analysis tasks.

What is On-Line Analytical Processing? Data analysis on Data Warehouses and decision

making.

What is the Data cube? Models all combinations of multidimensional

aggregate views of the warehouse.

How are the multidimensional views related? A lattice models the partial order of the views and

shows if a view can be computed from another.

12

11

98

34

8


20

Data Warehousing and OLAP

What defines the hierarchical relationship between two views:

View A can be computed by view B if the dimensions of A are a subset of the dimensions of B. Example: We can compute the cuboid

(product,city) from the cuboid (product, city, time).

View A can be computed by view B if each dimension in A is at the same or higher hierarchical conceptual level than the corresponding dimension in B. Example: We can compute the cuboid

(product,country) from the cuboid (product, city).

12

11

98

34

8


21

View Selection in Data Warehouses

What is the view selection problem? If we compute and materialize the whole cube

we can answer efficiently every OLAP query. However:

The size of the cube can be very large, thus we may not be able to store all views due to space constraints.

When the warehouse is updated, we may not have enough time to update all materialized views (maintenance cost constraint).

Therefore, given space and update time constraints we have to choose which set of views to materialize based on their benefit in answering queries

12

11

98

34

8


22


Two approaches in view selection: Static view selection: Given specific query

statistics, space and time constraints select the best views to materialize (paper: Implementing Data Cubes Efficiently).

Dynamic view materialization: The set materialized views is determined and dynamically based on (i) the frequency with which they are retrieved, (ii) their benefit to answer other queries, (iii) their space, (iv) their maintenance cost (paper: DynaMat).

12

11

98

34

8


23


Static view selection: How are the views selected?

Using a greedy algorithm.

Dynamic view materialization: How are the views replaced in the buffer

pool? Using a replacement policy that considers the

benefit of a materialized result dynamically. Which views are maintained during an

update? Those which can be maintained fast and provide a

larger benefit.

12

11

98

34

8


24

Strings and Biological Data

What is approximate string matching? Given a text t, a query q and a distance

threshold k, find all substrings in t whose distance from q is at most k.

What distance function is used? The edit distance, which counts the

minimum number of operations (insert, delete, replace) to transform one string to the other.

How is the problem solved exactly? Using an (expensive) dynamic

programming algorithm.


25


How does the dynamic programming algorithm work?

Problem: find the edit distance between strings x and y. Create a (|x|+1)(|y|+1) matrix C, where Ci,j represents the

minimum number of operations to match x1..i with y1..j. The matrix is constructed as follows.

Ci,0 = i C0,j = j Ci,j =

Ci-1,Cj-1 if xi=yi, 1+min(Ci-1,Cj, Ci,Cj-1, Ci-1,Cj-1), else.

How is it adapted for substring matching? The difference is that we set C0,j=0 for all j, since any text

position could be the potential start of a match. If the similarity distance bound is k, we report all positions,

where Cm ≤k (m is the last row – m = |q|).


26


What can we do to reduce the cost of DP? We can use cheap filtering techniques that avoid

applying DP at each position of t. They do not provide solutions, but aim to reduce

the search effort. They are based on the idea that it is most probable

for a text area not to match a query rather than to match it.

They work in two steps. First a cheap heuristic is used to determine whether

an area of the text t could match with the query q. If not search is abandoned for this area, otherwise

an expensive search algorithm (i.e., dynamic programming) is applied


27


Other ways to perform substring matching

Build a suffix-tree for t and use it for approximate and exact queries.

If t is large the suffix tree cannot be built in memory.

We can solve the problem, by splitting the possible substrings into groups and build a suffix tree in memory for each of them.

Paper: A Database Index to Large Biological Sequences (optional reading)


28


Other ways to perform substring matching All substrings are transformed to high

dimensional points and indexed. A query is then applied on the indexes using a

lower bound of the edit distance. Given an alphabet Σ = {α1,α2,...,αd}, we can

transform a string to a d-dimensional point, called frequency vector, which stores the global frequence of each character in the substring.

The idea is extended to capture also the local frequencies of the characters in the substring, using wavelet transforms.


29

Semi-structured Data

<A> <B>xxx</B> <C> <D>yyy</D> <D>zzz</D> </C></A>

What are semi-structured data? In various application domains, the data are semi-

structured; the database schema is loose-defined.

What is XML? A language like HTML, where tags describe the

data itself. Tags are called elements in XML. XML can be used to describe semi-structured data.

Why do we need special management techniques for semi-structured data? There is no well-defined schema, so we cannot

store the data in relational tables in an efficient way.


30



How do we store semi-structured data in a database? Solution 1:

Use specialized storage methods, query languages and query evaluation techniques for semi-structured data.

Solution 2: Represent XML data in relational tables,

transform queries to SQL, and use the mature relational DB technology.


31



How can we model semi-structured data? An XML document can be represented

as a node-labeled graph. The labels of the graph are element

tags, attribute names and values. Most documents can be represented

by trees. The edges that transform a tree to a graph come from ID references.


32



What types of queries apply on XML data? absolute path expressions.

Example: book/author/name/lastname/Smith simple path expressions.

Example: //author/name/lastname/Smith regular path expressions.

Example: //author//Smith expressions of complex structure.

Example: book[/title/XML][//author//Smith]


33

Indexing Semi-structured Data


Several storage schemes and indexes have been proposed for XML queries. Some of them index the paths or

subgraphs of the XML structures. Example: the A(k)-index

Some decompose the information and flatten it into relational DB tables. Example: The method described in the

Structural Joins paper


34

Indexing Semi-structured Data


The A(k)-index: Indexes simple paths. Useful for

simple path queries. Creates a structural summary of the

XML graphs. All the paths up to length k in the documents are preserved into the summary graph.

Based on the notion of k-bisimilarity.


35

Flattening XML data to relational tables


The position of each element/attribute occurrence is represented as a 3-tuple (Document-id, StartPos:EndPos, LevelNum)Values (text) is encoded using (Document-id, StartPos, LevelNum): Document-id is the id of the document that

contains the element StartPos is the number of words from the

beginning of the document until the start of the element

EndPos is the number of words from the beginning of the document until the end of the element

LevelNum is the nesting depth of the element


36

XML query processing


The query is broken into binary parent-child or ancestor descendant relationships.Example:

book[/title/XML][//author//Smith]

Broken to: book/title title/XML book//author author//Smith

book

title

XML

author

Smith


37



Each binary query is executed as a join, and their results are “stitched” together to formulate the results of the whole query.Thus the “heart” of XML query processing is the algorithm that joins the elements table to retrieve the results for each individual query component.A simple, tree merge join may perform many passes to the “inner” table, one for each element in the outer table that matches the elements there. In order to avoid this a stack-tree join algorithm is proposed.


38



How does the stack-tree join algorithm work? It keeps a stack to keep nested AList

elements which are in the same path as the current element from DList.

When a qualifying element in DList is found, all elements of AList in the stack are output.


39

DBMSs on Modern Machines

Memory and disk latency are not improving as fast as CPU processing speed, memory bandwidth and disk bandwidth:

1985 1990 1995 20001

10

100

CPU(kHz)

mem.bw

disk.bw

disk.lat(ms)

mem.lat(ns)

Also random accesses become very expensive and indexes less efficient

The new bottleneck is memory access and DB operators have to be tuned

DBMSs have to be redesigned for the characteristics of the new machines


40

The new bottleneck: Memory Access

Memory is now hierarchical: two levels of caching

CPUL1 cache

L2 cache

Main memory

CPU

die

L1 cache-line

L2 cache-line

Memory page

Memory-latency: the time needed to transfer 1 byte from the main memory to the L2 cache.Cache (L2) miss: if the requested data is not in cache and needs to be fetched from main memoryCache-line: The transfer unit from main memory to cache (e.g., L2 cache-line = 128 bytes)


41

The new bottleneck: Memory Access

Why is there memory latency? Accessing a specific address from memory loads also

the information next to it in the cache-line (in parallel over a wide bus). A cache-line at a time is accessed from main memory.

If the requested information is already in cache (e.g., the gender of the second tuple), main memory is not accessed. Otherwise, a cache-miss occurs.

“trash” may be loaded to cache together with the useful information. Therefore data which are likely to be accessed together by a program should be placed close in main memory.


42

Storage schemes

How are database relations stored in memory/disk? The Normal Storage Model stores the tuples sequentially. The Decomposed Storage Model breaks the relation into

binary tables <id,att1>, <id,att2>, ... , <id,attn>. The Partition Attributes Accross (PAX) model stores the

relation like NSM, but decomposes the information in the pages, in order to reduce the cache misses (optional reading).

What is the impact of these storage schemes in query processing

How many cache misses will occur during a selection on a relation if it stored by NSM, or DSM? What if we use unary tables?

There is a trade-off between fetching garbage to trash (NSM) and joining decomposed data (DSM).


43

Performing Joins in Main Memory

Join operators (which may require multiple passes on the data) should be implemented with care in main memory.What techniques can we use to speed-up join processing?

Preprocess the data to get rid of irrelevant information to the join.

Use blocking and partitioning to reduce the problem into small problems that can be performed in cache.

Use radix-clustering to perform hashing in multiple passes that reduce the overall hash-join cost.


44

Database Indexing in Main Memory

In future databases all data but few large tables will be memory-resident.Therefore is it important to build efficient main-memory indexes.These indexes should consider the hierarchical memories and the memory-access bottleneck.Indexes like the B+-tree and the R-tree are also suitable for main memory, if the node size is set to the cache-linesize.


45

Database Indexing in Main Memory

What techniques are used to improve the performance of these structures in memory?

Pointer elimination. Pointers are eliminated and the children of a specific node are allocated in sequential memory blocks.

How is the size of the structure and the search cost affected if pointer elimination is used?

The capacity of each node (nearly) doubles. Thus the height of the tree becomes shorter and queries are processed faster.

Hard-wiring binary search. The binary search code is replaced by if-else statements. We may choose to leave some key-slots empty in order to reduce the depth of the search tree.

Quantization and compression (for R-trees). We replace exact MBR coordinates by relative coordinates to the MBR of the containing node. We can compress further these coordinates by approximating them in a quantized grid.


46

The format of the final exam

It consists of simple problems which require that you have digested the materialThere will be problems from (almost) all topics covered in the course.

The problems do not require knowledge of the technical details.

However, you should be able to apply the simple versions of the algorithms if you are asked to for simple problems:

Example 1: I may give you an AList and a DList, and I want you to show me the steps (and stack) of the structural join algorithm (topic: semi-structured data).

Example 2: I may give you the data warehouse settings and ask you to find the views selected for materialization if the greedy algorithm is applied (topic: data warehousing).

csis7101: course summary

Documents

data node

timeseries data data

nn search

nearest neighbor nn

distance distq

xml data storage

minimum distance

smallest distance