building optimal websites with the constrained subtree selection problem brent heeringa (joint work...

75
Building Optimal Websites with the Constrained Subtree Selection Problem Brent Heeringa (joint work with Micah Adler) 09 November 2004

Post on 22-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Building Optimal Websites with the Constrained Subtree

Selection Problem

Brent Heeringa(joint work with Micah Adler)

09 November 2004

A website design problem(for example: a new kitchen store)

Given products, their popularity, and their organization:

How do we create a good website?Navigation is naturalAccess to information is timely

paring chef bread steak

Wüstof Henkels

Knives

Type Maker

0.26 0.33 0.27 0.14

Good website: Natural Navigation

Organization is a DAG

TC of DAG enumerates all viable categorical relationships and introduces shortcuts

Subgraph of TC preserves logical relationship between categories

Transitive Closure

Subgraph of TC

TC

Good website: Timely Access to Info

Two obstacles to finding info quickly Time scanning a page for correct link Time descending the DAG

Associate a cost with each obstacle Page cost (function of out-degree of

node) Path cost (sum of page costs on path)

Good access structure: Minimize expected path cost Optimal subgraph is always a full tree

1/2

Page Cost = # links Path Cost = 3+2=5Weighted Path Cost = 5/2

Constrained Subtree Selection (CSS)

An instance of CSS is a triple: (G,,w) G is a rooted, DAG with n leaves

(constraint graph) is a function of the out-degree of

each internal node (degree cost) w is a probability distribution over

the n leaves (weights)

A solution is any directed subtree of the transitive closure of G which includes the root and leaves

An optimal solution is one which minimizes the expected path cost

1/4 1/4 1/4 1/4

(x)=x

Constrained Subtree Selection (CSS)

An instance of CSS is a triple: (G,,w) G is a rooted, DAG with n leaves

(constraint graph) is a function of the out-degree of

each internal node (degree cost) w is a probability distribution over

the n leaves (weights)

A solution is any directed subtree of the transitive closure of G which includes the root and leaves

An optimal solution is one which minimizes the expected path cost

1/4 1/4 1/4 1/4

3(1/4)5(1/4)

5(1/4)

(x)=x Cost:4

3(1/4)

1/2 1/6 1/6 1/6

Constrained Subtree Selection (CSS)

An instance of CSS is a triple: (G,,w) G is a rooted, DAG with n leaves

(constraint graph) is a function of the out-degree of

each internal node (degree cost) w is a probability distribution over

the n leaves (weights)

A solution is any directed subtree of the transitive closure of G which includes the root and leaves

An optimal solution is one which minimizes the expected path cost

(x)=x Cost: 3 1/2

Constraint-Free Graphs and k-favorability

Constraint-Free GraphEvery directed, full tree with n leaves is a

subtree of the TC

CSS is no longer constrained by the graph

k-favorable degree cost Fix . There exists k>1 for any constraint-

free instance of CSS under where an optimal tree has maximal out-degree k

Linear Degree Cost - (x)=x

• 5 paths w/ cost 5

• 3 paths w/ cost 5• 2 paths w/ cost 4

• Prefer binary structure when a leaf has at least half the mass

• Prefer ternary structure when mass is uniformly distributed

> 1/2

Linear Degree Cost - (x)=x

CSS with 2-favorable degree costs and C.F. graphs is Huffman coding problem Examples: quadratic, exp, ceiling of log

Results

Complexity: NP-Complete for equal weights and many Sufficient condition on Hardness depends on constraint graph

Highlighted Results: Theorem: O(n(k)+k)-time DP algorithm

is integer-valued, k-favorable and G is constraint free (x)=x

Theorem: poly-time constant-approximation: ≥1 and k-favorable; G has constant out-degree Approximate Hotlink Assignment - [Kranakis et. al]

Other results: Characterizations of optimal trees for uniform probability

distributions

Related Work Adaptive Websites [Perkowitz & Etzioni]

Challenge to the AI community Novel views of websites: Page synthesis problem

Hotlink Assignment [Kranakis, Krizanc, Shende, et. al.] Add 1 hotlink per page to minimize expected distance

from root to leaves Recently: pages have cost proportional to their size

Hotlinks don’t change page cost

Optimal Prefix-Free Codes [Golin & Rote] Min code for n words with r symbols where symbol ai has

cost ci

Resembles CSS without a constraint graph

INPUT: (X,C) X=(x1,…,xn) n=3k and C=(C1,…,Cm) Ci X

OUTPUT: C’ C where |C’|=k and covers X

QUESTION: Given K and (X,C) is there a cover of size K?

Exact Cover by 3-Sets

Sufficient condition on :

For every integer k, there exists an integer s(k) such that

(X,C) X=(x1,…,xn) n=3k and C=(C1,…,Cm) Ci X

Lopsided Trees

Recall: (x)=x, and G is constraint free

Node level = path cost

Adding an edge increases level

Grow lopsided trees level by level

Lopsided Trees

Lopsided Trees

Lopsided Trees

Lopsided Trees

We know exact cost of tree up to the current level i:

Exact cost of m leaves Remaining n-m leaves must have path-cost at least i

Lopsided Trees

Exact cost of C: 3 • (1/3)=1

Remaining mass up to level 4: (2/3) • 4 = 8/3

Total: 1+8/3=11/3

Lopsided Trees

Tree cost at Level 5 in terms of Tree cost at Level 4: Add in the mass of

remaining leaves

Cost at Level 5: No new leaves 11/3+2/3=13/3

Lopsided Trees

Lopsided Trees

Lopsided Trees

Equality on trees: Equal number of leaves at or

above frontier Equal number of leaves at each

relative level below frontier

Nodes have outdegree ≤ 3 Node below frontier ≤ (3) (m;l1, l2, l3) = signature Example Signature: (2; 3, 2, 0)

2: C and F are leaves 3: G, H, I are 1 level past the frontier 2: J and K are 2 levels past the

frontier

Inductive Definition

Let CSS(m,l1,l2,l3) = min cost tree with sig. (m;l1, l2, l3)

Can we define CSS(m,l1,l2,l3) in terms of optimal substructures?

Which trees, when grown by one level, have signatures CSS(m,l1,l2,l3)?

Which signatures (m’,l’1,l’2,l’3) lead to (m,l1,l2,l3)

Sig: (0; 2, 0, 0)

Sig: (1; 0, 0, 3)

Growing a tree only affects frontierOnly l1 affects next levelChoose leavesThe remaining nodes are

internalChoose degree-2 (d2)

Remaining nodes are degree-3 (d3)

O(n2) choices

The other direction

The original question(warning: here be symbols)

Which (m’;l’1,l’2,l’3) (m;l1,l2,l3) l’1 and d2 are sufficient

l’1 and d2 are both O(n)

O(n2) possibilities for (m’;l’1,l’2,l’3)

CSS(m,l1,l2,l3) = min cost tree with sig. (m;l1, l2, l3)

= CSS(m’,l’1,l’2,l’3) + cm’ for 1≤d2≤l’1≤n

(cm’ are the smallest n-m’ weights)

CSS(n,0,0,0) = cost of optimal tree Analysis:

Table size = O(n4) Each cell takes O(n2) lookups O(n6) algorithm

Lower Bound on Cost

Lemma: H(w)/log(k) is a lower bound on the cost of an optimal treeFor any k-favorable degree cost , with ≥1G is constraint-free

T

c(T) ≥ c’(T) ≥ c’(T’) ≥ H(w)/log(k) (shannon)

1 1 1

1T1 1 1

1T’

1

A Simple Lemma Lemma 2: For any tree with m weighted nodes there exists 1 node

(splitter) which, when removed, divides the tree into subtrees with at most half the weight of the original tree.

splitter

< 1/2 < 1/2

<1/2

Aproximation AlgorithmLet G be a DAG where out-degree of every node

dChoose a spanning tree T from GBalance-Tree(T):

Find a splitter node in T (Lemma 2) Stop if splitter is child of root

Disconnect the splitter and reconnect it to the root root has degree at most d+1

Call Balance-Tree on all subtrees

splitter

Mass of each subtree is at least half of whole tree

Approximation Algorithm

Analysis: Mass under any node is half of mass under its

grandparent Path length to leaf with weight wi is -2log(wi)

Theorem: O(m)-time O(log(k)(d+1))-approx to optimal solution

For any DAG G with m nodes and out-degree d For every k-favorable degree cost ≥ 1,

Upper Bound on Node Cost Weighted Path Length

Proposed Problem 1(CSS in constraint-free graphs, equal leaf weights)

Question: Polytime algorithm for CSS with:Constraint-free graphsEqual leaf weightsIncreasing degree cost

Good News:Characterizations for linear and log degree costsNear linear time algorithms for r-ary Varn Codes

(Huffman codes with r unequal letter costs, uniform probability distribution)

Varn Codes(infinite lopsided tree)

Note: Not the 5 highest Leaves!

5 Leaves

Symbol Costs = (3,3,3,8,8)

Varn Codes(infinite lopsided tree)

6 Leaves

Symbol Costs = (3,3,3,8,8)

Note: m internal nodes are the highest m nodes in the infinite tree

Proposed Problem 1(CSS in constraint-free graphs, equal leaf weights)

Bad News: No Notion of an infinite lopsided tree in CSS

Degree change = structure change Optimal CSS tree is fairly balanced Property:

No leaf may appear above the level of any other internal node Proof: If it were the case, we could switch branches and

decrease the cost of the tree

Intuition: There is some k which optimizes breadth-to-depth tradeoff. The optimal tree repeats this structure. Fringe requires some computation time.

Proposed Problem 2(Dynamic CSS)

CSS often applies to environments which are inherently dynamicWeb pages change popularityAccess patterns change on file systems

Question: Given a CSS tree with property P, how much time does it take to maintain P after an update?

P = minimum cost, approximation-ratio of min cost

Restrict attention toInteger leaf weights (rational distributions)Unit updates

Proposed Problem 2(Dynamic CSS)

Good News: Knuth (and later Vitter) studied Dynamic Huffman Codes (DHC)Motivation: One-pass encodingProtocol:

Both parties maintain optimal tree for first t characters

Encode and decode t+1 characterUpdate tree

Optimality of tree maintained in time proportional to encoding

DHC: Sibling Property A binary tree with n leaves is a Huffman tree iff:

The n leaves have nonnegative weights w1…wn

the weight of each internal node is the sum of the weights of its children

The nodes can be numbered in non-decreasing order by weight

siblings are numbered consecutively common parent has a higher number

21

11

5 6

10

5 53 4 5 6

7 8

10

A B

D E

32

2 31 2C

119

11

F

Numbering corresponds to merging in greedy algorithm

DHC: Sibling Property A binary tree with n leaves is a Huffman tree iff:

The n leaves have nonnegative weights w1…wn

the weight of each internal node is the sum of the weights of its children

The nodes can be numbered in non-decreasing order by weight

siblings are numbered consecutively common parent has a higher number

22

11

5 6

11

5 63 4 5 6

7 8

10

A B

D E

33

2 41 2C

119

11

F

What happens if we increase B?

Node 4 violates the Sibling Property

DHC: Sibling Property A binary tree with n leaves is a Huffman tree iff:

The n leaves have nonnegative weights w1…wn

the weight of each internal node is the sum of the weights of its children

The nodes can be numbered in non-decreasing order by weight

siblings are numbered consecutively common parent has a higher number

21

11

5 6

10

5 53 4 5 6

7 8

10

A B

D E

32

2 31 2C

119

11

F

Before updating: Exchange current node with node with highest number having the same weight

DHC: Sibling Property A binary tree with n leaves is a Huffman tree iff:

The n leaves have nonnegative weights w1…wn

the weight of each internal node is the sum of the weights of its children

The nodes can be numbered in non-decreasing order by weight

siblings are numbered consecutively common parent has a higher number

21

11

5 6

10

5 53 4 5 6

7 8

10

A B

D E

32

2 31 2C

119

11

F

Before updating: Exchange current node with node with highest number having the same weight

DHC: Sibling Property A binary tree with n leaves is a Huffman tree iff:

The n leaves have nonnegative weights w1…wn

the weight of each internal node is the sum of the weights of its children

The nodes can be numbered in non-decreasing order by weight

siblings are numbered consecutively common parent has a higher number

21

11

5 6

10

53 5 6

7 8

10

D E

32

C

119

11

F

Different, but still optimal, greedy choice when merging nodes

54

A B2 31 2

DHC: Sibling Property A binary tree with n leaves is a Huffman tree iff:

The n leaves have nonnegative weights w1…wn

the weight of each internal node is the sum of the weights of its children

The nodes can be numbered in non-decreasing order by weight

siblings are numbered consecutively common parent has a higher number

21

11

5 6

10

53 5 6

7 8

10

D E

32

C

119

11

F

Different, but still optimal, greedy choice when merging nodes

54

A B2 31 2

DHC: Sibling Property A binary tree with n leaves is a Huffman tree iff:

The n leaves have nonnegative weights w1…wn

the weight of each internal node is the sum of the weights of its children

The nodes can be numbered in non-decreasing order by weight

siblings are numbered consecutively common parent has a higher number

21

5

10

53

7

10

C

33

B

11 8

11

F

Different, but still optimal, greedy choice when merging nodes

4

D2

11

65 6

9

E5

A

31

2

DHC: Sibling Property A binary tree with n leaves is a Huffman tree iff:

The n leaves have nonnegative weights w1…wn

the weight of each internal node is the sum of the weights of its children

The nodes can be numbered in non-decreasing order by weight

siblings are numbered consecutively common parent has a higher number

21

5

10

53

7

10

C

32

B

11 8

11

F

Now, safe to increase B, because it can’t be greater than the next highest!

4

D2

11

65 6

9

E5

A

31

2

DHC: Sibling Property A binary tree with n leaves is a Huffman tree iff:

The n leaves have nonnegative weights w1…wn

the weight of each internal node is the sum of the weights of its children

The nodes can be numbered in non-decreasing order by weight

siblings are numbered consecutively common parent has a higher number

21

5

10

53

7

10

C

33

B

11 8

11

F

Now, safe to increase B, because it can’t be greater than the next highest!

4

D2

12

65 6

9

E6

A

41

2

Proposed Problem 2(Dynamic CSS)

Good News: DHC generalizes to k-ary alphabets

Claim:DHC is an O((k))-approximation for

CSS : k-favorable, (x)≥1 constraint-free graphs

Proposed Problem 2(Dynamic CSS)

Bad News: DHC doesn’t generalize to Huffman codes with unequal letter costsSibling property = Greedy algorithmFuture:

Explore DHC for unequal letter costsMaintain approximation ratio in constant

degree graphs in time proportional to the height

(We can do it in linear time already)

Proposed Problem 3(Category Tree - CT)

Scenario:Large reservoir of songs in iTunes

Song is a vector of categorical values

Common to search all the songs for the right one

Question: Can we organize the songs by categories so that the average search time is minimized?

Proposed Problem 3(Category Tree - CT)

Category Tree: CT(,C,S) is the degree cost C=(d1,…,dm) are the m category sizes S is a set of objects drawn from C

Solution: Rooted, oriented tree Internal nodes are categories Edges are appropriate categorical values Leaves are objects

Optimal solution: Minimize expected path cost

Path cost is defined as in CSS

Optimal solution corresponds to an adaptive ordering of the categories

Proposed Problem 3(Constrained Category Tree - CCT)

Constrained Category Tree: CCT(,C,S) is the degree cost C=(d1,…,dm) are the m category sizes S is a set of objects drawn from C

Solution: Rooted, oriented tree Internal nodes are categories (and internal nodes at the same

depth have the same category) Edges are appropriate categorical values Leaves are objects

Optimal solution: Minimize expected path cost

Path cost is defined as in CSS

Optimal solution corresponds to a fixed ordering of the categories

Proposed Problem 3(Category Tree - CT)

CT and CCT are classical Decision Tree problems

Decision Tree (DT):Input: m binary tests T=(T1…Tm) and n objects O=(O1…On)

Output: Binary tree where internal nodes are Ti and leaves or Oi

Measure: Total external path length CT and CCT are NP-Complete

Reduction from Exact Cover by 3-Sets (XC3) Resembles hardness proof for Decision Tree

Proposed Problem 3(Category Tree - CT)

Decision Tree Inference (DTI) :Input: m examples – T/F labeled binary strings from {0,1}n Output: Binary tree where internal nodes are string positions and leaves are TRUE or FALSE which is consistent with examplesMeasure: Number of leaves (i.e. size of tree)

CT and CCT are not instances of DTI DT doesn’t easily reduce to DTI Most complexity results (lower bounds on approximations) are for DTI only!

Timetable

Solve some subset of open problems

1-2 academic years

Open Problems

Theorem: There is an for any instance (G,,w) of CSS where G is constraint free, is k-favorable, maps the positive integers to the positive integers and is non-decreasing

Proof:c(T) ≥ c’(T) ≥ c’(T’) ≥ H(w)/log(k)T is optimal tree for CSS cost cT’ is optimal tree for OPC cost c’ for k symbols each with weight 1 (i.e. (x)=1)H is entropy

NO

Signatures as Representation Different lopsided trees share common substructure when truncated

Level-i-Truncations: Include node iff parent is at most i Level-i-Signatures: [m;l1,..,l(k)]

m is the # of leaves ≤ level i lj is # of nodes at level i+j

Cost of Level-i-Truncation: Exact cost for m leaves Cost up to the truncation for the remaining n-m leaves.

The Dynamic Programming Table

Signatures = Table entries MIN[m;l1,..,l(k)] gives min-cost of

all truncated trees with signature [m;l1,..,l(k)]

O(n(k)+1) entries level-i truncation is parent of

O(nk-1) level-(i+1) truncation level-i sig is parent of O(nk-1) level-(i+1) sigs

Choose how many nodes at next level will be internal Among those, choose how

many will be degree 2, degree 3, …, degree k –– O(nk-1) choices

Consistent ordering of entries O(n(k)+k) algorithm; MIN[n;0,…,0]

contains minimum cost

Set of productsThe desired information

e.g., chef & paring knives

Popularity of productsWeights

Hierarchical organization of products into categoriesSingle, global category (the root)Products are endpoints (leaves)General to specific trajectory

Adaptive Websites [Perkowitz & Etzioni]Page synthesis (novel view) with clustering and concept learning using access logsEfficiently find topic of interest (effort)

Hotlink Assignment [Kranakis, Krizanc, Shende, et. al.]Add k hotlinks per page to minimize expected distance from root to leavesRecently: pages have fixed cost proportional to their size

Hotlinks don’t change path-cost

Optimal Prefix-Free Codes [Golin & Rote]Min code for n words with r symbols where symbol ai has cost ci

Resembles CSS without a constraint graph

Lopsided Trees

[m;l1,..,l(k)] = MIN[m,l1,..,l(k)] n leaves so at most O(n(k)+1)

entries Entry stores minimum cost of

tree bearing that signature Total ordering on signatures,

consistent with the growing process

O(nk-1) choices O(n(k)+k) algorithm

Lopsided Trees

Tree cost at Level 5 in terms of Tree cost at Level 4:

Cost at Level 5: 11/3+2/3=13/3

Cost at Level 6: 13/3+1/2=29/6

The original question(warning: here be symbols)

Which (m’,l’1,l’2,l’3) (m,l1,l2,l3)

The original question(warning: here be symbols)

Which (m’,l’1,l’2,l’3) (m,l1,l2,l3) Suppose we know

l’1 (the # of nodes one level below the frontier)

d2 (the # of l’1 which are degree-2 nodes in (m,l1,l2,l3))

Let’s determine the values of the remaining variables1

2

3l’1 nodes

1

2

d2 nodes3

The original question(warning: here be symbols)

Which (m’,l’1,l’2,l’3) (m,l1,l2,l3) Suppose we know

l’1 (the # of nodes one level below the frontier)

d2 (the # of l’1 which are degree-2 nodes in (m,l1,l2,l3))

m = m’ + l’1 - d2 - d3

The new number of leaves

The old number of leaves

Nodes at one level below the frontier

Internal nodes of degree 2

Internal nodes of degree 3

The original question(warning: here be symbols)

Which (m’,l’1,l’2,l’3) (m,l1,l2,l3) Suppose we know

l’1 (the # of nodes one level below the frontier)

d2 (the # of l’1 which are degree-2 nodes in (m,l1,l2,l3))

m = m’ + l’1 - d2 - l3/3

The new number of leaves

The old number of leaves

Nodes at one level below the frontier

Internal nodes of degree 2

Internal nodes of degree 3

The original question(warning: here be symbols)

Which (m’,l’1,l’2,l’3) (m,l1,l2,l3) Suppose we know

l’1 (the # of nodes one level below the frontier)

d2 (the # of l’1 which are degree-2 nodes in (m,l1,l2,l3))

l’2 = l1

The old number of nodes at2 levels below the frontier

New nodes one level below the frontier

The original question(warning: here be symbols)

Which (m’,l’1,l’2,l’3) (m,l1,l2,l3) Suppose we know

l’1 (the # of nodes one level below the frontier)

d2 (the # of l’1 which are degree-2 nodes in (m,l1,l2,l3))

l2 = l3+2d2

The new number of nodes 2 levels below the frontier

d2 nodes are binary so they contribute 2d2 to the frontier

Organized DataPremise: People organize data so it is easy

to findNatural navigationPopular items are easily accessible

Organized Data Observation: Most existing data could be better

organized Files clutter folders; directory structures lose consistency Web pages are buried deep in the website Searching takes too much time

Organized DataQuestion: How can we

automatically improve access to organized information?

Organized DataQuestion: How can we automatically improve access to organized information?

Thesis Goals: Models for information organization tasks

Novel deliberation costComputational complexityAlgorithms and approximations

Outline

Prior Work: Constrained Subtree SelectionDefinitions: k-favorable, constraint-freeRelated workPolytime DP algorithm for restricted caseOther results

Proposed Future Work:Dynamic CSSAlgorithms for open CSS problemsCategory Tree: A decision tree problem