orthogonal array based big data subsampling hongquan xuhqxu/stat201a/talk-bigdata2019.pdf · 2019....
TRANSCRIPT
![Page 1: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled](https://reader034.vdocument.in/reader034/viewer/2022052016/602f05206a04ec4d163aa25f/html5/thumbnails/1.jpg)
Orthogonal Array Based Big Data
Subsampling
Hongquan Xu
University of California, Los Angeles
Joint work with
Lin Wang, Jake Kramer and Weng Kee Wong
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 1 / 20
![Page 2: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled](https://reader034.vdocument.in/reader034/viewer/2022052016/602f05206a04ec4d163aa25f/html5/thumbnails/2.jpg)
Motivation
Data are big (sometimes redundant)
Analyzing the full data may be computationally unfeasible
Storing all of the data may not be possible
There are a lot of circumstances under which X is big while the
label (or response) Y is expensive to obtain
Big X Small Y
Patients’ records Performance of
a new medicine
Images as visual Brain response
stimuli
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 2 / 20
![Page 3: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled](https://reader034.vdocument.in/reader034/viewer/2022052016/602f05206a04ec4d163aa25f/html5/thumbnails/3.jpg)
Linear Regression Setup
y = β0 + β1x1 + · · ·+ βpxp + ε
Question:
1. If the budget only allows k responses (labels), k � n, choose
which k to label?
2. When all responses are available, to accelerate the computation,
how to choose a subsample of size k � n ?
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 3 / 20
![Page 4: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled](https://reader034.vdocument.in/reader034/viewer/2022052016/602f05206a04ec4d163aa25f/html5/thumbnails/4.jpg)
Leverage Subsampling
A subsampling method consists of
sampling probabilities πi , i = 1, . . . , n,∑n
i πi = 1
a weighted estimator β̂s = (XTs WXs)−1XT
s WYs , where Xs is
the subsample taken from X and W = diag(w1, . . . ,wk) is a
weight matrix
1. Uniform sampling: πi = 1/n, wi = 1
2. Leveraging: πi = hii/(p + 1), wi = 1/πi , where
hii = (X (XTX )−1XT )ii
(Drineas et al., 2006; Ma et al., 2015)
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 4 / 20
![Page 5: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled](https://reader034.vdocument.in/reader034/viewer/2022052016/602f05206a04ec4d163aa25f/html5/thumbnails/5.jpg)
Information-based Subsampling
The OLS for a subsample Xs is β̂s = (XTs Xs)−1XT
s Ys
E(β̂s) = β, Var(β̂s) = σ2(XTs Xs)−1
The Fisher information matrix for β with subdata is
Ms = XTs Xs
D-optimality: to find Xs with k points that maximizes det(Ms)
Available approach: IBOSS (Wang H., Yang, and Stufken, 2018 JASA)
Include data points with extreme (largest and smallest) covariate
values
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 5 / 20
![Page 6: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled](https://reader034.vdocument.in/reader034/viewer/2022052016/602f05206a04ec4d163aa25f/html5/thumbnails/6.jpg)
Orthogonal Array-based Subsampling
Lemma
Suppose each covariate is scaled to [−1, 1]. For a subsample Xs of
size k ,
det(Ms) ≤ kp+1,
and the equality holds (D-optimal) if and only if Xs forms a
two-level OA with levels from {−1, 1} and strength t ≥ 2.
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 6 / 20
![Page 7: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled](https://reader034.vdocument.in/reader034/viewer/2022052016/602f05206a04ec4d163aa25f/html5/thumbnails/7.jpg)
A Measure of “Similarity”
For x , y ∈ [−1, 1], define
δ(x , y) =
{2− (x2 + y2)/2, if sign(x) = sign(y);
1− (x2 + y2)/2, otherwise.
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
x
y
{1 ≤ δ(x , y) ≤ 2, if sign(x) = sign(y);
0 ≤ δ(x , y) ≤ 1, otherwise.
δ(−1, 1) = δ(1,−1) = 0
δ(−1,−1) = δ(1, 1) = 1
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 7 / 20
![Page 8: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled](https://reader034.vdocument.in/reader034/viewer/2022052016/602f05206a04ec4d163aa25f/html5/thumbnails/8.jpg)
J-optimality
For two data points xi = (xi1, . . . , xip) and xj = (xj1, . . . , xjp), let
δ(xi , xj) =
p∑l=1
δ(xil , xjl)
and define the J-optimality criterion (Xu 2002, Technometrics) as
J(Xs) =∑
1≤i<j≤k[δ(xi , xj)]2 .
Theorem
For any k-point subsample Xs over [−1, 1]p,
J(Xs) ≥ [k2p(p + 1)− 4kp2]/8,
with equality if and only if Xs forms an OA with strength 2.
Question: How to find Xs that minimizes J(Xs)?Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 8 / 20
![Page 9: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled](https://reader034.vdocument.in/reader034/viewer/2022052016/602f05206a04ec4d163aa25f/html5/thumbnails/9.jpg)
Algorithm: Select and eliminate points simultaneously
Step 0. Given n × p matrix X , scale each variable to [-1,1].
Step 1. Find a point that is farthest to the origin. Include it as Xs and
let i = 1.
Step 2. For each x ∈ X , compute the J-score
J(x ,Xs) =∑xs∈Xs
δ(x , xs)2.
Step 3. Find x∗ ∈ X that minimizes J(x ,Xs) and add x∗ to Xs .
Step 4. Keep t = bn/ic points in X with t smallest J(x ,Xs) values.
Remove x∗ and other points from X .
Step 5. Increase i by 1 and repeat Steps 2–4 until Xs contains k points.
Note: The complexity is O(np log(k)) as∑k
i=1(1/i) = O(log(k)).
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 9 / 20
![Page 10: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled](https://reader034.vdocument.in/reader034/viewer/2022052016/602f05206a04ec4d163aa25f/html5/thumbnails/10.jpg)
Algorithm
−1.0 −0.5 0.0 0.5 1.0
−1.0
−0.5
0.0
0.5
1.0
x[, 1]
x[, 2
]
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 10 / 20
![Page 11: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled](https://reader034.vdocument.in/reader034/viewer/2022052016/602f05206a04ec4d163aa25f/html5/thumbnails/11.jpg)
Algorithm
−1.0 −0.5 0.0 0.5 1.0
−1.0
−0.5
0.0
0.5
1.0
x[candi, 1]
x[c
andi, 2
]
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 11 / 20
![Page 12: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled](https://reader034.vdocument.in/reader034/viewer/2022052016/602f05206a04ec4d163aa25f/html5/thumbnails/12.jpg)
Algorithm
−1.0 −0.5 0.0 0.5 1.0
−1.0
−0.5
0.0
0.5
1.0
x[candi, 1]
x[c
andi, 2
]
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 12 / 20
![Page 13: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled](https://reader034.vdocument.in/reader034/viewer/2022052016/602f05206a04ec4d163aa25f/html5/thumbnails/13.jpg)
Algorithm
−1.0 −0.5 0.0 0.5 1.0
−1.0
−0.5
0.0
0.5
1.0
x[candi, 1]
x[c
andi, 2
]
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 13 / 20
![Page 14: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled](https://reader034.vdocument.in/reader034/viewer/2022052016/602f05206a04ec4d163aa25f/html5/thumbnails/14.jpg)
Algorithm
−1.0 −0.5 0.0 0.5 1.0
−1.0
−0.5
0.0
0.5
1.0
x[candi, 1]
x[c
andi, 2
]
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 14 / 20
![Page 15: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled](https://reader034.vdocument.in/reader034/viewer/2022052016/602f05206a04ec4d163aa25f/html5/thumbnails/15.jpg)
A Toy Simulation
Setup: x1, x2iid∼ Unif[−1, 1], n = 1000, p = 2, k = 100, ε ∼ N(0, 1)
y = 1 + 2x1 + 2x2 + ε
−1.0 −0.5 0.0 0.5 1.0
−1.0
−0.5
0.0
0.5
1.0
Full data 1000x2
x_1
x_2
−1.0 −0.5 0.0 0.5 1.0
−1.0
−0.5
0.0
0.5
1.0
IBOSS method with k=100
x_1
x_2
−1.0 −0.5 0.0 0.5 1.0
−1.0
−0.5
0.0
0.5
1.0
OA−based method with k=100
x_1
x_2
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 15 / 20
![Page 16: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled](https://reader034.vdocument.in/reader034/viewer/2022052016/602f05206a04ec4d163aa25f/html5/thumbnails/16.jpg)
Comparison
Unif_Sampling Leveraging IBOSS OA−based
0.0
00
.05
0.1
00
.15
0.2
0
MSEs for slope parameters (1000 repetitions)
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 16 / 20
![Page 17: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled](https://reader034.vdocument.in/reader034/viewer/2022052016/602f05206a04ec4d163aa25f/html5/thumbnails/17.jpg)
Theoretical Properties
Assume that the covariates x1, . . . , xp are i.i.d. with a uniform
distribution in [a, b]p, and we are considering the model
y = β0 + β1x1 + · · ·+ βpxp + ε.
Theorem
The OA-based subsampling minimizes MSE, i.e., the sum of the
variances of coefficient estimations, almost surely as n→∞.
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 17 / 20
![Page 18: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled](https://reader034.vdocument.in/reader034/viewer/2022052016/602f05206a04ec4d163aa25f/html5/thumbnails/18.jpg)
Model with Interactions
y = β0 + β1x1 + · · ·+ βpxp + β12x1x2 + · · ·+ β(p−1)pxp−1xp + ε
Define the J-optimality criterion as
J4(Xs) =∑
1≤i<j≤k[δ(xi , xj)]4 .
Assume that the covariates x1, . . . , xp are i.i.d. with a uniform
distribution in [a, b]p.
Theorem
The OA-based subsampling minimizes MSE, i.e., the sum of
variances of coefficient estimations, almost surely as n→∞.
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 18 / 20
![Page 19: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled](https://reader034.vdocument.in/reader034/viewer/2022052016/602f05206a04ec4d163aa25f/html5/thumbnails/19.jpg)
Simulation
Setup: x1, . . . , xpiid∼ Unif[−1, 1], n = 104, p = 10, k = 128,
ε ∼ N(0, 32)
y = 1 + x1 + · · ·+ x10 + x1x2 + · · ·+ x9x10 + ε
Unif_Sampling Leveraging IBOSS OA−based
0.5
1.0
1.5
MSEs for slope parameters (100 repetitions)
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 19 / 20
![Page 20: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled](https://reader034.vdocument.in/reader034/viewer/2022052016/602f05206a04ec4d163aa25f/html5/thumbnails/20.jpg)
Summary
Propose a new framework for subsampling: OA-based
subsampling
Computational complexity: O(np log(k))
Minimize J(Xs) for a linear model without interactions, attains
optimality for coefficient estimation for large n
Minimize J4(Xs) for a linear model with interactions, attains
optimality for coefficient estimation for large n
More efficient than leverage sampling and IBOSS
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 20 / 20