elements of statistical learning sol2


Elements of Statistical Learning
Solutions to the Exercises
Yu Zhang, sjtuzy@gmail.com
November 25, 2009
Exercise 2.6 Consider a regression problem with inputs xi and outputs yi,
and a parameterized model f(x) to be fit by least squares. Show that if
there are observations with tied or identical values of x, then the fit can be
obtained from a reduced weighted least squares problem.
Proof For known heteroskedasticity (e.g., grouped data with known group
sizes), use weighted least squares (WLS) to obtain efficient unbiased esti-
mates. Fig.1 explains  Observations with tied or identical values of x . In
"
"
7
"
"
6
"
"
5
"
y
"
4
"
"
3
"
"
2
10 15 20 25
x
WLS.01
Figure 1: Observations with tied
the textbook, section 2.7.1 also explain this problem.
1
If there are multiple observation pairs xi, yil, l = 1, 2 . . . , Ni at each value
of xi, the risk is limited as follows:
Ni Ni
" " " "
argmin (f(xi) - yil)2 = argmin (Nif(xi)2 - 2 f(xi)yil
 
i l=1 i l
Ni
"
2
+ yil)
l
"
= argmin Ni{(f(xi) - yi)2 + Constant}

i
"
= argmin Ni{(f(xi) - yi)2} (1)

i
which is a weighted least squares problem.
Ć
Exercise 3.19 Show that ||ridge|| increases as its tuning parameter  0.
Does the same property hold for the lasso and partial least squares esti-
mates? For the latter, consider the  tuning parameter to be the successive
steps in the algorithm.
Proof Let 2 < 1 and 1, 2 be the optimal solution. We denote the loss
function as follows
f() = ||Y - X|| + ||||2
2
Then we have
f1(2) + f2(1) e" f2(2) + f1(1)
1||2||2 + 2||1||2 e" 2||2||2 + 1||1||2
2 2 2
(1 - 2)||2||2 e" (1 - 2)||1||2
2 2
||2||2 e" ||1||2
2 2
Ć
So ||ridge|| increases as its tuning parameter  0.
Ć
Similarly, in lasso case, ||||1 increase as  0. But this can t guarantee
the l2-norm increase. Fig.2 is a direct view of this property.
In partial least square case, it can be shown that the PLS algorithm
is equivalent to the conjugate gradient method. This is a procedure that
iteratively computes approximate solutions of |A = b| by minimizing the
quadratic function
1
ń"A - bń"
2
along directions that are |A|-orthogonal (Ex.3.18). The approximate solu-
tion obtained after m steps is equal to the PLS estimator obtained after p
iterations. The canonical algorithm can be written as follows:
2
Figure 2: Lasso
1. Initialization 0 = 0, d0 = r0 = b - A0 = b
dH ri
i
2. ai =
dH Adi
i
3. i+1 = i + aidi
4. ri+1 = b - Ai+1(= ri - aiAdi)
H
ri+1Adi
5. bi = -
dH Adi
i
6. di+1 = ri+1 + bidi
The squared norm of j+1 can be written as
||j+1||2 = ||xj||2 + 2a2||dj||2 + ajdHj (2)
j j
We need only shown that ajdHj > 0.
j
H H H
5" rj+1dj+1 = rj+1rj+1 + bjrj+1dj and Ł'rj+1, dj' = 0 (3)
dHrj ||rj||2
j
4" aj = = > 0(A is positive definite.) (4)
dHAdj ||dj||2
j A
j
"
5" j = aidi (5)
i=0
j
"
4" dHj = aidHdi (6)
j j
i=0
3
Now we need to show that dHdi > 0, i = j. By Step 6, we have dj =
8
j
"j-1 "j-1
rj + ( bk)ri.
i=0 k=i
5" Ł'ri, rj' = 0, i = j (7)
8
4" bk > 0 dHdi > 0 (8)
j
5" Adi = a-1(ri - ri+1) (9)
i
H
rj+1(rj - rj+1)
||rj+1||2
4" bj = - = > 0 (10)
aj||dj||2 aj||dj||2
A A
So in PLS case, the ||||2 increase with m.
2
Exercise 4.1 Show how to solve the generalized eigenvalue problem max ań"Ba
subject to ań"W a = 1 by transforming to a standard eigenvalue problem.
Proof Hence W is the the common covariance matrix, we have W is semi-
positive definite. WLOG W is positive definite, we have
2
W = P (11)
Let b = P a, then the problem is
-1 -1
ań"Ba = bń"P BP b = bń"B"b (12)
subject to ań"W a = 1 = bń"B"b = 1. Now the problem transform to a
standard eigenvalue problem.
Exercise 4.2 Suppose we have features x " Rp, a two-class response, with
class sizes N1, N2, and the target coded as -N/N1, N/N2.
(d) Show that this result holds for any (distinct) coding of the two classes.
2 2
Ć
W.L.O.G any distinct coding y1, y2. Let  = (, 0)ń". Compute the
Ć
partial deviation of the RSS() we have
N
Ć "
"RSS()
= -2 (yi - 0 - ń"xi)xń" = 0 (13)
i
"
i=1
N
Ć "
"RSS()
= -2 (yi - 0 - ń"xi) = 0 (14)
"0
i=1
It yields that
" "
ń" (xi - xi)xń" = yixi (15)
Ż
i
i i
"
1
0 = (yi - ń"xi) (16)
N
i
4
Than we get
" " "
1
ń" (xi - xi)xń" = (yi - yi)xń" (17)
Ż
i i
N
i i i
2 2
N1y1 + N2y2
2 2
= N1y11 + N2y22 - (N11 + N22)
Ć Ć Ć Ć
(18)
N1 + N2
1
2 2
= (N1N2y2(2 - 1) - N1N2y1(2 - 1)) (19)
Ć Ć Ć Ć
N
N1N2 2 2
= (y2 - y1)(2 - 1) (20)
Ć Ć
N
Hence the proof in (c) still holds.
Exercise 4.3 Suppose we transform the original predictors X to v via
Ć
linear regression. In detail, let v = X(Xń"X)-1Xń"Y = XB, where Y is
the indicator response matrix. Similarly for any input x " Rp, we get a
Ć
transformed vector w = Bń"x " RK. Show that LDA using v is identical to
LDA in the original space.
Proof Prof.Zhang had given the solution about this problem, but only tow
student notice this problem is tricky. Here I give the main part of the
solution.
" First y1, . . . , yK must independent, then we have r(Ły) = K.
Ć Ć
" Assume that Bp"K, r(B) = K < p, r(Łx) = p. Note that
Ć Ć Ć Ć Ć
r(B(Bń"ŁxB)-1Bń") < r(B) < r(Łx)
Ć Ć Ć Ć
. Since B(Bń"ŁxB)-1Bń" = Ł-1 when dim(Y ) < dim(X).
8
x
Ć Ć Ć Ć
" This problem are equal to prove that B(Bń"ŁxB)-1Bń"(x - x) =
k l
Ć Ć Ć Ć
Ł-1(x - x). Let P = ŁxB(Bń"ŁxB)-1Bń", we have
x
k l
2
P = P (21)
Hence P is projection matrix. Note that Y = {y1, y2 . . . , yK} is indi-
5
cator response matrix, we have
1
x = Xń"yk
k
Nk
K
" "
Łx = (xi - k)(xi - k)ń"/(N - K)
i i
k=1 gi=k
K K
" " "
1
= ( xixń" - Nkx(x)ń")
i k k
N - K
k=1 gi=k k=1
K
"
1
ń"
= (Xń"X - Xń"ykyk )
N - K
k=1
1
ń"
= (Xń"X - Xń"Y Y X)
N - K
Note that
P (N1x, N2x . . . , NKx ) = P Xń"Y (22)
1 2 K
So we need only to shown that P Xń"Y = Xń"Y .
P Xń"Y = ŁxB(Bń"ŁxB)Bń"Xń"Y
By the definition of B, we have
ń"
Bń"Xń"Y = Y X(Xń"X)-1Xń"Y
1
ń"
ŁxB = ((Xń"X)(Xń"X)-1Xń"Y - Xń"Y Y X(Xń"X)-1Xń"Y )
N - K
1
ń"
= Xń"Y (I - Y X(Xń"X)-1Xń"Y )
N - K
ń" ń"
(Bń"ŁxB)-1 = (N - K)(Y X(Xń"X)-1(Xń"X - Xń"Y Y X)(Xń"X)-1Xń"Y )-1
ń" ń"
= (N - K)(Y X(Xń"X)-1Xń"Y [I - Y X(Xń"X)-1Xń"Y ])-1
ń"
Let Q = Y X(Xń"X)-1Xń"Y we have
Bń"Xń"Y = Q (23)
1
ŁxB = Xń"Y (I - Q) (24)
N - K
(Bń"ŁxB)-1 = (N - K)(Q(I - Q))-1 (25)
Hence Q(I - Q) is invertible matrix, we have
K = r(Q(Q - I)) d" r(Q) ! r(Q) = K (26)
So Q is invertible matrix, then it yields
(Bń"ŁxB)-1 = (N - K)(I - Q)-1Q-1 (27)
6
Combine with (16),(17),(20), we have
1
P Xń"Y = Xń"Y (I - Q)(N - K)(I - Q)-1Q-1Q (28)
N - K
= Xń"Y (29)
Ć Ć Ć Ć
Since we have B(Bń"ŁxB)-1Bń"(x - x) = Ł-1(x - x).
x
k l k l
7


Wyszukiwarka

Podobne podstrony:
elements of statistical learning sol1
F J Yndurain Elements of Group Theory
40,1002,cap 3 Cognitive theory of multimedia learning
Elements of Style Front
Prezentacja Master Of Hypnotic Learning 1 edycja
(Ebooks) Seamanship The Elements Of Celestial Navigation
Seul Blogging as an Element of the?olescent’s Media?ucation
Elements of Style 02
Teach Back 10 Elements of Competence
The effects of context on incidental vocabulary learning
Elementary Statistics 10e TriolaE S Creditspp855 856
Prywes Mathematics Of Magic A Study In Probability, Statistics, Strategy And Game Theory Fixed
Use of Technology in English Language Teaching and Learning An Analysis
Butterworth Finite element analysis of Structural Steelwork Beam to Column Bolted Connections

więcej podobnych podstron