Motif representation using position
weight matrix
Xiaohui Xie
University of California, Irvine
Motif representation using position weight matrix – p.1/31
Position weight matrix
Position weight matrix representation of a motif with width
w
:
θ
=
θ
11
θ
12
θ
13
θ
14
θ
21
θ
22
θ
23
θ
24
· · ·
θ
w
1
θ
w
2
θ
w
3
θ
w
4
(1)
where each row represents one position of the motif, and
is normalized:
4
X
j
=1
θ
ij
= 1
(2)
for all
i
= 1, 2, · · · , w
.
Motif representation using position weight matrix – p.2/31
Likelihood
Given the position weight matrix
θ
, the probability of
generating a sequence
S
= (S
1
, S
2
,
· · · , S
w
)
from
θ
is
P
(S|θ) =
w
Y
i
=1
P
(S
i
|θ
i
)
(3)
=
w
Y
i
=1
θ
i,S
i
(4)
For convenience, we have converted
S
from a string of
{A, C, G, T }
to a string of
{1, 2, 3, 4}
.
Motif representation using position weight matrix – p.3/31
Likelihood
Suppose we observe not just one, but a set of sequences
S
1
, S
2
,
· · · , S
n
, each of which contains exactly
w
letters.
Assume each of them is generated independently from
the model
θ
. Then, the likelihood of observing these
n
sequences is
P
(S
1
, S
2
,
· · · , S
n
|θ) =
n
Y
k
=1
P
(S
k
|θ)
(5)
=
n
Y
k
=1
w
Y
i
=1
θ
i,S
ki
=
w
Y
i
=1
4
Y
j
=1
θ
c
ij
ij
(6)
where
c
ij
is the number of letter
j
at position
i
(Note that
P
4
j
=1
c
ij
= n
for all
i
).
Motif representation using position weight matrix – p.4/31
Parameter estimation
Now suppose we do not know
θ
. How to estimate it from
the observed sequence data
S
1
, S
2
,
· · · , S
n
?
One solution: calculate the likelihood of observing the
provided
n
sequences for different values of
θ
,
L
(θ) = P (S
1
, S
2
,
· · · , S
n
|θ) =
n
Y
k
=1
w
Y
i
=1
θ
i,S
ki
(7)
Pick the one with the largest likelihood, that is, to find
θ
∗
that
max
θ
P
(S
1
, S
2
,
· · · , S
n
|θ)
(8)
Motif representation using position weight matrix – p.5/31
Maximum likelihood estimation
Maximum likelihood estimation of
θ
is
ˆ
θ
M L
= arg max
θ
log L(θ)) =
w
X
i
=1
4
X
j
=1
c
ij
log θ
ij
s.t.
4
X
j
=1
θ
ij
= 1,
∀i = 1, · · · , w
(9)
Motif representation using position weight matrix – p.6/31
Optimization with equality constraints
Construct a Lagrangian function taking the equality
constraint into account:
g
(θ) = log L(θ) +
w
X
i
=1
λ
i
(1 −
4
X
j
=1
θ
ij
)
(10)
Solve the unconstrained optimization problem
ˆ
θ
= arg max
θ
g
(θ)) =
w
X
i
=1
4
X
j
=1
c
ij
log θ
ij
+
w
X
i
=1
λ
i
(1 −
4
X
j
=1
θ
ij
)
(11)
Motif representation using position weight matrix – p.7/31
Optimization with equality constraints
Take the derivative of
g
(θ)
w.r.t.
θ
ij
and the Lagrange
multiplier
λ
i
and set them to 0
∂g
(θ)
θ
ij
= 0
(12)
∂g
(θ)
λ
i
= 0
(13)
which leads to:
ˆ
θ
ij
=
c
ij
n
(14)
which is simply the frequency of different letters at each
position. (
c
ij
is the number of letter
j
at position
i
).
Motif representation using position weight matrix – p.8/31
Bayes’ Theorem
P
(θ|S) =
P
(S|θ)P (θ)
P
(S)
(15)
Each term in Bayes’ theorem has a conventional name:
1.
P
(S|θ)
– the conditional probability of
S
given
θ
, also
called the likelihood.
2.
P
(θ)
– the prior probability or marginal probability of
θ
.
3.
P
(θ|S)
– the conditional probability of
θ
given
S
, also
called the posterior probability of
θ
4.
P
(S)
– the marginal probability of
S
, and acts as a
normalizing constant.
Motif representation using position weight matrix – p.9/31
Maximum a posteriori (MAP) estmation
MAP (or posterior mode) estimation of
θ
is
ˆ
θ
MAP
(S) = arg max
θ
P
(θ|S
1
, S
2
,
· · · , S
n
)
(16)
= arg max
θ
log L(θ) + log P (θ)
(17)
Assume
P
(θ) =
Q
w
i
=1
P
(θ
i
)
(independence of
θ
i
at
diffferent position
i
).
Model
P
(θ
i
)
with a Dirichlet distribution
(θ
i
1
, θ
i
2
, θ
i
3
, θ
i
4
) ∼ Dir(α
1
, α
2
, α
3
, α
4
).
(18)
Motif representation using position weight matrix – p.10/31
Dirichlet Distribution
Probability density function of Dirichlet distribution Dir(
α
)
of order
K
≥ 2
:
p
(x
1
,
· · · , x
K
; α
1
,
· · · , α
K
) =
1
B
(α)
K
Y
i
=1
x
α
i
−
1
k
(19)
for all
x
1
,
· · · , x
K
>
0
and
P
K
i
=1
x
i
= 1
. The density is zero
outside this open
(K − 1)
-dimensional simplex.
α
= (α
1
,
· · · , α
K
)
are parameters with
α
i
>
0
for all
i
.
B
(α)
, the normalizing constant, is the multinomial beta
function:
B
(α) =
Q
K
i
=1
Γ(α
i
)
Γ(
P
K
i
=1
α
i
)
(20)
Motif representation using position weight matrix – p.11/31
Gamma function
Gamma function for positive real
z
Γ(z) =
Z
∞
0
t
z−
1
e
−
t
dt
(21)
Γ(z + 1) = zΓ(z)
(22)
If
n
is a positive integer, then
Γ(n + 1) = n!
(23)
Motif representation using position weight matrix – p.12/31
Properties of Dirichlet distribution
Dirichlet distribution
p
(x
1
,
· · · , x
K
; α) =
1
B
(α)
K
Y
i
=1
x
α
i
−
1
i
(24)
Expectation, define
α
0
=
P
K
i
=1
α
i
,
E
[X
i
] =
α
i
α
0
(25)
Variance
V ar
[X
i
] =
α
i
(α
0
− α
i
)
α
2
0
(α
0
+ 1)
(26)
Co-variance
Cov
[X
i
X
j
] =
−α
i
α
j
α
2
0
(α
0
+ 1)
(27)
Motif representation using position weight matrix – p.13/31
Posterior Distribution
Conditional probability:
P
(S
1
, S
2
,
· · · , S
n
|θ) =
Q
w
i
=1
Q
4
j
=1
θ
c
ij
ij
Prior probability:
p
(θ
i
1
,
· · · , θ
i
4
; α) =
1
B
(α)
Q
4
i
=1
θ
α
i
−
1
i
Posterior probability:
P
(θ
i
|S
1
,
· · · , S
n
) = Dir(c
i
1
+ α
1
, c
i
2
+ α
2
, c
i
3
+ α
3
, c
i
4
+ α
4
)
Maxmium a posteriori estimate:
θ
MAP
i
=
c
ij
+ α
i
− 1
n
+ α
0
− 4
(28)
where
α
0
≡
P
i
α
i
.
Motif representation using position weight matrix – p.14/31
Mixture of sequences
Suppose we have a more difficult situation: Among the
set of
n
given sequences,
S
1
, S
2
,
· · · , S
n
, only a subset of
them are generated by a weight matrix model
θ
. How to
identify
θ
in this case?
Let us first define the "non-motif" (also called background)
sequence. Suppose they are generated from a single
distribution
p
0
= (p
0
A
, p
0
C
, p
0
G
, p
0
T
) = (p
0
1
, p
0
2
, p
0
3
, p
0
4
)
(29)
Motif representation using position weight matrix – p.15/31
Likelihood for mixture of sequences
Now the problem is we do not know which sequence is
generated from the motif (
θ
)
and which one is generated
from the background model (
θ
0
)
.
Suppose we are provided with such label information:
z
i
=
1
if
S
i
is generated by
θ
0
if
S
i
is generated by
θ
0
(30)
for all
i
= 1, 2, · · · , n
.
Then, the likelihood of observing the
n
sequences
P
(S
1
, S
2
,
· · · , S
n
|z, θ, θ
0
) =
n
Y
i
=1
[z
i
P
(S
i
|θ) + (1 − z
i
)P (S
i
|θ
0
)]
Motif representation using position weight matrix – p.16/31
Maximum Likelihood
Find the joint probability of sequences and the labels
P
(S, z|θ, θ
0
) = P (S|z, θ, θ
0
)P (z)
=
n
Y
i
=1
P
(z
i
)[z
i
P
(S
i
|θ) + (1 − z
i
)P (S
i
|θ
0
)]
where
z
≡ (z
1
,
· · · , z
n
)
and
P
(z) =
Q
i
P
(z
i
)
.
Marginalize over labels to derive the likelihood
L
(θ) = P (S|θ, θ
0
) =
n
Y
i
=1
[P (z
i
= 1)P (S
i
|θ)+P (z
i
= 0)P (S
i
|θ
0
)]
Maximum likelihood estimate:
ˆ
θ
M L
= arg max
θ
log L(θ))
Motif representation using position weight matrix – p.17/31
Maximum Likelihood
Find the joint probability of sequences and the labels
P
(S, z|θ, θ
0
) = P (S|z, θ, θ
0
)P (z)
=
n
Y
i
=1
P
(z
i
)[z
i
P
(S
i
|θ) + (1 − z
i
)P (S
i
|θ
0
)]
where
z
≡ (z
1
,
· · · , z
n
)
and
P
(z) =
Q
i
P
(z
i
)
.
Marginalize over labels to derive the likelihood
L
(θ) = P (S|θ, θ
0
) =
n
Y
i
=1
[P (z
i
= 1)P (S
i
|θ)+P (z
i
= 0)P (S
i
|θ
0
)]
Maximum likelihood estimate:
ˆ
θ
M L
= arg max
θ
log L(θ))
Motif representation using position weight matrix – p.18/31
Lower bound on the L
(θ)
Log likelihood function
log L(θ) =
n
X
i
=1
log [P (z
i
= 1)P (S
i
|z
i
= 1) + P (z
i
= 0)P (S
i
|z
i
= 0)]
where
P (S
i
|z
i
= 1) = P (S
i
|θ)
and
P (S
i
|z
i
= 0) = P (S
i
|θ
0
)
.
Jensen’s inequality:
log(q
1
x + q
2
y) ≥ q
1
log(x) + q
2
log(y)
for all
q
1
, q
2
≥ 0
and
q
1
+ q
2
= 1
.
Motif representation using position weight matrix – p.19/31
EM-algorithm
Lower bound on
log L(θ)
.
log L(θ) ≥
N
X
i
=1
{q
i
log
P (z
i
= 1)P (S
i
|z
i
= 1)
q
i
+
(1 − q
i
) log
P (z
i
= 0)P (S
i
|z
i
= 0)
1 − q
i
} ≡
N
X
i
=1
φ(q
i
, θ)
Expectation-Maximization: Alternate between two steps:
E-step
ˆ
q
i
= arg max
q
i
φ(q
i
, ˆ
θ)
M-step
ˆ
θ = arg max
θ
N
X
i
=1
φ(ˆ
q
i
, θ)
Motif representation using position weight matrix – p.20/31
E-Step
Auxiliary function
φ(q
i
, θ) = q
i
log
P (z
i
= 1)P (S
i
|z
i
= 1)
q
i
+(1−q
i
) log
P (z
i
= 0)P (S
i
|z
i
= 0)
1 − q
i
E-step
ˆ
q
i
= arg max
q
i
φ(q
i
, ˆ
θ)
which leads to
ˆ
q
i
=
P (z
i
= 1)P (S
i
|z
i
= 1)
P (z
i
= 1)P (S
i
|z
i
= 1) + P (z
i
= 0)P (S
i
|z
i
= 0)
= P (z
i
|S
i
)
Motif representation using position weight matrix – p.21/31
M-Step
Auxiliary function
φ(q
i
, θ) = q
i
log
P (z
i
= 1)P (S
i
|z
i
= 1)
q
i
+(1−q
i
) log
P (z
i
= 0)P (S
i
|z
i
= 0)
1 − q
i
M-step
ˆ
θ
=
arg max
θ
N
X
i=1
φ(ˆ
q
i
, θ)
=
arg max
θ
N
X
i=1
ˆ
q
i
[log P (S
i
|θ) + (1 − ˆ
q
i
) log P (S
i
|θ
0
)]
which leads to
ˆ
θ
ij
=
P
N
k=1
ˆ
q
k
I(S
ki
= j)
P
N
k=1
ˆ
q
k
where
I(a)
is an indicator function:
I(a) = 1
if
a
is true and 0
o.w
.
Motif representation using position weight matrix – p.22/31
Summary of EM-algorithm
Initialize parameters
θ
.
Repeat until convergence
E-step: estimate the expected values of labels, given the current
parameter estimate
ˆ
q
i
= P (z
i
|S
i
)
M-step: re-estimate the parameters, given the expected
estimates of the labels
ˆ
θ
ij
=
P
N
k=1
ˆ
q
k
I(S
ki
= j)
P
N
k=1
ˆ
q
k
The procedure is guaranteed to converge to a local maximum or saddle
point solution.
Motif representation using position weight matrix – p.23/31
What about MAP estimate?
Consider a Dirichlet prior distribution on
θ
i
:
θ
i
= Dir(α) ∀i = 1, · · · , w
Initialize parameters
θ
.
Repeat until convergence
E-step: estimate the expected values of labels, given the current
parameter estimate
ˆ
q
i
= P (z
i
|S
i
)
M-step: re-estimate the parameters, given the expected
estimates of the labels
ˆ
θ
ij
=
P
N
k=1
ˆ
q
k
I(S
ki
= j) + α
j
− 1
P
N
k=1
ˆ
q
k
+ α
0
− 4
Motif representation using position weight matrix – p.24/31
Different methods for parameter estimation
So far, we have introduced two methods: ML and MAP
Maximum likelihood (ML)
ˆ
θ
ML
= arg max
θ
P (S|θ)
Maximum a posterior (MAP)
ˆ
θ
MAP
= arg max
θ
P (θ|S)
Bayes Estimator
ˆ
θ = arg max
θ
E
h ˆ
θ(S) − θ)
2
i
that is the one minimizing MSE( mean square error).
ˆ
θ = E[θ|S] =
Z
θP (θ|S)dθ
Motif representation using position weight matrix – p.25/31
Joint distribution
Joint distribution of labels and sequences
P (S, z|θ, θ
0
)
= P (S|z, θ, θ
0
)P (z)
=
n
Y
i=1
P (z
i
)[z
i
P (S
i
|θ) + (1 − z
i
)P (S
i
|θ
0
)]
Joint distribution of
S
,
z
,
θ
and
θ
0
P (S, z, θ, θ
0
) =
n
Y
i=1
P (z
i
)[z
i
P (S
i
|θ) + (1 − z
i
)P (S
i
|θ
0
)]P (θ)P (θ
0
)
Find the joint distribution of
S
and
z
P (S, z) =
Z
θ
Z
θ
0
n
Y
i=1
P (z
i
)[z
i
P (S
i
|θ) + (1 − z
i
)P (S
i
|θ
0
)]P (θ)P (θ
0
)dθ
0
Motif representation using position weight matrix – p.26/31
Posterior distribution of labels
Posterior distribution of
z
:
P (z|S) =
Z
θ
n
Y
i=1
P (z
i
)[z
i
P (S
i
|θ) + (1 − z
i
)P (S
i
|θ
0
)]P (θ)dθ/P (S)
∼ q
m
(1 − q)
n
−m
w
Y
i=1
Z
θ
i
4
Y
j=1
θ
n
ij
ij
P (θ
i
)dθ
i
B(n
0
+ α
0
)
B(α
0
)
∼ q
m
(1 − q)
n
−m
w
Y
i=1
B(n
i
+ α)
B(α)
B(n
0
+ α
0
)
B(α
0
)
where
m =
P
n
i=1
z
i
,
P (z
i
= 1) = q
,
P (θ
i
) = Dir(α)
,
P (θ
0
) = Dir(α
0
)
are Dirichlet priors, and
n
ij
≡
P
n
k=1
z
k
I(S
ki
= j)
is the number of letter
j
at position
i
among
the sequences with label
1
.
n
i
≡ (n
i1
, · · · , n
i4
)
.
n
0,j
≡
P
n
k=1
(1 − z
k
)
P
w
i=1
I(S
ki
= j)
and
n
0
≡ (n
0,1
, · · · , n
0,4
)
.
Motif representation using position weight matrix – p.27/31
Sampling
Posterior distribution of
z
:
P (z|S) ∼ q
m
(1 − q)
n
−m
w
Y
i=1
B(n
i
+ α)
B(α)
B(n
0
+ α
0
)
B(α
0
)
Posterior distribution of
z
k
conditioned on all other labels
z
−k
≡ {z
i
|i = 1, · · · , n, i 6= k}
:
P (z
k
= 1|z
−k
, S) ∼ q
w
Y
i=1
B(n
−k,i
+ ∆(S
ki
) + α)
B(n
−k,i
+ α)
where
n
−k,ij
≡
P
n
l=1,l
6=k
z
l
I(S
li
= j)
is the number of letter
j
at
position
i
among all sequences with label
1
excluding the
k
th
sequence.
n
−k,i
≡ (n
−k,i1
, · · · , n
−k,i4
)
.
∆(l) = (b
1
, · · · , b
4
)
with
b
j
= 1
for
j = S
ki
and otherwise 0.
Motif representation using position weight matrix – p.28/31
Posterior distribution
Posterior distribution of
z
k
conditioned on
z
−k
:
P (z
k
= 1|z
−k
, S) ∼ q
w
Y
i=1
n
−k,iS
ki
+ α
S
ki
− 1
P
j
[n
−k,ij
+ α
j
− 1]
= q
w
Y
i=1
θ
iS
ki
Note that
θ
iS
ki
is same as the MAP estimate of the frequency weight
matrix using all sequences with label 1 excluding the
k
th
sequence.
Similarly
P (z
k
= 0|z
−k
, S) ∼ (1−q)
w
Y
i=1
n
−k,0S
ki
+ α
0,S
ki
− 1
P
j
[n
−k,0j
+ α
0,j
− 1]
= (1−q)
w
Y
i=1
θ
0,S
ki
θ
0,S
ki
is same as the MAP estimate of the background distribution.
Motif representation using position weight matrix – p.29/31
Gibbs sampling
Posterior probability
P (z
k
= 1|z
−k
, S) ∼
w
Y
i=1
θ
iS
ki
P (z
k
= 0|z
−k
, S) ∼ (1 − q)
w
Y
i=1
θ
0,S
ki
Gibbs sampling
P (z
k
= 1|z
−k
, S) =
q
Q
w
i=1
θ
iS
ki
q
Q
w
i=1
θ
iS
ki
+ (1 − q)
Q
w
i=1
θ
0,S
ki
Motif representation using position weight matrix – p.30/31
Gibbs sampling
Initialize labels
z
: Assign the value of
z
i
randomly according to
P (z
i
= 1) = q
for all
i = 1, · · · , n
.
Repeat until converge
Repeat from
i = 1
to
n
Update
θ
matrix using the MAP estimate (excluding
i
th
sequence)
Sample the value of
z
i
Motif representation using position weight matrix – p.31/31