Sequence Alignment

Xiaohui Xie

University of California, Irvine

Sequence Alignment – p.1/36

Pairwise sequence alignment

Example: Given two sequences:

S = ACCTGA

and

T = AGCTA

, find the

minimal number of edit operations to transform

Edit operations:

Insertion

Deletion

Substitution

Sequence Alignment – p.2/36

Biological Motivation

Comparing or retrieving DNA/protein sequences in databases

Comparing two or more sequences for similarities

Finding patterns within a protein or DNA sequence

Tracking the evolution of sequences

...

Sequence Alignment – p.3/36

Pairwise alignment

Definition: An alignment of two sequences

and

is obtained by

first inserting spaces (

′

−

′

) either into, before or at the ends of

and

to obtain

′

and

′

such that

′

| = |T

′

, and then placing

′

on top

′

such that every character in

′

is uniquely aligned with a

charater in

′

Example: two aligned sequences:

S: GTAGTACAGCT-CAGTTGGGATCACAGGCTTCT

|||| || ||| ||||||

||||||

|||

T: GTAGAACGGCTTCAGTTG---TCACAGCGTTC-

Sequence Alignment – p.4/36

Similarity measure

σ(a, b)

- the score (weight) of the alignment of character

with

character

, where

a, b ∈ Σ ∪ {

′

−

′

}

wher

Σ = {

′

}

For example

σ(a, b) =











a = b

and

a, b ∈ Σ

a 6= b

and

a, b ∈ Σ

−1

a 6= b

and

a =

′

−

′

b =

′

−

′

Similarity between

and

given the alignment

′

, T

′

)

V (S, T ) =

σ(S

′

, T

′

)

Sequence Alignment – p.5/36

Global alignment

INPUT: Two sequences

and

of roughly the same length

Q: What’s the maximum similarity between the two. Find abest

alignment.

Sequence Alignment – p.6/36

Nomenclature

- an alphabet, a non-empty finite set. For example,

Σ = {A, C, G, T }

string

over

is any finite sequence of characters from

- the set of all strings over

of length

. Note that

= {ǫ}

The set of all strings over

of any length is denoted

∗

∈N

a substring

of a string

T = t

· · · t

is a string

T = t

1+i

· · · t

, where

0 ≤ i

and

m + i ≤ n

a prefix

of a string

T = t

· · · t

is a string

T = t

· · · t

, where

m ≤ n

a suffix

of a string

T = t

· · · t

is a string

T = t

−m+1

· · · t

, where

m ≤ n

a subsequence

of a string

T = t

· · · t

is a string

T = t

· · · t

such

that

< · · · < i

, where

m ≤ n

Sequence Alignment – p.7/36

Nomenclature

Biology

Computer Science

Sequence

String,word

Subsequence

Substring (contiguous)

N/A

Subsequence

N/A

Exact matching

Alignment

Inexact matching

Sequence Alignment – p.8/36

Pairwise global alignment

Example: one possible alignment between

ACGCTTTG

and

CATGTAT

S: AC--GCTTTG

T: -CATG-TAT-

Global alignment

Input: Two sequences

S = s

· · · s

and

T = t

· · · t

(

and

are

approximately the same).

Question: Find an optimal alignment

S → S

′

and

T → T

′

such that

V =

d
i

σ(S

′

, T

′

)

is maximal.

Sequence Alignment – p.9/36

Dynamic programming

Let

V (i, j)

be the optimal alignment score of

1···i

and

1···j

(

0 ≤ i ≤ n

0 ≤ j ≤ m

has the following properties:

Base conditions:

V (i, 0) =

σ(S

′

−

′

)

(1)

V (0, j) =

σ(

′

−

′

, T

)

(2)

(3)

Recurrence relationship:

V (i, j) = max











V (i − 1, j − 1) + σ(S

, T

)

V (i − 1, j) + σ(S

′

−

′

)

V (i, j − 1) + σ(

′

−

′

, T

)

(4)

for all

i ∈ [1, n]

and

j ∈ [1, m]

Sequence Alignment – p.10/36

Tabular computation of optimal alignment

pseudo code:

for i=0 to n do

begin

for j=0 to m do

begin

Calculate V(i,j) using

V(i-1,j-1), V(i,j-1) and V(i-1,j)

end

Sequence Alignment – p.11/36

Tabular computation

-1

-2

-3

-4

-5

-1

-2

-1

-2

-3

-1

-4

-1

-5

-2

-6

-3

Score: match=+2, mismatch=-1.

Sequence Alignment – p.12/36

Pairwise alignment

Reconstruction of the alignment: Traceback

Establish pointers in the cells of the table as the values are

computed.

The time complexity of the algorithm is

O(nm)

. The space complexity

of the algorithm is

O(n + m)

if only

V (S, T )

is required and

O(nm)

for

the reconstruction of the alignment.

Sequence Alignment – p.13/36

Global alignment in linear space

Let

(i, j)

denote the optimal alignment value of the last

characters in sequence

against the last

characters in sequence

V (n, m) = max

∈[0,m]

V (

, k) + V

(

, m − k)

(5)

Sequence Alignment – p.14/36

Global alignment in linear space

Hirschberg’s algorithm:

1. Compute

V (i, j)

. Save the values of

-th row. Denote

V (i, j)

the

forward matrix

2. Compute

(i, j)

. Save the values of

-th row. Denote

(i, j)

the

forward matrix

3. Find the column

∗

such that

F (

∗

) + B(

, m − k

∗

)

is maximal

4. Now that

∗

is found, recursively partition the problem into two sub

problems: i) Find the path from

(0, 0)

(n/2, k

∗

)

ii) Find the path from

(n/2, m − k

∗

)

(n, m)

Sequence Alignment – p.15/36

Hirschberg’s algorithm

The time complexity of Hirschberg’s algorithm is

O(nm)

. The space

complexity of Hirschberg’s algorithm is

O(min(m, n))

Sequence Alignment – p.16/36

Local alignment problem

Input: Given two sequences

and

Question: Find the subsequece

and

, whose simililarity

(optimal global alignment) is maximal (over all such pairs of

subsequences).

Example: S=

GGTCTGAG

and T=

AAACGA

Score: match = 2; indel/substitution=-1

The optimal local alignment is

α =

CTGA

and

β =

CGA

CTGA

(

α ∈ S

)

C-GA

(

β ∈ T

)

Sequence Alignment – p.17/36

Local Suffix Alignment Problem

Input: Given two sequences

and

and two indices

and

Question: Find a (possibly empty) suffix

1···i

and a (possibliy

empty) suffix

1···j

such that the value of the alignment between

and

is maximal over all alignments of suffixes of

1···i

and

1···j

Terminology and Restriction

V (i, j)

: denote the value of the optimal local suffix alignment for a

given pair

of indices.

Limit the pair-wise scores by:

σ(x, y) =







≥ 0

match

≤ 0

do not match, or one of them is a space

(6)

Sequence Alignment – p.18/36

Local Suffix Alignment Problem

Recursive Definitions

Base conditions:

V (i, 0) = 0, V (0, j) = 0

for all

and

Recurrence relation:

V (i, j) = max











V (i − 1, j − 1) + σ(S

, T

)

V (i − 1, j) + σ(S

′

−

′

)

V (i, j − 1) + σ(

′

−

′

, T

)

(7)

Compute

∗

and

∗

V (i

∗

, j

∗

) =

max

∈[1,n],j∈[1,m]

V (i, j)

Sequence Alignment – p.19/36

Local Suffix Alignment Problem

Score: match=+2, mismatch=-1.

Sequence Alignment – p.20/36

Gap Penalty

Definition: A gap is any maximal, consecutive run of spaces in a

single sequece of a given alignment.

Definition: The length of a gap is the number of indel operations in it.

Example:

S: attc--ga-tggacc

T: a--cgtgatt---cc

7 matches,

gaps

= 4

gaps,

spaces

= 8

spaces, 0 mismatch.

Sequence Alignment – p.21/36

Affine Gap Penalty Model

A total penalty for a gap of length

is:

total

= W

+ qW

where

: the weight for “openning the gap”

: the weight for “extending the gap” with one more space

Under this model, the score for a particular alignment

S → S

′

and

T → T

′

is:

∈{k:S

′

−

′

& T

′

−

′

}

σ(S

′

, T

′

) + W

gaps

+ W

spaces

Sequence Alignment – p.22/36

Global alignment with affine gap penality

To align sequence

and

, consider the prefixes

1···i

and

1···j

Any alignment of these two prefixes is one of the following three types:

Type 1 (

A(i, j)

): Characters

and

are aligned opposite each

other.

S: ************i
T: ************j

Type 2 (

L(i, j)

): Character

is aligned to a chracter to the left of

S: ************i------
T: ******************j

Type 3 (

R(i, j)

): Character

is aligned to a chracter to the right of

S: ******************i
T: *************j-----

Sequence Alignment – p.23/36

Global alignment with affine gap penality

A(i, j)

– the maximum value of any alignment of Type 1

L(i, j)

– the maximum value of any alignment of Type 2

R(i, j)

– the maximum value of any alignment of Type 3

V (i, j)

– the maximum value of any alignment

Sequence Alignment – p.24/36

Recursive Definition

Recursive Definition

Base conditions:

V (0, 0) =0

(8)

V (i, 0) =R(i, 0) = W

+ iW

(9)

V (0, j) =L(0, j) = W

+ jW

(10)

Recurrence relation:

V (i, j) =max{A(i, j), L(i, j), R(i, j)}

(11)

A(i, j) =V (i − 1, j − 1) + σ(S

, T

)

(12)

L(i, j) =max{L(i, j − 1) + W

, V (i, j − 1) + W

+ W

}

(13)

R(i, j) =max{R(i − 1, j) + W

, V (i − 1, j) + W

+ W

}

(14)

Sequence Alignment – p.25/36

Local alignment problem

Local alignment problem

Input: Given two sequences

and

Question: Find the subsequece

and

, whose similarity

(optimal global alignment) is maximal (over all such pairs of

subsequences).

Example: S=

GGTCTGAG

and T=

AAACGA

Score: match = 2; indel/substitution=-1

The optimal local alignment is

α =

CTGA

and

β =

CGA

CTGA

(

α ∈ S

)

C-GA

(

β ∈ T

)

Suppose the maximal local alignment score between

and

How to measure the significane of

Sequence Alignment – p.26/36

Measure statistical significance

One possible solution:

1. Generate many random sequences

, T

, · · · , T

, (e.g.

N > 10, 000

2. Find the optimal alignment score

between

and

for all

3. p-value

N
i

I(S

≥ S)/N

However, the solution is not practical.

Sequence Alignment – p.27/36

Extreme value distribution (EVD)

Suppose that

, X

, · · · , X

are iid random variables. Denote the

maximum of these r.v. by

max

= max{X

, X

, · · · , X

}

Suppose that

, · · · X

are continuous r.v. with density function

(x)

and cumulative distribution function

(x)

Question: what is the distribution of

max

Sequence Alignment – p.28/36

Extreme value distribution (EVD)

Note that

Prob(X

max

≤ x) = [Prob(X ≤ x)]

. Hence

max

(x) = (F

(x))

Density function of

max

(x) = nf

(x)(F

(n))

−1

Sequence Alignment – p.29/36

Example: the exponential distribution

the exponential distribution

(x) =λe

−λx

x ≥ 0

(15)

(x) =1 − e

−λx

x ≥ 0

(16)

Mean:

1/λ

; Variance:

1/λ

Sequence Alignment – p.30/36

EVD of the exponential distribution

The EVD:

(x) =nλe

−λx

(1 − e

−λx

)

−1

(17)

max

(x) =(1 − e

−λx

)

(18)

Sequence Alignment – p.31/36

EVD of the exponential distribution

Mean and variance of

max

E[X

max

] =

(1 +

1
2

+ · · · +

)

→∞

−→

(γ + log n)

(19)

Var[X

max

] =

(1 +

+ · · · +

)

→∞

−→

6λ

(20)

where

γ = 0.5772 . . .

is Euler’s constant.

Sequence Alignment – p.32/36

Asymptotic distribution

Asymptotic formula for the distribution of

max

Define a rescaled

max

U =

max

− log(n)/λ

1/λ

= λX

max

− log n

n → ∞

, the mean of

approaches

and the variance of

approaches

Sequence Alignment – p.33/36

Gumbel distribution

The cumulative distribution:

Prob(U ≤ u) =Prob)(X

max

≤ (u + log n)/λ)

(21)

=(1 − e

−u

/n)

(22)

−e

−u

n → ∞

(23)

Or equivalently

Prob(U ≥ u) = 1 − e

−e

−u

n → ∞

which is called Gumbel distribution.

Sequence Alignment – p.34/36

EVD of the exponential distribution

EVD for large

The density function

(u) = e

−u

−e

−u

≈ e

−u

(1 − e

−u

−2u

− . . . ) ≈ e

−u

which decays much slower than the Gaussian distribution.

Sequence Alignment – p.35/36

Karlin & Altschul statistics

Karlin & Altschul statistics

For local ungapped alignments between two sequences of length

and

, the probability that there is a match of a score greater than

is:

P (x ≥ S) = 1 − e

−Kmne

−λS

Denote

E(S) = Kmne

−λS

- the expected number of unrelated

matches with score greather than

Significane requirement:

E(S)

should be significantly less than

that is

S <

log(mn)

log K

Sequence Alignment – p.36/36

Document Outline

Pairwise sequence alignment
Biological Motivation
Pairwise alignment
Similarity measure
Global alignment
Nomenclature
Nomenclature
Pairwise global alignment
Dynamic programming
Tabular computation of optimal alignment
Tabular computation
Pairwise alignment
Global alignment in linear space
Global alignment in linear space
Hirschberg's algorithm
Local alignment problem
Local Suffix Alignment Problem
Local Suffix Alignment Problem
Local Suffix Alignment Problem
Gap Penalty
Affine Gap Penalty Model
Global alignment with affine gap penality
Global alignment with affine gap penality
Recursive Definition
Local alignment problem
Measure statistical significance
Extreme value distribution (EVD)
Extreme value distribution (EVD)
Example: the exponential distribution
EVD of the exponential distribution
EVD of the exponential distribution
Asymptotic distribution
Gumbel distribution
EVD of the exponential distribution
Karlin & Altschul statistics

Wyszukiwarka

Podobne podstrony:
Alignmaster tutorial by PAV1007 Nieznany
Ćwiczenie 1 Quick Alignment and Carriageways
75 WHEEL ALIGNMENT THEORY OPERATION
Front End Alignment Basics
Checking Table Saw Blade Alignment
Front End Alignment Tests For
M32b Wheel Alignment
05 Structures and Alignment
74 WHEEL ALIGNMENT SPECIFICATIONS & PROCEDURES
Ćwiczenie 2 Alignment Reports
Alignmaster tutorial by PAV1007 Nieznany
Ćwiczenie 1 Quick Alignment and Carriageways
06 Alignment & Adjustment
03 Wheel Alignment Procedures
Pairwise alignment of metamorphic computer viruses
Alignment & Adjustment
Alignment & Adjustment
wheel alignment theory operation
Detecting Metamorphic viruses by using Arbitrary Length of Control Flow Graphs and Nodes Alignment

więcej podobnych podstron

alignment(4)

Document Outline