DSaA W13 String Matching

String matching
Data Structures and Algorithms
1
DSaA 2012/2013
String matching
String matching algorithms:
" Naive
" Rabin-Karp
" Finite automaton
" Knuth-Morris-Pratt (KMP)
� epsilon, � delta, Ć phi, � sigma, Ą - pi
2
DSaA 2012/2013
String matching problem
" We assume that the text is an array T [1..n] of length n and that the pattern
is an array P[1..m] of length m d" n. We further assume that the elements of
P and T are characters drawn from a finite alphabet S. For example, we
may have S = {0, 1} or S = {a, b,. . . , z}. The character arrays P and T are
often called strings of characters.
" We say that pattern P occurs with shift s in text T (or, equivalently, that
pattern P occurs beginning at position s + 1 in text T) if 0 d" s d" n - m and
T [s + 1..s + m] = P[1..m] (that is, if T [s + j] = P[ j], for 1 d" j d" m). If P occurs
with shift s in T , then we call s a valid shift; otherwise, we call s an invalid
shift. The string-matching problem is the problem of finding all valid shifts
with which a given pattern P occurs in a given text T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
a c b a b a b a a a b a b b a
Text T
a b a
Pattern P 3 occurrences
a b a
of pattern with shifts
1 2 3
a b a
3, 5, 9
a b a
3
DSaA 2012/2013
Complexity - comparison
Algorithm Preprocessing time Worse case
Matching time
Naive
0 O((n-m+1)m)
Rabin-Karp
W�(m)
O((n-m+1)m)
Finite automaton
O(m|S|) W�(n)
Knutt-Morris-Pratt
W�(m) W�(n)
4
DSaA 2012/2013
Notation and terminology
" We shall let S* (read "sigma-star") denote the set of all
finite-length strings formed using characters from the
alphabet S. In this lecture, we consider only finite-length
strings. The zero-length empty string, denoted e, also
belongs to S*. The length of a string x is denoted |x|. The
concatenation of two strings x and y, denoted xy, has
length |x| + |y| and consists of the characters from x
followed by the characters from y.
" We say that a string w is a prefix of a string x, denoted w
Ź" x, if x = wy for some string y��S*. Note that if w Ź" x,
then |w| d" |x|. Similarly, we say that a string w is a suffix
of a string x, denoted w �" x, if x = yw for some y��S*. It
follows from w �" x that |w| d" |x|. The empty string e is
both a suffix and a prefix of every string. For example,
we have ab Ź" abcca and cca �" abcca. It is useful to note
that for any strings x and y and any character a, we have
x�"y if and only if xa�"ya. Also note that Ź" and �" are
transitive relations.
5
DSaA 2012/2013
Notation and terminology
" Overlapping-suffix lemma:
Suppose that x, y, and z are strings such
that x�"z and y�"z. If |x| d" |y|, then x�"y. If
|x|e"|y|, then y�"x. If |x| = |y|, then x = y.
For brevity of notation, we shall denote the k-character prefix P[1..k]
of the pattern P[1..m] by Pk. Thus, P0 = e and Pm = P = P[1..m]. Similarly,
we denote the k-character prefix of the text T as Tk. Using this notation,
we can state the string-matching problem as that of finding all shifts s in
the range 0 d" s d" n-m such that P �"Ts+m.
The test "x = y" is assumed to take time W�(t + 1), where t is the length of the
longest string z such that zŹ"x and zŹ"y.
6
DSaA 2012/2013
Naive algorithm - code
Naive_String_Matcher(T,P)
{ 1} n := length[T]
{ 2} m := length[P]
{ 3} for s:=0 to n-m do
{ 4} if P[1..m]=T[s+1..s+m] then
{ 5} print patter occurs with shift s
T=an, P=am (n-m+1)m character comparisons
T=an, P=am-1b
Worse-case complexity: O((n-m+1)m)
T= random , P= random average d"2(n-m+1)
character comparisons
Average complexity: O(n-m+1)
7
DSaA 2012/2013
The Rabin-Karp algorithm - idea
" For expository purposes, let us assume that S = {0, 1, 2, . . . , 9}, so that each
character is a decimal digit. (In the general case, we can assume that each character
is a digit in radix-d notation, where d = |S|.) We can then view a string of k
consecutive characters as representing a length-k decimal number. The character
string 31415 thus corresponds to the decimal number 31,415.
" Given a pattern P[1..m], we let p denote its corresponding decimal value. In a similar
manner, given a text T[1..n], we let ts denote the decimal value of the length-m
substring T[s + 1..s + m], for s = 0, 1, . . . , n-m. Certainly, ts = p if and only if
T[s+1..s+m] = P[1..m]; thus, s is a valid shift if and only if ts = p. If we could compute
p in time W�(m) and all the ts values in a total of W�(n - m + 1) time, then we could
determine all valid shifts s in time W�(m) + W�(n-m+1) = W�(n) by comparing p with each
of the ts's. We can compute p in time W�(m) using Horner's rule:
p = P[m] + 10 (P[m - 1] + 10(P[m - 2] + � � � + 10(P[2] + 10P[1]) )).
" The value t0 can be similarly computed from T[1..m] in time W�(m). To compute the
remaining values t1, t2, . . . , tn-m in time W�(n - m), it suffices to observe that ts+1 can be
computed from ts in constant time, since
ts+�1 =�10(�ts -�10m-�1T[s +�1])�+�T[s +� m +�1]
T=..314152.., m=5, ts=31415
ts+1=10(31415-10000*3)+2=14152
Problem! : p and ts may be too large to work with conveniently.
8
DSaA 2012/2013
The Rabin-Karp algorithm
" We can compute p and the ts's modulo a
suitable modulus q.
2 3 5 9 0 2 3 1 4 1 5 2 6 7 3 9 9 2 1 3 1 4 1 5
q=13
& &
7
8 9 3 11 0 1 7 8 4 5 1011 7 9 11
valid spurious
match hit
2 3 5 9 0 2
35902 a" (23590 3*10000)*10+2 (mod 13)
a" (8 2*3)*10 + 2 (mod 13)
a" 9
8 9
m-�1
h �� d (�mod q)�
where
ts+�1 =� (�d(�ts -�T[�s +�1]�h)�+�T[�s +� m +�1]�)�mod q
9
DSaA 2012/2013
The Rabin-Karp algorithm - code
Rabin_Karp_Matcher(T,P)
{ 1} n := length[T]
{ 2} m := length[P]
{ 3} h := dm-1 mod q
{ 4} p := 0
{ 5} t0 := 0
{ 6} for i:=1 to m do
{ 7} p := (d*p+P[i]) mod q
{ 8} to := (d*t0+T[i]) mod q
{ 9} for s:=0 to n-m do
{10} if p=ts then
{11} if P[1..m]=T[s+1..s+m] then
{12} print patter occurs with shift s
{13} if s < n-m then
{14} ts+1:=(d*(ts-T[s+1]*h)+T[s+m+1]) mod q
The modulus q is typically chosen as a prime such that dq just fits within
one computer word (double word)
10
DSaA 2012/2013
The Rabin-Karp algorithm - analysis
" RABIN-KARP-MATCHER takes W�(m) preprocessing time, and its
matching time is W�((n - m + 1)m) in the worst case, since (like the naive
string-matching algorithm) the Rabin-Karp algorithm explicitly verifies
every valid shift. If P = am and T = an, then the verifications take time
W�((n - m + 1)m), since each of the n - m + 1 possible shifts is valid.
" In many applications, we expect few valid shifts (perhaps some
constant c of them); in these applications, the expected matching time
of the algorithm is only O((n - m + 1) + cm) = O(n+m), plus the time
required to process spurious hits. After a heuristic analysis on an
assumption we can then expect that the number of spurious hits is
O(n/q), since the chance that an arbitrary ts will be equivalent to p,
modulo q, can be estimated as 1/q. Since there are O(n) positions at
which the test of line 10 fails and we spend O(m) time for each hit, the
expected matching time taken by the Rabin-Karp algorithm is
O(n) + O(m(v + n/q)),
where v is the number of valid shifts. This running time is O(n) if v=O(1)
and we choose q = m. That is, if the expected number of valid shifts is
small (O(1)) and the prime q is chosen to be larger than the length of
the pattern, then we can expect the Rabin-Karp procedure to use only
O(n + m) matching time. Since m Ł� n, this expected matching time is
O(n).
11
DSaA 2012/2013
Finite automaton
" A finite automaton M is a 5-tuple (Q, q0, A, S, d), where:
Q is a finite set of states,
q0��Q is the start state,
A��Q is a distinguished set of accepting states,
S is a finite input alphabet,
d is a function from Q � S into Q, called the transition function of M.
The finite automaton begins in state q0 and reads the characters of its
input string one at a time. If the automaton is in state q and reads
input character a, it moves ("makes a transition") from state q to
state d(q, a). Whenever its current state q is a member of A, the
machine M is said to have accepted the string read so far. An input
that is not accepted is said to be rejected.
Q = {0, 1}, qo=0, A={1} S = { a, b},
a
d
a b b
0 1
a
0 1 0
1 0 0
b
12
DSaA 2012/2013
Automaton cont.
" A finite automaton M induces a function f , called the final-state
function, from S* to Q such that f(w) is the state M ends up in after
scanning the string w. Thus, M accepts a string w if and only if
f(w)��A. The function f is defined by the recursive relation:
f(�e)�=� q0
f(�wa)�=� d(�f(�w)�,a)�, for w��S*, a ��S
We define an auxiliary function s , called the suffix function corresponding to P.
The function s is a mapping from S* to {0, 1, . . . , m} such that s(x) is the length
of the longest prefix of P that is a suffix of x:
s(x) = max { k: Pk �" x}
We define the string-matching automaton that corresponds to a given pattern
P[1..m] as follows:
- The state set Q is {0, 1, . . . , m}. The start state q0 is state 0, and state m is the
only accepting state.
- The transition function d is defined by the following equation, for any state q
and character a:
(� )�
d(�q,a)�=� s Pqa
13
DSaA 2012/2013
Automaton - example
c
b,c
a
c
a
a
a b a b a c a
0 1 2 3 4 5 6 7
b b
a
b,c
b,c
b,c
c
d
a b c P
0 1 0 0 a
1 1 2 0 b i - 1 2 3 4 5 6 7 8 9 10 11
2 3 0 0 a T[i] - a b a b a b a c a b a
3 1 4 0 b state f(T[i]) 0 1 2 3 4 5 4 5 6 7 2 3
4 5 0 0 a
5 1 4 6 c
6 7 0 0 a
7 1 2 0
14
DSaA 2012/2013
Finite automaton matcher - code
Finite_Automaton_Matcher(T,P)
{ 1} n := length[T]
{ 2} q := 0
{ 3} for i:=1 to n do
{ 4} q := d(q,T[i])
{ 5} if q = m then
{ 6} s := i - m
{ 7} print patter occurs with shift s
Compute_Transition_Function(P,S)
{ 1} m := length[P]
{ 2} for q:=0 to m do
{ 3} for a��S do
{ 4} k := min(m+1,q+2)
{ 5} repeat
{ 6} k := k 1
{ 7} until Pk �" Pqa
{ 8} d(q,a) := k
{ 9} return d
15
DSaA 2012/2013
Finite automaton matcher - analysis
" The running time of COMPUTE-TRANSITION-
FUNCTION is O(m3|S|), because the outer loops
contribute a factor of m|S|, the inner repeat loop
can run at most m + 1 times, and the test Pk �"
Pqa on line 6 can require comparing up to m
characters.
" Much faster procedures exist; the time required to
compute d from P can be improved to O(m|S|) by
utilizing some cleverly computed information
about the pattern P. With this improved
procedure for computing d, we can find all
occurrences of a length-m pattern in a length-n
text over an alphabet S with O(m |S|)
preprocessing time and W�(n) matching time.
16
DSaA 2012/2013
Knuth-Morris-Pratt s idea
" Given a pattern P[1..m], the
b a c b a b a b a a b c b a b T
prefix function for the pattern
s
a b a b a c a P
P is the function p : {1, 2, . . . ,
m} ��{0, 1, . . . , m - 1} such
q
that
p[q]=max{k: kb a c b a b a b a a b c b a b T
?
s " That is, p[q] is the length of the
a b a b a c a P
longest prefix of P that is a
k
proper suffix of Pq.
a b a b a Pq
i 1 2 3 4 5 6 7 8 9 10
a b a Pk
P[i] a b a b a b a b c a
p[i] 0 0 1 2 3 4 5 6 0 1
17
DSaA 2012/2013
KMP - code
KMP_Matcher(T,P)
{ 1} n := length[T]
{ 2} m := length[P]
{ 3} p := Compute_Prefix_Function(P)
{ 4} q :=0
{ 5} for i:=1 to n do
{ 6} while q>0 and P[q+1]`"T[i] do
{ 7} q := p[q]
{ 8} if P[q+1] = T[i] then
{ 9} q := q+1
{10} if q=m then
{11} print patter occurs with shift i-m
{12} q := p[q]
Compute_Prefix_Function(P)
{ 1} m := length[P]
{ 2} p[1] := 0
{ 3} k := 0
{ 4} for q:=2 to m do
i 1 2 3 4 5 6 7 8 9 10
{ 5} while k>0 and P[k+1]`"P[q] do
P[i] a b a b a b a b c a
{ 6} k := p[k]
{ 7} if P[k+1]=P[q] then
p[i] 0 0 1 2 3 4 5 6 0 1
{ 8} k := k+1
{ 9} p[q] := k
{10} return p
18
DSaA 2012/2013
KMP - analysis
" The running time of COMPUTE-PREFIX-FUNCTION is W�(m), using
the potential method of amortized analysis. We associate a potential
of k with the current state k of the algorithm. This potential has an
initial value of 0, by line 3. Line 6 decreases k whenever it is
executed, since p[k] < k. Since p[k] = 0 for all k, however, k can
never become negative. The only other line that affects k is line 8,
which increases k by at most one during each execution of the for
loop body. Since k < q upon entering the for loop, and since q is
incremented in each iteration of the for loop body, k < q always
holds. (This justifies the claim that p[q] < q as well, by line 9.) We
can pay for each execution of the while loop body on line 6 with the
corresponding decrease in the potential function, since p[k] < k.
Line 8 increases the potential function by at most one, so that the
amortized cost of the loop body on lines 5-9 is O(1). Since the
number of outer-loop iterations is W�(m), and since the final potential
function is at least as great as the initial potential function, the total
actual worst-case running time of COMPUTE-PREFIX-FUNCTION
is W�(m).
" A similar amortized analysis, using the value of q as the potential
function, shows that the matching time of KMP-MATCHER is W�(n).
19
DSaA 2012/2013

Wyszukiwarka

Podobne podstrony:
W13
String concat
String indexOf
STRING (3)
stringappendoperator
w13 2
Strings LWB
question matching
W13 MPiS
String equals
StringSeqHolder
strings01
String substring
jobs matching2
strings SP
function mysql real escape string

więcej podobnych podstron