Great algorithms are poetry of computation Francis Sullivan
Poetry always give us an enjoyable exprience if not a meaningful enlightenment to the true nature of life. But 10 years ago when I first read about KMP on CLRS' classic Introduction to Algorithms, what I went through was only confusion. Though the code was succinct like a poem, I was always wondering how could someone came up with such a exotic if not esoteric but still efficient algorithm. It felt like someone just got lucky enough to be striked by a thunder, until I have read Sedgewick's fantastic Algorithms. He expalins this great algorithm from a DFA angle which reveals the delicacy of this algorithm in a intuitive and natural way.
But Sedgewick only explains the K part of KMP in his lecture and book, i.e. Knuth's original algorithm which uses a DFA branching on an alphabet. The flaw is obvious that if we have a large alphabet like Unicode the memory needed to store this DFA will be huge even if the pattern is composed by only a few different letters. This is where Knuth's student Pratt came in and came up with a way to reduce space. Like Sedgewick has mentioned in his lecture, the improved algorithm is more complicated. That's why he doesn't talk it in details in his lecture. But to my surprise, the improved algorithm is still quite unstandable if we follow the way he shows us how to build a DFA. The only difference is what we need now is a NFA.
This is why I am writing this blog. I will briefly recap the DFA explanation of KMP first. Then I will explain the improved algorithm from a NFA point. Finally, we will modify this NFA a little bit to make it search all occurences in the text. The reader should at least know what problem the KMP algorithm is solving. Section 5.3 of Sedgewick's Algorithm is recommended. The parts explaining the Brute-Force search and KMP search are strongly recommended.
The recap here is by no intention to be thorough. Please check Sedgewick's Lecture on KMP.
KMP is classic linear algorithm to do substring match. The critical idea is to
avoid backup so that we can move down the text directly. The key part is that
when a mismatch happens we know exactly which pattern index should we compare
against with current text index. We don't want backup so we would never decrease
text index. We can think of the pattern index as a state. A match has proceeded
pat[j] means we have already matched
j characters. Let's call it match
state. Assume we have such a machine that once we told it current state and the
next character, it would tell us which pattern index to match against next. Then
the match would be much simpler
def match(txt,n,pat,m,dfa): j = 0 # match state for i in xrange(n): j = dfa[j][txt[i]] if j == m: return i-m+1 return -1
By nature, this machine is a DFA since once the state and the input are determined so is the result.
Fortunately, this machine could be constructed recursively. Let's work on an
ABABAC. The match case is easy, we just increment the state.
For the mismatch case, let's say we are now at
pat. The thing we know about
is that the past 3 chars are "ABA" both in text and pattern, i.e.
"ABA". If backup is allowed, we need to compare
txt[i-2] which is
pat. But since we have already recursively constructed the DFA and we also
know the inputs (
"BA"), we can compute the final state virtually by feeding
"BA" to a DFA with state 0. It turns out the final state will be 1. Then The
destinations of mismatch transitions at state 3 are same as start from state 1.
Since this virtual state also follows a recursive manner, we could compute it together along with the DFA construction in one pass
Here's the Python implementation of the DFA construction
def build(pat,m): dfa = [[0 _ for _ in xrange(R)] for _ in xrange(m)] j = 0 # virtual state run through dfa for i in xrange(1,m): for k in xrange(R): dfa[i][k] = dfa[j][k] dfa[i][pat[i]] = i+1 j = dfa[j][pat[i]] return dfa
One idea to save space is that we don't record transitions for every input. One natural way is that we only differenciate match and mismatch. This means in the FSM we have only two transition at any state: match and mismatch. It's easy to see that extra space needed would be linear with respect to M (the length of pattern). But the problem is that since we only differenciate match and mismatch, we have no way to be deterministic when mismatch occurs. Indeed, there is only one deterministic way to match for any state. But there are many ways to mismatch. Which means that if we take a mismatch transition, we have to check again to if there is a match. If there is a match, okay, we can now take a deterministic match transition. If there is not, by definition, we have to take another mismatch transition and check again until we have reached state 0.
This may feel like someone is again got striked by a thunder. But believe me if you work on the DFA for a while and seize the two-type-transition-only idea tightly, it is quite natural to come to this NFA method.
Another thing to notice is that, we don't need to maintain match transition at
all. Because at any time, we could check
txt[i] ==? pat[j] to determine if
there is a match. And if there is a match, we increment state. The NFA we need
is to tell us which pattern index to try next if we tell it which match state we
are at. The input is always implicitly mismatch. Suppose we have already got
this NFA, the match will be
def match(txt,n,pat,m,nfa): j = 0 # match state for i in xrange(n): while j > 0 and txt[i] != pat[j]: j = nfa[j] if txt[i] == pat[j]: j += 1 if j >= m: return i-m+1 return -1
Like DFA, the NFA could be constructed recursively too. But since the machine is
now non-deterministic, the destination of mismatch transition only indicate next
state to try. Quite like the DFA construction, this virtual state could also be
computed along with the NFA construction. Let's take
"ABABAC" as the exmaple
pattern again, the FSM will be
Here's the code to construct it
def build(pat,m): nfa = [0 for _ in xrange(m)] j = 0 # virtual state run through nfa for i in xrange(1,m): nfa[i] = j while j > 0 and pat[i] != pat[j]: j = nfa[j] if pat[i] == pat[j]: j += 1 return nfa
The idea to match all occurences is that after finding one match, we continue the match process by taking any input as mismatch. What we need to do is add this transition to the above FSM. As always, let's assume we alreay have this NFA. The match is quite straightforward
def match_all(txt,n,pat,m,nfa): ret =  # all occurences j = 0 # match state for i in xrange(n): while j > 0 and txt[i] != pat[j]: j = nfa[j] if txt[i] == pat[j]: j += 1 if j >= m: ret.append(i-m+1) j = nfa[j] # wild mismatch return ret
"ABABAC" again as the example pattern. The modified FSM will be
Please notice, in this example, the mismatch transition at state 6 is to 0. But this DOT NOT means that we reset the state when we have a complete match. The mismatch transition has to be recursively computed from state 5. It happens to be 0 in this example. Because when at state 5, the mismatch transition is to state 3. But the match transition characters at state 3 and state 5 are different, we need to take another mismatch transition from state 3. It is to state 1. But once again, the match transition characters at state 1 and state 5 are different, we mismatch transit again to state 0. That's end of story
The Python implementation of build will be
def build_all(pat,m): nfa = [0 for _ in xrange(m+1)] j = 0 # virtual state for i in xrange(1,m): nfa[i] = j while j > 0 and pat[i] != pat[j]: j = nfa[j] if pat[i] == pat[j]: j += 1 nfa[m] = j return nfa
If you observe carefully, it is not hard to see that the mismatch transition
from state 0 is always to state 0. Thus we can save one memory cell. Definitely,
it will make the NFA index off by 1. But it also affects the inner loop
execution order. The overall effect is that
nfa[i] = j has to be put at end of
the loop instead of beginning. The KMP algorithm introduced in CLRS'
Introduction t _Algorithms uses this implementation. But I found it is too
clever to be smart, make the algorithm more twisted. The gain doesn't qualify
the loss. But that's only my opinion.
An algorithm must be seen to be believed Donald Knuth
Sedgewick use this quote of Knuth somewhat as the footnote of his research on algorithm visualization. But I kind of feel that Knuth is talking about the beauty of algorithms, as a poem must be sung to be fully enjoyed.