Great algorithms are poetry of computation Francis Sullivan

Poetry always give us an enjoyable exprience if not a meaningful enlightenment
to the true nature of life. But 10 years ago when I first read about KMP on
CLRS' classic *Introduction to Algorithms*, what I went through was only
confusion. Though the code was succinct like a poem, I was always wondering how
could someone came up with such a exotic if not esoteric but still efficient
algorithm. It felt like someone just got lucky enough to be striked by a
thunder, until I have read Sedgewick's fantastic *Algorithms*. He expalins this
great algorithm from a DFA angle which reveals the delicacy of this algorithm in
a intuitive and natural way.

But Sedgewick only explains the K part of KMP in his lecture and book, i.e. Knuth's original algorithm which uses a DFA branching on an alphabet. The flaw is obvious that if we have a large alphabet like Unicode the memory needed to store this DFA will be huge even if the pattern is composed by only a few different letters. This is where Knuth's student Pratt came in and came up with a way to reduce space. Like Sedgewick has mentioned in his lecture, the improved algorithm is more complicated. That's why he doesn't talk it in details in his lecture. But to my surprise, the improved algorithm is still quite unstandable if we follow the way he shows us how to build a DFA. The only difference is what we need now is a NFA.

This is why I am writing this blog. I will briefly recap the DFA explanation of
KMP first. Then I will explain the improved algorithm from a NFA point. Finally,
we will modify this NFA a little bit to make it search all occurences in the
text. The reader should at least know what problem the KMP algorithm is
solving. Section 5.3 of Sedgewick's *Algorithm* is recommended. The parts
explaining the Brute-Force search and KMP search are strongly recommended.

The recap here is by no intention to be thorough. Please check Sedgewick's Lecture on KMP.

KMP is classic linear algorithm to do substring match. The critical idea is to
avoid backup so that we can move down the text directly. The key part is that
when a mismatch happens we know exactly which pattern index should we compare
against with current text index. We don't want backup so we would never decrease
text index. We can think of the pattern index as a state. A match has proceeded
to `pat[j]`

means we have already matched `j`

characters. Let's call it match
state. Assume we have such a machine that once we told it current state and the
next character, it would tell us which pattern index to match against next. Then
the match would be much simpler

```
def match(txt,n,pat,m,dfa):
j = 0 # match state
for i in xrange(n):
j = dfa[j][txt[i]]
if j == m:
return i-m+1
return -1
```

By nature, this machine is a DFA since once the state and the input are determined so is the result.

Fortunately, this machine could be constructed recursively. Let's work on an
exmaple pattern `ABABAC`

. The match case is easy, we just increment the state.

For the mismatch case, let's say we are now at `pat[3]`

. The thing we know about
is that the past 3 chars are "ABA" both in text and pattern, i.e. ```
txt[i-3:i] ==
"ABA"
```

. If backup is allowed, we need to compare `txt[i-2]`

which is `'B'`

with
`pat[0]`

. But since we have already recursively constructed the DFA and we also
know the inputs (`"BA"`

), we can compute the final state virtually by feeding
`"BA"`

to a DFA with state 0. It turns out the final state will be 1. Then The
destinations of mismatch transitions at state 3 are same as start from state 1.

Since this virtual state also follows a recursive manner, we could compute it together along with the DFA construction in one pass

Here's the Python implementation of the DFA construction

```
def build(pat,m):
dfa = [[0 _ for _ in xrange(R)] for _ in xrange(m)]
j = 0 # virtual state run through dfa
for i in xrange(1,m):
for k in xrange(R):
dfa[i][k] = dfa[j][k]
dfa[i][pat[i]] = i+1
j = dfa[j][pat[i]]
return dfa
```

One idea to save space is that we don't record transitions for every input. One natural way is that we only differenciate match and mismatch. This means in the FSM we have only two transition at any state: match and mismatch. It's easy to see that extra space needed would be linear with respect to M (the length of pattern). But the problem is that since we only differenciate match and mismatch, we have no way to be deterministic when mismatch occurs. Indeed, there is only one deterministic way to match for any state. But there are many ways to mismatch. Which means that if we take a mismatch transition, we have to check again to if there is a match. If there is a match, okay, we can now take a deterministic match transition. If there is not, by definition, we have to take another mismatch transition and check again until we have reached state 0.

This may feel like someone is again got striked by a thunder. But believe me if you work on the DFA for a while and seize the two-type-transition-only idea tightly, it is quite natural to come to this NFA method.

Another thing to notice is that, we don't need to maintain match transition at
all. Because at any time, we could check `txt[i] ==? pat[j]`

to determine if
there is a match. And if there is a match, we increment state. The NFA we need
is to tell us which pattern index to try next if we tell it which match state we
are at. The input is always implicitly mismatch. Suppose we have already got
this NFA, the match will be

```
def match(txt,n,pat,m,nfa):
j = 0 # match state
for i in xrange(n):
while j > 0 and txt[i] != pat[j]:
j = nfa[j]
if txt[i] == pat[j]:
j += 1
if j >= m:
return i-m+1
return -1
```

Like DFA, the NFA could be constructed recursively too. But since the machine is
now non-deterministic, the destination of mismatch transition only indicate next
state to try. Quite like the DFA construction, this virtual state could also be
computed along with the NFA construction. Let's take `"ABABAC"`

as the exmaple
pattern again, the FSM will be

Here's the code to construct it

```
def build(pat,m):
nfa = [0 for _ in xrange(m)]
j = 0 # virtual state run through nfa
for i in xrange(1,m):
nfa[i] = j
while j > 0 and pat[i] != pat[j]:
j = nfa[j]
if pat[i] == pat[j]:
j += 1
return nfa
```

The idea to match all occurences is that after finding one match, we continue the match process by taking any input as mismatch. What we need to do is add this transition to the above FSM. As always, let's assume we alreay have this NFA. The match is quite straightforward

```
def match_all(txt,n,pat,m,nfa):
ret = [] # all occurences
j = 0 # match state
for i in xrange(n):
while j > 0 and txt[i] != pat[j]:
j = nfa[j]
if txt[i] == pat[j]:
j += 1
if j >= m:
ret.append(i-m+1)
j = nfa[j] # wild mismatch
return ret
```

Let take `"ABABAC"`

again as the example pattern. The modified FSM will be

Please notice, in this example, the mismatch transition at state 6 is to 0. But this DOT NOT means that we reset the state when we have a complete match. The mismatch transition has to be recursively computed from state 5. It happens to be 0 in this example. Because when at state 5, the mismatch transition is to state 3. But the match transition characters at state 3 and state 5 are different, we need to take another mismatch transition from state 3. It is to state 1. But once again, the match transition characters at state 1 and state 5 are different, we mismatch transit again to state 0. That's end of story

The Python implementation of build will be

```
def build_all(pat,m):
nfa = [0 for _ in xrange(m+1)]
j = 0 # virtual state
for i in xrange(1,m):
nfa[i] = j
while j > 0 and pat[i] != pat[j]:
j = nfa[j]
if pat[i] == pat[j]:
j += 1
nfa[m] = j
return nfa
```

If you observe carefully, it is not hard to see that the mismatch transition
from state 0 is always to state 0. Thus we can save one memory cell. Definitely,
it will make the NFA index off by 1. But it also affects the inner loop
execution order. The overall effect is that `nfa[i] = j`

has to be put at end of
the loop instead of beginning. The KMP algorithm introduced in CLRS'
*Introduction t _Algorithms* uses this implementation. But I found it is too
clever to be smart, make the algorithm more twisted. The gain doesn't qualify
the loss. But that's only my opinion.

An algorithm must be seen to be believed Donald Knuth

Sedgewick use this quote of Knuth somewhat as the footnote of his research on algorithm visualization. But I kind of feel that Knuth is talking about the beauty of algorithms, as a poem must be sung to be fully enjoyed.