algorithms · leetcode · 2017-06-04 · yuex

Poetry, Algorithm, and KMP

Great algorithms are poetry of computation Francis Sullivan

Poetry always give us an enjoyable exprience if not a meaningful enlightenment to the true nature of life. But 10 years ago when I first read about KMP on CLRS' classic Introduction to Algorithms, what I went through was only confusion. Though the code was succinct like a poem, I was always wondering how could someone came up with such a exotic if not esoteric but still efficient algorithm. It felt like someone just got lucky enough to be striked by a thunder, until I have read Sedgewick's fantastic Algorithms. He expalins this great algorithm from a DFA angle which reveals the delicacy of this algorithm in a intuitive and natural way.

But Sedgewick only explains the K part of KMP in his lecture and book, i.e. Knuth's original algorithm which uses a DFA branching on an alphabet. The flaw is obvious that if we have a large alphabet like Unicode the memory needed to store this DFA will be huge even if the pattern is composed by only a few different letters. This is where Knuth's student Pratt came in and came up with a way to reduce space. Like Sedgewick has mentioned in his lecture, the improved algorithm is more complicated. That's why he doesn't talk it in details in his lecture. But to my surprise, the improved algorithm is still quite unstandable if we follow the way he shows us how to build a DFA. The only difference is what we need now is a NFA.

This is why I am writing this blog. I will briefly recap the DFA explanation of KMP first. Then I will explain the improved algorithm from a NFA point. Finally, we will modify this NFA a little bit to make it search all occurences in the text. The reader should at least know what problem the KMP algorithm is solving. Section 5.3 of Sedgewick's Algorithm is recommended. The parts explaining the Brute-Force search and KMP search are strongly recommended.

Recap

The recap here is by no intention to be thorough. Please check Sedgewick's Lecture on KMP.

KMP is classic linear algorithm to do substring match. The critical idea is to avoid backup so that we can move down the text directly. The key part is that when a mismatch happens we know exactly which pattern index should we compare against with current text index. We don't want backup so we would never decrease text index. We can think of the pattern index as a state. A match has proceeded to pat[j] means we have already matched j characters. Let's call it match state. Assume we have such a machine that once we told it current state and the next character, it would tell us which pattern index to match against next. Then the match would be much simpler

def match(txt,n,pat,m,dfa):
    j = 0 # match state
    for i in xrange(n):
        j = dfa[j][txt[i]]
        if j == m:
            return i-m+1
    return -1

By nature, this machine is a DFA since once the state and the input are determined so is the result.

Fortunately, this machine could be constructed recursively. Let's work on an exmaple pattern ABABAC. The match case is easy, we just increment the state.

For the mismatch case, let's say we are now at pat[3]. The thing we know about is that the past 3 chars are "ABA" both in text and pattern, i.e. txt[i-3:i] == "ABA". If backup is allowed, we need to compare txt[i-2] which is 'B' with pat[0]. But since we have already recursively constructed the DFA and we also know the inputs ("BA"), we can compute the final state virtually by feeding "BA" to a DFA with state 0. It turns out the final state will be 1. Then The destinations of mismatch transitions at state 3 are same as start from state 1.

Since this virtual state also follows a recursive manner, we could compute it together along with the DFA construction in one pass

dfa

Here's the Python implementation of the DFA construction

def build(pat,m):
    dfa = [[0 _ for _ in xrange(R)] for _ in xrange(m)]
    j = 0 # virtual state run through dfa
    for i in xrange(1,m):
        for k in xrange(R):
            dfa[i][k] = dfa[j][k]
        dfa[i][pat[i]] = i+1
        j = dfa[j][pat[i]]
    return dfa

From DFA to NFA

One idea to save space is that we don't record transitions for every input. One natural way is that we only differenciate match and mismatch. This means in the FSM we have only two transition at any state: match and mismatch. It's easy to see that extra space needed would be linear with respect to M (the length of pattern). But the problem is that since we only differenciate match and mismatch, we have no way to be deterministic when mismatch occurs. Indeed, there is only one deterministic way to match for any state. But there are many ways to mismatch. Which means that if we take a mismatch transition, we have to check again to if there is a match. If there is a match, okay, we can now take a deterministic match transition. If there is not, by definition, we have to take another mismatch transition and check again until we have reached state 0.

This may feel like someone is again got striked by a thunder. But believe me if you work on the DFA for a while and seize the two-type-transition-only idea tightly, it is quite natural to come to this NFA method.

Another thing to notice is that, we don't need to maintain match transition at all. Because at any time, we could check txt[i] ==? pat[j] to determine if there is a match. And if there is a match, we increment state. The NFA we need is to tell us which pattern index to try next if we tell it which match state we are at. The input is always implicitly mismatch. Suppose we have already got this NFA, the match will be

def match(txt,n,pat,m,nfa):
    j = 0 # match state
    for i in xrange(n):
        while j > 0 and txt[i] != pat[j]:
            j = nfa[j]
        if txt[i] == pat[j]:
            j += 1
        if j >= m:
            return i-m+1
    return -1

Like DFA, the NFA could be constructed recursively too. But since the machine is now non-deterministic, the destination of mismatch transition only indicate next state to try. Quite like the DFA construction, this virtual state could also be computed along with the NFA construction. Let's take "ABABAC" as the exmaple pattern again, the FSM will be

nfa1

Here's the code to construct it

def build(pat,m):
    nfa = [0 for _ in xrange(m)]
    j = 0 # virtual state run through nfa
    for i in xrange(1,m):
        nfa[i] = j
        while j > 0 and pat[i] != pat[j]:
            j = nfa[j]
        if pat[i] == pat[j]:
            j += 1
    return nfa

Match Them All

The idea to match all occurences is that after finding one match, we continue the match process by taking any input as mismatch. What we need to do is add this transition to the above FSM. As always, let's assume we alreay have this NFA. The match is quite straightforward

def match_all(txt,n,pat,m,nfa):
    ret = [] # all occurences
    j = 0    # match state
    for i in xrange(n):
        while j > 0 and txt[i] != pat[j]:
            j = nfa[j]
        if txt[i] == pat[j]:
            j += 1
        if j >= m:
            ret.append(i-m+1)
            j = nfa[j] # wild mismatch
    return ret

Let take "ABABAC" again as the example pattern. The modified FSM will be

nfa2

Please notice, in this example, the mismatch transition at state 6 is to 0. But this DOT NOT means that we reset the state when we have a complete match. The mismatch transition has to be recursively computed from state 5. It happens to be 0 in this example. Because when at state 5, the mismatch transition is to state 3. But the match transition characters at state 3 and state 5 are different, we need to take another mismatch transition from state 3. It is to state 1. But once again, the match transition characters at state 1 and state 5 are different, we mismatch transit again to state 0. That's end of story

The Python implementation of build will be

def build_all(pat,m):
    nfa = [0 for _ in xrange(m+1)]
    j = 0 # virtual state
    for i in xrange(1,m):
        nfa[i] = j
        while j > 0 and pat[i] != pat[j]:
            j = nfa[j]
        if pat[i] == pat[j]:
            j += 1
    nfa[m] = j
    return nfa

Final Trick

If you observe carefully, it is not hard to see that the mismatch transition from state 0 is always to state 0. Thus we can save one memory cell. Definitely, it will make the NFA index off by 1. But it also affects the inner loop execution order. The overall effect is that nfa[i] = j has to be put at end of the loop instead of beginning. The KMP algorithm introduced in CLRS' Introduction t _Algorithms uses this implementation. But I found it is too clever to be smart, make the algorithm more twisted. The gain doesn't qualify the loss. But that's only my opinion.

Summary

An algorithm must be seen to be believed Donald Knuth

Sedgewick use this quote of Knuth somewhat as the footnote of his research on algorithm visualization. But I kind of feel that Knuth is talking about the beauty of algorithms, as a poem must be sung to be fully enjoyed.