KMP算法中DFA的构造

Posted by ChaosNyaruko on April 20, 2021

以下内容转载自https://stackoverflow.com/questions/30548170/dfa-construction-in-knuth-morris-pratt-algorithm,帮助理解KMP算法中根据模式串构造DFA,以及与LPS的关系


Question

I am referring to the outline of the Knuth-Morris-Pratt (KMP) algorithm for substring search in Sedgewick’s book “Algorithms” (4th ed.).

The KMP algorithm uses a backup in substring search based on a deterministic finite automaton (DFA). I understand how the DFA enters the algorithm, however I don’t understand how to construct the DFA, which is done by the following code snippet:

dfa[pat.charAt(0)][0] = 1;
for (int X = 0; j = 1; j< M; j++) {
    for (int c = 0; c < R; c++) {
       dfa[c][j] = dfa[c][X];
    }
    dfa[pat.charAt(j)][j] = j+1;
    X = dfa[pat.charAt(j)][X];
}

where M is the the length of the pattern pat and R the size of the alphabet. The charAt() function returns the integer value of the character at the respective position.

Can someone explain in what way this piece of code constructs a dfa? I am lost in the actual intuitive meaning of the inner for-loop.

Answer

Let us take a look at the following FA for pattern ACACAGA.

dfa

enter image description here

The above diagrams represent graphical and tabular representations of pattern ACACAGA.

Here, number of states in DFA is M + 1 where M is length of the pattern. The main thing to construct DFA is to get the next state from the current state for every possible character. Given a character x and a state k, we can get the next state by considering the string “pat[0..k-1]x” which is basically concatenation of pattern characters pat[0], pat1 … pat[k-1] and the character x. The idea is to get length of the longest prefix of the given pattern such that the prefix is also suffix (LPS) of “pat[0..k-1]x”. The value of length gives us the next state.

For example, let us see how to get the next state from current state 5 and character ‘C’ in the above diagram. We need to consider the string, “pat[0..5]C” which is “ACACAC”. The lenght of the longest prefix of the pattern such that the prefix is suffix of “ACACAC”is 4 (“ACAC”). So the next state (from state 5) is 4 for character ‘C’.

// dfa[i][j] = k denotes the transition function will go k'th state 
// with character i from state j

// DFA will go always (i + 1)'th state from i'th state 
//if the character c is equal to current character of pattern 
dfa[pat.charAt(0)][0] = 1;

//  here X is initialized with LPS (longest prefix suffix) 
// of pattern[0....j - 1]. LPS[0] is always 0
for (int X = 0; j = 1; j< M; j++) {
    for (int c = 0; c < R; c++) { // for all possible characters
        // transition function from j'th state taking character c 
        // will go to the same state where it went from X(LPS)'th state
        // taking character c (justify it with an example) 
        dfa[c][j] = dfa[c][X];
    }
    // DFA will go always (i + 1)th state from i'th state if 
    // the character c is equal to current character of pattern
    dfa[pat.charAt(j)][j] = j + 1;
    X = dfa[pat.charAt(j)][X]; // update LPS
}

References

网上大多数资料,特别是中文资料,表述的并不是很清楚,相较之下:

http://www.huangbin.fun/KMP%E7%AE%97%E6%B3%95-%E5%9F%BA%E4%BA%8E%E3%80%8A%E7%AE%97%E6%B3%95%E3%80%8B%E7%AC%AC%E5%9B%9B%E7%89%88.html#%E8%A6%81%E7%82%B9

讲的还可以,补充了为什么是pat[1..j-1]