以下内容转载自https://stackoverflow.com/questions/30548170/dfa-construction-in-knuth-morris-pratt-algorithm,帮助理解KMP算法中根据模式串构造DFA,以及与LPS的关系
Question
I am referring to the outline of the Knuth-Morris-Pratt (KMP) algorithm for substring search in Sedgewick’s book “Algorithms” (4th ed.).
The KMP algorithm uses a backup in substring search based on a deterministic finite automaton (DFA). I understand how the DFA enters the algorithm, however I don’t understand how to construct the DFA, which is done by the following code snippet:
dfa[pat.charAt(0)][0] = 1;
for (int X = 0; j = 1; j< M; j++) {
for (int c = 0; c < R; c++) {
dfa[c][j] = dfa[c][X];
}
dfa[pat.charAt(j)][j] = j+1;
X = dfa[pat.charAt(j)][X];
}
where M
is the the length of the pattern pat
and R
the size of the alphabet. The charAt()
function returns the integer value of the character at the respective position.
Can someone explain in what way this piece of code constructs a dfa? I am lost in the actual intuitive meaning of the inner for-loop.
Answer
Let us take a look at the following FA for pattern ACACAGA.
The above diagrams represent graphical and tabular representations of pattern ACACAGA.
Here, number of states in DFA is M + 1 where M is length of the pattern. The main thing to construct DFA is to get the next state from the current state for every possible character. Given a character x and a state k, we can get the next state by considering the string “pat[0..k-1]x” which is basically concatenation of pattern characters pat[0], pat1 … pat[k-1] and the character x. The idea is to get length of the longest prefix of the given pattern such that the prefix is also suffix (LPS) of “pat[0..k-1]x”. The value of length gives us the next state.
For example, let us see how to get the next state from current state 5 and character ‘C’ in the above diagram. We need to consider the string, “pat[0..5]C” which is “ACACAC”. The lenght of the longest prefix of the pattern such that the prefix is suffix of “ACACAC”is 4 (“ACAC”). So the next state (from state 5) is 4 for character ‘C’.
// dfa[i][j] = k denotes the transition function will go k'th state
// with character i from state j
// DFA will go always (i + 1)'th state from i'th state
//if the character c is equal to current character of pattern
dfa[pat.charAt(0)][0] = 1;
// here X is initialized with LPS (longest prefix suffix)
// of pattern[0....j - 1]. LPS[0] is always 0
for (int X = 0; j = 1; j< M; j++) {
for (int c = 0; c < R; c++) { // for all possible characters
// transition function from j'th state taking character c
// will go to the same state where it went from X(LPS)'th state
// taking character c (justify it with an example)
dfa[c][j] = dfa[c][X];
}
// DFA will go always (i + 1)th state from i'th state if
// the character c is equal to current character of pattern
dfa[pat.charAt(j)][j] = j + 1;
X = dfa[pat.charAt(j)][X]; // update LPS
}
References
网上大多数资料,特别是中文资料,表述的并不是很清楚,相较之下:
http://www.huangbin.fun/KMP%E7%AE%97%E6%B3%95-%E5%9F%BA%E4%BA%8E%E3%80%8A%E7%AE%97%E6%B3%95%E3%80%8B%E7%AC%AC%E5%9B%9B%E7%89%88.html#%E8%A6%81%E7%82%B9
讲的还可以,补充了为什么是pat[1..j-1]