- Order:
- Duration: 9:17
- Published: 2010-05-31
- Uploaded: 2010-12-18
- Author: jornki
these configurations will be saved for each time you visit this page using this browser
The remainder of this article describes the table-based kind of parser, the alternative being a recursive descent parser which is usually coded by hand (although not always; see e.g. ANTLR for an LL(*) recursive-descent parser generator).
An LL parser is called an LL(k) parser if it uses k tokens of lookahead when parsing a sentence. If such a parser exists for a certain grammar and it can parse sentences of this grammar without backtracking then it is called an LL(k) grammar. A language that has an LL(k) grammar is known as an LL(k) language. There are LL(k+n) languages that are not LL(k) languages. A corollary of this is that not all context-free languages are LL(k) languages.
LL(1) grammars are very popular because the corresponding LL parsers only need to look at the next token to make their parsing decisions. Languages based on grammars with a high value of k have traditionally been considered to be difficult to parse, although this is less true now given the availability and widespread use of parser generators supporting LL(k) grammars for arbitrary k.
An LL parser is called an LL(*) parser if it is not restricted to a finite k tokens of lookahead, but can make parsing decisions by recognizing whether the following tokens belong to a regular language (for example by use of a Deterministic Finite Automaton).
There is contention between the "European school" of language design, who prefer LL-based grammars, and the "US-school", who predominantly prefer LR-based grammars. This is largely due to teaching traditions and the detailed description of specific methods and tools in certain text books; another influence may be Niklaus Wirth at ETH Zürich in Switzerland, whose research has described a number of ways of optimising LL(1) languages and compilers.
The parser consists of
The parser applies the rule found in the table by matching the top-most symbol on the stack (row) with the current symbol in the input stream (column).
When the parser starts, the stack already contains two symbols:
[ S, $ ]
where '$' is a special terminal to indicate the bottom of the stack and the end of the input stream, and 'S' is the start symbol of the grammar. The parser will attempt to rewrite the contents of this stack to what it sees on the input stream. However, it only keeps on the stack what still needs to be rewritten.
# S → F # S → ( S + F ) # F → a
and parse the following input:
:( a + a )
The parsing table for this grammar looks as follows:
:{| border="1" align="center" cellspacing="3" padding="5" class="wikitable" align="center" |----- align="center" | || ( || ) || a || + | $ |----- align="center" | S || 2 || - || 1 || - | - |----- align="center" | F || - || - || 3 || - | - |} (Note that there is also a column for the special terminal, represented here as $, that is used to indicate the end of the input stream.)
Thus, in its first step, the parser reads the input symbol '(' and the stack-top symbol 'S'. The parsing table instruction comes from the column headed by the input symbol '(' and the row headed by the stack-top symbol 'S'; this cell contains '2', which instructs the parser to apply rule (2). The parser has to rewrite 'S' to '( S + F )' on the stack and write the rule number 2 to the output. The stack then becomes:
[ (, S, +, F, ), $ ]
Since the '(' from the input stream did not match the top-most symbol, 'S', from the stack, it was not removed, and remains the next-available input symbol for the following step.
In the second step, the parser removes the '(' from its input stream and from its stack, since they match. The stack now becomes:
[ S, +, F, ), $ ]
Now the parser has an 'a' on its input stream and an 'S' as its stack top. The parsing table instructs it to apply rule (1) from the grammar and write the rule number 1 to the output stream. The stack becomes:
[ F, +, F, ), $ ]
The parser now has an 'a' on its input stream and an 'F' as its stack top. The parsing table instructs it to apply rule (3) from the grammar and write the rule number 3 to the output stream. The stack becomes:
[ a, +, F, ), $ ]
In the next two steps the parser reads the 'a' and '+' from the input stream and, since they match the next two items on the stack, also removes them from the stack. This results in:
[ F, ), $ ]
In the next three steps the parser will replace 'F' on the stack by 'a', write the rule number 3 to the output stream and remove the 'a' and ')' from both the stack and the input stream. The parser thus ends with '$' on both its stack and its input stream.
In this case the parser will report that it has accepted the input string and write the following list of rule numbers to the output stream:
: [ 2, 1, 3, 3 ]
This is indeed a list of rules for a leftmost derivation of the input string, which is:
: S → ( S + F ) → ( F + F ) → ( a + F ) → ( a + a )
enum Symbols { // the symbols: // Terminal symbols: TS_L_PARENS, // ( TS_R_PARENS, // ) TS_A, // a TS_PLUS, // + TS_EOS, // $, in this case corresponds to '\0' TS_INVALID, // invalid token
// Non-terminal symbols: NTS_S, // S NTS_F };
/* Converts a valid token to the corresponding terminal symbol
case ')': return TS_R_PARENS; break;
case 'a': return TS_A; break;
case '+': return TS_PLUS; break;
case '\0': // this will act as the $ terminal symbol return TS_EOS; break;
default: return TS_INVALID; break; } }
int main(int argc, char **argv) { using namespace std;
if (argc < 2) { cout << "usage:\n\tll '(a+a)'" << endl; return 0; }
map< enum Symbols, map
// initialize the symbols stack ss.push(TS_EOS); // terminal, $ ss.push(NTS_S); // non-terminal, S
// initialize the symbol stream cursor p = &argv;[1][0];
// setup the parsing table table[NTS_S][TS_L_PARENS] = 2; table[NTS_S][TS_A] = 1; table[NTS_F][TS_A] = 3;
while(ss.size() > 0) { if(lexer(*p) == ss.top()) { cout << "Matched symbols: " << lexer(*p) << endl; p++; ss.pop(); } else { cout << "Rule " << table[ss.top()][lexer(*p)] << endl; switch(table[ss.top()][lexer(*p)]) { case 1: // 1. S → F ss.pop(); ss.push(NTS_F); // F break;
case 2: // 2. S → ( S + F ) ss.pop(); ss.push(TS_R_PARENS); // ) ss.push(NTS_F); // F ss.push(TS_PLUS); // + ss.push(NTS_S); // S ss.push(TS_L_PARENS); // ( break;
case 3: // 3. F → a ss.pop(); ss.push(TS_A); // a break;
default: cout << "parsing table defaulted" << endl; return 0; break; } } }
cout << "finished parsing" << endl;
return 0; }
Unfortunately, the First-sets are not sufficient to compute the parsing table. This is because a right-hand side w of a rule might ultimately be rewritten to the empty string. So the parser should also use the rule A → w if ε is in Fi(w) and it sees on the input stream a symbol that could follow A. Therefore we also need the Follow-set of A, written as Fo(A) here, which is defined as the set of terminals a such that there is a string of symbols αAaβ that can be derived from the start symbol. Computing the Follow-sets for the nonterminals in a grammar can be done as follows: # initialize every Fo(Ai) with the empty set # if there is a rule of the form Aj → wAiw' , then #* if the terminal a is in Fi(w' ), then add a to Fo(Ai) #* if ε is in Fi(w' ), then add Fo(Aj) to Fo(Ai) # repeat step 2 until all Fo sets stay the same.
Now we can define exactly which rules will be contained where in the parsing table. If T[A, a] denotes the entry in the table for nonterminal A and terminal a, then : T[A,a] contains the rule A → w if and only if :: a is in Fi(w) or :: ε is in Fi(w) and a is in Fo(A).
If the table contains at most one rule in every one of its cells, then the parser will always know which rule it has to use and can therefore parse strings without backtracking. It is in precisely this case that the grammar is called an LL(1) grammar.
* FIRST/FOLLOW conflict : The FIRST and FOLLOW set of a grammar rule overlap. With an epsilon in the FIRST set it is unknown which alternative to select. : An example of an LL(1) conflict: S -> A 'a' 'b' A -> 'a' | epsilon : The FIRST set of A now is { 'a' epsilon } and the FOLLOW set { 'a' }.
* left-recursion : Left recursion will cause a FIRST/FIRST conflict with all alternatives. E -> E '+' term | alt1 | alt2
* Substitution Substituting a rule into another rule to remove indirect or FIRST/FOLLOW conflicts. Note that this may cause a FIRST/FIRST conflict.
* Left recursion removal A simple example for left recursion removal: The following production rule has left recursion on E E -> E '+' T -> T This rule is nothing but list of T's separated by '+'. In a regular expression form T ('+' T)*. So the rule could be rewritten as E -> T Z Z -> '+' T Z -> ε Now there is no left recursion and no conflicts on either of the rules.
However, not all CFGs have an equivalent LL(k)-grammar, e.g.: S -> A | B A -> 'a' A 'b' | ε B -> 'a' B 'b' 'b' | ε It can be shown that there does not exist any LL(k)-grammar accepting the language generated by this grammar.
Category:Parsing algorithms Category:Articles with example C++ code
This text is licensed under the Creative Commons CC-BY-SA License. This text was originally published on Wikipedia and was developed by the Wikipedia community.