- Order:
- Duration: 9:38
- Published: 27 Oct 2010
- Uploaded: 27 Mar 2011
- Author: guillemgodoy
In a context free grammar the left hand side of a production rule is always a single nonterminal symbol. In a general grammar, it could be a string of terminal and/or nonterminal symbols. The grammars are called context free because – since all rules only have a nonterminal on the left hand side – one can always replace that nonterminal symbol with what is on the right hand side of the rule. The context in which the symbol occurs is therefore not important.
Context-free languages are exactly those which can be understood by a finite state computer with a single infinite stack. In order to keep track of nested units, one pushes the current parsing state at the start of the unit, and one recovers it at the end.
Context-free grammars play a central role in the description and design of programming languages and compilers. They are also used for analyzing the syntax of natural languages. Noam Chomsky originally proposed the use of context-free grammars for generating the base, i.e. the input for transformations (Chomsky 1957, 1965). In later work (e.g. Chomsky 1981), the idea of formulating a grammar consisting of explicit rewrite rules was abandoned. In other generative frameworks, e.g. Generalized Phrase Structure Grammar (Gazdar et al. 1985), context-free grammars were taken to be the mechanism for the entire syntax, eliminating transformations.
An essential property of these block structures is that logical units never overlap. For example, the sentence: John, whose blue car was in the garage, walked to the green store.
can be logically parenthesized as follows:
:: (John, ((whose blue car) (was (in the garage)))), (walked (to (the green store))).
Natural human languages rarely allow overlapping constructions, such as:
:: [ John saw (a blue car in an ad yesterday] with bright yellow headlights).
It is hard to find cases of possible overlap and many such cases can be explained in an alternative way that doesn't assume overlap.
The formalism of context-free grammars was developed in the mid-1950s by Noam Chomsky, and also their classification as a special type of formal grammar (which he called phrase-structure grammars).
A context-free grammar provides a simple and precise mechanism for describing the methods by which phrases in some natural language are built from smaller blocks, capturing the "block structure" of sentences in a natural way. Its simplicity makes the formalism amenable to rigorous mathematical study. Important features of natural language syntax such as agreement and reference are not part of the context free grammar, but the basic recursive structure of sentences, the way in which clauses nest inside other clauses, and the way in which lists of adjectives and adverbs are swallowed by nouns and verbs, is described exactly.
Block structure was introduced into computer programming languages by the Algol project, which, as a consequence, also featured a context-free grammar to describe the resulting Algol syntax. This became a standard feature of computer languages, and the notation for grammars used in concrete descriptions of computer languages came to be known as Backus-Naur Form, after two members of the Algol language design committee.
The "block structure" aspect that context-free grammars capture is so fundamental to grammar that the terms syntax and grammar are often identified with context-free grammar rules, especially in computer science. Formal constraints not captured by the grammar are then considered to be part of the "semantics" of the language.
Context-free grammars are simple enough to allow the construction of efficient parsing algorithms which, for a given string, determine whether and how it can be generated from the grammar. An Earley parser is an example of such an algorithm, while the widely used LR and LL parsers are more efficient algorithms that deal only with more restrictive subsets of context-free grammars.
where
1. is a finite set of non-terminal characters or variables. They represent different types of phrase or clause in the sentence. They are sometimes called syntactic categories. Each variable represents a language.
2. is a finite set of terminals, disjoint from , which make up the actual content of the sentence.The set of terminals is the alphabet of the language defined by the grammar.
3. is a relation from to such that . These relations are called productions or rewrite rules.
4. is the start variable (or start symbol), used to represent the whole sentence (or program). It must be an element of .
is a finite set. The members of are called the rules or productions of the grammar. The asterisk represents the Kleene star operation.
A language is said to be a context-free language (CFL), if there exists a CFG , such that .
:S → SS :S → (S) :S → ()
The first rule allows Ss to multiply; the second rule allows Ss to become enclosed by matching parentheses; and the third rule terminates the recursion.
Starting with S, and applying the rules, one can construct:
:S → SS → SSS → (S)SS → ((S))SS → ((SS))S(S) :→ ((()S))S(S) → ((()()))S(S) → ((()()))()(S) :→ ((()()))()(())
:S → SS :S → () :S → (S) :S → [] :S → [S]
with terminal symbols [ ] ( ) and nonterminal S.
The following sequence can be derived in that grammar: : ([ [ [ ()() [ ][ ] ] ]([ ]) ])
However, there is no context-free grammar for generating all sequences of two different types of parentheses, each separately balanced disregarding the other, but where the two types need not nest inside one another, for example:
: [ [ [ [(((( ] ] ] ]))))(([ ))(([ ))([ )( ])( ])( ])
The terminals here are a and b, while the only non-terminal is S. The language described is all nonempty strings of s and s that end in .
This grammar is regular: no rule has more than one nonterminal in its right-hand side, and each of these nonterminals is at the same end of the right-hand side.
Every regular grammar corresponds directly to a nondeterministic finite automaton, so we know that this is a regular language.
It is common to list all right-hand sides for the same left-hand side on the same line, using | to separate them, like this:
:S → a | aS | bS
So that this is the same grammar described in a terser way.
:S → aSb :S → ab
This grammar generates the language , which is not regular (according to Pumping Lemma for regular languages).
The special character ε stands for the empty string. By changing the above grammar to :S → aSb | ε we obtain a grammar generating the language instead. This differs only in that it contains the empty string while the original grammar did not.
This grammar can, for example, generate the string
:( x + y ) * x - z * y / ( x + x )
as follows:
:S (the start symbol) : → S - S (by rule 5) : → S * S - S (by rule 6, applied to the leftmost S) : → S * S - S / S (by rule 7, applied to the rightmost S) : → ( S ) * S - S / S (by rule 8, applied to the leftmost S) : → ( S ) * S - S / ( S ) (by rule 8, applied to the rightmost S) : → ( S + S ) * S - S / ( S ) (etc.) : → ( S + S ) * S - S * S / ( S ) : → ( S + S ) * S - S * S / ( S + S ) : → ( x + S ) * S - S * S / ( S + S ) : → ( x + y ) * S - S * S / ( S + S ) : → ( x + y ) * x - S * y / ( S + S ) : → ( x + y ) * x - S * y / ( x + S ) : → ( x + y ) * x - z * y / ( x + S ) : → ( x + y ) * x - z * y / ( x + x )
Note that many choices were made underway which S
was going to be rewritten next.
These choices look quite arbitrary. As a matter of fact, they are.
Also, many choices were made on which rule to apply to the selected S
.
These do not look so arbitrary: they usually affect which terminal string comes out at the end.
To see this we can look at the parse tree of this derivation:
S
|
/|\
S - S
/ \
/|\ /|\
S * S S / S
/ | | \
/|\ x /|\ /|\
( S ) S * S ( S )
/ | | \
/|\ z y /|\
S + S S + S
| | | |
x y x x
Starting at the top, step by step, an S in the tree is expanded, until no more unexpanded S
es (non-terminals) remain.
Picking a different order of expansion will produce a different derivation, but the same parse tree.
The parse tree will only change, if we pick a different rule to apply at some position in the tree.
But can a different parse tree still produce the same terminal string,
which is ( x + y ) * x - z * y / ( x + x )
in this case?
Yes, for this grammar, this is possible.
Grammars with this property are called ambiguous.
For example, x + y * z
can be produced with these two different parse trees:
S S
| |
/|\ /|\
S * S S + S
/ \ / \
/|\ z x /|\
x + y y * z
However, the language described by this grammar is not inherently ambiguous: an alternative, unambiguous grammar can be given for the language, for example:
:T → x :T → y :T → z :S → S + T :S → S - T :S → S * T :S → S / T :T → ( S ) :S → T
(once again picking S
as the start symbol).
Context-free grammars are not limited in application to mathematical ("formal") languages. For example, it has been suggested that a class of Tamil poetry called Venpa is described by a context-free grammar.
=== Derivations and syntax trees ===
There are two common ways to describe how a given string can be shown to be derivable from the start symbol of a given grammar. The simplest way is to list the set of consecutive derivations, beginning with the start symbol and ending with the string itself along with the rules applied to produce each intermediate step. However, if we introduce a strategy such as, "always replace the left-most nonterminal first"; then, for context-free grammars the list of applied grammar rules is by itself sufficient and is called the leftmost derivation of a string. For example, if we take the following grammar:
: (1) S → S + S : (2) S → 1 : (3) S → a
and the string "1 + 1 + a" then a left derivation of this string is the list [ (1), (1), (2), (2), (3) ], that is, the following one step derivations:
S → S + S → S + S + S → 1 + S + S → 1 + 1 + S → 1 + 1 + a
Analogously, the rightmost derivation is defined as the list that we get if we always replace the rightmost nonterminal first. In this case this could be the list [ (1), (3), (1), (2), (2)].
S → S + S → S + a → S + S + a → S + 1 + a → 1 + 1 + a
The distinction between leftmost derivation and rightmost derivation is important because in most parsers the transformation of the input is defined by giving a piece of code for every grammar rule that is executed whenever the rule is applied. Therefore it is important to know whether the parser determines a leftmost or a rightmost derivation because this determines the order in which the pieces of code will be executed. See for an example LL parsers and LR parsers.
A derivation also imposes in some sense a hierarchical structure on the string that is derived. For example, if the string "1 + 1 + a" is derived according to the leftmost derivation:
:S → S + S (1) : → S + S + S (1) : → 1 + S + S (2) : → 1 + 1 + S (2) : → 1 + 1 + a (3)
the structure of the string would be:
: { { { 1 }S + { 1 }S }S + { a }S }S where { ... }S indicates a substring recognized as belonging to S. This hierarchy can also be seen as a tree:
S
/|\
/ | \
/ | \
S '+' S
/|\ |
/ | \ |
S '+' S 'a'
| |
'1' '1'
This tree is called a concrete syntax tree (see also abstract syntax tree) of the string. In this case the presented leftmost and the rightmost derivations define the same syntax tree; however, there is another (leftmost) derivation of the same string
:S → S + S (1) : → 1 + S (2) : → 1 + S + S (1) : → 1 + 1 + S (2) : → 1 + 1 + a (3)
and this defines the following syntax tree:
S
/|\
/ | \
/ | \
S '+' S
| /|\
| / | \
'1' S '+' S
| |
'1' 'a'
If, for certain strings in the language of the grammar, there is more than one parsing tree, then the grammar is said to be an ambiguous grammar. Such grammars are usually hard to parse because the parser cannot always decide which grammar rule it has to apply. Usually, ambiguity is a feature of the grammar, not the language, and an unambiguous grammar can be found that generates the same context-free language. However, there are certain languages that can only be generated by ambiguous grammars; such languages are called inherently ambiguous.
Because of the especially simple form of production rules in Chomsky Normal Form grammars, this normal form has both theoretical and practical implications. For instance, given a context-free grammar, one can use the Chomsky Normal Form to construct a polynomial-time algorithm that decides whether a given string is in the language represented by that grammar or not (the CYK algorithm).
Still, many problems remain undecidable. Examples:
A reduction can be demonstrated to this problem from the well-known undecidable problem of determining whether a Turing machine accepts a particular input (the Halting problem). The reduction uses the concept of a computation history, a string describing an entire computation of a Turing machine. We can construct a CFG that generates all strings that are not accepting computation histories for a particular Turing machine on a particular input, and thus it will accept all strings only, if the machine doesn't accept that input.
The undecidability of this problem is a direct consequence of the previous: we cannot even decide whether a CFG is equivalent to the trivial CFG defining the language of all strings.
Each of these problems is undecidable.
In computer science, examples of this approach include affix grammars, attribute grammars, indexed grammars, and Van Wijngaarden two-level grammars.
Similar extensions exist in linguistics.
Another extension is to allow additional terminal symbols to appear at the left hand side of rules, constraining their application. This produces the formalism of context-sensitive grammars.
Category:Compiler theory Category:Formal languages Category:Programming language topics Category:Wikipedia articles with ASCII art
This text is licensed under the Creative Commons CC-BY-SA License. This text was originally published on Wikipedia and was developed by the Wikipedia community.