We have seen that a lexical analyser can identify tokens with the help of regular expressions and pattern rules. But a lexical analyser cannot check the syntax of a given sentence due to the limitations of the regular expressions. Regular expressions cannot check balancing tokens, such as parenthesis. Therefore, this phase uses context-free grammar (CFG), which is recognised by push-down automata.
CFG, on the other hand, is a superset of Regular Grammar, as depicted below:
It implies that every Regular Grammar is also context-free, but there exists some problems, which are beyond the scope of Regular Grammar. CFG is a helpful tool in describing the syntax of programming languages.
Context-Free Grammar
In this section, we will first see the definition of context-free grammar and introduce terminologies used in parsing technology.
A context-free grammar has four components:
A set of non-terminals (V). Non-terminals are syntactic variables that denote sets of strings. The non-terminals define sets of strings that help define the language generated by the grammar.
A set of tokens, known as terminal symbols (Σ). Terminals are the basic symbols from which strings are formed.
A set of productions (P). The productions of a grammar specify the manner in which the terminals and non-terminals can be combined to form strings. Each production consists of a non-terminal called the left side of the production, an arrow, and a sequence of tokens and/or on- terminals, called the right side of the production.
One of the non-terminals is designated as the start symbol (S); from where the production begins.
The strings are derived from the start symbol by repeatedly replacing a non-terminal (initially the start symbol) by the right side of a production, for that non-terminal.
Top Down Parsers:
===============
a) Recursive Descent Parser:
- There is a recursive function for every Variable.
- Terminals have a match function
b) LL(1) Parser:
- Left Recursive grammars are not allowed
- Ambiguous grammars are not allowed
- Productions are atomic (i.e not more than one production is allowed in the table rows)
- First and Follow techniques are used to place productions in the rows of the (Variables , Terminals) table.
Shift-Reduce Parsing
==================
Shift-reduce parsing uses two unique steps for bottom-up parsing. These steps are known as shift-step and reduce-step.
Shift step: The shift step refers to the advancement of the input pointer to the next input symbol, which is called the shifted symbol. This symbol is pushed onto the stack. The shifted symbol is treated as a single node of the parse tree.
Reduce step : When the parser finds a complete grammar rule (RHS) and replaces it to (LHS), it is known as reduce-step. This occurs when the top of the stack contains a handle. To reduce, a POP function is performed on the stack which pops off the handle and replaces it with LHS non-terminal symbol.
Bottom Up Parsers:
================
a) Operator Precedence Parser:
- Can accept Ambiguous grammars
- Productions have to be in the form of Operand-operator-Operand
- Operation relations table and Operation function table are used for Operator precedence and associativity.
- Operation relations table has space complexity of O(n^2)
- Operation functions table has space complexity of O(2n)
b) SR Parsers: (Shift Reduce Parsers : Stack is used)
- All the parsers differ only by their parsing table entries.
- All the productions uses items i,e (.) in the beginning of the RHS productions.
Ex. S —> .AA
- All the parsers use Canonical Collection of LR(0) items (i.e S’ -> .S production)
- Productions are grouped to represent “Shift” (Transition Production) and “Reduce” (Completed Productions) states.
- SR (Shift/Reduce) and RR (Reduce/Reduce) issues may appear in Parsing table.
Types of LR Parsers:
LR(0) and SLR(1) rely on the canonical LR(0) items
(i) LR(0) Parser :
- The reduce step is present across all the entries in the table.
(ii) SLR(1) Parser :
- The reduce step is only present for the First of the Left hand side production entry.
- LALR(1) and CLR(1) rely on the LR(1) item i.e LR(0) + Look head ($) symbols
- Have more number of entries and statues than the LR(0) items due to the look ahead symbols.
- Along with each production there is an Look ahead symbol associated, which is computed as First of items after . in each production for each step.
- CLR(1) parsing table size can be reduced by merging the Shift steps (State numbers), which will make the look aheads irrelevant.
- CLR(1) parsing table can be converted to LALR(1) parsing table by joining the States thereby ignoring the look ahead symbols.
(iii) LALR(1) Parser :
- The states and look aheads in CLR(1) are merged .
- May results in RR (reduce/reduce) conflict due to merging of look aheads.
(iv) CLR(1) Parser :
Notes:
=====
- Number of steps in CLR(1) >= Number of states in LR(0), SLR(1) & LALR(1) parsing table.
i.e CLR(1) >= LR(0) = SLR(1) = LALR(1)
=====================================================================
SDTs : Syntax Directed Translations are of 2 types
(i) S - Attributed SDT :
- If an SDT uses only synthesised attributes, it is called as S-attributed SDT.
- These attributes are evaluated using S-attributed SDTs that have their semantic actions written after the production (right hand side).
(ii) L - Attributed SDT :
- This form of SDT uses both synthesised and inherited attributes with restriction of not taking values from right siblings.
- In L-attributed SDTs, a non-terminal can get values from its parent, child, and sibling nodes.
As in the following production
S → ABC
S can take values from A, B, and C (synthesised). A can take values from S only. B can take values from S and A. C can get values from S, A, and B. No non-terminal can get values from the sibling to its right.
Attributes in L-attributed SDTs are evaluated by depth-first and left-to-right parsing manner.
We may conclude that if a definition is S-attributed, then it is also L-attributed as L-attributed definition encloses S-attributed definitions.
- SDTs are used for evaluating an arithmetic expression based on arithmetic precedence and associativity.
- SDTs are also used for infix to postfix translations
Ex. A+B —> AB+
=====================================================================