Initial Parser – XM3 Compiler Design

From the Lexer, we are currently returning the following tokens:

%%

"int" { return INT; }
"return" { return RETURN; }

[a-zA-Z_][a-zA-Z0-9_]* {
    yylval.str = _strdup(yytext);
    return ID;
}

[0-9]+ {
    int value = atoi(yytext);
    yylval.intval = value;
    return ICONST;
}

"+" {return PLUS;}
"-" {return MINUS;}
"/" {return DIVIDE;}
"*" {return MULT;}
"=" {return ASSIGN;}

";" { return SEMI; }
"(" { return OP_PAR; }
")" { return CL_PAR; }
"{" { return OP_BRACE; }
"}" { return CL_BRACE; }

%%

The Parser must be able to take in each of the above tokens and recognize valid grammar rules. The first step is declaring the tokens in the Parser as well as the union used to pass values through yylval. Tokens are declared with %token and the union is declared with %union. If a token is also passing a value through the union it must contain the data type within angle brackets. See the code below for the declarations:

%union {
	int intval;
	double doubleval;
	char charval;
	char* str;
	struct AST_Node* node;
};

%token RETURN INT 
%token SEMI OP_PAR CL_PAR OP_BRACE CL_BRACE EOL EQ 
%token PLUS MINUS MULT DIVIDE ASSIGN
%token <intval> ICONST
%token <str> ID

Now that the tokens have been declared, grammar rules need to be set. For a starting point we will be treating a C program as a series of functions. These functions contain statements (i.e., loop statements, if statements, or declarations) which themselves contain expressions (i.e., additions or assignments). Moreover the expressions contain variables and numbers.

The program, functions, statements, and expressions will need to be represented as nonterminal tokens in bison. These nonterminal tokens will be made up of tokens and other nonterminal tokens (which breakdown into tokens) to represent valid grammar rules. For example, the expression nonterminal token would breakdown as follows (note that term is also a nonterminal token):

expression:
term {;}
| variable assign expression {;}
| expression PLUS expression {;}
| expression MINUS expression {;}
| expression MULT expression {;}
| expression DIVIDE expression {;}
;

term:
ICONST {;}
| ID {;}
| OP_PAR expression CL_PAR {;}
;

In the above code the “|” is used to represent alternative options for a nonterminal token. Take the nonterminal token “term”. It can be an integer (ICONST token), a variable (ID token) or an expression within parenthesis (used so that expressions within parenthesis occur before those outside). It is also important to note that the expression nonterminal token contains itself. This means any amount expressions will be represented by a single expression nonterminal token as recursion will occur until only terms and operators are left.

To declare these nonterminal tokens, %type is used. These %types will eventually contain nodes so that the program can be represented as an abstract syntax tree, so <node> will be used. For now, the only statement type the Lexer is passing a token for is a return statement. Therefore, return_st will also be a type (more types will be required in the future):

%type <node> program function statement expression term return_st

Setting the grammar rules for these types results in:

program:
function {;}
;

function:
INT ID OP_PAR CL_PAR OP_BRACE statement CL_BRACE {;}
;

statement:
return_st {;}
;

return_st:
RETURN SEMI {;} /*Return nothing*/
| RETURN expression SEMI {;} /*Return a value*/
;

expression:
term {;}
| variable ASSIGN expression {;}
| expression PLUS expression {;}
| expression MINUS expression {;}
| expression MULT expression {;}
| expression DIVIDE expression {;}
;

term:
ICONST {;}
| ID {;}
| OP_PAR expression CL_PAR {;}
;

In the above code the grammar rules are set but no C code is executed when they are encountered. In bison, $$ represents the output (nonterminal token which in this case is always a node), and $x (where x is the order of the token/nonterminal token in the grammar rule) to represent one of the parts of the grammar. For example, in the below code $$ = nonterminal, $1 = TOKEN1, $2 = nonterminal2, and $3 = TOKEN2:

nonterminal:
TOKEN1 nonterminal2 TOKEN2
;

Adding C code results in the types having functions to creating nodes (see post for creating ast.h and ast.c) except for the program type which calls the function ast_traverse (for now used to print out the abstract syntax tree, but later will be used to create next stage of intermediate representation):

program:
function { ast_traverse($1); }
;

function:
INT ID OP_PAR CL_PAR OP_BRACE statement CL_BRACE { 
$$ = new_ast_func_node($6); 
}
;

statement:
return_st { $$ = $1; }
;

return_st:
RETURN SEMICOLON { $$ = new_ast_return_node(0, NULL); } /*Return nothing*/
| RETURN expression SEMICOLON { $$ = new_ast_return_node(1, $2); } /*Return a value*/
;

expression:
term { $$ = $1;}
| variable ASSIGN expression { $$ = new_ast_expr_node($1, ASSIGNop, $3); }
| expression PLUS expression { $$ = new_ast_expr_node($1, PLUSop, $3); }
| expression MINUS expression { $$ = new_ast_expr_node($1, MINUSop, $3); }
| expression MULT expression { $$ = new_ast_expr_node($1, MULTop, $3); }
| expression DIVIDE expression { $$ = new_ast_expr_node($1, DIVIDEop, $3); }
;

term:
ICONST { $$ = new_ast_const_node(INTtype, $1); }
| ID { $$ = new_ast_ID_node(INTtype, $1, NULL); }
| OP_PAR expression CL_PAR { $$ = $2; }
;

At this point we now have the union, tokens, and types declared. Also, we have the grammar rules and corresponding C code set for the grammar rules. The Parser still requires a lot of expansion (i.e., more operators in expressions, more statements, a function being able to contain multiple statements, and more) but a starting point has been set. The last step to setting up the Parser is defining the default starting symbol of the grammar using %start. For our compiler this is the program type as all the tokens passed to the parser make up a single program:

%start program