aidannearing – XM3 Compiler Design

Symbols Breakdown

Details of variables and functions are stored in the Complier’s symbol table so that the Complier can later access information for the symbol. The Complier uses the information in the symbol table to help perform semantic analysis and generate assembly code.

The first field of the symbol table is a string for the symbol name. The symbol name is stored as it helps the Complier find the symbol in the symbol table:

#define MAXIDLEN 31 // max length of identifier (defined by C standard)

typedef struct symbol {
	char id[MAXIDLEN];
}symbol_t;

Sometimes the symbol table will create a collision, in this case a linked list at that index of the symbol table is required. Therefore, each symbol has a field to allow the creation of a linked list:

typedef struct symbol {
	char id[MAXIDLEN];
	struct symbol* next;
}symbol_t;

For now the symbols can either be for a variable or a function (may have more types in the future). This means a field is needed to tell the Complier whether the symbol is for a variable or function:

typedef struct symbol {
	char id[MAXIDLEN];
	int sym_type;
	struct symbol* next;
}symbol_t;

For semantic analysis, it is also important to store what variable type the symbol is for:

// Enum used for the different C variable types
typedef enum Variable_t
{
	CHARtype,
	DOUBLEtype,
	FLOATtype,
	INTtype,
	LONGtype,
	SHORTtype,
	VOIDtype,
	STRtype,
	NUMtotal //Stores how many variables types are handled
} VariableType;

typedef struct symbol {
	char id[MAXIDLEN];
	int sym_type;
	VariableType data_type;
	struct symbol* next;
}symbol_t;

In the creation of the symbol table, it was decided that each scope of the C program should have its own scope. This means that the Complier must know the name and scope of a symbol to locate it:

// Updates to ast.h
/* Function Nodes */
typedef struct AST_Node_Func_t {
	Node_Type type;
	char* funcname;
	scope_t* scope;
	AST_Node* statement_list;
	AST_Node* param_list;
} AST_Node_Func;

typedef struct AST_Node_Func_Call_t {
	Node_Type type;
	char* funcname;
	AST_Node* param_list;
	scope_t* scope;
} AST_Node_Func_Call;

typedef struct AST_Node_ID_t {
	Node_Type type;
	char* varname;
	int vartype;
	scope_t* scope;
	Value val;
} AST_Node_ID;

// Update to symtab.h
typedef struct symbol {
	char id[MAXIDLEN];
	int scope;
	int sym_type;
	VariableType data_type;

	struct symbol* next;
}symbol_t;

Variables will have values stored in memory at a relative offset to the base pointer of a stack frame. To allow the complier to access variables stored in memory, a field for the base pointer offset was added (note this field is used for variables, it is not used for functions):

typedef struct symbol {
	char id[MAXIDLEN];
	int scope;
	int sym_type;
	VariableType data_type;
	int bp_offset;
	struct symbol* next;
}symbol_t;

Functions won’t have values stored in memory like variables; however, it is useful to store how many parameters are passed to the function (i.e., allows the Complier to know how many variables must be popped from the stack in a return from a function call). Therefore, a field was added to store how many parameters a function has (this will have to be updated in the future to account for parameter types):

typedef struct symbol {
	char id[MAXIDLEN];
	int scope;
	int sym_type;
	VariableType data_type;
	int bp_offset;
	int param_count;
	struct symbol* next;
}symbol_t;

Finally, a field was added to store the values of variables:

typedef union Value {
	int ival;
	double dval;
	long lval;
	float fval;
	char cval;
	char* sval;
	short shval;
} Value;

typedef struct symbol {
	char id[MAXIDLEN];
	int scope;
	int sym_type;
	VariableType data_type;
	Value value;
	int bp_offset;
	int param_count;
	struct symbol* next;
}symbol_t;

Function Calls

Although the Parser can handle functions, it cannot handle function calls. Therefore, a new token called func_call_st was created. The expected grammar for a function call is the function name followed by parameters contained within parenthesis:

// Update to code in Parser.y
statement:
return_st { $$ = $1; } 
| declaration_st SEMI { $$ = $1; }
| func_call_st SEMI { $$ = $1; }
| expression SEMI { $$ = $1; }
;

func_call_st:
ID OP_PAR call_parameters CL_PAR { 
	$$ = new_ast_func_call_node($1, $3); 
}
;

See below for the code added to create function call nodes in the abstract syntax tree:

// Update to code in ast.h
typedef struct AST_Node_Func_Call_t {
	Node_Type type;
	char* funcname;
	AST_Node* param_list;
	scope_t* scope;
} AST_Node_Func_Call;

// Update to code in ast.c
AST_Node* new_ast_func_call_node(char* fname, AST_Node* param_list) {
	AST_Node_Func_Call* node = malloc(sizeof(AST_Node_Func_Call));

	node->type = FUNC_CALL_NODE;
	node->funcname = _strdup(fname);
	node->param_list = param_list;
	node->scope = current_scope;
	return (AST_Node*)node;
}

Note that another new token besides func_call_st is used, called call_parameters. Although the parameters token was created previously, it cannot be reused as parameters in function calls do not include the variable type. Due to this, the call_parameters/call_parameter tokens were created the same as the parameters/parameter tokens except variable types weren’t included:

// Update to code in Parser.y
call_parameters:
call_parameters call_parameter {
	AST_Node_ParamList* temp_paramlist = (AST_Node_ParamList*) $1;
	$$ = new_ast_paramlist_node($1, $2, temp_paramlist->param_cnt + 1); 
}
| call_parameter  { $$ = new_ast_paramlist_node(NULL, $1, 1); }
| VOID { $$ = NULL; }
| { $$ = NULL; } /*No paramters passed*/
;

call_parameter:
variable COMMA { $$ = new_ast_call_param_node($1); }
| variable { $$ = new_ast_call_param_node($1); }
;

The parameter list node created for the parameters token is reused (list to store an unknown amount of parameters for a function or function call); however, a new abstract syntax tree node was created for call parameters (same as parameter nodes, except variable type is not stored):

// Update to code in ast.h
typedef struct AST_Node_Call_Param_t {
	Node_Type type;
	AST_Node* ID;
} AST_Node_Call_Param;

// Update to code in ast.c
AST_Node* new_ast_call_param_node(AST_Node* ID) {
	AST_Node_Call_Param* node = malloc(sizeof(AST_Node_Call_Param));

	node->type = CALL_PARAM_NODE;
	node->ID = ID;
	return (AST_Node*)node;
}

Code to insert call parameters into the symbol table will need to be added in the future for the Complier to be able to access the parameters when generating assembly code.

Function Parameters

Using the same method as was done with declarations, the grammar for functions was updated in the Parser to account for multiple functions in a program:

functions:
functions function {
	AST_Node_StateList* temp_funclist = (AST_Node_StateList*) $1;
	$$ = new_ast_statelist_node($1, $2, temp_funclist->statement_cnt + 1); 
}
| function  { 
	incr_scope();
	$$ = new_ast_statelist_node(NULL, $1, 1); 
}
;

However, the grammar for a single function still does not account for parameters (see that the grammar only looks for an open and a close parenthesis with nothing in between):

function:
INT ID OP_PAR CL_PAR OP_BRACE statement CL_BRACE {
      $$ = new_ast_func_node($2,$6); }
;

The updated grammar for a function that accounts for parameters will look like the following with the addition of a new token called parameters:

// Additions to Parser.y
function:
var_type ID OP_PAR parameters CL_PAR OP_BRACE statements CL_BRACE {
      $$ = new_ast_func_node($2,$4,$7); }
;

This update requires the new_ast_func_node() function to now take an additional parameter for the parameters token:

// Additions to ast.c
AST_Node* new_ast_func_node(char* fname, AST_Node* param_list, AST_Node* state_list) {
	AST_Node_Func* node = malloc(sizeof(AST_Node_Func));

	node->type = FUNC_NODE;
	node->funcname = _strdup(fname);
	node->statement_list = state_list;
	node->param_list = param_list;
	node->scope = current_scope;
	return (AST_Node*)node;
}

As functions will have an unknown amount of parameters, a parameter list node is created for each function. The design of the parameter list node is the same as the previously created statement list. Function parameters in C are a variable type followed by a variable name. Therefore, the parameter nodes found within a parameter list were designed to contain the parameter type and name:

// Additions to ast.h
typedef struct AST_Node_ParamList_t {
	Node_Type type;
	AST_Node* paramlist;
	AST_Node* param;
	int param_cnt;
} AST_Node_ParamList;

typedef struct AST_Node_Param_t {
	Node_Type type;
	VariableType vartype;
	AST_Node* ID;
} AST_Node_Param;

Functions were made to create parameter list and parameter nodes:

// Additions to ast.c
AST_Node* new_ast_param_node(int vtype, AST_Node* ID) {
	AST_Node_Param* node = malloc(sizeof(AST_Node_Param));

	node->type = PARAM_NODE;
	node->ID = ID;
	node->vartype = vtype;
	return (AST_Node*)node;
}

AST_Node* new_ast_paramlist_node(AST_Node* prmlist, AST_Node* prm, int prmcnt) {
	AST_Node_ParamList* node = malloc(sizeof(AST_Node_ParamList));

	node->type = PARAMLIST_NODE;
	node->paramlist = prmlist;
	node->param = prm;
	node->param_cnt = prmcnt;
	return (AST_Node*)node;
}

A function can either have multiple parameters, a single parameter, or no parameters (marked by empty space or void). Therefore, the grammar for the parameters token was made to account for these four possibilities (note, grammar resembles that of the statements token):

parameters:
parameters parameter {
	AST_Node_ParamList* temp_paramlist = (AST_Node_ParamList*) $1;
	$$ = new_ast_paramlist_node($1, $2, temp_paramlist->param_cnt + 1); 
}
| parameter  { 	
	$$ = new_ast_paramlist_node(NULL, $1, 1); 
}
| VOID { 
	$$ = NULL; 
}
| { /*No paramters passed*/
	$$ = NULL; 
} 
;

The token parameter must account for parameters in the function. As previously mentioned, parameters in a C function are variable types followed by variable names; however, if there are multiple parameters, they will be separated with commas. The grammar for the parameter token was created to account for both parameters that aren’t and are separated by commas:

parameter:
var_type variable COMMA { 
	AST_Node_ID* temp_ID = (AST_Node_ID*) $2;
	$$ = new_ast_param_node($1, $2); 
}
| var_type variable { 
	AST_Node_ID* temp_ID = (AST_Node_ID*) $2;
	$$ = new_ast_param_node($1, $2); 
}
;

Future updates for parameters include inserting the parameters into the symbol table and accounting for function call parameters.

Hash Table and Scopes

For the Complier it was decided to use a hash table for the symbol table due to hash tables having fast lookup speeds. Hash tables are arrays paired with a hash function that uses symbol names to calculate the index of the array where the symbol is to be stored:

The hash function calculates the index to store the symbol by summing the decimal values of the ASCII characters in the symbol name, then taking the remainder of the sum divided by the size of the hash table’s array. See below for the function used to calculate the index of the hash table to store a symbol:

unsigned int hash(char* key) {
	unsigned int hashval = 0;

	// sum ASCII values of characters
	while (*key) {
		hashval += *key;
		key++;
	}
	return hashval % COLUMN_SIZE; // return modulo on size of table
}

It is possible to have what is called a collision where the same index is calculated for different symbol names. In this case a linked list is created at that index of the hash table. For example, say val1 and val2 ended up having the same index calculated from the hash function, the table would look like the figure below:

This allows for multiple values to be stored at a single index of the hash table. The downside of collisions is that they increase the lookup time as the Compiler must perform a linear search of each entry of the linked list to look for specific a symbol name.

Another important thing to consider is that a C program can have multiple symbols that share a name if they are in different scopes of the program. This will cause a collision issue as the same index will be calculated for each symbol; however, the names of the symbols will be the same so a search through a linked list cannot be used as done before . To overcome this, it was decided to give each scope of the program it’s own symbol table.

Each scope contains the next scope, the previous scope, sub scope, the scope number (global would be 0, number increases the further away it is from the global scope), and a symbol table:

// Addition to symtab.h
typedef struct scope
{
	struct scope* next;
	struct scope* prev;
	struct scope* sub_scope;
	int scope_num;
	symbol_t** table;
}scope_t;

Functions were created to initialize, increment, and decrement scopes:

// Addition to symtab.c
scope_t* init_scope() {
	// allocate memory
	scope_t* scp = (scope_t*)calloc(1, sizeof(scope_t));
	scp->next = NULL;
	scp->prev = NULL;
	scp->sub_scope = NULL;
	scp->scope_num = curr_scope;
	scp->table = (symbol_t**)calloc(COLUMN_SIZE, sizeof(symbol_t*));
	return scp;
}

void incr_scope() {
	scope_t* scp = init_scope();
	current_scope->next = scp;
	scp->prev = current_scope;
	current_scope = scp;
	curr_scope++;
}

extern void decr_scope() {
	current_scope = current_scope->prev;
	curr_scope--;
}

The main function was updated to create a global and main scope:


// Addition to main.c
int curr_scope = 0;
int curr_sub_scope = 0;
int ROW_SIZE = 10;
int COLUMN_SIZE = 20;
float RESIZE = 0.5;

scope_t* current_scope;
scope_t* global_scope;
scope_t* main_scope;

int main(int argc, char* argv[]) {
	global_scope = init_scope();
	current_scope = global_scope;
	incr_scope();
        ...

The Parser was also updated to increment the scope whenever the grammar for a function is encountered (note, may need to add the incr_scope() function in other areas of the Parser as the grammar of the Compiler grows):

// Addition to Parser.y
functions:
functions function {
	AST_Node_StateList* temp_funclist = (AST_Node_StateList*) $1;
	$$ = new_ast_statelist_node($1, $2, temp_funclist->statement_cnt + 1); 
}
| function  { 
	incr_scope();
	$$ = new_ast_statelist_node(NULL, $1, 1); 
}
;

Purpose of the Symbol Table

The Compiler must be able to store information about symbols found within a C program. Symbols can include variables, function names, other objects and more. Information of these symbols must be stored as it allows the Complier to generate assembly code and the information can be used for semantic analysis.

For example, say a function call is performed with the parameters val1, val2, and val3. These parameters will be stored in a stack frame in a relative location to the base pointer for that stack frame. These relative location allow the Complier to access the parameters in the function call. These relative locations that are vital to the Complier being able to access parameters and are just one of the details stored in the symbol table.

Other details stored in the symbol table include:

Symbol Name
Symbol Scope
Symbol Type (i.e., variable or function name)
Symbol Variable Type
Symbol Value
Symbol Address
Symbol Base Pointer Offset (mainly used for variables)
Symbol Parameters Count (mainly used for functions)

Declaration Parser Grammar/AST Code

The initial Parser’s grammar must be expanded to account for many types of statements. The first will be declaration statements. Starting with the most basic C declaration statement, there is a variable type followed by a variable name:

int var_name;

First, the grammar was updated by adding a token for variables. Just using the previously created term token won’t work as it also allows constant and expressions within parenthesis which should not be in the left side of a declaration statement. Note for now only integer variables are handled:

term:
ICONST { $$ = new_ast_const_node(INTtype, $1); }
| variable
| OP_PAR expression CL_PAR { $$ = $2; }
;

variable:
ID  { $$ = new_ast_ID_node(INTtype, $1, $1); }
;

Using the previously created var_type token, the starting grammar for a declaration statement that accounts for a single variable with no value being assigned is:

declaration_st:
var_type variable SEMI { ; }
;

However, sometimes declarations assign a value, such as the following:

int var_name = 1;

This means that another token is needed that can be a variable or a variable assigned to a value. This token is called a declaration:

declaration_st:
var_type declaration SEMI { ; }
;

declaration:
variable { ; }
| variable ASSIGN expression { ; }
;

Now out parser can recognize a declaration statement for a single variable. Yet, it cannot handle a declaration statement for multiple statements such as the below example:

int var_name1, var_name2 = 100, var_name3;

To handle multiple declarations in a single line of C code, another new token called declarations is used. The declarations tokens allows for multiple comma separated declaration tokens to be handled:

declaration_st:
var_type declarations SEMI { ; }
;

declarations:
declarations declaration { ; }
| declaration { ; }
;

declaration:
variable { ; }
| variable COMMA { ; }
| variable ASSIGN expression { ; }
| variable ASSIGN expression COMMA { ; }

;

The parser can now recognize declaration statements, but the code to be executed when the declaration grammar is encountered must be added. Since a single line of C code can declare multiple variables, it is unknown how many AST nodes will be needed for a single declaration statement. For this, an AST (Abstract Syntax Tree) structure is created to store a list of declarations is used (fields are variable type and a pointer to a list of individual declarations):

// Addition to ast.h
typedef struct AST_Node_Declaration_List_t {
	Node_Type type;
	int vartype;
	AST_Node* decls;
} AST_Node_Decl_List;

// Addition to ast.c
AST_Node* new_ast_decl_list_node(int vtype, AST_Node* decls)
{
	AST_Node_Decl_List* node = malloc(sizeof(AST_Node_Decl_List));

	node->type = DECLS_NODE;
	node->vartype = vtype;
	node->decls = decls;

	return (AST_Node*)node;
}

Also, a structure must be created for the individual declarations stored in the declration list. The fields for this structure are the variable type, a node pointing to the variable, and a value (NULL unless variable is assigned a value in declaration):

// Additions to ast.h
typedef struct AST_Node_Declaration_t {
	Node_Type type;
	int vartype;
	AST_Node* var;
	AST_Node* val;
} AST_Node_Decl;

// Additions to ast.c
AST_Node* new_ast_decl_node(int vtype, AST_Node* var, AST_Node* val)
{
	AST_Node_Decl* node = malloc(sizeof(AST_Node_Decl));

	node->type = DECL_NODE;
	node->vartype = vtype;
	node->var = var;
	node->val = val;

	return (AST_Node*)node;
}

Adding the C code to the Parser leads to the following:

declaration_st:
var_type declarations { $$ = new_ast_decl_list_node($1, $2); }
;

declarations:
declarations declaration {
	AST_Node_StateList* temp_decl_list = (AST_Node_StateList*) $1;
	$$ = new_ast_statelist_node($1, $2, temp_decl_list->statement_cnt + 1); 
}
| declaration { $$ = new_ast_statelist_node(NULL, $1, 1); }
;

/* $2 -> declaration name, $1 -> declaration type*/
declaration:
variable 
{
	AST_Node_ID* temp_ID = $1;
	insert(temp_ID->varname, SYMTAB_LOCAL, decl_type, 0); 
	$$ = new_ast_decl_node(decl_type, $1, NULL); 
}
| variable COMMA
{
	AST_Node_ID* temp_ID = $1;
	insert(temp_ID->varname, SYMTAB_LOCAL, decl_type, 0); 
	$$ = new_ast_decl_node(decl_type, $1, NULL); 
}
| variable ASSIGN expression 
{
	AST_Node_ID* temp_ID = $1;
	insert(temp_ID->varname, SYMTAB_LOCAL, decl_type, 0); 
	$$ = new_ast_decl_node(decl_type, $1, $3);
}
| variable ASSIGN expression COMMA
{
	AST_Node_ID* temp_ID = $1;
	insert(temp_ID->varname, SYMTAB_LOCAL, decl_type, 0); 
	$$ = new_ast_decl_node(decl_type, $1, $3);
}
;

Now the grammar of the Compiler can handle C declaration statements.

Expanding Variable Types

Looking at the grammar for a function in the Parser, it is observed that the only valid variable type is an integer (INT):

function:
INT ID OP_PAR CL_PAR OP_BRACE statements CL_BRACE {  
     $$ = new_ast_func_node($6); 
}
;

The compiler must be able to handle more variable types such as characters or doubles. The first step to adding new variable types is to create an enum used to distinguish between the different variable types:

typedef enum Variable_t
{
	CHARtype,
	DOUBLEtype,
	FLOATtype,
	INTtype,
	LONGtype,
	SHORTtype,
	VOIDtype
} VariableType;

Then, the Lexer is updated so that it can return tokens for the different variable types as well as their corresponding enum values:

"char" {
    yylval.intval = CHARtype; 
    return CHAR;
}
"double" {
    yylval.intval = DOUBLEtype; 
    return DOUBLE;
}
"float" {
    yylval.intval = FLOATtype; 
    return FLOAT;
}
"int" {
    yylval.intval = INTtype; 
    return INT;
}
"long" {
    yylval.intval = LONGtype; 
    return LONG;
}
"short" {
    yylval.intval = SHORTtype; 
    return SHORT;
}
"void" {
    yylval.intval = VOIDtype; 
    return VOID;
}

Next, the Parser must be updated. First, the tokens must be added to the Parser (must be <intval> tokens as they are also passing the enum values through the yylval union):

%token <intval> CHAR DOUBLE FLOAT INT LONG SHORT VOID /*Variable Types*/

Now, the compiler is able to pass different variable types to the Parser; however, the grammar rules within the Parser still need to be updated. One solution would to be have 7 rules for function where the variable type changes each time. This is inefficient as every time the compiler has a grammar rule with a variable type in it, multiple lines would be required. Therefore, a new %type <intval> called “var_type” will be created:

%type <intval> var_type

The grammar for var_type is set so that it can be any variable type ($1 in this case is the enum value passed through yylval):

var_type:
CHAR { $$ = $1; }
| DOUBLE { $$ = $1; }
| FLOAT { $$ = $1; }
| INT { $$ = $1; }
| LONG { $$ = $1; }
| SHORT { $$ = $1; }
| VOID { $$ = $1; }
;

With var_type introduced, the grammar rule for function can be updated to account for multiple variable types without needing to add more lines:

function:
var_type ID OP_PAR CL_PAR OP_BRACE statements CL_BRACE { 
    $$ = new_ast_func_node($6); 
}
;

The addition of the var_type will not only help with the existing grammar for functions but will also prove useful for other grammar rules with variable types that are to be added in the future (i.e., declarations).

Adding Recursive Statement Nodes

Taking a look at how the function and statement types were previously implemented in the Parser, it is observed that a function can only hold one statement node:

function:
INT ID OP_PAR CL_PAR OP_BRACE statement CL_BRACE {;}
;

statement:
return_st {;}
;

In reality, a function will contain many statements, so the Parser must be modified. To allow a function to contain multiple statements, a statement list node will be used. The statement list node will contain a pointer to a statement node as well as a pointer to another statement list node:

Since a statement list node contains a pointer to another statement list node, any amount of statements can be contained within a single statement list tree. Therefore, if the function node contains a statement list node rather than a statement node, any amount of statement nodes can be contained within a single function.

Another field to be implemented in the statement list node is the statement count. The statement count represents how many statement nodes can be found from traversing the tree starting at that particular statement list node. The statement count will be useful later when translating to the XM3 language as the CEX instruction (conditional execution) needs to know the amount of instructions to run/not run. Below shows a statement list tree that uses the statement count:

Now, to implement the statement list node into the compiler. First, a struct for a statement list node is needed for the abstract syntax tree. It will contain fields for the node type, a statement list node pointer, a statement node pointer, and an integer for the statement count:

typedef struct AST_Node_StateList_t {
	Node_Type type;
	AST_Node* statelist;
	AST_Node* statement;
	int statement_cnt;
} AST_Node_StateList;

Then, the Parser needs recursive grammar to enable the creation of statement list nodes. This will require a new %type <node> called statements (refer to the post “Initial Parser” for an explanation of bison types). The two possible grammars for a statements type will be:

A statements type followed by a statement type
A single statement type.

This allows a statements type to represent any amount of subsequent statement types as “statements” is replaced with “statement” or “statements statement” where “statements” can continue to be replaced. See the below code for the implementation of the “statements” type grammar in the Parser (note: function now uses the statements type rather than the statement type):

function:
INT ID OP_PAR CL_PAR OP_BRACE statements CL_BRACE {  
     $$ = new_ast_func_node($6); }
;

statements:
statements statement {;}
| statement {;}
;

The example below shows a visual representation of the creation of a statements type. Say 4 subsequent statement types are passed to the Parser (note statement1 is passed to the Parser first, and statement4 is passed to the Parser last):

The Parser has been updated to have recursive grammar; however, it still needs C code to create the statement list nodes. For this, the below function is used to create statement list nodes:

AST_Node* new_ast_statelist_node(AST_Node* stmlist, AST_Node* stm, int stmcnt)
{
	AST_Node_StateList* node = malloc(sizeof(AST_Node_StateList));

	node->type = STATELIST_NODE;
	node->statelist = stmlist;
	node->statement = stm;
	node->statement_cnt = stmcnt;
	return (AST_Node*)node;
}

Now, it is needed to update the C code in the actual Parser. Again, two cases must be considered as the grammar for the statements type can either be “statements statement” or just “statement”.

First, take the case where the statements type is just “statement”. In this case, only a single statement needs to be handled. Therefore, the pointer to the next statement list node should be NULL. Also, since there are no further statement list nodes, the statement count should be 1. Therefore, the Parser will call the function new_ast_statelist_node with the parameters of statement list node pointer = NULL, statement node pointer = $1 (pointer to statement node returned from statement type), and statement count = 1:

statements:
statements statement {;}
| statement { $$ = new_ast_statelist_node(NULL, $1, 1);}
;

Then, the next case of the statements type being “statements statement”. Take the previous example of there being 4 subsequent statement types. Initially the Parser will only see the first statement. This will be handled as described above for the “statement” case. Then, when the Parser is passed the second statement it will see the grammar as:

In this case, statements1 is a statement list node where the statement node pointer is to the node for statement1 and statement2 is a statement node. Therefore, the statement list node created for statement2 should have the next statement list node pointer point to statements1, should have the statement node pointer point to statement2, and should have the statement count be 1 greater than that of the statement count in statements1. See below for a visual representation of the abstract syntax tree nodes being created :

Implementing the case of “statements statement” in the Parser requires the statement count field node from statements node to be known. To do this, a temporary node is created to access the field. See below for the implementation of the “statements statement” case:

statements:
statements statement { 
    AST_Node_StateList* temp_statelist = (AST_Node_StateList*) $1;
    $$ = new_ast_statelist_node($1, $2, temp_statelist->statement_cnt + 1); 
}
| statement { $$ = new_ast_statelist_node(NULL, $1, 1); }
;

The compiler can now handle a function with multiple statements. Running a test, the abstract syntax tree is created as expected:

Adding More Operators and Operator Precedence

When the Parser was first set up it only was able to recognize the =, +, -, *, and / operators. More operators are needed to create a C compiler such as logical and bitwise operators. First, the Lexer must be updated to return tokens for these operators. For this, first the enum of operator types was expanded:

typedef enum Operator_t
{
	PLUSop,
	MINUSop,
	MULTop,
	DIVop,
	MODop,
	EQop,
	NOTEQop,
	GTop,
	LTop,
	GTEQop,
	LTEQop,
	LSHIFTop,
	RSHIFTop,
	BITANDop,
	BITXORop,
	BITORop,
	LOGANDop,
	LOGORop,
	ASSIGNop,
	PLUS_ASSIGNop,
	MIN_ASSIGNop,
	MULT_ASSIGNop,
	DIV_ASSIGNop,
	MOD_ASSIGNop,
	LSHIFT_ASSIGNop,
	RSHIFT_ASSIGNop,
	BITAND_ASSIGNop,
	BITXOR_ASSIGNop,
	BITOR_ASSIGNop
} Operator;

Then, the Lexer was updated to return tokens for each of the operators. Note that these operators now also return an integer to be used to set the operator field in an expression node:

%%

"+" {
    yylval.intval = PLUSop;
    return PLUS;
}
"-" {
    yylval.intval = MINUSop;
    return MINUS;
}
"/" {
    yylval.intval = DIVop;
    return DIVIDE;
}
"*" {
    yylval.intval = MULTop;
    return MULT;
}
"==" {
    yylval.intval = EQop;
    return EQ_COMP;
}
"+=" {  
    yylval.intval = PLUS_ASSIGNop;
    return PLUS_ASSIGN;
}
"-=" {  
    yylval.intval = MIN_ASSIGNop;
    return MIN_ASSIGN;
}
"/=" {  
    yylval.intval = DIV_ASSIGNop;
    return DIV_ASSIGN;
}
"*=" {  
    yylval.intval = MULT_ASSIGNop;
    return MULT_ASSIGN;
}
"%=" {  
    yylval.intval = MOD_ASSIGNop;
    return MOD_ASSIGN;
}
"&=" {  
    yylval.intval = BITAND_ASSIGNop;
    return BITAND_ASSIGN;
}
"|=" {  
    yylval.intval = BITOR_ASSIGNop;
    return BITOR_ASSIGN;
}
"^=" {  
    yylval.intval = BITXOR_ASSIGNop;
    return BITXOR_ASSIGN;
}
"<<=" {  
    yylval.intval = LSHIFT_ASSIGNop;
    return LSHIFT_ASSIGN;
}
">>=" {  
    yylval.intval = RSHIFT_ASSIGNop;
    return RSHIFT_ASSIGN;
}
"!=" {
    yylval.intval = NOTEQop;
    return NOTEQ_COMP;
}
"<<" {
    yylval.intval = LSHIFTop;
    return L_SHIFT;
}
">>" {
    yylval.intval = RSHIFTop;
    return R_SHIFT;
}
">=" {
    yylval.intval = GTEQop;
    return GTEQ_COMP;
}
"<=" { 
    yylval.intval = LTEQop;
    return LTEQ_COMP;
}
">" {
    yylval.intval = GTop;
    return GT_COMP;
}
"<" {  
    yylval.intval = LTop;
    return LT_COMP;
}
"=" {  
    yylval.intval = ASSIGNop;
    return ASSIGN;
}
"%" {
    yylval.intval = MODop;
    return MOD;
}
"&&" {
    yylval.intval = LOGANDop;
    return LOGAND;
}
"&" {
    yylval.intval = BITANDop;
    return BITAND;
}
"||" {
    yylval.intval = LOGORop;
    return LOGOR;
}
"|" {
    yylval.intval = BITORop;
    return BITOR;
}
"^" {
    yylval.intval = BITXORop;
    return BITXOR;
}

Updating the token declarations in the Parser results in:

%token <intval> ASSIGN PLUS_ASSIGN MIN_ASSIGN MULT_ASSIGN DIV_ASSIGN MOD_ASSIGN LSHIFT_ASSIGN RSHIFT_ASSIGN BITAND_ASSIGN BITXOR_ASSIGN BITOR_ASSIGN

%token <intval> BITAND BITXOR BITOR LOGAND LOGOR
%token <intval> EQ_COMP NOTEQ_COMP
%token <intval> GT_COMP LT_COMP GTEQ_COMP LTEQ_COMP
%token <intval> R_SHIFT L_SHIFT
%token <intval> PLUS MINUS
%token <intval> MULT DIVIDE MOD
%token <intval> INCR

Then, the grammar for the expression type must be updated to account for these new operators. Note that now we can use $2 in the create_new_ast_expr_node function rather than an enum as the integer is passed from the Lexer:

expression:
term { $$ = $1; }
| variable ASSIGN expression { $$ = new_ast_expr_node($1, $2, $3); }
| variable PLUS_ASSIGN expression { $$ = new_ast_expr_node($1, $2, $3); }
| variable MIN_ASSIGN expression { $$ = new_ast_expr_node($1, $2, $3); }
| variable MULT_ASSIGN expression { $$ = new_ast_expr_node($1, $2, $3); }
| variable DIV_ASSIGN expression { $$ = new_ast_expr_node($1, $2, $3); }
| variable MOD_ASSIGN expression { $$ = new_ast_expr_node($1, $2, $3); }
| variable LSHIFT_ASSIGN expression { $$ = new_ast_expr_node($1, $2, $3); }
| variable RSHIFT_ASSIGN expression { $$ = new_ast_expr_node($1, $2, $3); }
| variable BITAND_ASSIGN expression { $$ = new_ast_expr_node($1, $2, $3); }
| variable BITXOR_ASSIGN expression { $$ = new_ast_expr_node($1, $2, $3); }
| variable BITOR_ASSIGN expression { $$ = new_ast_expr_node($1, $2, $3); }
| expression LOGOR expression { $$ = new_ast_expr_node($1, $2, $3); }
| expression LOGAND expression { $$ = new_ast_expr_node($1, $2, $3); }
| expression BITOR expression { $$ = new_ast_expr_node($1, $2, $3); }
| expression BITXOR expression { $$ = new_ast_expr_node($1, $2, $3); }
| expression BITAND expression { $$ = new_ast_expr_node($1, $2, $3); }
| expression EQ_COMP expression { $$ = new_ast_expr_node($1, $2, $3); }
| expression NOTEQ_COMP expression { $$ = new_ast_expr_node($1, $2, $3); }
| expression GT_COMP expression { $$ = new_ast_expr_node($1, $2, $3); }
| expression LT_COMP expression { $$ = new_ast_expr_node($1, $2, $3); }
| expression GTEQ_COMP expression { $$ = new_ast_expr_node($1, $2, $3); }
| expression LTEQ_COMP expression { $$ = new_ast_expr_node($1, $2, $3); }
| expression R_SHIFT expression { $$ = new_ast_expr_node($1, $2, $3); }
| expression L_SHIFT expression { $$ = new_ast_expr_node($1, $2, $3); }
| expression PLUS expression { $$ = new_ast_expr_node($1, $2, $3); }
| expression MINUS expression { $$ = new_ast_expr_node($1, $2, $3); }
| expression MULT expression { $$ = new_ast_expr_node($1, $2, $3); }
| expression DIVIDE expression { $$ = new_ast_expr_node($1, $2, $3); }
;

Now are Parser can handle more operators; however, there is no operator precedence. In C, operator precedence is determines what operators should be evaluated before others. For example, take the expression 1*2+3. The multiplication of 1 and 2 should occur first; however, without operator precedence, the compiler evaluates the addition first:

To add operator precedence in bison, the %left and %right tokens can be used. %left is used for left to right precedence (operators on the same precedence level occur left to right) and %right is the inverse of %left. Precedence levels are set up as follows (the lower they are, the earlier they will be evaluated):

%right ASSIGN PLUS_ASSIGN MIN_ASSIGN MULT_ASSIGN DIV_ASSIGN MOD_ASSIGN LSHIFT_ASSIGN RSHIFT_ASSIGN BITAND_ASSIGN BITXOR_ASSIGN BITOR_ASSIGN
%left LOGOR
%left LOGAND
%left BITOR
%left BITXOR 
%left BITAND 
%left EQ_COMP NOTEQ_COMP
%left GT_COMP LT_COMP GTEQ_COMP LTEQ_COMP
%left R_SHIFT L_SHIFT
%left PLUS MINUS
%left MULT DIVIDE MOD

Now running the same expression results in the multiplication occurring first:

There are still more operators to be added but for now a good base has been set and operator precedence has been introduced.

Initial Parser

From the Lexer, we are currently returning the following tokens:

%%

"int" { return INT; }
"return" { return RETURN; }

[a-zA-Z_][a-zA-Z0-9_]* {
    yylval.str = _strdup(yytext);
    return ID;
}

[0-9]+ {
    int value = atoi(yytext);
    yylval.intval = value;
    return ICONST;
}

"+" {return PLUS;}
"-" {return MINUS;}
"/" {return DIVIDE;}
"*" {return MULT;}
"=" {return ASSIGN;}

";" { return SEMI; }
"(" { return OP_PAR; }
")" { return CL_PAR; }
"{" { return OP_BRACE; }
"}" { return CL_BRACE; }

%%

The Parser must be able to take in each of the above tokens and recognize valid grammar rules. The first step is declaring the tokens in the Parser as well as the union used to pass values through yylval. Tokens are declared with %token and the union is declared with %union. If a token is also passing a value through the union it must contain the data type within angle brackets. See the code below for the declarations:

%union {
	int intval;
	double doubleval;
	char charval;
	char* str;
	struct AST_Node* node;
};

%token RETURN INT 
%token SEMI OP_PAR CL_PAR OP_BRACE CL_BRACE EOL EQ 
%token PLUS MINUS MULT DIVIDE ASSIGN
%token <intval> ICONST
%token <str> ID

Now that the tokens have been declared, grammar rules need to be set. For a starting point we will be treating a C program as a series of functions. These functions contain statements (i.e., loop statements, if statements, or declarations) which themselves contain expressions (i.e., additions or assignments). Moreover the expressions contain variables and numbers.

The program, functions, statements, and expressions will need to be represented as nonterminal tokens in bison. These nonterminal tokens will be made up of tokens and other nonterminal tokens (which breakdown into tokens) to represent valid grammar rules. For example, the expression nonterminal token would breakdown as follows (note that term is also a nonterminal token):

expression:
term {;}
| variable assign expression {;}
| expression PLUS expression {;}
| expression MINUS expression {;}
| expression MULT expression {;}
| expression DIVIDE expression {;}
;

term:
ICONST {;}
| ID {;}
| OP_PAR expression CL_PAR {;}
;

In the above code the “|” is used to represent alternative options for a nonterminal token. Take the nonterminal token “term”. It can be an integer (ICONST token), a variable (ID token) or an expression within parenthesis (used so that expressions within parenthesis occur before those outside). It is also important to note that the expression nonterminal token contains itself. This means any amount expressions will be represented by a single expression nonterminal token as recursion will occur until only terms and operators are left.

To declare these nonterminal tokens, %type is used. These %types will eventually contain nodes so that the program can be represented as an abstract syntax tree, so <node> will be used. For now, the only statement type the Lexer is passing a token for is a return statement. Therefore, return_st will also be a type (more types will be required in the future):

%type <node> program function statement expression term return_st

Setting the grammar rules for these types results in:

program:
function {;}
;

function:
INT ID OP_PAR CL_PAR OP_BRACE statement CL_BRACE {;}
;

statement:
return_st {;}
;

return_st:
RETURN SEMI {;} /*Return nothing*/
| RETURN expression SEMI {;} /*Return a value*/
;

expression:
term {;}
| variable ASSIGN expression {;}
| expression PLUS expression {;}
| expression MINUS expression {;}
| expression MULT expression {;}
| expression DIVIDE expression {;}
;

term:
ICONST {;}
| ID {;}
| OP_PAR expression CL_PAR {;}
;

In the above code the grammar rules are set but no C code is executed when they are encountered. In bison, $$ represents the output (nonterminal token which in this case is always a node), and $x (where x is the order of the token/nonterminal token in the grammar rule) to represent one of the parts of the grammar. For example, in the below code $$ = nonterminal, $1 = TOKEN1, $2 = nonterminal2, and $3 = TOKEN2:

nonterminal:
TOKEN1 nonterminal2 TOKEN2
;

Adding C code results in the types having functions to creating nodes (see post for creating ast.h and ast.c) except for the program type which calls the function ast_traverse (for now used to print out the abstract syntax tree, but later will be used to create next stage of intermediate representation):

program:
function { ast_traverse($1); }
;

function:
INT ID OP_PAR CL_PAR OP_BRACE statement CL_BRACE { 
$$ = new_ast_func_node($6); 
}
;

statement:
return_st { $$ = $1; }
;

return_st:
RETURN SEMICOLON { $$ = new_ast_return_node(0, NULL); } /*Return nothing*/
| RETURN expression SEMICOLON { $$ = new_ast_return_node(1, $2); } /*Return a value*/
;

expression:
term { $$ = $1;}
| variable ASSIGN expression { $$ = new_ast_expr_node($1, ASSIGNop, $3); }
| expression PLUS expression { $$ = new_ast_expr_node($1, PLUSop, $3); }
| expression MINUS expression { $$ = new_ast_expr_node($1, MINUSop, $3); }
| expression MULT expression { $$ = new_ast_expr_node($1, MULTop, $3); }
| expression DIVIDE expression { $$ = new_ast_expr_node($1, DIVIDEop, $3); }
;

term:
ICONST { $$ = new_ast_const_node(INTtype, $1); }
| ID { $$ = new_ast_ID_node(INTtype, $1, NULL); }
| OP_PAR expression CL_PAR { $$ = $2; }
;

At this point we now have the union, tokens, and types declared. Also, we have the grammar rules and corresponding C code set for the grammar rules. The Parser still requires a lot of expansion (i.e., more operators in expressions, more statements, a function being able to contain multiple statements, and more) but a starting point has been set. The last step to setting up the Parser is defining the default starting symbol of the grammar using %start. For our compiler this is the program type as all the tokens passed to the parser make up a single program:

%start program