Intermediate Representation – XM3 Compiler Design

Function Calls

Although the Parser can handle functions, it cannot handle function calls. Therefore, a new token called func_call_st was created. The expected grammar for a function call is the function name followed by parameters contained within parenthesis:

// Update to code in Parser.y
statement:
return_st { $$ = $1; } 
| declaration_st SEMI { $$ = $1; }
| func_call_st SEMI { $$ = $1; }
| expression SEMI { $$ = $1; }
;

func_call_st:
ID OP_PAR call_parameters CL_PAR { 
	$$ = new_ast_func_call_node($1, $3); 
}
;

See below for the code added to create function call nodes in the abstract syntax tree:

// Update to code in ast.h
typedef struct AST_Node_Func_Call_t {
	Node_Type type;
	char* funcname;
	AST_Node* param_list;
	scope_t* scope;
} AST_Node_Func_Call;

// Update to code in ast.c
AST_Node* new_ast_func_call_node(char* fname, AST_Node* param_list) {
	AST_Node_Func_Call* node = malloc(sizeof(AST_Node_Func_Call));

	node->type = FUNC_CALL_NODE;
	node->funcname = _strdup(fname);
	node->param_list = param_list;
	node->scope = current_scope;
	return (AST_Node*)node;
}

Note that another new token besides func_call_st is used, called call_parameters. Although the parameters token was created previously, it cannot be reused as parameters in function calls do not include the variable type. Due to this, the call_parameters/call_parameter tokens were created the same as the parameters/parameter tokens except variable types weren’t included:

// Update to code in Parser.y
call_parameters:
call_parameters call_parameter {
	AST_Node_ParamList* temp_paramlist = (AST_Node_ParamList*) $1;
	$$ = new_ast_paramlist_node($1, $2, temp_paramlist->param_cnt + 1); 
}
| call_parameter  { $$ = new_ast_paramlist_node(NULL, $1, 1); }
| VOID { $$ = NULL; }
| { $$ = NULL; } /*No paramters passed*/
;

call_parameter:
variable COMMA { $$ = new_ast_call_param_node($1); }
| variable { $$ = new_ast_call_param_node($1); }
;

The parameter list node created for the parameters token is reused (list to store an unknown amount of parameters for a function or function call); however, a new abstract syntax tree node was created for call parameters (same as parameter nodes, except variable type is not stored):

// Update to code in ast.h
typedef struct AST_Node_Call_Param_t {
	Node_Type type;
	AST_Node* ID;
} AST_Node_Call_Param;

// Update to code in ast.c
AST_Node* new_ast_call_param_node(AST_Node* ID) {
	AST_Node_Call_Param* node = malloc(sizeof(AST_Node_Call_Param));

	node->type = CALL_PARAM_NODE;
	node->ID = ID;
	return (AST_Node*)node;
}

Code to insert call parameters into the symbol table will need to be added in the future for the Complier to be able to access the parameters when generating assembly code.

Function Parameters

Using the same method as was done with declarations, the grammar for functions was updated in the Parser to account for multiple functions in a program:

functions:
functions function {
	AST_Node_StateList* temp_funclist = (AST_Node_StateList*) $1;
	$$ = new_ast_statelist_node($1, $2, temp_funclist->statement_cnt + 1); 
}
| function  { 
	incr_scope();
	$$ = new_ast_statelist_node(NULL, $1, 1); 
}
;

However, the grammar for a single function still does not account for parameters (see that the grammar only looks for an open and a close parenthesis with nothing in between):

function:
INT ID OP_PAR CL_PAR OP_BRACE statement CL_BRACE {
      $$ = new_ast_func_node($2,$6); }
;

The updated grammar for a function that accounts for parameters will look like the following with the addition of a new token called parameters:

// Additions to Parser.y
function:
var_type ID OP_PAR parameters CL_PAR OP_BRACE statements CL_BRACE {
      $$ = new_ast_func_node($2,$4,$7); }
;

This update requires the new_ast_func_node() function to now take an additional parameter for the parameters token:

// Additions to ast.c
AST_Node* new_ast_func_node(char* fname, AST_Node* param_list, AST_Node* state_list) {
	AST_Node_Func* node = malloc(sizeof(AST_Node_Func));

	node->type = FUNC_NODE;
	node->funcname = _strdup(fname);
	node->statement_list = state_list;
	node->param_list = param_list;
	node->scope = current_scope;
	return (AST_Node*)node;
}

As functions will have an unknown amount of parameters, a parameter list node is created for each function. The design of the parameter list node is the same as the previously created statement list. Function parameters in C are a variable type followed by a variable name. Therefore, the parameter nodes found within a parameter list were designed to contain the parameter type and name:

// Additions to ast.h
typedef struct AST_Node_ParamList_t {
	Node_Type type;
	AST_Node* paramlist;
	AST_Node* param;
	int param_cnt;
} AST_Node_ParamList;

typedef struct AST_Node_Param_t {
	Node_Type type;
	VariableType vartype;
	AST_Node* ID;
} AST_Node_Param;

Functions were made to create parameter list and parameter nodes:

// Additions to ast.c
AST_Node* new_ast_param_node(int vtype, AST_Node* ID) {
	AST_Node_Param* node = malloc(sizeof(AST_Node_Param));

	node->type = PARAM_NODE;
	node->ID = ID;
	node->vartype = vtype;
	return (AST_Node*)node;
}

AST_Node* new_ast_paramlist_node(AST_Node* prmlist, AST_Node* prm, int prmcnt) {
	AST_Node_ParamList* node = malloc(sizeof(AST_Node_ParamList));

	node->type = PARAMLIST_NODE;
	node->paramlist = prmlist;
	node->param = prm;
	node->param_cnt = prmcnt;
	return (AST_Node*)node;
}

A function can either have multiple parameters, a single parameter, or no parameters (marked by empty space or void). Therefore, the grammar for the parameters token was made to account for these four possibilities (note, grammar resembles that of the statements token):

parameters:
parameters parameter {
	AST_Node_ParamList* temp_paramlist = (AST_Node_ParamList*) $1;
	$$ = new_ast_paramlist_node($1, $2, temp_paramlist->param_cnt + 1); 
}
| parameter  { 	
	$$ = new_ast_paramlist_node(NULL, $1, 1); 
}
| VOID { 
	$$ = NULL; 
}
| { /*No paramters passed*/
	$$ = NULL; 
} 
;

The token parameter must account for parameters in the function. As previously mentioned, parameters in a C function are variable types followed by variable names; however, if there are multiple parameters, they will be separated with commas. The grammar for the parameter token was created to account for both parameters that aren’t and are separated by commas:

parameter:
var_type variable COMMA { 
	AST_Node_ID* temp_ID = (AST_Node_ID*) $2;
	$$ = new_ast_param_node($1, $2); 
}
| var_type variable { 
	AST_Node_ID* temp_ID = (AST_Node_ID*) $2;
	$$ = new_ast_param_node($1, $2); 
}
;

Future updates for parameters include inserting the parameters into the symbol table and accounting for function call parameters.

Adding Recursive Statement Nodes

Taking a look at how the function and statement types were previously implemented in the Parser, it is observed that a function can only hold one statement node:

function:
INT ID OP_PAR CL_PAR OP_BRACE statement CL_BRACE {;}
;

statement:
return_st {;}
;

In reality, a function will contain many statements, so the Parser must be modified. To allow a function to contain multiple statements, a statement list node will be used. The statement list node will contain a pointer to a statement node as well as a pointer to another statement list node:

Since a statement list node contains a pointer to another statement list node, any amount of statements can be contained within a single statement list tree. Therefore, if the function node contains a statement list node rather than a statement node, any amount of statement nodes can be contained within a single function.

Another field to be implemented in the statement list node is the statement count. The statement count represents how many statement nodes can be found from traversing the tree starting at that particular statement list node. The statement count will be useful later when translating to the XM3 language as the CEX instruction (conditional execution) needs to know the amount of instructions to run/not run. Below shows a statement list tree that uses the statement count:

Now, to implement the statement list node into the compiler. First, a struct for a statement list node is needed for the abstract syntax tree. It will contain fields for the node type, a statement list node pointer, a statement node pointer, and an integer for the statement count:

typedef struct AST_Node_StateList_t {
	Node_Type type;
	AST_Node* statelist;
	AST_Node* statement;
	int statement_cnt;
} AST_Node_StateList;

Then, the Parser needs recursive grammar to enable the creation of statement list nodes. This will require a new %type <node> called statements (refer to the post “Initial Parser” for an explanation of bison types). The two possible grammars for a statements type will be:

A statements type followed by a statement type
A single statement type.

This allows a statements type to represent any amount of subsequent statement types as “statements” is replaced with “statement” or “statements statement” where “statements” can continue to be replaced. See the below code for the implementation of the “statements” type grammar in the Parser (note: function now uses the statements type rather than the statement type):

function:
INT ID OP_PAR CL_PAR OP_BRACE statements CL_BRACE {  
     $$ = new_ast_func_node($6); }
;

statements:
statements statement {;}
| statement {;}
;

The example below shows a visual representation of the creation of a statements type. Say 4 subsequent statement types are passed to the Parser (note statement1 is passed to the Parser first, and statement4 is passed to the Parser last):

The Parser has been updated to have recursive grammar; however, it still needs C code to create the statement list nodes. For this, the below function is used to create statement list nodes:

AST_Node* new_ast_statelist_node(AST_Node* stmlist, AST_Node* stm, int stmcnt)
{
	AST_Node_StateList* node = malloc(sizeof(AST_Node_StateList));

	node->type = STATELIST_NODE;
	node->statelist = stmlist;
	node->statement = stm;
	node->statement_cnt = stmcnt;
	return (AST_Node*)node;
}

Now, it is needed to update the C code in the actual Parser. Again, two cases must be considered as the grammar for the statements type can either be “statements statement” or just “statement”.

First, take the case where the statements type is just “statement”. In this case, only a single statement needs to be handled. Therefore, the pointer to the next statement list node should be NULL. Also, since there are no further statement list nodes, the statement count should be 1. Therefore, the Parser will call the function new_ast_statelist_node with the parameters of statement list node pointer = NULL, statement node pointer = $1 (pointer to statement node returned from statement type), and statement count = 1:

statements:
statements statement {;}
| statement { $$ = new_ast_statelist_node(NULL, $1, 1);}
;

Then, the next case of the statements type being “statements statement”. Take the previous example of there being 4 subsequent statement types. Initially the Parser will only see the first statement. This will be handled as described above for the “statement” case. Then, when the Parser is passed the second statement it will see the grammar as:

In this case, statements1 is a statement list node where the statement node pointer is to the node for statement1 and statement2 is a statement node. Therefore, the statement list node created for statement2 should have the next statement list node pointer point to statements1, should have the statement node pointer point to statement2, and should have the statement count be 1 greater than that of the statement count in statements1. See below for a visual representation of the abstract syntax tree nodes being created :

Implementing the case of “statements statement” in the Parser requires the statement count field node from statements node to be known. To do this, a temporary node is created to access the field. See below for the implementation of the “statements statement” case:

statements:
statements statement { 
    AST_Node_StateList* temp_statelist = (AST_Node_StateList*) $1;
    $$ = new_ast_statelist_node($1, $2, temp_statelist->statement_cnt + 1); 
}
| statement { $$ = new_ast_statelist_node(NULL, $1, 1); }
;

The compiler can now handle a function with multiple statements. Running a test, the abstract syntax tree is created as expected:

Creating ast.c and ast.h

Once the Parser has accepted a series of tokens as a valid grammar, it will execute C code within the braces after the grammar:

return_st: RETURN ID SEMICOLON {/*C code to be executed when grammar is encountered*/}

What we want the Parser to produce is some form of intermediate representation (IR) that the compiler can analyze to eventually output the program in the XM3 machine’s language. Therefore, most of the C code to be executed when a valid grammar is encountered will produce IR. The IR output by the Parser take the form of abstract syntax trees (AST).

An abstract syntax tree is a tree of nodes that represent a program. For example, take the line of code: “a = 1+2-1;”. An AST for this code could be:

For the compiler, we will use nodes of various types to represent the initial C program as an AST. This AST can be traversed by later sections of the compiler to eventually produce the desired output program.

The first thing required for an AST is a basic node. Note that this node will be type cast later to hold more specific details for the different types of nodes. This basic node will have fields for a node type, left node pointer, and a right node pointer:

 typedef struct AST_node_t {
	Node_Type type;
	struct AST_Node_t* left;
	struct AST_Node_t* right;
} AST_Node;

Node_Type is an enum of the various possible node types. To start with, the following node types are declared (note that many more types will be required):

typedef enum Node_Type
{
	BASIC_NODE,
	// functions
	FUNC_NODE,
	// declarations
	CONST_NODE,
        ID_NODE,
	EXPR_NODE,
	// statements
	RETURN_NODE
} Node_Type;

The declarations for the various node type structs are as follows:

/* Function Nodes */
typedef struct AST_Node_Func_t {
	Node_Type type;
	AST_Node* statement;
} AST_Node_Func;

typedef struct AST_Node_Const_t {
	Node_Type type;
	int const_type;
	int val;
}AST_Node_Const;

typedef struct AST_Node_ID_t {
	Node_Type type;
	char* varname;
	int vartype;
	list_t* val;
} AST_Node_ID;

typedef struct AST_Node_Expr_t {
	Node_Type type;
	Operator op;
	AST_Node* left;
	AST_Node* right;
} AST_Node_Expr;

typedef struct AST_Node_Return_t {
	Node_Type type;
	int ret_type;
	AST_Node* ret_val;
} AST_Node_Return;

In the expression node, an enum is used to hold the operator. For now, the available operators are:

typedef enum Operator_t
{
        ASSIGNop
	PLUSop,
	MINUSop,
	MULTop,
	DIVop
} Operator;

Now that the initial node types have been declared in a header file (ast.h), we need functions for the Parser to execute to create these nodes. These functions allocate space for the node, fill the fields of the nodes with the parameters of the function, and return the node. The following functions (in ast.c) are used:

AST_Node* new_ast_node(Node_Type type, AST_Node* left, AST_Node* right) 
{
	// allocate memory
	AST_Node* node = malloc(sizeof(AST_Node));
	// set members
	node->type = type;
	node->left = left;
	node->right = right;

	return node;
}

AST_Node* new_ast_func_node(AST_Node* statement) {
	AST_Node_Func* node = malloc(sizeof(AST_Node_Func));

	node->type = FUNC_NODE;
	node->statement = statement;
	return (AST_Node*) node;
}

AST_Node* new_ast_const_node(int const_type, int val) 
{	
	AST_Node_Const* node = malloc(sizeof(AST_Node_Const));

	node->type = CONST_NODE;
	node->const_type = const_type;
	node->val = val;

	return (AST_Node*) node;
}

AST_Node* new_ast_ID_node(int vtype, char* vname, Value* val)
{
	AST_Node_ID* node = malloc(sizeof(AST_Node_ID));
	node->type = ID_NODE;
	node->vartype = vtype;
	node->varname = _strdup(vname);
	node->val = val;
	return (AST_Node*)node;
}

AST_Node* new_ast_expr_node(AST_Node* left, int op, AST_Node* right)
{
	AST_Node_Expr* node = calloc(1, sizeof(AST_Node_Expr));

	node->type = EXPR_NODE;
	node->left = left;
	node->op = op;
	node->right = right;

	return (AST_Node*) node;
}

AST_Node* new_ast_return_node(int ret_type, AST_Node* ret_val)
{
	AST_Node_Return* node = malloc(sizeof(AST_Node_Return));

	node->type = RETURN_NODE;
	node->ret_type = ret_type;
	node->ret_val = ret_val;

	return (AST_Node*) node;
}

Now that we have declared various node types and have functions to create them, it is possible to generate an AST. However, we still need a function to traverse the AST. To traverse the AST, we can use recursion to travel to the leftmost node of the tree, then work all the way back to the starting node visiting the right node each step of the way. The following function is used to traverse the AST (note that tab_count is a variable used to help print the tree):

int tab_count = -1;

/* Traverse the AST starting from given node, this is done with recursive calls on the child nodes*/
void ast_traverse(AST_Node* node)
{
	tab_count++;
	// check for empty node
	if (!node) {
		return;
	}

	// Nodes with one left child and one right child
	if (node->type == BASIC_NODE) {
		ast_traverse(node->left);
		ast_traverse(node->right);
		ast_print_node(node, tab_count);
	}
	//Function nodes
	else if (node->type == FUNC_NODE) {
		AST_Node_Func* temp_func = (AST_Node_Func*)node;
		ast_traverse(temp_func->statement);
		ast_print_node(node, tab_count);
	}
	// Return nodes
	else if (node->type == RETURN_NODE) {
		AST_Node_Return* temp_return = (AST_Node_Return*) node;
		ast_traverse(temp_return->ret_val);
		ast_print_node(node, tab_count);
	}
	else if (node->type == EXPR_NODE) {
		AST_Node_Expr* temp_expr = (AST_Node_Expr*) node;
		ast_traverse(temp_expr->left);
		ast_traverse(temp_expr->right);
		ast_print_node(node, tab_count);
	}
	// Nodes with no children
	else {
		ast_print_node(node, tab_count);
	}
	tab_count--;
}

Finally, for debugging, a print node function was created. This print node function performs a switch case on the node type then prints out the important details of the node. tab_count is also passed to the print node function as it enables the user to see which nodes are children of other nodes. The print function is below:

/* Print a node of any type */
void ast_print_node(AST_Node* node, int tab_cnt)
{
	//declare temp node pointers for each node type
	AST_Node_Const* temp_const;
        AST_Node_ID* temp_ID;
	AST_Node_Return* temp_return;
	AST_Node_Func* temp_func;
	AST_Node_Expr* temp_expr;

	for (int i = 0; i < tab_cnt; i++) {
		printf("\t");
	}

	// do a switch on the node type
	switch (node->type) {
	case BASIC_NODE:
		printf("Basic Node\n");
		break;
	case FUNC_NODE:
		temp_func = (AST_Node_Func*) node;
		printf("Function Node");
		break;
	case CONST_NODE:
		temp_const = (AST_Node_Const*) node;
		printf("Const Node val: %d\n", temp_const->val);
		break;
	case ID_NODE:
		temp_ID = (AST_Node_ID*)node;
		printf("ID node: %s\n", temp_ID->varname);
		break;
	case EXPR_NODE:
		temp_expr = (AST_Node_Expr*)node;
		printf("Expression node: %c\n", ops[temp_expr->op]);
		break;
	case RETURN_NODE:
		temp_return = (AST_Node_Return*)node;
		printf("Return Node type: %d\n", temp_return->ret_type);
		break;
	default:
		printf("Error finding node type, aborting...\n");
		break;
	}
}