Returning First Tokens From Lexer

When we set up the regular expressions for the Lexer, the C code (code within braces) to be executed was only print statements:

"return" {printf("RETURN\n")};

For the Lexer to return tokens to the Parser, we need to modify this C code. For example, when the regular expression “return” is observed, a token representing a return keyword should be returned to the Parser:

"return" { return RETURN; }

In the above code, RETURN is a token that is returned to the Parser. Simply returning a token is sufficient for keywords; however, sometimes we want to return more. Say a number is encountered. We not only want to return a number token, but also return the number that was observed. For this a union called yylval is used.

yylval is a global variable used to pass values between the Lexer and Parser. Since yylval is a union it can pass multiple types of values (note union is definied in Parser):

%union {
	int intval;
	double doubleval;
	char charval;
	char* str;
	struct AST_Node* node;
};

Now, since we have yylval, numbers and other desired values (i.e., variable names in the form of strings) can be passed to the Parser:

[0-9]+ {
    int value = atoi(yytext);
    yylval.intval = value;
    return ICONST;
}

The following tokens can now be passed to the Parser from the Lexer (many more will be needed):

%%

"int" { return INT; }
"return" { return RETURN; }

[a-zA-Z_][a-zA-Z0-9_]* {
    yylval.str = _strdup(yytext);
    return ID;
}

[0-9]+ {
    int value = atoi(yytext);
    yylval.intval = value;
    return ICONST;
}

"+" {return PLUS;}
"-" {return MINUS;}
"/" {return DIVIDE;}
"*" {return MULT;}
"=" {return ASSIGN;}

";" { return SEMI; }
"(" { return OP_PAR; }
")" { return CL_PAR; }
"{" { return OP_BRACE; }
"}" { return CL_BRACE; }

%%

Creating ast.c and ast.h

Once the Parser has accepted a series of tokens as a valid grammar, it will execute C code within the braces after the grammar:

return_st: RETURN ID SEMICOLON {/*C code to be executed when grammar is encountered*/}

What we want the Parser to produce is some form of intermediate representation (IR) that the compiler can analyze to eventually output the program in the XM3 machine’s language. Therefore, most of the C code to be executed when a valid grammar is encountered will produce IR. The IR output by the Parser take the form of abstract syntax trees (AST).

An abstract syntax tree is a tree of nodes that represent a program. For example, take the line of code: “a = 1+2-1;”. An AST for this code could be:

For the compiler, we will use nodes of various types to represent the initial C program as an AST. This AST can be traversed by later sections of the compiler to eventually produce the desired output program.

The first thing required for an AST is a basic node. Note that this node will be type cast later to hold more specific details for the different types of nodes. This basic node will have fields for a node type, left node pointer, and a right node pointer:

 typedef struct AST_node_t {
	Node_Type type;
	struct AST_Node_t* left;
	struct AST_Node_t* right;
} AST_Node;

Node_Type is an enum of the various possible node types. To start with, the following node types are declared (note that many more types will be required):

typedef enum Node_Type
{
	BASIC_NODE,
	// functions
	FUNC_NODE,
	// declarations
	CONST_NODE,
        ID_NODE,
	EXPR_NODE,
	// statements
	RETURN_NODE
} Node_Type;

The declarations for the various node type structs are as follows:

/* Function Nodes */
typedef struct AST_Node_Func_t {
	Node_Type type;
	AST_Node* statement;
} AST_Node_Func;

typedef struct AST_Node_Const_t {
	Node_Type type;
	int const_type;
	int val;
}AST_Node_Const;

typedef struct AST_Node_ID_t {
	Node_Type type;
	char* varname;
	int vartype;
	list_t* val;
} AST_Node_ID;

typedef struct AST_Node_Expr_t {
	Node_Type type;
	Operator op;
	AST_Node* left;
	AST_Node* right;
} AST_Node_Expr;

typedef struct AST_Node_Return_t {
	Node_Type type;
	int ret_type;
	AST_Node* ret_val;
} AST_Node_Return;

In the expression node, an enum is used to hold the operator. For now, the available operators are:

typedef enum Operator_t
{
        ASSIGNop
	PLUSop,
	MINUSop,
	MULTop,
	DIVop
} Operator;

Now that the initial node types have been declared in a header file (ast.h), we need functions for the Parser to execute to create these nodes. These functions allocate space for the node, fill the fields of the nodes with the parameters of the function, and return the node. The following functions (in ast.c) are used:

AST_Node* new_ast_node(Node_Type type, AST_Node* left, AST_Node* right) 
{
	// allocate memory
	AST_Node* node = malloc(sizeof(AST_Node));
	// set members
	node->type = type;
	node->left = left;
	node->right = right;

	return node;
}

AST_Node* new_ast_func_node(AST_Node* statement) {
	AST_Node_Func* node = malloc(sizeof(AST_Node_Func));

	node->type = FUNC_NODE;
	node->statement = statement;
	return (AST_Node*) node;
}

AST_Node* new_ast_const_node(int const_type, int val) 
{	
	AST_Node_Const* node = malloc(sizeof(AST_Node_Const));

	node->type = CONST_NODE;
	node->const_type = const_type;
	node->val = val;

	return (AST_Node*) node;
}

AST_Node* new_ast_ID_node(int vtype, char* vname, Value* val)
{
	AST_Node_ID* node = malloc(sizeof(AST_Node_ID));
	node->type = ID_NODE;
	node->vartype = vtype;
	node->varname = _strdup(vname);
	node->val = val;
	return (AST_Node*)node;
}

AST_Node* new_ast_expr_node(AST_Node* left, int op, AST_Node* right)
{
	AST_Node_Expr* node = calloc(1, sizeof(AST_Node_Expr));

	node->type = EXPR_NODE;
	node->left = left;
	node->op = op;
	node->right = right;

	return (AST_Node*) node;
}

AST_Node* new_ast_return_node(int ret_type, AST_Node* ret_val)
{
	AST_Node_Return* node = malloc(sizeof(AST_Node_Return));

	node->type = RETURN_NODE;
	node->ret_type = ret_type;
	node->ret_val = ret_val;

	return (AST_Node*) node;
}

Now that we have declared various node types and have functions to create them, it is possible to generate an AST. However, we still need a function to traverse the AST. To traverse the AST, we can use recursion to travel to the leftmost node of the tree, then work all the way back to the starting node visiting the right node each step of the way. The following function is used to traverse the AST (note that tab_count is a variable used to help print the tree):

int tab_count = -1;

/* Traverse the AST starting from given node, this is done with recursive calls on the child nodes*/
void ast_traverse(AST_Node* node)
{
	tab_count++;
	// check for empty node
	if (!node) {
		return;
	}

	// Nodes with one left child and one right child
	if (node->type == BASIC_NODE) {
		ast_traverse(node->left);
		ast_traverse(node->right);
		ast_print_node(node, tab_count);
	}
	//Function nodes
	else if (node->type == FUNC_NODE) {
		AST_Node_Func* temp_func = (AST_Node_Func*)node;
		ast_traverse(temp_func->statement);
		ast_print_node(node, tab_count);
	}
	// Return nodes
	else if (node->type == RETURN_NODE) {
		AST_Node_Return* temp_return = (AST_Node_Return*) node;
		ast_traverse(temp_return->ret_val);
		ast_print_node(node, tab_count);
	}
	else if (node->type == EXPR_NODE) {
		AST_Node_Expr* temp_expr = (AST_Node_Expr*) node;
		ast_traverse(temp_expr->left);
		ast_traverse(temp_expr->right);
		ast_print_node(node, tab_count);
	}
	// Nodes with no children
	else {
		ast_print_node(node, tab_count);
	}
	tab_count--;
}

Finally, for debugging, a print node function was created. This print node function performs a switch case on the node type then prints out the important details of the node. tab_count is also passed to the print node function as it enables the user to see which nodes are children of other nodes. The print function is below:

/* Print a node of any type */
void ast_print_node(AST_Node* node, int tab_cnt)
{
	//declare temp node pointers for each node type
	AST_Node_Const* temp_const;
        AST_Node_ID* temp_ID;
	AST_Node_Return* temp_return;
	AST_Node_Func* temp_func;
	AST_Node_Expr* temp_expr;

	for (int i = 0; i < tab_cnt; i++) {
		printf("\t");
	}

	// do a switch on the node type
	switch (node->type) {
	case BASIC_NODE:
		printf("Basic Node\n");
		break;
	case FUNC_NODE:
		temp_func = (AST_Node_Func*) node;
		printf("Function Node");
		break;
	case CONST_NODE:
		temp_const = (AST_Node_Const*) node;
		printf("Const Node val: %d\n", temp_const->val);
		break;
	case ID_NODE:
		temp_ID = (AST_Node_ID*)node;
		printf("ID node: %s\n", temp_ID->varname);
		break;
	case EXPR_NODE:
		temp_expr = (AST_Node_Expr*)node;
		printf("Expression node: %c\n", ops[temp_expr->op]);
		break;
	case RETURN_NODE:
		temp_return = (AST_Node_Return*)node;
		printf("Return Node type: %d\n", temp_return->ret_type);
		break;
	default:
		printf("Error finding node type, aborting...\n");
		break;
	}
}

Purpose of the Parser

Once the Lexer has converted a C programs characters to a series of tokens, the compiler must check if the order of these tokens lead to valid statements. For this, the Parser is used.

The Parser is passed the tokens from the Lexer then uses a set rules called grammars to check whether valid statements are being passed. For example say the Lexer converts the C code “return a;” to “RETURN ID SEMICOLON”. The Parser would have a grammar to recognize this pattern tokens:

return_st: RETURN ID SEMICOLON {;}

Note, Bison will be used to implement the Parser for the XM3 compiler. Bison is an open source tool used with Flex.

Initial Lexer Regular Expressions

Regular expressions are a sequence of characters that match a pattern of text. For example, “i”,”n”,”t” -> “int”. The Lexer uses regular expressions to create tokens of valid words when analyzing the c program character by character.

An initial list of regular expressions that the Lexer must be able to recognize include identifiers (i.e., variable names), numbers, keywords (i.e., int or while), operators, and some other basic regular expression (i.e., ;).

For keywords and operators the regular expressions are simple as they are just strings. For example, the regular expression to recognize the else keyword is “else”:

"else" {printf("ELSE_TOKEN\n")};

Then, to recognize a number, the first character must be a digit (0, 1, 2, 3, 4, 5, 6, 7, 8, or 9). Then any amount of digits can follow the initial digit. Using regex, the regular expression for a number is:

[0-9]+ {printf("NUMBER_TOKEN\n")};

Finally, to recognize an identifier, the first character must be a letter or an underscore. Then any amount of letters, underscores, and digits can follow the initial character. Using regex, the regular expression for an identifier is:

[a-zA-Z_][a-zA-Z0-9_]* {printf("ID_TOKEN\n")};

Note, that the C code contained in the braces after the regular expressions are executed when the regular expression is encountered. For now, print statements are used, but later these print statements will be replaced by code used to pass tokens to the parser.

See the code below for the initial regular expressions of the Lexer:

[a-zA-Z_][a-zA-Z0-9_]* {printf("ID_TOKEN\n")};

[0-9]+ {printf("NUMBER_TOKEN\n")};

"case" {printf("CASE\n");}
"default" {printf("DEFAULT\n");}
"else if" {printf("ELIF\n");}
"else" {printf("ELSE\n");}
"if" {printf("IF\n");}
"switch" {printf("SWITCH\n");}
"break" {printf("BRK\n");}
"continue" {printf("CONT\n");}
"for" {printf("FOR\n");}
"do" {printf("DO\n");}
"while" {printf("WHILE\n");}
"char" {printf("CHAR\n");}
"const" {printf("CONST\n");}
"double" {printf("DOUBLE\n");}
"enum" {printf("ENUM\n");}
"extern" {printf("EXTERN\n");}
"float" {printf("float\n");}
"int" {printf("INT\n");}
"long" {printf("LONG\n");}
"signed" {printf("SIGNED\n");}
"short" {printf("SHORT\n");}
"static" {printf("STATIC\n");}
"struct" {printf("STRUCT\n");}
"union" {printf("UNION\n");}
"unsigned" {printf("UNSIGNED\n");}
"void" {printf("VOID\n");}
"volatile" {printf("VOLATILE\n");}
"goto" {printf("GOTO\n");}
"return" {printf("RETURN\n");}

"++" {printf("INC\n");}
"+" {printf("PLUS\n");}
"--" {printf("DEC\n");}
"-" {printf("MINUS\n");}
"/" {printf("DIV\n");}
"*" {printf("MULT\n");}
"==" {printf("EQUAL_CHK\n");}
"+=" {printf("PLUSEQ\n");}
"-=" {printf("MINUSEQ\n");}
"/=" {printf("DIVEQ\n");}
"*=" {printf("MULTEQ\n");}
"%=" {printf("MODEQ\n");}
"&=" {printf("ANDEQ_BIT\n");}
"|=" {printf("OREQ_BIT\n");}
"^=" {printf("XOREQ_BIT\n");}
"<<=" {printf("LSEQ_BIT\n");}
">>=" {printf("RSEQ_BIT\n");}
"!=" {printf("NOT_EQUAL_CHK\n");}
"<<" {printf("L_SHIFT\n");}
">>" {printf("R_SHIFT\n");}
">=" {printf("GTEQ_CHK\n");}
"<=" {printf("LTEQ_CHK\n");}
">" {printf("GT_CHK\n");}
"<" {printf("LT_CHK\n");}
"=" {printf("ASSIGN\n");}
"%" {printf("MOD\n");}
"&&" {printf("AND\n");}
"&" {printf("BITAND\n");}
"||" {printf("OR\n");}
"|" {printf("BITOR\n");}
"!" {printf("NOT\n");}
"^" {printf("BITXOR\n");}
"~" {printf("BITCOMP\n");}

";" {printf("SEMICOLON\n");}
"\n" {printf("NEWLINE\n");}
"}" {printf("OP_PAR\n");}
"}" {printf("CL_PAR\n");}
"}" {printf("OP_BRACE\n");}
"}" {printf("CL_BRACE\n");}
"[" {printf("OP_BRACKET\n");}
"]" {printf("CL_BRACKET\n");}

Purpose of the Lexer

The overall goal of our compiler is to take in C code and output the equivalent program in the XM3 machine’s language. The first step in achieving this goal is to make sure our compiler can understand what the input C code is. The first stage in understanding the C code is the Lexer.

The Lexer converts the C code into many tokens. For example, take the C code “int a = 5;”. The Lexer could convert this C statement to the following tokens: INT IDENTIFIER ASSIGN INTEGER SEMICOLON.

These tokens are the compiler recognizing valid words in the C language. Therefore, the purpose of the Lexer is to breakdown the C language into valid words. These words will later be analyzed by the Parser to see if they fit into any valid grammar.

Note, Flex will be used to implement the Lexer for the XM3 compiler. Flex is an open source tool used to generate lexical analyzers.