Lexer – XM3 Compiler Design

Expanding Variable Types

Looking at the grammar for a function in the Parser, it is observed that the only valid variable type is an integer (INT):

function:
INT ID OP_PAR CL_PAR OP_BRACE statements CL_BRACE {  
     $$ = new_ast_func_node($6); 
}
;

The compiler must be able to handle more variable types such as characters or doubles. The first step to adding new variable types is to create an enum used to distinguish between the different variable types:

typedef enum Variable_t
{
	CHARtype,
	DOUBLEtype,
	FLOATtype,
	INTtype,
	LONGtype,
	SHORTtype,
	VOIDtype
} VariableType;

Then, the Lexer is updated so that it can return tokens for the different variable types as well as their corresponding enum values:

"char" {
    yylval.intval = CHARtype; 
    return CHAR;
}
"double" {
    yylval.intval = DOUBLEtype; 
    return DOUBLE;
}
"float" {
    yylval.intval = FLOATtype; 
    return FLOAT;
}
"int" {
    yylval.intval = INTtype; 
    return INT;
}
"long" {
    yylval.intval = LONGtype; 
    return LONG;
}
"short" {
    yylval.intval = SHORTtype; 
    return SHORT;
}
"void" {
    yylval.intval = VOIDtype; 
    return VOID;
}

Next, the Parser must be updated. First, the tokens must be added to the Parser (must be <intval> tokens as they are also passing the enum values through the yylval union):

%token <intval> CHAR DOUBLE FLOAT INT LONG SHORT VOID /*Variable Types*/

Now, the compiler is able to pass different variable types to the Parser; however, the grammar rules within the Parser still need to be updated. One solution would to be have 7 rules for function where the variable type changes each time. This is inefficient as every time the compiler has a grammar rule with a variable type in it, multiple lines would be required. Therefore, a new %type <intval> called “var_type” will be created:

%type <intval> var_type

The grammar for var_type is set so that it can be any variable type ($1 in this case is the enum value passed through yylval):

var_type:
CHAR { $$ = $1; }
| DOUBLE { $$ = $1; }
| FLOAT { $$ = $1; }
| INT { $$ = $1; }
| LONG { $$ = $1; }
| SHORT { $$ = $1; }
| VOID { $$ = $1; }
;

With var_type introduced, the grammar rule for function can be updated to account for multiple variable types without needing to add more lines:

function:
var_type ID OP_PAR CL_PAR OP_BRACE statements CL_BRACE { 
    $$ = new_ast_func_node($6); 
}
;

The addition of the var_type will not only help with the existing grammar for functions but will also prove useful for other grammar rules with variable types that are to be added in the future (i.e., declarations).

Returning First Tokens From Lexer

When we set up the regular expressions for the Lexer, the C code (code within braces) to be executed was only print statements:

"return" {printf("RETURN\n")};

For the Lexer to return tokens to the Parser, we need to modify this C code. For example, when the regular expression “return” is observed, a token representing a return keyword should be returned to the Parser:

"return" { return RETURN; }

In the above code, RETURN is a token that is returned to the Parser. Simply returning a token is sufficient for keywords; however, sometimes we want to return more. Say a number is encountered. We not only want to return a number token, but also return the number that was observed. For this a union called yylval is used.

yylval is a global variable used to pass values between the Lexer and Parser. Since yylval is a union it can pass multiple types of values (note union is definied in Parser):

%union {
	int intval;
	double doubleval;
	char charval;
	char* str;
	struct AST_Node* node;
};

Now, since we have yylval, numbers and other desired values (i.e., variable names in the form of strings) can be passed to the Parser:

[0-9]+ {
    int value = atoi(yytext);
    yylval.intval = value;
    return ICONST;
}

The following tokens can now be passed to the Parser from the Lexer (many more will be needed):

%%

"int" { return INT; }
"return" { return RETURN; }

[a-zA-Z_][a-zA-Z0-9_]* {
    yylval.str = _strdup(yytext);
    return ID;
}

[0-9]+ {
    int value = atoi(yytext);
    yylval.intval = value;
    return ICONST;
}

"+" {return PLUS;}
"-" {return MINUS;}
"/" {return DIVIDE;}
"*" {return MULT;}
"=" {return ASSIGN;}

";" { return SEMI; }
"(" { return OP_PAR; }
")" { return CL_PAR; }
"{" { return OP_BRACE; }
"}" { return CL_BRACE; }

%%

Initial Lexer Regular Expressions

Regular expressions are a sequence of characters that match a pattern of text. For example, “i”,”n”,”t” -> “int”. The Lexer uses regular expressions to create tokens of valid words when analyzing the c program character by character.

An initial list of regular expressions that the Lexer must be able to recognize include identifiers (i.e., variable names), numbers, keywords (i.e., int or while), operators, and some other basic regular expression (i.e., ;).

For keywords and operators the regular expressions are simple as they are just strings. For example, the regular expression to recognize the else keyword is “else”:

"else" {printf("ELSE_TOKEN\n")};

Then, to recognize a number, the first character must be a digit (0, 1, 2, 3, 4, 5, 6, 7, 8, or 9). Then any amount of digits can follow the initial digit. Using regex, the regular expression for a number is:

[0-9]+ {printf("NUMBER_TOKEN\n")};

Finally, to recognize an identifier, the first character must be a letter or an underscore. Then any amount of letters, underscores, and digits can follow the initial character. Using regex, the regular expression for an identifier is:

[a-zA-Z_][a-zA-Z0-9_]* {printf("ID_TOKEN\n")};

Note, that the C code contained in the braces after the regular expressions are executed when the regular expression is encountered. For now, print statements are used, but later these print statements will be replaced by code used to pass tokens to the parser.

See the code below for the initial regular expressions of the Lexer:

[a-zA-Z_][a-zA-Z0-9_]* {printf("ID_TOKEN\n")};

[0-9]+ {printf("NUMBER_TOKEN\n")};

"case" {printf("CASE\n");}
"default" {printf("DEFAULT\n");}
"else if" {printf("ELIF\n");}
"else" {printf("ELSE\n");}
"if" {printf("IF\n");}
"switch" {printf("SWITCH\n");}
"break" {printf("BRK\n");}
"continue" {printf("CONT\n");}
"for" {printf("FOR\n");}
"do" {printf("DO\n");}
"while" {printf("WHILE\n");}
"char" {printf("CHAR\n");}
"const" {printf("CONST\n");}
"double" {printf("DOUBLE\n");}
"enum" {printf("ENUM\n");}
"extern" {printf("EXTERN\n");}
"float" {printf("float\n");}
"int" {printf("INT\n");}
"long" {printf("LONG\n");}
"signed" {printf("SIGNED\n");}
"short" {printf("SHORT\n");}
"static" {printf("STATIC\n");}
"struct" {printf("STRUCT\n");}
"union" {printf("UNION\n");}
"unsigned" {printf("UNSIGNED\n");}
"void" {printf("VOID\n");}
"volatile" {printf("VOLATILE\n");}
"goto" {printf("GOTO\n");}
"return" {printf("RETURN\n");}

"++" {printf("INC\n");}
"+" {printf("PLUS\n");}
"--" {printf("DEC\n");}
"-" {printf("MINUS\n");}
"/" {printf("DIV\n");}
"*" {printf("MULT\n");}
"==" {printf("EQUAL_CHK\n");}
"+=" {printf("PLUSEQ\n");}
"-=" {printf("MINUSEQ\n");}
"/=" {printf("DIVEQ\n");}
"*=" {printf("MULTEQ\n");}
"%=" {printf("MODEQ\n");}
"&=" {printf("ANDEQ_BIT\n");}
"|=" {printf("OREQ_BIT\n");}
"^=" {printf("XOREQ_BIT\n");}
"<<=" {printf("LSEQ_BIT\n");}
">>=" {printf("RSEQ_BIT\n");}
"!=" {printf("NOT_EQUAL_CHK\n");}
"<<" {printf("L_SHIFT\n");}
">>" {printf("R_SHIFT\n");}
">=" {printf("GTEQ_CHK\n");}
"<=" {printf("LTEQ_CHK\n");}
">" {printf("GT_CHK\n");}
"<" {printf("LT_CHK\n");}
"=" {printf("ASSIGN\n");}
"%" {printf("MOD\n");}
"&&" {printf("AND\n");}
"&" {printf("BITAND\n");}
"||" {printf("OR\n");}
"|" {printf("BITOR\n");}
"!" {printf("NOT\n");}
"^" {printf("BITXOR\n");}
"~" {printf("BITCOMP\n");}

";" {printf("SEMICOLON\n");}
"\n" {printf("NEWLINE\n");}
"}" {printf("OP_PAR\n");}
"}" {printf("CL_PAR\n");}
"}" {printf("OP_BRACE\n");}
"}" {printf("CL_BRACE\n");}
"[" {printf("OP_BRACKET\n");}
"]" {printf("CL_BRACKET\n");}

Purpose of the Lexer

The overall goal of our compiler is to take in C code and output the equivalent program in the XM3 machine’s language. The first step in achieving this goal is to make sure our compiler can understand what the input C code is. The first stage in understanding the C code is the Lexer.

The Lexer converts the C code into many tokens. For example, take the C code “int a = 5;”. The Lexer could convert this C statement to the following tokens: INT IDENTIFIER ASSIGN INTEGER SEMICOLON.

These tokens are the compiler recognizing valid words in the C language. Therefore, the purpose of the Lexer is to breakdown the C language into valid words. These words will later be analyzed by the Parser to see if they fit into any valid grammar.

Note, Flex will be used to implement the Lexer for the XM3 compiler. Flex is an open source tool used to generate lexical analyzers.