Symbols Breakdown

Details of variables and functions are stored in the Complier’s symbol table so that the Complier can later access information for the symbol. The Complier uses the information in the symbol table to help perform semantic analysis and generate assembly code.

The first field of the symbol table is a string for the symbol name. The symbol name is stored as it helps the Complier find the symbol in the symbol table:

#define MAXIDLEN 31 // max length of identifier (defined by C standard)

typedef struct symbol {
	char id[MAXIDLEN];
}symbol_t;

Sometimes the symbol table will create a collision, in this case a linked list at that index of the symbol table is required. Therefore, each symbol has a field to allow the creation of a linked list:

typedef struct symbol {
	char id[MAXIDLEN];
	struct symbol* next;
}symbol_t;

For now the symbols can either be for a variable or a function (may have more types in the future). This means a field is needed to tell the Complier whether the symbol is for a variable or function:

typedef struct symbol {
	char id[MAXIDLEN];
	int sym_type;
	struct symbol* next;
}symbol_t;

For semantic analysis, it is also important to store what variable type the symbol is for:

// Enum used for the different C variable types
typedef enum Variable_t
{
	CHARtype,
	DOUBLEtype,
	FLOATtype,
	INTtype,
	LONGtype,
	SHORTtype,
	VOIDtype,
	STRtype,
	NUMtotal //Stores how many variables types are handled
} VariableType;

typedef struct symbol {
	char id[MAXIDLEN];
	int sym_type;
	VariableType data_type;
	struct symbol* next;
}symbol_t;

In the creation of the symbol table, it was decided that each scope of the C program should have its own scope. This means that the Complier must know the name and scope of a symbol to locate it:

// Updates to ast.h
/* Function Nodes */
typedef struct AST_Node_Func_t {
	Node_Type type;
	char* funcname;
	scope_t* scope;
	AST_Node* statement_list;
	AST_Node* param_list;
} AST_Node_Func;

typedef struct AST_Node_Func_Call_t {
	Node_Type type;
	char* funcname;
	AST_Node* param_list;
	scope_t* scope;
} AST_Node_Func_Call;

typedef struct AST_Node_ID_t {
	Node_Type type;
	char* varname;
	int vartype;
	scope_t* scope;
	Value val;
} AST_Node_ID;

// Update to symtab.h
typedef struct symbol {
	char id[MAXIDLEN];
	int scope;
	int sym_type;
	VariableType data_type;

	struct symbol* next;
}symbol_t;

Variables will have values stored in memory at a relative offset to the base pointer of a stack frame. To allow the complier to access variables stored in memory, a field for the base pointer offset was added (note this field is used for variables, it is not used for functions):

typedef struct symbol {
	char id[MAXIDLEN];
	int scope;
	int sym_type;
	VariableType data_type;
	int bp_offset;
	struct symbol* next;
}symbol_t;

Functions won’t have values stored in memory like variables; however, it is useful to store how many parameters are passed to the function (i.e., allows the Complier to know how many variables must be popped from the stack in a return from a function call). Therefore, a field was added to store how many parameters a function has (this will have to be updated in the future to account for parameter types):

typedef struct symbol {
	char id[MAXIDLEN];
	int scope;
	int sym_type;
	VariableType data_type;
	int bp_offset;
	int param_count;
	struct symbol* next;
}symbol_t;

Finally, a field was added to store the values of variables:

typedef union Value {
	int ival;
	double dval;
	long lval;
	float fval;
	char cval;
	char* sval;
	short shval;
} Value;

typedef struct symbol {
	char id[MAXIDLEN];
	int scope;
	int sym_type;
	VariableType data_type;
	Value value;
	int bp_offset;
	int param_count;
	struct symbol* next;
}symbol_t;

Hash Table and Scopes

For the Complier it was decided to use a hash table for the symbol table due to hash tables having fast lookup speeds. Hash tables are arrays paired with a hash function that uses symbol names to calculate the index of the array where the symbol is to be stored:

The hash function calculates the index to store the symbol by summing the decimal values of the ASCII characters in the symbol name, then taking the remainder of the sum divided by the size of the hash table’s array. See below for the function used to calculate the index of the hash table to store a symbol:

unsigned int hash(char* key) {
	unsigned int hashval = 0;

	// sum ASCII values of characters
	while (*key) {
		hashval += *key;
		key++;
	}
	return hashval % COLUMN_SIZE; // return modulo on size of table
}

It is possible to have what is called a collision where the same index is calculated for different symbol names. In this case a linked list is created at that index of the hash table. For example, say val1 and val2 ended up having the same index calculated from the hash function, the table would look like the figure below:

This allows for multiple values to be stored at a single index of the hash table. The downside of collisions is that they increase the lookup time as the Compiler must perform a linear search of each entry of the linked list to look for specific a symbol name.

Another important thing to consider is that a C program can have multiple symbols that share a name if they are in different scopes of the program. This will cause a collision issue as the same index will be calculated for each symbol; however, the names of the symbols will be the same so a search through a linked list cannot be used as done before . To overcome this, it was decided to give each scope of the program it’s own symbol table.

Each scope contains the next scope, the previous scope, sub scope, the scope number (global would be 0, number increases the further away it is from the global scope), and a symbol table:

// Addition to symtab.h
typedef struct scope
{
	struct scope* next;
	struct scope* prev;
	struct scope* sub_scope;
	int scope_num;
	symbol_t** table;
}scope_t;

Functions were created to initialize, increment, and decrement scopes:

// Addition to symtab.c
scope_t* init_scope() {
	// allocate memory
	scope_t* scp = (scope_t*)calloc(1, sizeof(scope_t));
	scp->next = NULL;
	scp->prev = NULL;
	scp->sub_scope = NULL;
	scp->scope_num = curr_scope;
	scp->table = (symbol_t**)calloc(COLUMN_SIZE, sizeof(symbol_t*));
	return scp;
}

void incr_scope() {
	scope_t* scp = init_scope();
	current_scope->next = scp;
	scp->prev = current_scope;
	current_scope = scp;
	curr_scope++;
}

extern void decr_scope() {
	current_scope = current_scope->prev;
	curr_scope--;
}

The main function was updated to create a global and main scope:


// Addition to main.c
int curr_scope = 0;
int curr_sub_scope = 0;
int ROW_SIZE = 10;
int COLUMN_SIZE = 20;
float RESIZE = 0.5;

scope_t* current_scope;
scope_t* global_scope;
scope_t* main_scope;

int main(int argc, char* argv[]) {
	global_scope = init_scope();
	current_scope = global_scope;
	incr_scope();
        ...

The Parser was also updated to increment the scope whenever the grammar for a function is encountered (note, may need to add the incr_scope() function in other areas of the Parser as the grammar of the Compiler grows):

// Addition to Parser.y
functions:
functions function {
	AST_Node_StateList* temp_funclist = (AST_Node_StateList*) $1;
	$$ = new_ast_statelist_node($1, $2, temp_funclist->statement_cnt + 1); 
}
| function  { 
	incr_scope();
	$$ = new_ast_statelist_node(NULL, $1, 1); 
}
;

Purpose of the Symbol Table

The Compiler must be able to store information about symbols found within a C program. Symbols can include variables, function names, other objects and more. Information of these symbols must be stored as it allows the Complier to generate assembly code and the information can be used for semantic analysis.

For example, say a function call is performed with the parameters val1, val2, and val3. These parameters will be stored in a stack frame in a relative location to the base pointer for that stack frame. These relative location allow the Complier to access the parameters in the function call. These relative locations that are vital to the Complier being able to access parameters and are just one of the details stored in the symbol table.

Other details stored in the symbol table include:

  • Symbol Name
  • Symbol Scope
  • Symbol Type (i.e., variable or function name)
  • Symbol Variable Type
  • Symbol Value
  • Symbol Address
  • Symbol Base Pointer Offset (mainly used for variables)
  • Symbol Parameters Count (mainly used for functions)