COMP10002 Foundations of Algorithms
Strings in C
The small amount of C string handling you may need for the one-hot embedding stage: what a string is, how to compare and copy it, and what usually goes wrong.
Overview
If you have not learnt C strings yet, the good news is that you only need a small subset for this assignment.
For the one-hot embedding stage, the main jobs are:
- store tokens as strings
- compare tokens to see whether two strings are the same
- sort strings into lexicographic order
- copy strings into your own arrays when needed
You do not need advanced text processing.
What a string is
In C, a string is a char array ending with the special character '\0', called the NUL byte
.
For example:
char word[] = "cat";
This is stored as four characters:
'c' 'a' 't' '\0'
How char word[] = "cat"; is stored
One extra slot marks the end of the string
The visible word is cat, but the array also stores one final byte to mark where the string stops.
'\0' is part of the array. It is the byte that tells C where the string ends.char word[] = "cat"; is stored
The visible word is cat, but the array also stores one final byte to mark where the string stops.
| index | 0 | 1 | 2 | 3 |
|---|---|---|---|---|
| byte | 'c' | 'a' | 't' | '\0' |
| ASCII | 99 | 97 | 116 | 0 |
The final '\0' is part of the array. It is the byte that tells C where the string ends.
That final '\0' is what tells C where the string ends. It is written as '\0' because it is the character with numeric value zero.
This is not the same as NULL
. '\0' is a byte inside a character array. NULL is used for pointers.
So if you store many tokens, a common pattern is:
char tokens[MAX_TOKENS][MAX_TOKEN_LENGTH + 1];
That means:
- each row stores one token string
- each token can use up to
MAX_TOKEN_LENGTHvisible characters - one extra slot is needed for
'\0'
Reading a string using scanf
To read in a string with scanf, we need to specify the maximum size of our char array.
For example if stdin is “Hello World”, we would need to have atleast 6 chars of space allocated in our char array to store the visible characters and the NUL '/0' character of “Hello”.
We could use the array char str[6] with scanf("%5s ", str) to read {'H','e','l','l','o','\0'} into this array. If we tried to read “Hello! World”, scanf would fail as it would find 6 chars before reaching a whitespace character. To build this format string from the array size constants, the skeleton provides you with the constant TOKEN_STR_SCANF_FORMAT which you can use.
Important Warning
All the inbuilt string functions in C assume your char array is NUL '\0' terminated. If it isn’t, this will cause out of bounds memory access and will likely crash your program or cause weird behaviour. Be careful!
Compare and copy
The most important rule is this:
== does not compare string contents
In C, == on arrays or pointers does not mean “do these two words have the same letters?”. For strings, use functions from <string.h>.
Useful ones here are:
strcmp(a, b)
- returns
0if the strings are equal - returns a negative value if
acomes beforeb - returns a positive value if
acomes afterb
That is why strcmp is useful both for:
- checking whether two tokens are the same
- sorting tokens lexicographically
If you want to use the library sort for Stage 1 rather than writing your own sorting loop, see qsort and function pointers in C.
To copy a string into your own array, use:
strcpy(dest, src);
If you want the length of a string, use:
strlen(s);