COMP10002 Foundations of Algorithms

Strings in C

The small amount of C string handling you may need for the one-hot embedding stage: what a string is, how to compare and copy it, and what usually goes wrong.

Overview

If you have not learnt C strings yet, the good news is that you only need a small subset for this assignment.

For the one-hot embedding stage, the main jobs are:

You do not need advanced text processing.

What a string is

In C, a string is a char array ending with the special character '\0', called the NUL byte .

For example:

char word[] = "cat";

This is stored as four characters:

'c' 'a' 't' '\0'

How char word[] = "cat"; is stored

One extra slot marks the end of the string

The visible word is cat, but the array also stores one final byte to mark where the string stops.

The final '\0' is part of the array. It is the byte that tells C where the string ends.
How char word[] = "cat"; is stored

The visible word is cat, but the array also stores one final byte to mark where the string stops.

index 0 1 2 3
byte 'c' 'a' 't' '\0'
ASCII 99 97 116 0

The final '\0' is part of the array. It is the byte that tells C where the string ends.

That final '\0' is what tells C where the string ends. It is written as '\0' because it is the character with numeric value zero.

This is not the same as NULL . '\0' is a byte inside a character array. NULL is used for pointers.

So if you store many tokens, a common pattern is:

char tokens[MAX_TOKENS][MAX_TOKEN_LENGTH + 1];

That means:

Reading a string using scanf

To read in a string with scanf, we need to specify the maximum size of our char array. For example if stdin is “Hello World”, we would need to have atleast 6 chars of space allocated in our char array to store the visible characters and the NUL '/0' character of “Hello”.

We could use the array char str[6] with scanf("%5s ", str) to read {'H','e','l','l','o','\0'} into this array. If we tried to read “Hello! World”, scanf would fail as it would find 6 chars before reaching a whitespace character. To build this format string from the array size constants, the skeleton provides you with the constant TOKEN_STR_SCANF_FORMAT which you can use.

Important Warning

All the inbuilt string functions in C assume your char array is NUL '\0' terminated. If it isn’t, this will cause out of bounds memory access and will likely crash your program or cause weird behaviour. Be careful!

Compare and copy

The most important rule is this:

== does not compare string contents

In C, == on arrays or pointers does not mean “do these two words have the same letters?”. For strings, use functions from <string.h>.

Useful ones here are:

strcmp(a, b)

That is why strcmp is useful both for:

If you want to use the library sort for Stage 1 rather than writing your own sorting loop, see qsort and function pointers in C.

To copy a string into your own array, use:

strcpy(dest, src);

If you want the length of a string, use:

strlen(s);