CS 2213 Advanced Programming
Hash Tables


Start of Class Thursday, Nov 2, 2000


Review

Reference: Data Structures and Problem Solving Using Java by Mark Weiss, Chapter 19: Hash Tables.

Table lookup is an important operation.
One way to do table lookup is to keep the table sorted and then use a binary search.
A binary search of n sorted items takes about log n operations.

We can do better than this!
It is possible to do a lookup whose time is independent of the number of items in the list.
We do this all of the time with arrays.
Given the array index we can go directly to the corresponding entry.
For an array, the search key is an integer and we assume that the keys consist of all consecutive numbers.

We can extend this to the case in which our key value is something else, say strings, if we have a function that maps strings into integers.
Ideally, if we had n strings, the function would map all of the strings uniquely into the integers from 0 to n-1.
Unfortunately, we usually cannot do this, so we settle for a function, called a has function, which maps the strings into the integers from 0 to m-1, where m > n.
We apply the function to get the index at which we will store the string.
We do not require the mapping to be unique.
That is, two strings may be mapped into the same index.
This is called a collision.
The larger the value of m, the easier it is to avoid collisions, but this requires more storage.

There are many ways of handling collisions, we will look at a simple one called Linear Probing.

Hashing Strings

When a program is compiled, the compiler must keep a list of the identifiers.
In the simplest case of variables, the compiler keps the name of the variable (a string) and the location (address) at which it is stored.
When the compiler comes across a variable name, it must look up the address of the variable.
Most compilers use a hash table for this.
Here is a data structure that can be used to store a string and its address:

struct stringval {
   char *str;
   int val;
};
Note that a pointer to the string is stored and not the string itself.

A hash table will be an array of these things, but not all array entries will be used (since the number of strings is usually much less than the size of the array).
A hash table entry will contain one of the stringval structures and a status indicating if it is there:

struct hashentry {
   struct stringval sval;
   int status;
};
The status will be one of the following:
#define HASH_STATUS_EMPTY  0
#define HASH_STATUS_FULL   1
#define HASH_STATUS_USED   2
The first two have an obvious meaning.
The last of these will be explained later.

A hash table will be an array of struct hashentry, possibly like this:

struct hashentry hashtable[HASHSIZE];
The size of the array will be chosen to be larger than the number of strings we might want to store.

The next thing needed is a has function to map a string into an array index.

A good one will be fast and minimize the number of collisions.
Here are some simple bad ones:

int hf1(char *str) {
   return *str;
}

int hf2(char *str) {
   return strlen(str);
}

int hf2(char *str) {
   int sum;
   for (sum=0;*str!='\0';str++)
      sum = (sum + *str) % HASHSIZE;
   return sum;
}
 
int hf3(char *str) {
   int sum;
   for (sum=0;*str!='\0';str++)
      sum = sum + *str;
   return sum % HASHSIZE;
}
 
int hf4(char *str) {
   int sum;
   for (sum=0;*str!='\0';str++)
      sum = sum + *str;
   if (sum < 0) sum = -sum;
   return sum % HASHSIZE;
}

The last three of these will tend to map all short strings into the beginning of the hash table.

If either operand of % is negative, the sign of the result is implementation-defined.
That is why we need to make sure that sum is not negative in the last hash function.


An aside: from the C standard document:
If either operand is negative, whether the result of the / operator is the largest integer less than or equal to the algebraic quotient or the smallest integer greater than or equal to the algebraic quotient is implementation defined ...

So what is the value of -1/0?

In other words, if the parent of node k is node (k-1)/2, what is the index of the parent of the root?

Several people tried to make their code more efficient by not treating the root in the trickle up of the heapsort creation stage.
This only works if -1/2 = 0.


Here is a simple hash function which is not too bad.

int hashfunction(char *str) {
   int hashval;
   int i;

   hashval = 0;
   for (i=0;str[i]!=0;i++)
      hashval = 37*hashval + str[i];
   if (hashval < 0) hashval = -hashval;
   return hashval % HASHSIZE;
}
This will tend to overflow an integer for strings of moderate length.
That is the reason for the test for negative.

Suppose we have

struct stringval sval;
We would store this in the hash table at index hashfunction(sval.str).
What do we do if we get a collision?

Linear Probing

The idea of linear probing is simple.

Insertion
If the computed entry is free, store the value there.
If the computed entry is not free, store it in the next free entry.
This can work well if the number of collisions is small and the entries are spread out so that the next entry is usually free.

Look up
To look up an entry, we start at the index determined by the hash function.
If this entry is empty, the string is not in the hash table.
If the entry is full, we do a string compare to see if we found it.
If not, we continue looking at succeeding entries until either we find it or we hit an empty entry.

Deletion
The idea of deletion is simple.
Look up the entry, as above, and if found mark the entry as empty.

Example: Suppose we start with an empty hash table and have two string, str1 and str2, which both hash to index 50.
We insert str1 at 50 and str2 at 51.
Now we delete str1.
Entry 50 is marked empty.
We now lookup str2.
We look at entry 50 and since it is empty, we assume that str2 is not in the table.
We fix this by not marking deleted entries as empty, but previously used.
When doing a lookup, we skip over these previously used entries and keep looking.
When doing an insert, we can use these and make them full.

Important note: When doing a lookup, you may have to wrap around.

Look at Programming Assignment 2 here.

Here are some of the files for this assignment:


Quadratic Probing

In linear probing, after looking at cell n, we look at n+1, n+2, etc.
In quadratic probing, we look at n+12, n+22, n+32, etc.

Since it is important for table lookup to be fast, it is useful to do this without having to do multiplications.
We can do this by noticing that 22 - 12 = 3, 32 - 22 = 5, 42 - 32 = 7, etc.
That is, each time we add two more than we did the time before.

Back to Structures