I took a little break from programming on my own time but I'm back now. I've recently been studying C and writing some C. I want to write a hash table implementation in C. Here's the general idea of how a string-based hash table works. You have this big regular array with numerical indexes/keys. But you want to use string keys instead of number keys to access elements in the array. So what you do is write a function called a hash function that converts the strings into numbers that are used as keys/indexes to insert, delete, update and retrieve elements of the array. Because hash tables are just using a numerical index into an array, these operations are really fast and don't slow down as more and more data is added.
The first thing I needed was a hash function. I didn't know what to use. So I found some hash functions on the web. But I didn't know how good they were and I wanted to have an idea of how well they worked. A problem with hash functions is that it is difficult to translate all different strings to all different numbers. When a hash function converts two different strings to the same number, it is a collision. You obviously can't have the same numerical index for different array elements, unless you have a collision strategy, which use of reduces performance.
So I wrote this C program to test the 7 different hash functions found on the web. The test generates 50,000 random strings that are randomly between 1 character and 25 characters long inclusive. And each of the random strings are unique -- there are no duplicate strings. The program takes a hash function and runs it on each of the 50,000 random strings and counts how many collisions the hash function produces. I ran the program several times for each hash function. The resulting hash from the hash functions is an unsigned long int, which on my machine is a number between 0 and 4,294,967,295 inclusive.
Here were the results:
hash1: 47,635 to 47,645 collisions
hash2: 415 to 428 collisions
hash3: 350 to 396 collisions
hash4: 0 to 2 collisions
hash5: 844 to 905 collisions
hash6: 845 to 910 collisions
hash7: 0 collisions
The program is pretty slow; it's using linear searches. But it doesn't need to be fast. The real test of the hash functions will be when they are used on real data that the hash table is built to be used with. The collisions might be different with lots of similar strings.
Here's a good tutorial on hash tables.
25 February 2008
The result for hash1 looks so wrong, are you sure there is no mistake? To be honest I didn't look at the actual algorithm or homepage... .
I rerun them on my Mac (but modified the program to use the same strings on each hash-algorithm):
Here my resuts:
Interesting info BTW!
25 February 2008
Hey thanks for checking it out!
The results for hash1 are accurate. It's a really bad algorithm. What it does is take the ascii value of each character of a string and add them together and return the sum as the hash. The maximum amount of different numbers this algorithm can produce with this test is 3050. That's why there's so many collisions!
7 January 2009
Thanks for your code. This is very helpful for testing my hash algorithm.
I hope you'll fix the link from http://nickmudge.info/c/hashtest.c.txt to http://nickmudge.info/c/hashtest.c
4 April 2008
I tested couple of hash function in your code. My concern is hash function should produce keys within the size limit. i.e key <= size [not worrying about collision at this point]
Lets say we an array of Keys(0), ..., Keys(n-1). The array 'Keys()' is a hash table. A hash function h transforms the information x into an integer h(x) such that: 0 <= h(x)<= n-1 where n is the size of the table.
The information x is then stored at Keys(h(x))
Btw, thanks for sharing.
19 September 2008
hi nick, this is mukul sharma .there is no prog http://nickmudge.info/c/hashtest.c.txt on this link.
plz send me mail its really a very help full 4 me
thank you in advance