Brian Hurt has a nice post about Postgres: Postgres for the win!
At this point, the only complaint I have is that Postgres is doing so much with so little that I’ll never get cool hardware to play with. I’ll be stuck with cheap low-end boxes forever. Sigh.
Here's a post he wrote with some great data on Hash Tables: Problems with Hash Tables
The new sort can either sort items by most recently submitted or by most recently traded. It sorts by both by default.
The new search provides a convenient interface to Newsconomy's URL-based tag system to find items.
I took a little break from programming on my own time but I'm back now. I've recently been studying C and writing some C. I want to write a hash table implementation in C. Here's the general idea of how a string-based hash table works. You have this big regular array with numerical indexes/keys. But you want to use string keys instead of number keys to access elements in the array. So what you do is write a function called a hash function that converts the strings into numbers that are used as keys/indexes to insert, delete, update and retrieve elements of the array. Because hash tables are just using a numerical index into an array, these operations are really fast and don't slow down as more and more data is added.
The first thing I needed was a hash function. I didn't know what to use. So I found some hash functions on the web. But I didn't know how good they were and I wanted to have an idea of how well they worked. A problem with hash functions is that it is difficult to translate all different strings to all different numbers. When a hash function converts two different strings to the same number, it is a collision. You obviously can't have the same numerical index for different array elements, unless you have a collision strategy, which use of reduces performance.
So I wrote this C program to test the 7 different hash functions found on the web. The test generates 50,000 random strings that are randomly between 1 character and 25 characters long inclusive. And each of the random strings are unique -- there are no duplicate strings. The program takes a hash function and runs it on each of the 50,000 random strings and counts how many collisions the hash function produces. I ran the program several times for each hash function. The resulting hash from the hash functions is an unsigned long int, which on my machine is a number between 0 and 4,294,967,295 inclusive.
Here were the results:
hash1: 47,635 to 47,645 collisions
hash2: 415 to 428 collisions
hash3: 350 to 396 collisions
hash4: 0 to 2 collisions
hash5: 844 to 905 collisions
hash6: 845 to 910 collisions
hash7: 0 collisions
The program is pretty slow; it's using linear searches. But it doesn't need to be fast. The real test of the hash functions will be when they are used on real data that the hash table is built to be used with. The collisions might be different with lots of similar strings.
Here's a good tutorial on hash tables.