I just finished reading a blog entry on the new System.Collections.Generics.Dictionary class in Whidbey where Krzysztof Cwalina talked about the difference in conflict resolution between the Hashtable and the Dictionary classes. (The Hashtable uses probing, the Dictionary chaining.) When either data structure reaches its maximum size it must be expanded. (This maximum size is defined as a percentage of full buckets for the Hashtable, as defined by its loadFactor; for the Dictionary, all of its buckets must be filled before being resized.) When being resized, the algorithm goes as follows:
- Double the size of the number of buckets
- Find the next largest prime number; this prime number is the new number of buckets for the data structure.
For the Hashtable, which uses probing, it is of great importance that the hash and the number of buckets be relatively prime, in order to ensure that probing does not lead to conflicts itself. This is ensured by having the number of buckets be prime, meaning it will be relatively prime with any number, save itself and 1. But why does Dictionary need a prime number of buckets? There may be some reason I'm missing, but I don't believe it's a requirement.
Anywho, a commentor on Krzysztof's blog asked why the number of buckets is expanded so. That is, why double (and then some) the number of buckets, why not increase is by 10%, or 50%, or something user-defined? From my understanding, this could be user-defined, but doubling the size of a buffer is a good heuristic. It's what the ArrayList and List data structures do when you exceed their capacities.
But enough rambling, let me address the title line of this blog. When expanding a Hashtable, it is important to end up with a prime number of buckets. But how does one find a prime number? Since a prime number is a number with only 1 and itself as factors, a naive approach would be to loop from 2 to the number - 1, and see if any numbers divided the number being tested. If so, the number is not prime; if not, the number is prime.
While this will definitely work, it's a bit of overkill. You can actually loop to just the square root of the number you want to test. Why? Well, say you want to know if x is prime. If it is, then a number less than or equal to sqrt(x) must divide x. To prove this, use a proof by contradiction. Assume that x is composite and that it is not divisible by a number less than sqrt(x). Then there exists some set of numbers, call them y and z, such that y * z = x, and y, z > sqrt(x). If this is the case, then y * z must be greater than x, since sqrt(x) * sqrt(x) = x. Since we have reached an absurd conclusion, the premise must be wrong, and therefore a composite number must have a factor less than or equal to its square root. So you can just loop from 2 to sqrt(x) when checking for divisibility.
Even this approach is overkill, really. I believe it was Euclid who showed that all integers can be written as a product of primes. So rather than looping from 2 to sqrt(x), you really only need to check for divisibility against the prime numbers between 2 and sqrt(x).
Enough mathematics - how does .NET find the next largest prime number? Well, the Hashtable maintains an array of 70 prime numbers that are roughly twice in size. That is, it starts with 11, then 17, then 23, then 29, then 37, and so on, up to 7,199,369. Obviously it skips over a lot of primes, since the 1,000 prime number is 7,919. Given a number, we can find the prime larger than this number by consulting this pre-defined array. But what if the number we have to find a prime larger than is, itself, larger than 7,199,369? If this is the case, the .NET Framework falls back on the ol' tried and true measure - looping from 2 to sqrt(x) and checking for divisibility.
To learn more about Hashtables in the .NET Framework 1.x, check out: An Extensive Examination of Data Structures, Part 2.