Earlier today a colleague wondered if there was a built-in method in the .NET Framework to determine if a number is prime. (A prime number, you'll recall, is one whose only factors are 1 and itself.) There is such a method, but unforunately it is not very usable for two reasons:
- It's a private method in the System.Collections.Hashtable class, meaning you can't call it from your application code, and, more importantly,
- It has a bug in it that will report a certain class of non-prime numbers as prime.
As I discuss in An Extensive Examination of Data Structures: Part 2, the Hashtable class maintains a prime number of buckets. When you add an item to the Hashtable, a hashing function is used to determine what bucket to place the data into. There is a chance, however, that someone else might be using that same bucket - this is called a collision. When a collision occurs there are a variety of possible remedies; the Hashtable class uses a technique dubbed rehashing. Rehashing applies an alternative hash function to find a new bucket, and trying that one out. If that one is also taken, the algorithm repeats with yet another hash function and so on and so on until an empty bucket is found.
For performance reasons, it's important to have a healthy ratio of empty buckets to filled buckets. Consider the costs if a high percentage of buckets were filled - this would result in numerous collisions, thereby requiring that the hashtable spend an inordinate amount of time rehashing. Too low of a percentage and you have a waste of space. (For the incurably curious, 72% is the ratio Microsoft suggests, and is the default ratio. Interestingly, you cannot specify a higher ratio than this, but you can indicate that a lower ratio be used in one of the Hashtable's constructor overloads.)
Whenever you add an item to the Hashtabel that pushes the ratio between empty and non-empty buckets above the specified limit, the Hashtable increases its size. Specifically, the Hashtable is careful to resize itself to have a prime number of buckets. Why, you might ask? It has to do with the rehashing technique used. Recall that rehashing applies a new hash function each time a collision occurs in placing an item. That is, for any item being added (or searched for), it first uses hash function H_{1}. If that leads to a collision, it attempts to resolve the collision using hash function H_{2.} If that doesn't cut the mustard, onto H_{3} we go, and so on. For the Hashtable class, these hash functions are defined as:
H_{k}(item) = [GetHash(item) + k * (1 + (((GetHash(item) >> 5) + 1) % (hashsize – 1)))] % hashsize
With rehashing it is imperative that for a given item each hash function hashes to a unique key. This can be guaranteed if [GetHash(key) + k * (1 + (((GetHash(key) >> 5) + 1) % (hashsize – 1)))] and hashsize are relatively prime - that is, if these two numbers share no common factors. One way to guarantee that any two numbers are relatively prime is to choose those one or both of two numbers as prime themselves. And so this what the Hashtable class does by ensuring hashsize - the number of buckets in the Hashtable - is a prime number.
Ok, so the Hashtable needs to be able to pick a prime number for the number of buckets when it must increase its capacity. How does it do this? Well, it starts by defining its favorite 70 prime numbers in a private integer array. These prime numbers range from the smallest possible Hashtable capacity - 11 - all they way up to 7,199,369. They are not the first 70 prime numbers, mind you, just 70 prime numbers, each one, on average, 1.253 times larger than the last (assuming I did my math right). But what happens if we have a Hashtable with 7,199,369 buckets and need to grow it? What hashsize is chosen then?
The Hashtable class, in this case, reverts to the old school way of finding a prime number - it starts at the minimum prime number size needed, moves it to the nearest odd number, if needed, and starts adding 2, checking to see if the new number is prime. To check if a number is prime, the Hashtable class calls its private method Primep(int), which returns a Boolean indicating if the passed-in integer is prime or not. The Primep(int) method's code is shown below:
private bool Primep(int candidate)
{
if ((candidate & 1) == 0)
{
return (candidate == 2);
}
for (int num1 = 3; num1 < ((int) Math.Sqrt((double) candidate)); num1 += 2)
{
if ((candidate % num1) == 0)
{
return false;
}
}
return true;
}
The method starts by asking, “Is candidate even?” If so, candidate is only prime if it equals 2. If candidate is odd, the code loops through the odd numbers between 3 and one less than the square root of the number being tested, checking to see if any of these numbers divide candidate. If any of them do, then candidate is composite (not prime); otherwise, candidate is prime.
Wondering why we don't need to test all odd numbers between 3 and candidate, why we can stop at the square root? The reason you can stop at the sqrt is because if there does not exist a factor less than or equal to the sqrt of a number, then clearly there can't be a factor greater than the sqrt since you would need a smaller number to multiply it by to reach the number being tested. More formally, assume that a number x doesn't have any factors less than or equal to sqrt(x), but that it does have a factor > sqrt(x). Let this factor be y. Ergo, there must exist some number z s.t. z*y = x, but z must be greater than sqrt(x) as well (since we already know there's no factor <= sqrt(x). So z*y > sqrt(x) * sqrt(x), or z*y > x, so z*y cannot equal x. Hence we have reached a contradiction.
If you look closely at the code used by the Hashtable class, though, you'll find an off-by-one bug. Note that the loop only goes from 3 to a number less than the square root of candidate - it should be less than or equals. Why? Well, consider that candidate is the square of two primes, such as 25 or 49. Clearly these numbers are composite, but their only factor other than 1 and the number itself is precisely the number's square root - which will never get tested by the Primep() method! Hence, Primep() indicates that numbers like 25 and 49 are indeed prime. Oops.
Fortunately the Hashtable uses those pre-determined prime numbers for Hashtables with a size up to 7,199,369, but what if we have a Hashtable that exceeds this capacity? The Hashtable is going to try to find a prime number larger than twice the current capacity. If, by misfortune, that happens to come out to a number that happens to be the square of two primes... well, your Hashtable's capacity will become a non-prime number, which will potentially break the rehashing technique. That is, when rehashing you might end up back at the same bucket that you already tried. I imagine you might even end up in a loop, tirelessly hopping around filled buckets, never making progress. (The Hashtable is smart enough that if it loops around the number of buckets times, an exception is thrown...)
After I stumbled upon this bug in the .NET Framework I did a bit of Googling to see if what I had found had been discovered previously. And, sure enough, it has, as mentioned in this blog entry by Brad Abrams: Primes in the BCL... Part 2. Brad assures us that this will be fixed in 2.0.
Speaking of 2.0, with Whidbey a new Dictionary data structure is introduced, which uses chaining to handle conflicts rather than rehasing. With chaining, each bucket has a list of conflicts. When a conflict occurs, the item is added to this list (as opposed to probing for a new bucket as with rehasing). More info at this blog entry by Krzysztof Cwalina, or you could just read An Extensive Examination of Data Structures Using C# 2.0: Part 2.