Welcome back. Today we're going to look at hashing. Which is another approach to implementing symbol tables that can also be very effective in a practical applications. Here's our summary of where we left off with red black BSTs. Where we could get guaranteed logarithmic performance for a broad range of symbol table operations. And the question is, can we do better than that? Is logarithmic performance the best we can do? And the answer is that, actually, we can. But it's a different way of accessing the data. And also, it doesn't support ordered operations. But there's plenty of applications where the extra speed for search and insert that we can get this way is worthwhile. The basic plan is to think of the symbol table as really try to reduce the problem to being like an array. And what we do is use a function known as a hash function that takes the key that our symbol table key and reduces it to an integer, an array index, and we use that array index to store the key and the value in an array. Maybe the value in a parallel array. Now, there's a lot of issues in doing this. The first thing is we need to be able to compute the hash function. That is easy for some types of data but it can get complicated for more complicated types of data. Then the other thing is that instead of doing compare to's, we're going to be doing equality tests. So we have to be sure we've got the method that we want for checking whether two keys are equal. All we're going to do is look in the table and try to see if the key that's there is equal to the key we're looking for. And then there's the problem of collision resolution. Where, it's since there are so many possible values for a typical data type. We're going to get the situation where two values hash to the same array index and we need a collision resolution strategy to try to figure out what to do in that case. And these things are not difficult but they're all worth articulating as separate issues that we have to deal with in order to get an effective single table implementation. Hashing really at its core is a classic space-time tradeoff. If we had no limitation on space at all, then we can have a very huge array with space for every possible key and just use the key itself as an index. If our keys are 32 bit integer keys and we've got a table of size 2 to 32 second then we're just fine. If there were no time limit computation at all, then I'll just hash everything to the same place and then do sequential search. But a sequential search can be slow if we have lots of keys. So what hashing is kind of in the real word where we're trying to tradeoff this idea that we don't have unlimited space and we also don't unlimited time so we're trying to find something in-between. So we'll look at hash functions, separate chaining and then two collision resolution methods called separate chaining and linear probing. Now, we'll look at the implementation of hash functions. So idealistically, what we'd like is to be able to take any key and uniformly scramble it to produce a table index. We have two requirements, and one is that we have to be able to compute the thing efficiently in a reasonable amount of time. And the other is that it should be the case that every table index is equally likely for each key. Now, mathematicians and computer scientists have researched this problem in a lot of detail. And there's quite a bit known about it. But in practice this is something that still we have to worry about somewhat. So for example, let's suppose that our keys are phone numbers. Probably a bad idea to use the first three digits of the phone number as a hash function because so many phone numbers will have the same area code. And it's not equally likely that each phone number has the same first three digits. You have a better chance using the last three digits. But actually, in most cases, you want to find the way to use all the data. Another example, Social Security numbers. Again, it's not too good to use the first three digits because they're associated with some geographic region and it's better to use the last three digits. And the real practical challenge with hashing is that developing a hash function is that every type of key needs a hash function and you need a different approach for every key type. Now for standard keys like integers and strings and doubles and so forth, we can count on the designers and implementors at Java to implement good hash functions. But if we're going to be implementing symbol with our own types of data we're going to have to worry about these things in order to get a hash function that's effective, that leads to an effective symbol table implementation. So hashing is widely used for systems programming and applications, so some conventions for hashing are built into Java. In particular, all Java classes inherit a method called hash code which is returns a 32-bit int value. And it's a requirement that if x and y are equal then their hash code should be equal. So that's something that is a convention that's built into Java and that enables the hash code to be used for hashing. Also, of course, if they're not equal then you'd like it to be that they're hash code's are not equal but you can't always get. Now, the default implementation for hashing is the memory address of the object. For hashing an object just some memory address of an object. So that kind of meets these two requirements for Java. The one that it doesn't maybe meet is the idea that every table position should be equally likely. So usually we'll do some more work to try to make that one happen. As far as the algorithms go, as far as the rules go, you can always return 17. That's legal. It doesn't have this highly desirable attribute but everything would compile. So you'd have to be a little careful that somebody is in there doing that. So Java has a customized implementations for the standard data types that people would use for similar table keys and that's the sweet spot for hashing. Where some expert has done implementation of the hash code and also your application does not need ordering. But for user defined types, you're on your own and we'll talk a little bit about how to implement hash codes. So here's the Java library implementations for a few standard types and they are what they are and what we'll do is we acknowledge that that's what the hash code is. We'll do some extra work to try to get this extra property that every table position should seem to be equally likely. So if it's an integer the hash codes suppose to be 32-bits, integer supposed to be 32-bits. So they just returned the value. If it's a boolean, they pick out a couple of particular values that they return, so hashing boolean type, there's only two different values, so it's hard to think about what you really might want there. For a double value, this is the code. They convert to 64-bit, and x or the most significant 32-bits with the least significant 32-bits. Now, this illustrates something that you want to do if you have a lot of bits, you want to try to involve all the bits somehow into hash function. And for strings, it kind of creates the string as a huge number and then, really computes the value of that number. MOD 32. It uses an arithmetic. A way of evaluating a polynomial or a number. So called Horner's Method. Where for each digit, you just multiply. So it treats it as a base 31 number. And to compute that whole number you multiply 31 times what you have so far and add the next digit. And that's called Horner's Rule. And if you're familiar with it, fine. If you're not, you can look at this little example and decide what it is. And again it involves all the characters of the string in computing the hash function. So and actually, since strings are immutable, what Java does is keep the hash value in an instance variable so it only gets computed once. And that is going to be very effective for performance and lots of applications. Once it computes the hash code, it stores it as an instance variable. And the next time you ask for the hash code of that string, it will just provide it and that works because strings are immutable. So how about implementing a hash code for our own type of data? And so our transaction type might have a couple of instance variables, a string, a date, and a double. And we need to compute a hash code so return a 32-bit value. And again, we want to try to make use of all the pieces of data that we have. And we also want to make use of the hash code implementations for the types of data that we're using. So one thing to do is start out with some small prime number and this kind of mimics Horner's method to just add in more data as we get it. So we pick some other small prime number and for each field we multiply by 31. And then add the hash code for that field. So if it's a reference type, you just use the hash code. So who was a string, so string has a hash code method. So we add that in. And dates, when is a date so we add that hash code, multiplied by 31 and add that hash code in. Trying to take all the bits and scramble all the bits and use them. And for primitive types take the wrapper type and use the hash code. So that's a simple example of implementing a hash code for our own type of data that might include several different types of instance variables. So that's the standard recipe. Here's the 31x plus y rule to combine all the fields. If it's a primitive type, use the wrapper hashCode. If the field is null, return 0. If it's a reference type, use that hashCode and apply recursively. And if you have an array, you have to apply it to each entry. Or actually Java implements that in its arrays library. So this recipe works pretty well in practice and it's used in several Java's libraries. Now in theory, it's possible to do something that has the property that all positions are equally likely. It's called universal hash functions. This things exist but they're not so widely applied at in practice. So the basic rule is that if you're computing your own try to use the whole key but consult an expert if you're seeing some performance problems. Or you really want to be certain that it in some performance critical situation. Now, what we get back from a hash code is a int value that is between minus 2 to the 31st and 231st- 1. Now, what we need is if we have a table of size M, an array of size M that we are going to use to store the keys, we need an int value between zero and M minus one. The value of M is maybe a power of two or sometimes we'd pick a prime because of the way that we normally would get the big hash code value down to be a number between zero and M minus one. This is just do mod M and if M is a prime then from that modular arithmetic we know that we're using all the bits in the number in that point to. Now, sinse the hash code can be negative, this doesn't quite work the way this arithmetic implement and Java, because it's one in a billion times. You really have to take the absolute value. Well, sorry, you have to take the absolute value because otherwise it'd be negative and you can't have it negative. You want it to be between 0 and M- 1. But even if you take the absolute value. There's going to have -2 to the 31st. It's possible so you have to just take the 31-bits. Get the hash code out, make it positive and MOD M is the way to go. The math doesn't quite work out right. So anyway, that code down at the bottom is you can use that as a template for what you might want to do. And that's what we do in order to get the hash code to be a number between 0 and M-1. And if M is prime, it gives us some comfort that we have some possibility of each table position appearing with equal likelihood. So that's our assumption that each key is equally likely to hash an integer between zero and M minus one. And this assumption, again, it would work. It's possible to come close to this. Lots of researchers have done good work to show this. We'll assume that is a starting point. And that allows us to model the situation with a so-called Bins and Balls model that directly relates the study of hash functions to classical probability theory. So we've got M bins, that's our correspondence to our hash table. And we get M balls. And we have some number of balls, however many keys we have. And we'd throw them universally at random into M bins. And these things are studied in classical combinatorial analysis. For example, there's the birthday problem. Which how many balls do you throw before you find two hitting the same bin, when do you get the first collision? And the answer to that is it's about square root of pi M over two. When does all the bins fill up? That's called the coupon collector problem. After about natural log M tosses, every bin has at least one ball. And those are just examples of classic results from combinatorial analysis that help us understand what happens when we do this, which is what we're doing with hashing. And we'll look at more advanced versions of these problems when we want to study hashing. In particular, it's known that after you've thrown M balls into the M bins then the most loaded bin has about log M over log M balls. So that's going to help us get a handle on the performance of hashing algorithms when we get to the implementations. So this is just an example showing all the words in a Tale of Two Cities using the modular hashing function for strings like the one that Java uses. And they're pretty uniformly distributed. That's the summary for hash functions.