Talk:Hash function

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Computer science (Rated C-class, High-importance)
WikiProject icon This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 High  This article has been rated as High-importance on the project's importance scale.
 

"data" is plural[edit]

The article uses singular, such as "the data is". — Preceding unsigned comment added by 193.11.93.250 (talk) 09:10, 5 November 2013 (UTC)

not necessarily. -- intgr [talk] 13:43, 5 November 2013 (UTC)

MD5 and SHA-1 weaknesses[edit]

This article should document the issues, raised last summer and sharpened last month, with MD5 and some other widely used hash functions. I'm putting a note that the algorithm has issues on this page and the MD5 page. If anyone wants to document this properly, look at:

(I won't write this up properly: not my field and I have very limited interest in the topic) ---- Charles Stewart 15:20, 14 Mar 2005 (UTC)

I confused this page with Cryptographic hash function: sorry, no issue ---- Charles Stewart 15:26, 14 Mar 2005 (UTC)

Who is Jenkins?[edit]

Quote: "See the external link "Hash Functions for Hash Table Lookup" below for a comparison of speed and quality of various hash functions, including Pearson Hashing and some fast, high quality hash functions by Jenkins himself." It would be informative to know who on earth this Jenkins is, before he is alluded to as 'the big man himself'. :) --Egregius 23:11, 29 Mar 2005 (UTC)

He published a hash algorithm in Dr. Dobbs. I'm not aware of any thing on the topic that he has published in any more formal on the topic, nor have I found any articles that examine his hash formally. I would have expected that if it had been formally evaluated it would have been mentioned on his web page. Nahaj 17:48, 1 July 2006 (UTC)
I was just about to ask the same thing. Why link to his source code in particular? If it's an algorithm of note, it ought to be worth having its own page on Wikipedia for, to which this article can link. If not, why mention it? --Kate (talk) 11:13, 24 February 2009 (UTC)

Strongly and Weakly Collision resistant?[edit]

If, given a message x, it is computationally infeasible to find a message y not equal to x such that H(x) = H(y) then H is said to be a weakly collision-free hash function. On the other hand, a strongly collision-free hash function H is one for which it is computationally infeasible to find any two messages x and y such that H(x) = H(y).

This seems awkward and opaque to me...

The only semantic difference I can see in this text seems to suggest that for "strongly" collision free hash functions, even if message x = y, H(x) != H(y)! Which doesn't sound much like a function to me...

Looking around, I found (at http://www.cs.berkeley.edu/~jordan/courses/174-spring02/notes/note14.pdf) the following:

2. A hash function h(x) is said to be weakly collision-free if given a message x1 it is hard to find another message x2 such that h(x1) = h(x2). 3. A hash function h(x) is said to be strongly collision-free if it is hard to find any pair of messages x1, x2 such that h(x1) = h(x2).

This is still not too clear to me.

Weakly collision free seems to me to mean: Given any specific message x, it is computationally infeasible to find a message y such that H(x) = H(y), but if you wish to, you can craft two messages together (x and y), such that H(x) = H(y).

That is, if you're NOT a hacker trying to create a forged message (which is hard), but rather a social engineer trying to create TWO messages which can be interchanged to create plausible deniability (which is easier), you can do it with a Weakly collision free hashing function.

Strongly collision free seems to me to mean: Even though in a large domain of messages there will clearly be a large number of collisions, It is computationally infeasible to find ANY two messages x and y such that x != y and H(x) = H(y), even if you are trying to craft the two together.

Sigh... I'm still not sure how to phrase this is a way which is clear to the casual reader, even assuming that I have correctly interpreted this. Anyone else care to give it a try?

You interpret correctly. But the text makes sense to me as-is.
Maybe this is a clearer non-technical rendering: "Weak means it is hard to find a collision WITH A SPECIFIC message. Strong means it is hard to find ANY MESSAGES that collide." But I agree with the anonymous person above... it makes sense to me as is. [But it is a bit of a technical statement for the math phobic.] Nahaj 17:54, 1 July 2006 (UTC)

I agree with Nahaj. However, I think the distinction between "weakly collision-free hash function" vs. "strongly collision-free hash function" only matters when we are talking about cryptographic hash functions -- right? So I see no need to even mention that distinction in the hash function article -- please delegate it to the cryptographic hash function article.

I agree this is only an issue for cryptographic hashing. There are similar issues that come up in adversarial hashing. For example if someone can find a large group of messages that all fall into the same hash bucket they can make your performance suffer. However I think that collisions resistance is enough to make something a cryptographic hash function. SamHartman 21:33, 3 October 2007 (UTC)

Overhaul of hash function and merging of hash algorithm[edit]

Hi Pfunk42! I just saw your major overhaul of hash function and "merging" of hash algorithm. Very nice work! You beat me to it. I put up those merging notices but never got around to note on the talk pages that I was going to do that work in some days. But I am happy you did it since it saved me the work and you did it so well. I added a picture instead, hope you like it. --David Göthberg 02:31, 5 September 2005 (UTC)

Surjective?[edit]

The article says that hash functions are surjective. Is that true? They're certainly not injective, but my understanding is that "surjective" means that "All elements of the image domain are actually 'used' by the mapping," to quote a posting by Peter Simons from the monoton-devel mailing list. [1]

So a hash function could be surjective, but I don't think it has to be.

But this is so far from my areas of expertise that I don't want to just change it; I may be way offbase. --Elysdir 00:10, 17 December 2005 (UTC)

Basically, you want the outputs of your hash function to be uniformly distributed over the codomain of the function. If the function is not a surjection, there exist values in the codomain that are not in the range. If this is true, then the output is not, strictly speaking, uniformly distributed. However, if the size of the codomain is greater than the square root of the size of the domain, then many quite good hash functions will not be surjective (see birthday paradox). Anyway, it's complicated, but not really an issue anymore since i took out the reference to "surjective". --pfunk42 05:56, 2 January 2006 (UTC)

Stock function?[edit]

What's a stock hash function? Isn't the case when an algorithm become published it becomes stock? What's significant about a stock function the way it is being described in this article? Just a little bit confused. --MegaHasher

Some hash functions are special-purpose. For example, they might only give nice, uniform output for the input they were designed to take. Or they might accept only input of some fixed size. These are not stock because they aren't widely reusable. Also, classes of hash functions or hash function designs are not themselves hash functions. For example, Pearson Hashing is not a stock hash function. Basically, what I'm familiar with being called a "stock hash function" is one which, from a mathematical perspective, is almost universally applicable and good. I say "from a mathematical perspective" because being a stock hash function doesn't mean it's implementation is universally fast. --pfunk42 05:56, 2 January 2006 (UTC)

Randomization Function[edit]

Ok. maybe i should have discussed it before throwing it in there, but this is widely used term for a hash function that is one-to-one. a google search for "randomization function" reveals that most of the results use the term as i describe.

True, a bijective function whose domain equals its range represents a permutation, but calling any one-to-one hash function a "permutation" is highly misleading.

It might be worth creating a separate "Randomization Function" page (with redirect from "Mix Function"), but i'm not sure. Comments welcome.--pfunk42 06:12, 2 January 2006 (UTC)

IMHO, Wikipedia already has too many semi-related pages about random functions. In my mind, permutations used for hashing has less strength than permutations used for pseudo randomness. Even with lots of rounds these permutations probably would not be too good for pseudo randomness, because the period is too short when there are already fast RNG generators with very long periods. --MegaHasher 06:33, 2 January 2006 (UTC)

Many uses of randomization functions, especially those which use them under that name, have no "period" requirement/concern. In fact, periodicity concerns might inappropriately narrow the space of possible functions. Other uses (such as pseudorandom number generation) might call for functions satisfying the requirements of a "randomization function" as well as other, stronger requirements. It would seem to me appropriate to disambiguate such classes of requirements in an article. Here's one thing i believe is a useful observation: a random function is an idealization of a hash function; a random permutation is an idealization of a randomization function. --pfunk42 13:08, 2 January 2006 (UTC)

Intro Edit[edit]

I admit perhaps the original intro is too technical. Hopefully my edit is more to the level of a beginner programmer. --MegaHasher 03:44, 25 January 2006 (UTC)

The example of catching a change in an article suffers from a defect that the check is only probabilistic. The materials in the introduction should be rock solid.

I appreciate your effort, but even so the opening sentence basically says "a hash function is something used to create a hash value". I got to this page by searching for "hash value", which redirected me here. In other words, I still don't know what "hash" means, except with respect to meat dishes. I'm going attempt to redress that. ShawnVW 18:52, 2 March 2006 (UTC)

ShawnVW: I like your new intro. --David Göthberg 01:04, 3 March 2006 (UTC)

I think the intro is good (concise, simple, accurate, and to the point). There is one small inaccuracy, I believe. The input to the hash function is not necessarily variable length. Here are a few definitions (from a Google search of "define hash function") that might be worth consideration as part of the intro:

Web definitions

A hash function is any well-defined procedure or mathematical function that converts a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an index to an array. ... en.wikipedia.org/wiki/Hash_function


(Hash functions) take a large amount of data and create a small fingerprint based on that data. www.informit.com/guides/content.aspx


I like aspects the first one, because it includes "possibly," and I like aspects of the second one because it is easy to picture and understand if someone is not familiar with this type of information. Maybe the best of both could be incorporated?

--Mark Nadon 8/13/2013

Hash table section[edit]

Two issues are in this section:

  • Cryptographic hash functions are too slow for hash table usages. Cryptographic hash functions are used to generate security hashes.
  • "too slow" is a brash generalization. some hash tables have records that live on disk, in which case you have virtually all the hashing time in the world. besides, i've seen many-a-hash table using MD5 as the hash function, which is (IIRC) only about 2-3 times slower than Jenkins. (Granted, i quickly pointed out they should use Jenkins instead... :) --pfunk42 21:25, 2 February 2006 (UTC)
  • MD5 is a CRYPTOGRAPHIC hash function, and is designed for a much different use than Jenkin's hash. I'm surprised to see it does that well. But I'd suggest that 2 or 3 times as slow is impractical for many uses. I'v implemented many cryptographic hash functions (1 NIST validated), and I don't use them for anything other than cryptographic work... any more than I would use hash table algorithms for cryptographic use. Cryptographic hash functions are WAY too slow for any hash table use I do. Nahaj 17:32, 1 July 2006 (UTC)
  • A hash function that maps successive keys to successive output values probably has severe funneling, and is not a good hash function to begin with. --MegaHasher 06:31, 1 February 2006 (UTC)
  • another brash generalization. there are cases in which regularity in the input can be exploited to beat "random" hash functions. this is pretty well known thanks to work by Knuth, et al. but the downside is that the *wrong* set of inputs can be very bad... as i discuss in the article!  :) also, some very small hash tables (for example, in implementation of "dictionary"-heavy languages like python, smalltalk, etc.) use hash functions that are pretty bad in terms of collisions but work quite well because low load factor + linear probing (very fast search) + small size + frequent access (likely in CPU cache) mean the hash function must be obscenely fast. --pfunk42 21:25, 2 February 2006 (UTC)
  • (Knuth worked with non-cryptographic hash functions, and the rest of Pfunk42's reply sounds like uses of non-cryptographic hashs. MegaHasher's comment sounds like a sound objection to such a hash for cryptographic work.) Possibly we are talking apples and oranges here? Nahaj 17:32, 1 July 2006 (UTC)

Reverse hash.[edit]

Say I have a hash (I know the hashing function and it has no salt), and I know my cleartext is between 12 and 18 bytes long and contains only ascii a-z \. 0-9 Would it be possible to write a function to generate all the possible cleartexts that fall under this particular category with this hash? Doodle77 03:24, 28 March 2006 (UTC)

In the case of non-crypto hash, yes. -MegaHasher 23:50, 5 April 2006 (UTC)
Actually, Doodle77 hasn't really asked the right question, and MegaHasher's answer is potentially misleading (though technically correct). Regardless of whether a hash function is cryptographically strong, one can write a function to generate all possible cleartexts within a finite range/set that generate a certain hash. The solution is as simple as iterating over the possibilities, hashing them, and comparing the result to the desired hash. This answers Doodle77's question the way he/she asked it; it was asked as a question of computability. Doodle77 was probably interested in whether the solution was efficiently computable, or practical, which is a question of computational complexity. With a space of inputs as large as you describe, efficiently generating matching cleartexts depends on the ability to "reverse" the hash function, which is (presumably; see One-way function) much more difficult for cryptographic hash functions.

--pfunk42 12:36, 23 April 2006 (UTC)

Intro[edit]

It seems the intro has become longer than the main text. Clean-up is again needed. -MegaHasher 23:50, 5 April 2006 (UTC)

The Intro is incorrect. A well designed hash function is not necessarily one-way nor unforgeable. A well designed hash function only needs to distribute input keys well across all buckets. You might be able to prove that a hash function that does a good job of spreading values across buckets is also one-way or non-forgeable, but those are not initial design constraints. There are "niche" applicaitons in which one-way or non-forgeable attributes are important, but it is not a design constraint for hash functions in general. Chuck Simmons 20:06, 14 April 2006 (UTC)

robust noisey room hashing[edit]

"Most of the algorithms available are not extremely robust, but some are so robust that they can identify music played on loud-speakers in a noisy room." - very interesting, can we have a source - william sharkey

Algorithms for Audio Identification[edit]

Contrary to an assumption often voiced by people who don't follow the industry carefully, hashing algorithms do exist that are robust to these minor differences.

This sentence is way too vague. Could someone please elucidate which algorithms and give some perspective on how they work? I assume that one can bring up a more broad topic that is "algorithms for analog signals identification", too, but I really have no knowledge about the topic. Poli 10:38, 12 Jun 2005 (UTC)

A Google search led me to

I suggest we move most of this information on audio identification to the acoustic fingerprint article. --75.37.227.177 14:55, 18 July 2007 (UTC)

Audio Identification NPOV[edit]

"Contrary to an assumption often voiced by people who don't follow the industry carefully,"

That seems a bit too opinionated to be in accord with WP:NPOV and WP:AWW, all the more so that it's uncited. I made the statement a bit more neutral and verifiable, but I still ask that a source be given for this claim, or otherwise it should be removed entirely. The claim regarding the noisy room is also uncited, so I'm requesting a source for that as well. --Ultimus 17:07, 1 July 2006 (UTC)

Audio Identification Facts[edit]

Musicbrainz indeed is an open source system as the article describes; however Musicbrainz does not do audio hashing. It depends on a commercial system to do that—Trevor Caira 11:09, 29 December 2006 (UTC)

How do we know it isn't using human recognition[2]? --75.37.227.177 14:55, 18 July 2007 (UTC)

Merge with one-way security[edit]

The article claims that a merge with One-way security is being considered. As far as I can tell there is no discussion of that here and no proposed reason why the merge is a good idea. I don't see why that merge makes sense. So, I'm going to remove the merge tags. Feel free to add them back and explain why you think it would be a good idea. --SamHartman 21:22, 3 October 2007 (UTC)

Checksums, fingerprints, and cryptographic hash functions are distinct[edit]

The article states that hash functions are the same thing as checksums, fingerprints, and cryptographic hash functions. Actually these concepts are related but distinct; they serve different purposes and have different requirements. Jorge Stolfi (talk) 19:58, 23 February 2008 (UTC)

I noticed that the article then goes on to say under the Descriptions section that hash tables are related to, but not the same as, checksums, contradicting itself in a space of a few paragraphs. Velociostrich (talk) 16:49, 18 November 2012 (UTC)

True! Anyone care to edit it lol, wikipedia is full of "this" these days, a hash is defiantly an extension of the checksum idea, which is an extension of the parity bit idea, this is an emergent field though so some people do not understand the history of the language, it waits for someone to take authority on the idea, but citations are preferred DarkShroom (talk) 10:16, 13 April 2014 (UTC)

Universal class of hash functions[edit]

The article does not mention universal hash functions which are used in applications such as Cuckoo Hashing. It would be nice if they did.

Dr Tom Conway (talk) —Preceding comment was added at 09:30, 31 March 2008 (UTC)


I agree. But I'm biased being one of the authors of Universal Hash Functions, Wegman, so I won't make the change myself. It's much more than Cuckoo Hashing though. Universal Hash Functions almost guarantee that for any data set the hash function will work as extpected. That is to say the programmer doesn't have to worry about collisions. I would propose to replace this section:

Uniformity

A good hash function should map the expected inputs as evenly as possible over its output range. That is, every hash value in the output range should be generated with roughly the same probability. The reason for this last requirement is that the cost of hashing-based methods goes up sharply as the number of collisions — pairs of inputs that are mapped to the same hash value — increases. Basically, if some hash values are more likely to occur than others, a larger fraction of the lookup operations will have to search through a larger set of colliding table entries.

Note that this criterion only requires the value to be uniformly distributed, not random in any sense. A good randomizing function is usually good for hashing, but the converse need not be true.

Hash tables often contain only a small subset of the valid inputs. For instance, a club membership list may contain only a hundred or so member names, out of the very large set of all possible names. In these cases, the uniformity criterion should hold for almost all typical subsets of entries that may be found in the table, not just for the global set of all possible entries.

In other words, if a typical set of m records is hashed to n table slots, the probability of a bucket receiving many more than m/n records should be vanishingly small. In particular, if m is less than n, very few buckets should have more than one or two records. (In an ideal "perfect hash function", no bucket should have more than one record; but a small number of collisions is virtually inevitable, even if n is much larger than m -- see the birthday paradox).

When testing a hash function, the uniformity of the distribution of hash values can be evaluated by the chi-square test.[3]


with something different.

Few Collisions

The odds of a pair of elements colliding under the hash function should be small. If x and y are different it's desirable that the probability of H(x) = H(y) be roughly the size of the range of the function. That is to say if you are hashing into a table the odds of two elements colliding should be no more than the number of elements in the table. The cost of look up in the best of implementations of tables is roughly the number of elements that collide.

Unfortunately hash functions are many to one mappings. The indexes to the tables we use are smaller than the strings. In a cryptography application the authentication tag is typically much smaller than the string being authenticated. So by definition there is some set of inputs that is all mapped to a single element and those inputs by definition collide. A good hash function can require someone to think carefully about the kinds of values being hashed to make sure that there won't be too many inputs that will all map to the same value under the chosen function.

An alternative to careful thought is universal_hashing. A universal hash function has one or more parameters to the function. The value of those parameters are chosen randomly. A universal hash function has the property that for any set of inputs the vast majority of parameter values will distribute the set with few collisions. For example Ha,b(x) might equal ax+b (mod p) (mod q), where p is a prime the size of the largest input and a and b are choosen less than p and where q is the number of elements in the table. There are many implementation tricks to make the calculation of such a function fast.



The current write up refers to perfect hashing which can only be used once the entire set of possibly hashed values has been determined and they tend to be too slow to compute for most examples. There is no reason to bother with a chi-square test when using universal functions the odds of getting a high value on such a test are provably infinitesimal. —Preceding unsigned comment added by MarkWegman (talkcontribs) 23:00, 19 January 2010 (UTC)

Uniformity of telephone numbers/license plates[edit]

Neither telephone numbers nor license plates have a uniform distribution over their number ranges/values. Therefore, using modulo arithmetic will not have the desired effect. 203.97.96.78 (talk) 23:07, 27 August 2008 (UTC)

The highest-order digits ("exchange code") are usually assigned by location. Therfore, a local business that hashes its customer's phones into an 1000-entry table by using the three high-order digits will find that most of the phones are mapped to a coupld of table entries. On the other hand, the lowest-order digits of phone numbers are assigned in an essentially random way within each exchange. So, the modulo 1000 hashshould spread the numbers fairly uniformly in the range [0..999] (which is what is needed). The same hold for a general table size N, up to a few thousand. Dos this make sense? --Jorge Stolfi (talk) 19:00, 21 November 2008 (UTC)
It makes sense, but that claim requires citations. I've edited to request these. 192.12.184.6 (talk) 16:51, 8 November 2013 (UTC)

Hashing is for indexing (?)[edit]

I have restored the qualification "usually a single integer that may serve as an index into an array" for the result of a hash function. To my knowledge, this property is what distinguishes hash functions (in the proper sense of the term) from other related but distinct concepts such as checksums, crypto hash functions, fingerprints, etc.. For instance, a function that returns a 128-bit value cannot be called a hash function.

Perhaps some peopel would rather define "hash function" as a generic term that covers all those concepts. But then we would be left without a name for the original sense ('a function that is suitable for "hashing" scattered data into a table'). All the best, --Jorge Stolfi (talk) 18:41, 21 November 2008 (UTC)

Illustration needed[edit]

Image 1. A typical hash function at work

This article had an illustration that was actually showing a checksum or fingerprint function, rather than a hash function proper (see at right). The illustration has been copied to the checksum and fingerprint (computing) articles. Would someone provide an illustration for hashing in the proper sense? (I don't have any SVG editor at hand). Thanks, --Jorge Stolfi (talk) 19:06, 21 November 2008 (UTC)

I made the original png image that the svg to the right is based on. I see that you removed the image from the article. Why did you do that? What do you think is wrong with the image? That is, what do you think needs to be changed?
It does show an actual hash function at work, in this case it shows the first four bytes (the first 32 bit word) of output from SHA-1. This image has been tested on many beginners and it helped them understand what hash functions are. And it has been discussed with other crypto geeks and computer scientists I know, and this is the best image we have come up with so far.
The reason we choose to show just 4 bytes output in this article is that for table lookup and other such tasks fast short hash sums are usually used. For cryptographic use hash sums are much longer so in the corresponding image over at cryptographic hash function we show the same image but with the full SHA-1 output.
--David Göthberg (talk) 11:24, 25 November 2008 (UTC)
Image 2. Hash table
The problem is that 32 bits are too many for a hash function. Image 1 would be OK for a checksum, and indeed I have adapted it for the checksum article (with 32 bit output) and for the fingerprint (computing) article (with 64-bit output). Moreover collisions are almost unavoidable in a hash function, and should be shown; whereas Image 1 and its original caption suggest that they do not exist. Finally, using SHA-1 for ordinary hashing seems an overkill.
I propose that we use the same illustration in hash table (Image 2) but without the third column (the key/value pairs). I would do the adaptation myself, but I don't have the time now (my SVG editor is emacs ...8-( ). --Jorge Stolfi (talk) 02:21, 26 November 2008 (UTC)
Well, what size the output should have in the image can be debated. Since what hash size one uses depends very much on the usage case, so it can be anything from some bits to a lot of bits.
And the image is not meant to show SHA-1 specifically, I just mentioned that the values you see happens to be produced by SHA-1, thus it does represent the kind of output you might get from a hash function.
I think that adding hash collisions in the first image the readers see perhaps makes things a bit too complicated. The first image should show the basics. And the basics here are that most/many hash sums take input of varying size (thus the short "Fox" text compared to the longer texts) and still produce a fixed size output. And that very small changes in the text produces very different hash sums (thus the two long but similar inputs). The images you added over at checksum and fingerprint (computing) fails to illustrate that checksums and fingerprints usually can take varying length inputs.
If you want to explain collisions then I suggest you use a second image further down in the article. Since my experience when teaching is that people first have to understand the basics I just described above, before they understand why collisions occur. And note that not all hash functions have collisions. Since for some uses we use hash functions that has the same size input as output, and then we often design them so they don't have any collisions at all.
So you removed an image that explains the basics, but you did not replace it with anything else. Thus you did more harm than good. I say we put the image back. If/when there is a better image, then it should perhaps be replaced. But as I said, I think collisions should probably be shown in a second image instead.
--David Göthberg (talk) 20:44, 26 November 2008 (UTC)
  • Well, let me try again:
  • When the term "hash function" was introducted (and for decades thereafter), it meant exclusively the sort of function one uses to build hash tables: it generates a small integer (not a hex string), much smaller than 32 bits. Apart from the (very rare) "perfect hash functions", hash functions were understood to have collisions just as life has taxes; and the many hash table algorithms were basically ways to handle those unavoidable collisions. They had *no* connection to cryptography or trapdoor functions at all. In particular, its is *not* true that "very small changes in the text [should produce] very different hash sums". (And I have never heard them called "hash sums").
  • At the same tme, but quite independently, people invented "checksum functions" to guard against *accidental* errors. They had very different properties from hash functions: were usually 16 or 32 bits, were *guaranteed* to be distinct under *small* differences, etc.. No one ever confused checksums with hash functions. (And no one called them "hash sums").
  • In the late 80's yet another concept became popular, that of "fingerprints"; sort of longish checksums, but designed so that there would be no collisions even among all the files in a large filesystem. This was to be true even for non-random inputs, *but not necessarily under malicious attack*. They typically were 64 bits long or so, and still did not need cryptographic techniques. A small change in the inputhad to generate a different output, but it could be a small one too. Fingerprints were even further removed from hash functions, and, again, no one confused the two. (And no one called them "hash sums").
  • Finally, in the 90s, yet another concept became popular: a function that was almost certainly collision-free, *even for malicious inputs*. These functions needed even more output bits (128, 256, etc.) and cryptographic techniques. For this goal, a small change in the input must necessarily cause a large change in the output Unfortunately, the cryptographers choose to call them "cryptographic hash functions" or "hash sums". However, they are *not* a special class of hash functions; and are distinct in purpose and design from checksums. The three concepts are almost as unlike as bottle, bottle gourd, and bottleneck.
  • So this is the root of the problem: we have three very, very different concepts with confusingly similar names. Each deserves a separate article, with a warning about the distinction. Image 1 is neither a "hash function" nor a "cryptographic hash function"; it could be a checksum function except that using SHA-1 for that purpose is an overkill. I don't think that a wrong image is better than no image. Image 2, minus the right-hand column, shows a proper "hash function".
  • Hope it helps, --Jorge Stolfi (talk) 01:02, 27 November 2008 (UTC)

Collision-free hashing is lossless coding ?[edit]

A hash function that has no collisions is what we call a lossless coding. Cuddlyable3 (talk) 23:48, 26 November 2008 (UTC)

This is a poor analogy - lossless coding usually produces variable-size codings and usually takes advantage of context to determine the code for each symbol, and so isn't really a function at all. A hash function with no collisions is best called a perfect hash function. Dcoetzee 00:08, 27 November 2008 (UTC)
Agreed. Lossless coding is different from perfect hashing in many aspects. It may deserve a "See also", but not more. --Jorge Stolfi (talk) 01:02, 27 November 2008 (UTC)
It's a statement not an analogy. It does not say that lossless codings are hash functions. A hash function with no collisions can be reversed to recover the original data, hence lossless. Whether particular lossless codings should be called functions depending on whether they produce fixed-length codes and on whether they take advantage of context seems moot here. Cuddlyable3 (talk) 11:40, 19 December 2008 (UTC)
I misunderstood your phrasing as suggesting that perfect hash functions should be explained in terms of lossless coding and readers redirected to the article on lossless coding. Perfect hash functions can be used as (very poor) lossless codes; this deserves a link, but they're studied in very different contexts and understanding one will not allow you to automatically understand the other. Dcoetzee 18:49, 19 December 2008 (UTC)
I don't have any editing proposal about this yet. The article talks about perfect = injective hashes which to me seem equivalent to substitution codes, and not very useful (except when the substitution function is both changing and secret. That case is the one-time pad that gives excellent cryptography.) I suggest that the alphabetical ordering in a telephone directory of the names of people who in the real world are in an unrelated order is a combination of two hashes: (i) a very compressive hash with many collisions whereby names are sorted into 26 bins (by their first letters A, B, ... Z). (ii) a perfect hash that allocates exactly one bin (text line) to each person. Cuddlyable3 (talk) 20:59, 19 December 2008 (UTC)
If the hash function had no collision in the first place, and recall that since a hash function is deterministic, this would imply that the inputs have a finite size and are all distinct, why would you use a hash function in the first place .. 121.73.17.17 (talk) 00:44, 6 December 2010 (UTC)

Overlap/contradiction with Hash table section[edit]

This article seems to have some overlap in function and content with Hash_table#Choosing_a_good_hash_function. Also, the latter characterizes the most important test for a hash function as being the avalanche condition, which seems a bit dubious to me - could an expert take a look at this? Dcoetzee 10:49, 24 November 2008 (UTC)

Geometric hashing[edit]

The section Geometric hashing looks like a description of Vector quantization. Vector quantisation collects groups (not just pairs) of data points that are close together (in typically 2- or more dimension space) so they can be represented with small errors by a single vector e.g. the centroid of the data group. Lossy compression is achieved by describing the data set using the reduced set of vectors. An example application is image compression from 24-bit RGB color to an optimum indexed 256-colour palette. In contrast, hashing seeks to group data in its own space that is constructed with the intention that data points that are actually close should not be kept together. Cuddlyable3 (talk) 11:22, 19 December 2008 (UTC)

  • Vector quantization is a special case of geometric hashing, although the latter is a more recent and (afaik) independently developed concept. "Bucket grid" techniques were used in the 1980's in several computer graphics applications, such as mapping a mouse click to the nearest object on the screen. Indeed, 2- and 3-dimensional bucket grids have probably been used since computers were applied to surveying, cartography, etc.. The Hough transform is basically geometric hashing of straight lines. The term "geometric hashing" applied to general shapes (as opposed to points) has a definite origin, perhaps in the 1990's; I will try to find the reference. Jorge Stolfi (talk) 15:55, 8 March 2009 (UTC)

Trivial Hash Function[edit]

The example code given for a trivial hash function doesn't seem to be the best way to explain this; whilst it is a completely valid section of code, it is probably a bit obscure to anyone who doesn't know C and/or C++. An example in some form of pseudocode might make more sense? The parts they may be too obscure to someone unfamiliar with the language are

char*

declaring a pointer to one character within a string of text, the conditional

if(g=h&0xf0000000)

both changing the value of g and checking this value, and

*p!='\0'

catching the end of the string of text. --Elder pegasus (talk) 22:32, 6 March 2009 (UTC)

  • That code was temporarily removed because of the above problems (and also because of the overly optimistic but unsourced claim that it generates no collisions). There is now a generic pseudocode in the section "Hashing variable-length data", which presumably can be made equivalent to the deleted code by the proper choice of the F function. Jorge Stolfi (talk) 16:02, 8 March 2009 (UTC)

Hash functions and "modulo the table size"[edit]

There seems to be some confusion about the "modulo the table size" bit. A hash function used for a hash table with N buckets should return an integer in 0..N-1. That is, the "mod" operation, if it is done at all, is technically inside the hash function, not outside it. The whole discussion about what makes a good hash function assumes this view. As said further on in the article, one way to get a hash function for a given size N is to take a hash function h with a much larger range (such as 232 and then compute h(k) mod N. But that is not the only way, and it may easily give a very poor hash function. If N is a power of two, the mod-N trick gives the low-order bits of h(k); which, for some popular hash functions, are known to be the worst ones. There are many other ways of reducing a large hash value to a smaller one, but they cannot be analyzed or recommended separately from the raw hash function. This is another reason to consider the mod-N operation a part of the hash function, and to avoid mentioning it outside of that section. All the best, --Jorge Stolfi (talk) 01:57, 18 April 2009 (UTC)

MurmurHash conflict[edit]

I'm writing to ask that editors interested in hashing consider taking a look at MurmurHash. Recently, a particular editor with a track record of aggressive and hasty deletionism started cutting the article down and removing key references, turning it into a stub. He removed the entire summary paragraph, repeatedly tried to remove the algorithm, and has taken out almost all of the links to ports in various languages. This lowers the value of the article and I'm obviously very unhappy about it, but I don't see further discussion with him as likely to be productive. Instead, I'd like to get more people involved so that a genuine consensus can be formed. 208.80.104.2 (talk) 13:53, 11 September 2009 (UTC)

I suggest you take a look at WP:CIVILITY before you throw around unfounded accusations such as "aggressive and hasty deletionism".
To uninvolved parties, I'm the editor that's being referred to here. I'm more than happy for a 3rd opinion on the MurmurHash article. Oli Filth(talk|contribs) 17:48, 11 September 2009 (UTC)
Anyone capable of clicking on your talk page can decide for themselves whether I am correct in saying you have a history of aggressive and hasty deletionism. False accusations of incivlity are themselves a violation of WP:CIVILITY, so mind your own manners. 208.80.104.2 (talk) 18:12, 11 September 2009 (UTC)

I've inherited the article, which I've recreated under guidance from the admins who initially opposed it. It's back on the chopping block, so please chime in: Wikipedia:Articles for deletion/MurmurHash (2nd nomination). Phil Spectre (talk) 00:35, 15 November 2009 (UTC)

Different from fingerprint?[edit]

From the two pages it seem like hash function and fingerprint are the same thing. Could someone explain the difference? Thanks, — sligocki (talk) 18:15, 12 November 2009 (UTC)

For example, they both claim to aim to avoid collisions. — sligocki (talk) 18:18, 12 November 2009 (UTC)
Consider that a hash function that collides infrequently might be fine for use in a hash table, but it wouldn't be a good fingerprint. Fingerprints are the subset of hashes with particularly low collision rates and resistance to tampering. Phil Spectre (talk) 03:35, 13 November 2009 (UTC)

Mathematically definiton[edit]

In Introduction it is written, that a hash function is a mathematical function. Therefore I missed a proper mathematical definition of a hash function. For example something like:

A hash function is a polynomial time algorithm H which maps a bit string of arbitrary length to a bit string of a certain length. In formula: H:\{0,1\}^* \rightarrow \{0,1\}^l.

Now you could distinguish certain kinds and properties of hash functions in a more formal way. For example like:

  • fixed length: fix the domain and set H:\{0,1\}^{l'} \rightarrow \{0,1\}^l with l'>l
  • family of hash functions or also called keyed hash functions
  • strongly and weekly collision resistant and explanation of the difference between
  • preimage resistance: for given H(m) it is infeasable to find a certain m
  • second preimage resitance: for given m and H(m) it is infeasable to find a certain m' s.t H(m)=H(m')

I read above that someone more prefer to write such properties in the cryptographic hash function article. But the book I actually read (Introduction to modern cryptography by Katz and Lindell) describe cryptographic hash function as functions which fulfill certain properties but define this properties for ('noncryptographic') hash functions in general. Anyhow at the moment this things aren't described in detail neither in Hash_function, nor in Cryptographic_hash_function.

--Wohingenau (talk) 13:54, 26 March 2010 (UTC)

'Bernstein hash' link goes to French wikipedia[edit]

The link at the bottom of the page to the 'Bernstein hash' page goes to http://fr.wikipedia.org/wiki/Table_de_hachage#Fonction_de_Hachage , which is probably undesirable on the English version of Wikipedia. I did a quick search for 'Bernstein hash' on the English wikipedia and didn't find a specifc page about the algorithm, otherwise I would have fixed the link myself. Either the link should be removed, or the content in the French wiki article should be copied over. 202.20.20.202 (talk) 05:15, 24 May 2010 (UTC)

Help understanding please...[edit]

How are any of these functions regarded as reliable? Take SHA-1, for example: with a message size of up to 2^64-1 bits, and a digest size of 160 bits, any given digest can have many tens of trillions of messages (roughly 2^57, assuming the algorithm yields high efficiency and even distribution). Compared to the fact that any given message yields exactly one of up to 2^160 digests, there's an incredible discrepancy (like, a billion orders of magnitude) and collisions seem inevitable under the circumstances of even very low-tax, common applications (those applications where message sizes are nowhere near the full possible length). I guess the question in short is: How long can messages be, before the risk of frequent collisions becomes a practical problem for the application? Something like message lengths of 2^8 bits appears to be the point at which collisions become absolutely guaranteed. For many cryptographic hash applications, it doesn't matter if you discover the actual message, if you discover any one of the many tens of trillions of messages which match the digest, you're ready to spoof. So, I am missing something. --76.102.243.117 (talk) 22:25, 7 June 2010 (UTC)

It's not how long the messages are, it's how many messages you have. If you have more than 2^m messages, where m=n/2 and n is the bit length of the hashes, then you've got at least a 50% chance that there's at least one collision somewhere.
At least, that's the case if the hash function is good, and there's nothing special about the messages; i.e. if the hashes are generated randomly enough that you can apply statistics to the problem.- Wolfkeeper 22:47, 7 June 2010 (UTC)
That's the Birthday paradox at work.- Wolfkeeper 22:48, 7 June 2010 (UTC)
OK, from there I found the link to Birthday attack... --76.102.243.117 (talk) 10:35, 25 August 2010 (UTC)

Cutter numbers[edit]

Is there a connection between hashes and Cutter numbers (Library Science)? The aim of Cutters, as far as I can tell, is to map points from a large sparse space (eg names of people) to a small space (eg 3 decimal digits). That sounds like hashing. Cutters have the additional property that order is preserved - that is, if name X is alphabetically before name Y, then the Cutter number of X comes before the Cutter number of Y.

An issue that Cutters make use of is that points in the name space are not uniformly distributed. A practical way to generate Cutters for people's names is to use a phone book. Find the page with the name, or where the name would be if it is missing, and use the page number. This takes account of the non-uniform frequency distribution of names.

138.251.194.28 (talk) 14:43, 23 August 2010 (UTC)

Cutter codes are generated according to a set of rules (i.e., a hash function), although some human judgement is allowed to accommodate very common names; on the other hand, such judgement could be incorporated into the rules of one's Cutter algorithm. Cutter is not a hash function itself, but a kind of hash function, and different libraries may use different Cutter tables in generating their Cutter codes (hashes). Here's an example of a Cutter table: Example Cutter table

However, this Wikipedia page defines hash codes as being of "fixed length", which Cutter codes are not, yet gives no explanation why hash codes must be of fixed length.

Must Hash Codes Be of "Fixed Length"?[edit]

This page defines hash codes as being of "fixed length", but offers no explanation of why uniform length is an inherent and necessary trait of hash codes. There are hashing systems (such as Cutter codes) that produce (slightly) varied hash-code lengths, but that satisfy all the functional requirements of a hashing system. I would argue that fixed or uniform length of hash codes is a typical characteristic, but not a required characteristic of hash codes. Anyone know any reason to the contrary? — Preceding unsigned comment added by 76.121.65.17 (talk) 17:28, 23 November 2012 (UTC)

@"Perfect Hasing"[edit]

maybe we should mention that any Perfect Hash function cannot output any smaller data sets than the input? :p

if the hasing function is "perfect", the output cannot be smaller than the input...Divinity76 (talk) 19:08, 19 July 2012 (UTC)


Perfect???: PCR-Hash to texts for Programm Codon Restrict IS rename failure by Polesinskij:

l:=(Length of bytes)

S:=(total plass of bites of all bytes)

U:=(total multiplicat for bites of all bytes)


with sqrt on n=N:

([S/l]=n to sqrt of U)=k

x:=([k(to float)]=n2 to sqrt of U)

total concatenat:l+k+x

second part with_2-sqrt:

beg {z:=l/2; int Y+1; while z>2;} end

beg {k:=2sqrt of U;Y-1; while Y>0;} end

x:=[k(to float)=n to sqrt of U]

total concatenat:l+k+x


— Preceding unsigned comment added by 95.153.189.81 (talk) 19:48, 1 November 2013 (UTC)

Bloom filters and hashing[edit]

Yes, Bloom filters use hash functions, but this section isn't helpful. What does "a compact data structure that provides an enclosing approximation to a set of them" even mean? I'd recommend saying something that makes sense here, or deleting the section.

Paul Kube (talk) 22:54, 4 January 2013 (UTC)

Dubious source of CRC32 information[edit]

Bret Mulvey's analysis referenced in the article shows a CRC-32 implementation where the polynomial used is the most-significant-bit (msb)-first representation of one of the standard CRC-32 implementations. However, his table initialization code is for a lsb-first implementation. (Note that all 256 of his table values will have the most significant 4 bits unset.) His CRC-calculation code is correct for checking the CRC-32 in a protocol where the CRC-32 is appended to the message such that the CRC-32 of the message with the appended CRC-32 is a constant. However, it appears that he didn't append 4 null bytes to his keys (as is effectively done in protocol implementations by zero-ing out the 4 bytes left for the CRC at the end of the message buffer). Note that the worked-through 3-bit CRC example in the Wikipedia article on CRCs includes padding with 3 zero bits.

These two flaws (improperly mixing lsb-first and msb-first implementations, and failing to pad with 32 zero bits) together account for his poor results.

Kmag (talk) 05:48, 15 December 2013 (UTC)

@Kmag: The CRC-32 mapping is given in the article as an example of intermediate hashing. Although CRC-32 is not intended to be a hashing algorithm and his length (4 bytes) is too small to avoid frequent collisions, it does give very different values for slightly different input by design. This is why it is used for error detection. A poor implementation does not mean the algorithm itself is weak. I think it is good that the source was removed because it creates a false impression. However, we don't really need to support the argument that CRC-32 produces very different results for slightly different input, because it is the main characteristic of the algorithm. Nxavar (talk) 10:54, 4 June 2014 (UTC)

Functions versus algorithms[edit]

Every once in a while, someone confuses functions with algorithms. Functions are mappings, algorithms are abstractions of computer programs. Functions are concepts, algorithms are implementations. So saying that "A hash function is any algorithm that maps ..." has things entirely backwards. Correct would be: "A hash function is a function that maps ...", and "A hash algorithm is an algorithm that implements some hash function". Boute (talk) 13:45, 27 April 2014 (UTC)

It's easy to see where the confusion comes from, the word "function" can refer to a subroutine in a computer program (implementation of an algorithm), as well as a "mathematical" function, which is an abstract mapping. Maybe it's easier to just avoid the distinction in the first place? -- intgr [talk] 11:56, 4 June 2014 (UTC)
@Nxavar: recently added this to the lead: "A hashing algorithm evaluates a hash function repeatedly to produce a hash value for input data of arbitrary size". I think you're referring to the "round function" in cryptographic hashes, which is indeed executed multiple times. -- intgr [talk] 12:08, 4 June 2014 (UTC)
I am not refering to the round function. The sentence should be understood in the context of how the hash function was just described: it takes data of fixed size and some hash value as an argument. I chose not to elaborate on how, for data of arbitrary size, the hash function can be used to produce a hash value: a) the input must be broken into chuncks that have the size supported by the hash function, and b) the hash function is evaluated for each chunk, with its output serving as the hash value for the next evaluation. This iterative procedure is the hashing algorithm, and terminates when all chunks have been processed, returning the last hash value. Checksums are produced the same way. Nxavar (talk) 14:22, 4 June 2014 (UTC)
There is nothing wrong in calling an algorithm with no side effects a function. Such algorithms are restricted to producing some output for given input which matches perfectly the definition for a function. Hashing algorithms have no side effects. I suggest that we go back to refering to the hash function as the hashing algorithm. Nxavar (talk) 04:09, 5 June 2014 (UTC)
I forgot to mention that in order to produce same output for same input, the algorithm must not use any global variables, and must be deterministic. Both are true for hashing algorithms. Nxavar (talk) 04:19, 5 June 2014 (UTC)
I think I see now what you mean. So you say that "hash function" is a function of hash(prev_state, fixed_block) -> new_state and a "hash algorithm" applies that function to arbitrary-length inputs?
I think we can't reasonably make progress here by hand-waving and opinions, so please provide some sources supporting this definition. This does not match any definitions of "hash function" that I've seen. Most definitions say "arbitrary-lenght input, fixed-length output". Admittedly these are cryptography sources because I couldn't find solid definitions for non-cryptographic hashes. For example Handbook of Applied Cryptography (Menezes et al) defines it as:
A hash function (in the unrestricted sense) is a function h which has, as a minimum, the following two properties:
1. compression — h maps an input x of arbitrary finite bitlength, to an output h(x) of fixed bitlength n
Applied Cryptography (Schneier) chapter 18:
A one-way hash function, H(M), operates on an arbitrary-length pre-image message, M. It returns a fixed-length hash value, h.
Most cryptographic hashes today use Merkle–Damgård construction or similar, which is indeed such an iterated algorithm, but the resulting construction is usually called "hash function", while the internal function is a "one-way compression function" or just "compression function". -- intgr [talk] 10:41, 5 June 2014 (UTC)
From the pseudocode given in the article for SHA-1:
...
Process the message in successive 512-bit chunks:
break message into 512-bit chunks
for each chunk
...
The chunk size is the block size mentioned in comparison tables of cryptographic hash functions. I do not think there is a distinction between a hash function and a hashing algorithm in the literature, but small hashes are usually based on some arithmetic expression and you can say that you have a "normal" function and an algorithm that uses it. Nxavar (talk) 12:19, 5 June 2014 (UTC)
@Nxavar: I do not think there is a distinction between a hash function and a hashing algorithm in the literature
Are you familiar with verifiability policy? This sounds like original research and does not belong to Wikipedia.
As for SHA-1, I already addressed that in my last comment: SHA-1 is a hash based on the Merkle–Damgård construction. The inner function is called a "one-way compression function" and the resulting construction is a "hash function". The Handbook of Applied Cryptography also uses this terminology. -- intgr [talk] 14:22, 5 June 2014 (UTC)
You focus too much on cryptographic hash functions. This is not the only type of hash function. What is called hash function for small hashes is called compression function for large ones. Cryptographic hash functions are effectively hash functions that produce large hashes. The qualifier cryptographic just signifies the intended area of application. Anyway, I completely agree with not including such information in the lead. What about putting it to the "Description" section, using appropriate terminology? Nxavar (talk) 15:10, 5 June 2014 (UTC)
Cryptographic hash function have higher standards for collision resistance ofcourse, but their basic algorithmic design is the same as for other hash functions. Nxavar (talk) 15:15, 5 June 2014 (UTC)

────────────────────────────────────────────────────────────────────────────────────────────────────

True, I brought up cryptographic hash functions because that's what I could get a solid definition for. Ignore the cryptography, I'm asking you to find a reliable source that supports your point of view. Wikipedia requires sources for contentious claims. If there's no source then Wikipedia shouldn't state this. -- intgr [talk] 16:34, 5 June 2014 (UTC)

Here are some sources:
Hash Functions
Selecting a hashing algorithm and the full article on the authors website.
They deal with small hashes. You can see that the hash functions are indeed refered to as hashing algorithms, and they use an "inner function" which is indeed an arithmetic expression. We now have references for the "outer function" - "inner function" concept, where the former feeds the input data in chunks to the latter, for both simple and cryptographic hash functions. Nxavar (talk) 18:03, 5 June 2014 (UTC)
I am not contesting the fact that you can split a hash function into an inner and outer part. I am talking specifically about the claim you are trying to make about the terms "hash function" and "hashing algorithm":
"A hashing algorithm evaluates a hash function repeatedly to produce a hash value for input data of arbitrary size"
Neither of these sources supports your distinction. In fact, they seem to contradict your claim. "Selecting a Hashing Algorithm" seems to use the two terms interchangeably and the other page describes the whole construction (including the loop) as "hash function". If I am mistaken, be more specific about how they support your claim. -- intgr [talk] 20:07, 5 June 2014 (UTC)
As I mentioned before, we can use appropriate terminology for the article. The terms are indeed used interchangeably, but when you try to use them together the obvious choice is to name the outer function the "hashing algorithm" and the inner function the "hashing function". This terminolgy ofcourse is not compatible with cryptographic hash functions because different terms are used in that case, even though they mean the same thing. However, by putting this content in the Description we won't have to be terse and compress everything to a single sentence. We can give all necessary details for avoiding misinterpretations. Nxavar (talk) 04:12, 6 June 2014 (UTC)
Sorry, but I disagree that your proposed terminology is "obvious" (or even accurate). I don't think you should add it to the article at all, on the grounds that it's original research. -- intgr [talk] 13:17, 6 June 2014 (UTC)
No problem. We can show how "ordinary" hash functions have an inner loop with an arithmetic expression and then show that cryptographic hash functions have an inner loop with a compression function, and that both inner loops work on data of fixed size, and that this is a similarity of the two types of hash function. Nxavar (talk) 14:10, 6 June 2014 (UTC)
Agreed. -- intgr [talk] 17:04, 6 June 2014 (UTC)

"The Hash Keeper database ..."[edit]

Since nobody knows what this is, what is point of the sentence?

"The remaining characters of the string ..".[edit]

Given that the previous so-called sentence runs on and on without saying anything, what "remaining characters"?