Talk:Pearson hashing

This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Computer scienceWikipedia:WikiProject Computer scienceTemplate:WikiProject Computer scienceComputer science articles

???

This article has not yet received a rating on the project's importance scale.

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

Change of Python snippet

Important for hash function is to give us same result on every run, so shuffle is not allowed. --Bastie (talk) 08:53, 4 June 2023 (UTC)[reply]

I think the code is more broken now, as you've left the shuffle() in, but the array is empty? David Malone (talk) 15:38, 4 June 2023 (UTC)[reply]

Wikipedia is not the right place to share code. It is not the right place to debug code. It is the right place to quote code examples from reliable sources. See wp:rs and wp:or, two of the relevant policies. David Spector (talk) 18:21, 4 June 2023 (UTC)[reply]

I would be happy for someone to delete the Python - it doesn't match the pseudocode anyway, as it starts with the length mod 256, rather than 0. That's not wrong, but it isn't consistent either. David Malone (talk) 13:21, 7 June 2023 (UTC)[reply]

of course wikipedia is no git repository and also wikipedia should be correct even if code is provided

so follow your meaning it is better to remove Python code and let the C# code as example Bastie (talk) 10:48, 17 June 2023 (UTC)[reply]

Removal of Python snippet

I've removed the Python code snippet for the following reasons:

One code snippet is sufficient - the pseudocode is enough to clearly demonstrate the principle of the algorithm
It's redundant - the Python snippet implements the algorithm in precisely the same way as the pseudocode
Precedent on most algorithm articles is to remove duplicate code examples in different languages if there's already one canonical example. Sometimes they've been moved to Wikibooks, but I hardly think that's worth it for 5 lines of code.

Oli Filth^{(talk|contribs)} 17:36, 11 September 2009 (UTC)[reply]

1) It's not sufficient. Pseudo-code is good for showing the concept, since it's not wedded to an implementation. Sample code is good for actually being able to step through a working implementation, as well as using it as a reference for your own.

2) It's not redundant. Look at them for yourself and you'll see there are optimizations in the Perl that were not present in the pseudo-code and which make it less than obvious that the two have identical behavior.

3) This precendent is disputed.

208.80.104.2 (talk) 18:20, 11 September 2009 (UTC)[reply]

Please stop this; you've already made it explicitly clear that you're willing to edit war until someone gets blocked, if you keep this up I'll report you for tendentious editing.

I've pointed you at examples where consensus has been to remove pointless extra code snippets, and your claim that the Python example is "optimised" is nonsense (there's no such optimisation); even if it were true it would not be relevant, as Wikipedia is not a how-to guide. Furthermore, the pseudocode is from a referenced source (apparently), whereas the Python isn't. Oli Filth^{(talk|contribs)} 18:26, 11 September 2009 (UTC)[reply]

Please don't be uncivil my misrepresenting what I've said. I said repeatedly that I would prefer to avoid conflict. I also said that if you try to get me blocked on false charges of incivility, you will expose yourself to a block on true charges of the same. In the interest of avoiding conflict, I will wait until you can manage a civil response. Until then, I will edit as I see fit, within the rules of Wikipedia. 208.80.104.2 (talk) 18:53, 11 September 2009 (UTC)[reply]

So you're opting not to reply to the counterarguments I made?

For reference, I was referring to this edit of yours. Oli Filth^{(talk|contribs)} 18:56, 11 September 2009 (UTC)[reply]

And conveniently ignoring the [1] next edit. Very selective of you. 208.80.104.2 (talk) 18:58, 11 September 2009 (UTC)[reply]

To be clear, so long as you are uncivil, I will not pretend that you have any arguments to refute. 208.80.104.2 (talk) 18:59, 11 September 2009 (UTC)[reply]

That approach will get you nowhere fast. If you aren't going to participate in discussion, then don't make edits to the article! Oli Filth^{(talk|contribs)} 19:02, 11 September 2009 (UTC)[reply]

Practical impact

I have a bit of a conflict here, as I know that Pearson's Hash benchmarks well for many uses, from my own testing. So the practical impact is quite significant, and worth noting. However, as I understand, Wikipedia "no original research" policy frowns on linking back to your own articles, so I have NOT linked back to my own writings. Which also seems to make writing a weblog article to document my own benchmarks as not-quite-suitable. So how to make this all "legitimate"? pbannister (talk) 21:50, 17 January 2010 (UTC)[reply]

Write a good article/blog post on it and wait for somebody else to consider it to be a good source for this article. That's all you can do. 82.33.61.156 (talk) 16:00, 18 January 2010 (UTC)[reply]

I have used Pearson Hashing in my programs, and it works well. But I also found that it is easy to extend it to multiples of 8 bits (without requiring a table of size two to the 16). Even extending it only to 16 bits makes it practical for use in large string-handling programs. I haven't found any place on the Web where this is discussed, but someone would surely find such a discussion valuable. Even so, it does not belong in WP, since this is OR. The world needs a spam-free general discussion forum that covers all of human knowledge, a kind of WP without being limited to that which is known to be true. David Spector (talk) 13:37, 8 August 2011 (UTC)[reply]

Incorrect C code for 64 bits

In article there is talking about function that accepts any kind of byte, but c code example is for null terminated strings. I suggest function phash(char* buffer,int length,unsigned hash) —Preceding unsigned comment added by 89.212.91.6 (talk) 14:00, 25 January 2010 (UTC)[reply]

I second that the C code looks dubious, for 2 reasons. First it claims to return a 64bit value, but unless I am mistaken I see that it returns an 8 bit char. I also think that line 32 (h = 0) is incorrect. In this case, a collision that is found for a single byte of output, will continue to generate a collision no matter how many bytes of output are generated. Simply removing this line would remove this problem. 74.205.213.237 (talk) 20:24, 20 October 2013 (UTC)[reply]

The C implementation does not initialize the retval variable before use. Hence the value returned is undefined. 203.129.23.7 (talk) 10:28, 8 May 2022 (UTC)[reply]

Check the line after Change the first byte - the hash for each byte of the result is initialised there? David Malone (talk)

No, the 'retval' variable appears to be uninitialized. It is a horribly buggy algorithm that obviously has not been tested. Untested and incorrect and unreferenced original research must not be published in WP, so I will attempt to delete it. David Spector (talk) 10:25, 9 May 2022 (UTC)[reply]

I think the usage is technically uninitialised by Appendix J.2 of the C standard. However, the initialised bits are all shifted right off the top of the 64 bit value, so I suspect it was tested and probably works on most C compilers, with less aggressive optimisation. Initialising it to zero would probably have been better than removing the code. David Malone (talk)

You probably did not read the comment I put on the deletion. I did not delete it due to the incorrectness and obvious lack of testing. I deleted it because of WP:OR. It needs to stay out of WP until it has been published and accepted by the programming community. David Spector (talk) 14:42, 9 May 2022 (UTC)[reply]

Extension to 16 bits or more; best hashing?

It is easy to extend PH to 16 bits or more. Simply break up the large number into bytes, then PH each byte, using a seed which is in the program as a random constant for each byte number; concat the results to yield a hash value that is as large as the input domain. As usual, the result may be reduced to any range using bit-masking or mod.

Nobody's studied this alg. The fact that the initial seed is different for each byte should give independence for each byte and good distribution for the result, but this isn't obviously true, so it has to be proved.

Can't be added yet due to WP:OR. I invented this and have used it ever since PH was published.

It's possible that PH is a "best possible" (evenly distributed for many random inputs) non-complete and non-minimal hashing alg. Could this be proved?

David Spector (talk) 18:07, 19 April 2010 (UTC)[reply]

Seems to me there might be a more efficient notion for extending the Pearson hash to 16-bits or more, at the same cost (or nearly so) as the 8-bit hash.

Extending the Pearson hash function to larger values

No attempt at formal proof, and exclusion of original research puts this outside the scope of the Wikipedia article.

pbannister (talk) 00:25, 20 April 2010 (UTC)[reply]

If you think about it, for C-string, all you need to generilize this algorithm is to cast (char*) to (ushort*), (ulong*) or (unsigned long long*) and complement the whole string with zeros to avoid any sneaky SIGSEGVs plus a little bitwise arithmetics. FYI. — Preceding unsigned comment added by 217.9.91.108 (talk) 22:32, 31 January 2013 (UTC)[reply]

Opinions and possible bias

There are several opinionated statements throughout this article:

"The code above is intended to make Pearson's excellent algorithm useful to many."
"Pearson's scheme produces an 8 bit byte, or integer, 0-255. That's not very useful."
"But using the logic in xPear16 to produce 128-bit hashes would be rather pointless, because it could generate those very long hashes only about 3/4 as fast as MD5"
"Beyond hashes 512 hex digits long - (absurdly long) - that scheme would fail"

These are just some of the examples. It may be worth taking a look to reword these statements to remove potential bias. Also, the singular citation seems rather weak to me, personally, but then I'm not willing to buy the article. If the article does support the listed benefits, it would at least seem reassuring to add a reference to the article there ("but it offers these benefits") -- otherwise the claims seem unverified. -66.87.75.174 (talk) 19:28, 15 May 2013 (UTC)[reply]

How useful is an 8-bit hash? :-) But humor aside, I get your point. I have made some revisions in response to the legitimate concern you expressed. But I don't believe that the remark 'excellent algorithm' is objectionable. It is a value judgement, but it's one based on a lot of experience in programming - I have decades - and before I wrote xPear16, I spent a fair amount of time simply studying the logic, so I could develop some deep confidence in it, which I did. I will take the time to explain my 'excellent' assessment rather than withdrawing that particular comment. I think it belongs in the article. Note: I'm not praising my program; I'm praising Pearson's wonderful little algorithm, there. It's some worthy niftiness.

Pearson is excellent. It's elegant and praise-worthy for three reasons. (1) It's remarkable simple, (2) it's fast (compared to other hashing algorithms) and (3) it offers excellent randomness in the values output. In computer science, in general, the first two traits are naturally admired. Simple code tends to be solid, reliable and it can be incredibly efficient. (Binary search is very simple, but wonderfully fast. This is similar. How much praise could we offer binary search? Plenty.) Speed is also generally desirable. In hashing, the third trait (randomness) is very valuable, if not essential. The greater the randomness, the less chance of a collision.

In a 64-bit number space, close to perfect randomness - what Pearson provides - makes hash collisions so unlikely that there is no practical need to test for them. That is NOT the case with 32-bit hashes. (I'm talking here from extensive practical experience, not theoretical grounds. I use hashes a lot.) Given perfectly random 32-bit hashes, and 'only' about a milion values generated, the reality is that one should expect a collision. Even if the chance is (apparently) only one in some thousands, there is almost bound to be a collision. That is the well-known 'birthdays in common' issue. Generally, with a million items, about one collision does occur, even though only about 1/4,000th of the number space is used. A million out of four billion numbers: good chance of collision. It seems contrary to reason, but it happens more often than not.

However in the realm of 64-bit numbers, it's a very different story. (Given the same data set.) The values used are then so sparse that the probability of a collision becomes infintessimal.

Trusting that the set of values generated will be collision-free depends (very much) on the randomness of the hash values output. One needs something very good, to make testing for collisions unnecessary, in practical terms. In 64-bit numbers, one could call it the chance of collsion the chance of fairy dust colliding. (Not likely.) 64-bit numbers are virtually ideal for hashing, because essentially the collision problem goes away. (If.)

Pearson's algorithm is excellent that way. Excellent randomness implies (in 64-bit numbers) no need to test for collisions, unless one is hashing many billion of items, then maybe testing would become necessary, at some point. No need to test for collisions also helps in terms of speed, of course. The argument is simple. Making the number space four billion times larger makes the chance of a collision four billion times smaller. (Such that it's no longer an issue.)

In summary: from Pearson's speedy, simple, little algorithm comes - hash-wise - that third very nice and desirable property: randomness. The values the program shown generates seem to be perfectly random, which means that it seems to be optimal that way. It's like the perfect hashing scheme, if you like. The credit isn't due to me; it's due to Mr Pearson, for inventing it. I just figured out how to make it more useful.

In any event, I would say that the merits of the algorithm (considerable) make 'excellent' an okay comment. That's not 'bias'. How to say this better? It's as good as hashing gets. ??? :-)

For two solid practical reasons, I believe Pearson's algorithm deserves the compliment I offered and the unbiased description excellent. The third reason - simplicity - is aesthetic. It's pleasing intellectually that it's so simple.

I will add a wee bit of personal history, that is relevant. Years ago I wrote my own hashing routine, it became very useful to me, it was excellent in terms of randomness (thus very useable), but rather slow, because it involved a lot of multiplying and adding. I eventually found a much faster routine, called SuperFastHash. I found it very good - much faster than mine, really speedy - but I was able to detect some subtle non-randomness in its output. (Not optimal. Nonetheless I extended it from 32 to 64 bits, and began using it.)

xPear16 is not only faster than SuperFastHash, but it also provides appreciably more random output. Checking digits was part of reaching that conclusion. It is literally the best - the fastest, with the most random output - hashing method that I am aware of. Now, prior to discovering Pearson, in trying to improve my own 'ugly' (brute force) way, that probably had the CPU running rather hot, I investigated a good many hashing schemes (at least a dozen), and speed-compared them all. They are mostly all complicated, logic wise; thus all were appreciably slower than SuperFastHash. But more complex to what benefit? I didn't see it. In practical terms, I don't much care what's 'under the hood', I just want it to go. I wanted very random; I also wanted really fast. At the end of all my investigating, and analyzing, and testing, I ended up with Pearson, and what I have personally devised from it.

My faith in Pearson's beautiful original algorithm is deep. (Deep enough to warrant this little essay on the subject.) I think it's the winner. The best, in three different respects. Incredibly, beautiful simple; really fast (I've pigged it down with the multiple scans, but it's still faster than the fastest hashing method I found and was using). And best of all: first-class output. If admiring an algorithm that is truly worthy makes me biased...so be it. Pearson's nifty hashing scheme will stand on its own merits, no matter what I say about it. :-) Cheers. Stellar-TO (talk) 14:15, 23 May 2013 (UTC)[reply]

Agree. At the time Pearson published his article in Communications of the ACM I was similarly impressed with its beautiful logic for converting the key into a smeared, orthogonal combination of random numbers, one for each bit, in order. I constructed testing programs to compare it to the best algorithms I could find or invent. It did much better on statistical and other measures.

The limit of being a one-byte hash was not a problem for its original application: to write a lexer to classify a word quickly as a C-language keyword or as a user identifier. There were fewer than 255 such keywords. But any software engineer of that period, reading the article, would see ways to increase the bit-width limit beyond 8 bits. The orthogonality of the algorithm made that possible by adding one 256-entry table for each additional byte in key size (or, with some sacrifice in good statistics, by simply changing the salt for each byte in the key, using the same 256-entry table for all bytes). So, what seems like a severe limitation in the original algorithm is really easy to get around: instead of 256**n entries, which is impractical, the table need only have 256*n entries, where n is the desired number of bytes in a keyword. Thus, Pearson's algorithm scales up well to any number of bytes. In many applications, as Stellar-TO stated, 8 byte keywords will in practice never collide.

Similarly, I have extended Pearson's algorithm to create an excellent extendible hashing function, which means that the hash table is extended by one entry for each additional keyword hashed, with minimal rehashing, using published methods. Part of the excellence of Pearson's algorithm is its flexibility and adaptability to a wide variety of applications.

The WP policy of neutrality should not be used to object to praise when praise is warranted. Furthermore, anyone who wishes to balance the praise of Pearson hashing with an equal amount of criticism is going to have to face the lack of such criticism in computer science. It may be possible to find applications for which Pearson hashing is not ideal (it is neither minimal nor perfect, for example), but such examples would not constitute criticism of the algorithm itself, only appropriate limits to using any hashing function.

Finally, anyone thinking they know of a flaw in Pearson's function must find a reliable citation for such a flaw, since WP reports knowledge as it already exists, not original research. All this having been said, if the language used in the article is un-encyclopedic (for example, it is too colloquial), feel free to improve it, to make it more professional-sounding, as is appropriate for an encyclopedia. For example, "The code above is intended to make Pearson's excellent algorithm useful to many" might be rewritten as "The code above is intended to make Pearson's algorithm more accessible to those who might benefit from its excellent freedom from collisions." Just an example--I'm not claiming this is the best possible rewording. David Spector (talk) 00:51, 13 July 2013 (UTC)[reply]

I wrote the above in response to this Talk section before looking at the two most recent edits. I stand behind what I wrote, but I also agree with the two edits. The first, a deletion, was for an extension of Pearson hashing contributed by an editor. While it is obviously useful, it is also obviously original research, and therefore prohibited from WP. There are probably other fora on the Web for such a contribution. The following edit added a statement concerning the requirement for a table of constants. This is a valid objection to the algorithm in certain applications where 256 words of memory are not easy to come by. These days, that may not seem reasonable in view of the very large and cheap memories available, but nevertheless, such limitations may exist in the world of embedded applications. David Spector (talk) 01:05, 13 July 2013 (UTC)[reply]

I just added a reference to an online version of the defining paper, so editors can judge the facts reported in the article for themselves. David Spector (talk) 01:16, 13 July 2013 (UTC)[reply]

64-bit hash C code restored

I was very dismayed to see my contribution deleted. I shouldn't have to, but let me argue it again. There is limited and questionable utility in an (only) 8-bit hash generator. It's lovely in principle, and in the case of the Pearson hash, it even warrants the term elegant. But it's of limited and questionable practical utility. However, with a bit of additional logic wrapped around it, Pearson's algorithm can be *extremely* useful. Many of my programs are riddled with it, and rely heavily on it. That is why my programs tend to be really fast. Reduce most things to numbers, then it's all just number-crunching. Which means they use an extended - my - form of Pearson's method. A very fast, quality (very random) 64-bit/16-byte hash generator *is* really useful. With a bit of tweaking, the code I've contributed could generate *very* long Pearson hashes; even 256 bytes or 1,024 bits long: possible, with the same code, slightly tweaked. It would do it fast, and no need to worry about any repeating values or integer overflow.

To the anonymous IP yonk who stupidly removed the section of the article that explains in clear terms *how* it can be easily done: the article is now back to showing, how very *useful* Pearson's fast, excellent little algorithm is. His method has so many advantages. Given that I've had much experience using it, and have never seen even a hint of a problem, it's undoubtedly the only hashing method I will ever use. But only writing that C program allowed me to realize how simple and easy it was to generate long hashes, using his method. Without a *clear* explanation, how do people understand? The best way is to show them some actual *logic*. Then they can soon go: ahah. And maybe begin using it.

Yes, there are other places I could put it. In fact, I'm sure the web pages devoted to Pearson hashing are legion. But how dumb was your reason - "the Internet is big" - for objecting to my putting that code here, when it gave a clear illustration of why the Pearson algorithm, in being so simple, is great? I say that's downright *dense*, pally. Stellar-TO (talk) 13:12, 16 September 2013 (UTC)[reply]

I read this quickly, so I hope I understand the issues. The Pearson algorithm was originally created to recognize the keywords of a programming language. Since there were never likely to exist more than 256 such keywords, and the algorithm is fast, CACM published the algorithm.

As you correctly point out, using 8 bits for a hash key is impractical for real-world hash coding. Therefore, extensions of Pearson hashing are needed.

However, any such extension you have designed yourself is considered "original research" and is actually banned from Wikipedia. An example reason is that original research has not always gone through public scrutiny, which means that it might contain an error not obvious to the originator of the research. Errors in hashing algorithms are common, since good statistical behavior is needed to prevent unwanted collisions and other problems.

But whether you agree with the original research policy or not, WP requires us editors to remove edits lacking reliable sources (in this case, examples might include textbooks or professional magazines) or referencing original research. Please see Wikipedia:No original research and other articles about WP policies. David Spector (talk) 13:13, 4 June 2023 (UTC)[reply]

Is Pearson a cryptographic hash function?

Does anyone know of the Pearson algorithm having been analyzed as a possible Cryptographic hash function? It is an obvious choice as the basis for a Hash table implementation, but I haven't seen its properties promoted for Cryptography. Even if it is cryptographically excellent in its native 8 bits, in practice longer extensions must be used, and therefore analyzed. (The great speed of the Pearson algorithm can also be a drawback in some cryptographic applications, but an advantage in others.) David Spector (talk) 21:55, 12 April 2014 (UTC)[reply]

I'd like to add my opinion that because of the simplicity of its algorithm, I highly doubt that any extensions beyond 8 bits of key that preserve the way Pearson hashing works would be cryptographically unacceptable. I haven't seen an unacceptability proof myself, but I believe it is likely to be an easy one. David Spector (talk) 13:17, 4 June 2023 (UTC)[reply]

In my haste, I wrote this incorrectly. I meant to say that all extensions based on PH will NOT likely be cryptographically acceptable. The algorithm is great for hashing but very poor for encrypting. David Spector (talk) 19:19, 7 June 2023 (UTC)[reply]

Pearson is not intended as a cryptographic hash function. Generation of many collisions is easy by choosing the output value, x, you want, looking up what value in the permutation produces that output T[y] = x, and then looking up then choosing your last inputs to be b1 and b2 so that T[b1] = b2^y. If you don't know the table T, there are even ways to determine it from observing collisions. David Malone (talk)

Proposal to delete the "Example implementations" section

I propose deleting the Example implementations section. While it is nice to have actual running code in WP, that is not part of WP's mandate; there may be other websites that provide sample PH code that has been tested professionally. We don't know who wrote the current code, whether it was tested other than by the author, or where the code comes from, since it contains no references. WP prohibits including unpublished material; see WP:OR. In addition, WP requires references to be to reliable sources, see WP:RS. I believe it is better to publish nothing than to publish code that may be incorrect or may not work on all versions of available compilers, interpreters, and OSs.

Finally, WP is an encyclopedia (see WP:Wikipedia is an encyclopedia); it is not a place to publish sample source code in multiple languages, whether tested or untested, unless the code was published already in a reliable source and is known not to be copyrighted. David Spector (talk) 21:23, 14 June 2023 (UTC)[reply]