Perfect hash function

Quality 0.50 · 4 views · Updated 2 months ago

Hash function without any collisions

title: "Perfect hash function" type: doc version: 1 created: 2026-02-28 author: "Wikipedia contributors" status: active scope: public tags: ["hashing", "hash-functions", "search-algorithms"] description: "Hash function without any collisions" topic_path: "technology/cryptography" source: "https://en.wikipedia.org/wiki/Perfect_hash_function" license: "CC BY-SA 4.0" wikipedia_page_id: 0 wikipedia_revision_id: 0

::summary Hash function without any collisions ::

::figure[src="https://upload.wikimedia.org/wikipedia/commons/7/71/Hash_table_4_1_1_0_0_0_0_LL.svg" caption="A perfect hash function for the four names shown"] ::

::figure[src="https://upload.wikimedia.org/wikipedia/commons/2/2e/Hash_table_4_1_0_0_0_0_0_LL.svg" caption="A minimal perfect hash function for the four names shown"] ::

In computer science, a perfect hash function h for a set S is a hash function that maps distinct elements in S to a set of m integers, with no collisions. In mathematical terms, it is an injective function.

Perfect hash functions may be used to implement a lookup table with constant worst-case access time. A perfect hash function can, as any hash function, be used to implement hash tables, with the advantage that no collision resolution has to be implemented. In addition, if the keys are not in the data and if it is known that queried keys will be valid, then the keys do not need to be stored in the lookup table, saving space.

Disadvantages of perfect hash functions are that S needs to be known for the construction of the perfect hash function. Non-dynamic perfect hash functions need to be re-constructed if S changes. For frequently changing S dynamic perfect hash functions may be used at the cost of additional space. The space requirement to store the perfect hash function is in O(n) where n is the number of keys in the structure.

The important performance parameters for perfect hash functions are the evaluation time, which should be constant, the construction time, and the representation size.

Application

A perfect hash function with values in a limited range can be used for efficient lookup operations, by placing keys from S (or other associated values) in a lookup table indexed by the output of the function. One can then test whether a key is present in S, or look up a value associated with that key, by looking for it at its cell of the table. Each such lookup takes constant time in the worst case. With perfect hashing, the associated data can be read or written with a single access to the table.{{citation | last1 = Lu | first1 = Yi | author1-link = Yi Lu (computer scientist) | last2 = Prabhakar | first2 = Balaji | author2-link = Balaji Prabhakar | last3 = Bonomi | first3 = Flavio | title = 2006 IEEE International Symposium on Information Theory | chapter = Perfect Hashing for Network Applications | author3-link = Flavio Bonomi | doi = 10.1109/ISIT.2006.261567 | pages = 2774–2778 | year = 2006| isbn = 1-4244-0505-X | s2cid = 1494710 }}

Performance of perfect hash functions

The important performance parameters for perfect hashing are the representation size, the evaluation time, the construction time, and additionally the range requirement \frac{m}{n} (average number of buckets per key in the hash table). The evaluation time can be as fast as O(1), which is optimal. The construction time needs to be at least O(n), because each element in S needs to be considered, and S contains n elements. This lower bound can be achieved in practice.

The lower bound for the representation size depends on m and n. Let and h a perfect hash function. A good approximation for the lower bound is \log e - \varepsilon \log \frac{1+\varepsilon}{\varepsilon} Bits per element. For minimal perfect hashing, , the lower bound is log e ≈ 1.44 bits per element.

Construction

A perfect hash function for a specific set S that can be evaluated in constant time, and with values in a small range, can be found by a randomized algorithm in a number of operations that is proportional to the size of S. The original construction of uses a two-level scheme to map a set S of n elements to a range of O(n) indices, and then map each index to a range of hash values. The first level of their construction chooses a large prime p (larger than the size of the universe from which S is drawn), and a parameter k, and maps each element x of S to the index :g(x)=(kx\bmod p)\bmod n. If k is chosen randomly, this step is likely to have collisions, but the number of elements ni that are simultaneously mapped to the same index i is likely to be small. The second level of their construction assigns disjoint ranges of O(ni2) integers to each index i. It uses a second set of linear modular functions, one for each index i, to map each member x of S into the range associated with g(x).{{citation | last1 = Fredman | first1 = Michael L. | author1-link = Michael Fredman | last2 = Komlós | first2 = János | author2-link = János Komlós (mathematician) | last3 = Szemerédi | first3 = Endre | author3-link = Endre Szemerédi | doi = 10.1145/828.1884 | issue = 3 | journal = Journal of the ACM | mr = 0819156 | page = 538 | title = Storing a Sparse Table with O(1) Worst Case Access Time | volume = 31 | year = 1984| s2cid = 5399743 | doi-access = free

As show, there exists a choice of the parameter k such that the sum of the lengths of the ranges for the n different values of g(x) is O(n). Additionally, for each value of g(x), there exists a linear modular function that maps the corresponding subset of S into the range associated with that value. Both k, and the second-level functions for each value of g(x), can be found in polynomial time by choosing values randomly until finding one that works.

The hash function itself requires storage space O(n) to store k, p, and all of the second-level linear modular functions. Computing the hash value of a given key x may be performed in constant time by computing g(x), looking up the second-level function associated with g(x), and applying this function to x. A modified version of this two-level scheme with a larger number of values at the top level can be used to construct a perfect hash function that maps S into a smaller range of length n + o(n).

A more recent method for constructing a perfect hash function is described by as "hash, displace, and compress". Here a first-level hash function g is also used to map elements onto a range of r integers. An element x ∈ S is stored in the Bucket Bg(x).

Then, in descending order of size, each bucket's elements are hashed by a hash function of a sequence of independent fully random hash functions (Φ1, Φ2, Φ3, ...), starting with Φ1. If the hash function does not produce any collisions for the bucket, and the resulting values are not yet occupied by other elements from other buckets, the function is chosen for that bucket. If not, the next hash function in the sequence is tested.

To evaluate the perfect hash function h(x) one only has to save the mapping σ of the bucket index g(x) onto the correct hash function in the sequence, resulting in .

Finally, to reduce the representation size, the (σ(i))0 ≤ i are compressed into a form that still allows the evaluation in O(1).

This approach needs linear time in n for construction, and constant evaluation time. The representation size is in O(n), and depends on the achieved range. For example, with achieved a representation size between 3.03 bits/key and 1.40 bits/key for their given example set of 10 million entries, with lower values needing a higher computation time. The space lower bound in this scenario is 0.88 bits/key.

Pseudocode

algorithm hash, displace, and compress is (1) Split S into buckets {{math|Bi : g−1({i})∩S,0 ≤ i (2) Sort buckets Bi in falling order according to size |Bi| (3) Initialize array T[0...m-1] with 0's (4) for all i∈[r], in the order from (2), do (5) for l←1,2,... (6) repeat forming Ki←l(x)|x∈Bi} (6) until |Ki|=|Bi| and Ki∩{j|T[j]=1}=∅ (7) let σ(i):= the successful l (8) for all j∈Ki let T[j]:=1 (9) Transform (σi)0≤i into compressed form, retaining O(1) access.

Space lower bounds

The use of O(n) words of information to store the function of is near-optimal: any perfect hash function that can be calculated in constant time requires at least a number of bits that is proportional to the size of S.{{citation | last1 = Fredman | first1 = Michael L. | author1-link = Michael Fredman | last2 = Komlós | first2 = János | author2-link = János Komlós (mathematician) | doi = 10.1137/0605009 | issue = 1 | journal = SIAM Journal on Algebraic and Discrete Methods | mr = 731857 | pages = 61–68 | title = On the size of separating systems and families of perfect hash functions | volume = 5 | year = 1984}}.

For minimal perfect hash functions the information theoretic space lower bound is :\log_2e\approx1.44 bits/key.

For perfect hash functions, it is first assumed that the range of h is bounded by n as . With the formula given by and for a universe U\supseteq S whose size tends towards infinity, the space lower bounds is :\log_2e-\varepsilon \log\frac{1+\varepsilon}{\varepsilon} bits/key, minus log(n) bits overall.

Extensions

Dynamic perfect hashing

Main article: Dynamic perfect hashing

Using a perfect hash function is best in situations where there is a frequently queried large set, S, which is seldom updated. This is because any modification of the set S may cause the hash function to no longer be perfect for the modified set. Solutions which update the hash function any time the set is modified are known as dynamic perfect hashing,{{citation | last1 = Dietzfelbinger | first1 = Martin | last2 = Karlin | first2 = Anna | author2-link = Anna Karlin | last3 = Mehlhorn | first3 = Kurt | author3-link = Kurt Mehlhorn | last4 = Meyer auf der Heide | first4 = Friedhelm | last5 = Rohnert | first5 = Hans | last6 = Tarjan | first6 = Robert E. | author6-link = Robert Tarjan | doi = 10.1137/S0097539791194094 | issue = 4 | journal = SIAM Journal on Computing | mr = 1283572 | pages = 738–761 | title = Dynamic perfect hashing: upper and lower bounds | volume = 23 | year = 1994}}. but these methods are relatively complicated to implement.

Minimal perfect hash function

A minimal perfect hash function is a perfect hash function that maps n keys to n consecutive integers – usually the numbers from 0 to n − 1 or from 1 to n. A more formal way of expressing this is: Let j and k be elements of some finite set S. Then h is a minimal perfect hash function if and only if implies (injectivity) and there exists an integer a such that the range of h is . It has been proven that a general purpose minimal perfect hash scheme requires at least \log_2 e \approx 1.44 bits/key.{{citation | last1 = Belazzougui | first1 = Djamal | last2 = Botelho | first2 = Fabiano C. | last3 = Dietzfelbinger | first3 = Martin | contribution = Hash, displace, and compress | contribution-url = https://cmph.sourceforge.net/papers/esa09.pdf | doi = 10.1007/978-3-642-04128-0_61 | location = Berlin | mr = 2557794 | pages = 682–693 | publisher = Springer | series = Lecture Notes in Computer Science | title = Algorithms - ESA 2009 | volume = 5757 | isbn = 978-3-642-04127-3 | year = 2009| citeseerx = 10.1.1.568.130 | url = https://cmph.sourceforge.net/papers/esa09.pdf | last1 = Esposito | first1 = Emmanuel | last2 = Mueller Graf | first2 = Thomas | last3 = Vigna | first3 = Sebastiano | contribution = RecSplit: Minimal Perfect Hashing via Recursive Splitting | doi = 10.1137/1.9781611976007.14 | pages = 175–185 | series = Proceedings | title = 2020 Proceedings of the Symposium on Algorithm Engineering and Experiments (ALENEX) | year = 2020 | arxiv = 1910.06416 | doi-access = free

k-perfect hashing

A hash function is k-perfect if at most k elements from S are mapped onto the same value in the range. The "hash, displace, and compress" algorithm can be used to construct k-perfect hash functions by allowing up to k collisions. The changes necessary to accomplish this are minimal, and are underlined in the adapted pseudocode below: (4) for all i∈[r], in the order from (2), do (5) for l←1,2,... (6) repeat forming Ki←l(x)|x∈Bi} (6) until |Ki|=|Bi| and Ki∩{j|T[j]=k}=∅ (7) let σ(i):= the successful l (8) for all j∈Ki set T[j]←T[j]+1

Order preservation

A minimal perfect hash function F is order preserving if keys are given in some order a1, a2, ..., a**n and for any keys a**j and a**k, j j) k). In this case, the function value is just the position of each key in the sorted ordering of all of the keys. A simple implementation of order-preserving minimal perfect hash functions with constant access time is to use an (ordinary) perfect hash function to store a lookup table of the positions of each key. This solution uses O(n \log n) bits, which is optimal in the setting where the comparison function for the keys may be arbitrary. However, if the keys a1, a2, ..., a**n are integers drawn from a universe {1, 2, \ldots, U}, then it is possible to construct an order-preserving hash function using only O(n \log \log \log U) bits of space. Moreover, this bound is known to be optimal.

Related constructions

While well-dimensioned hash tables have amortized average O(1) time (amortized average constant time) for lookups, insertions, and deletion, most hash table algorithms suffer from possible worst-case times that take much longer. A worst-case O(1) time (constant time even in the worst case) would be better for many applications (including network router and memory caches). Timothy A. Davis. "Chapter 5 Hashing": subsection "Hash Tables with Worst-Case O(1) Access"

Few hash table algorithms support worst-case O(1) lookup time (constant lookup time even in the worst case). The few that do include: perfect hashing; dynamic perfect hashing; cuckoo hashing; hopscotch hashing; and extendible hashing.

A simple alternative to perfect hashing, which also allows dynamic updates, is cuckoo hashing. This scheme maps keys to two or more locations within a range (unlike perfect hashing which maps each key to a single location) but does so in such a way that the keys can be assigned one-to-one to locations to which they have been mapped. Lookups with this scheme are slower, because multiple locations must be checked, but nevertheless take constant worst-case time.{{citation | last1 = Pagh | first1 = Rasmus | author1-link = Rasmus Pagh | last2 = Rodler | first2 = Flemming Friche | doi = 10.1016/j.jalgor.2003.12.002 | issue = 2 | journal = Journal of Algorithms | mr = 2050140 | pages = 122–144 | title = Cuckoo hashing | volume = 51 | year = 2004}}.

References

(2001). "Efficient Minimal Perfect Hashing in Nearly Minimal Space". Springer Berlin Heidelberg.
[https://github.com/iwiwi/minimal-perfect-hash minimal-perfect-hash (GitHub)]
Jenkins, Bob. (14 April 2009). "Dictionary of Algorithms and Data Structures". U.S. National Institute of Standards and Technology.
(July 1991). "Order-preserving minimal perfect hash functions and information retrieval". ACM.
(November 2008). "Theory and practice of monotone minimal perfect hashing". Journal of Experimental Algorithmics.
(January 2023). "Tight Bounds for Monotone Minimal Perfect Hashing". Society for Industrial and Applied Mathematics.

::callout[type=info title="Wikipedia Source"] This article was imported from Wikipedia and is available under the Creative Commons Attribution-ShareAlike 4.0 License. Content has been adapted to SurfDoc format. Original contributors can be found on the article history page. ::