About Hash Algorithms

An introduction of Hash function

Introduction

  • a set of N items

    • we can sort them with binary search in O(log N) time using a structure(i.e, an array) of O(N) space .

    • while a hashing technique can consume O(N) space, and answers a dictionary query in O(1) expected time.

  • We can reference key-value pairs using arrays by doing arithmetic operations to transform keys into array indices. Thus there are two steps

    • compute a hash function that transfers the search key into an array index

    • collision-resolution process that deals with two or more different keys (may hash to the same array index)

hash functions

  • hashing

    • the hash function h maps the university U of keys into the $T[0...m-1]$

      • $h : U -> T = {0, 1, ..., m-1}$
    • uniform hashing assumption: assume that any given element in U is equally likely to hash into any of the m slots, independently of where any other element has hashed to .

      • For any diff integers $k_1$ and $k_2$, $Pr(h(k_1)=h(k_2)) \leq 1/m$
  • a good hash function

    • deterministic: equal keys produce the same hash value
    • efficient to compute
    • uniformly distribute the keys among the integer values [0, m-1]
      • each key is equally likely to be hashed to one of m indices
      • The reason for this requirement is that the cost of hashing-based methods goes up sharply as the number of collisions-pairs of inputs that are mapped to the same hash value- increases. If some hash values are more likely to occur than others, a larger fraction of the lookup operations will have to search through a larger set of colliding table entries. [2]
      • [Note]: the random is uniform
  • hashing by division h(k) = k mod m

    • map a key k into one of m slots

    • if we know that the keys are random real numbers k independently and uniformly distributed in the range $0 \leq k < 1$ then, the hash function: $h(k) = floor(k_m)$

    • the division method can give good results, assumimg that a prime number k that is unrelated to any patterns in the distribution of keys .

  • hashing by multiplication

  • universal hashing

    • TODO

for the collision-resolution process

  • collision: two keys may hash to the same slot

  • hashing with separate chaining (link structure)

    • linked list of the key-value pairs whose keys hash to that index

    • the basic idea is to choose m to be sufficiently large that the lists are sufficiently short to enable efficient search through a two-step process

  • hashing with linear probing

    • open-addressing: store N key-value pairs in a hash table of size m>N

    • the idea is that rather than using memory space for references in linked lists, we use it for the empty entries in the hash table, which mark the ends of probe sequences

hashtables vs other dictionary implementations

  • choosing a good capacity m

    • We can choose the table size m to be sufficiently small that we do not waste a huge area of contiguous memory with empty chains but sufficiently large that we donot waste time searching through long chains.
  • hashtables are sometimes good because

    • the keys do not have to come from an ordered type
    • in practice we can set things up for constant-time performance
  • hashtables aren't always the best choice because

    • you cannot easily read things out in a sorted order, assuming these was such an order
    • you can't really get a good hash function that be can computed quickly, so they're usally bad for small dictionaries
    • resizing is very expensive, and probably can't be used in most real-time systems

reference

  • Juraj Hromkovic - ch3 "Design and Analysis of Randomized Algorithms"
  • Robert Sedgewick and Kevin Wayne - ch3.4, "Algorithms"
  • hashtable
  • hash function