Hashing¶
Interpolation Search
Find key
in a sorted list list[l].key
, list[l+1].key
, ..., list[u].key
.
- 如果
key
和下标的关系是线性的
- 非线性
Seach by Formula
General Idea¶
Symbol Table: <name, attribute>
,相当于字典
Oxford Dictionary
name
:word
attribute
:definition
,pronunciation
,part of speech
,example
Compiler
name
: identifierattribute
: type, scope, address
Symbol Table ADT¶
- Objects: A set of name-attribute pairs, where the names are unique.
- Operations:
Attribute Find(name)
: Find the attribute of a name.
Hash Table¶
每一行都是一个桶,列数就是桶的槽数
For each identifier \(x\), we define a hash function \(f(x)\).
$f(x) = $ position of \(x\) in ht[]
(i.e. the index of the bucket that contains \(x\))
- $T = $ total number of distinct possible values of \(x\)
- $n = $ total number of identifiers in
ht[]
- identifier density \(= n / T\)
- loading density \(\lambda = n / (sb)\)
Collision
A collision occurs when we hash two different identifiers into the same bucket, i.e. \(f(i_1) = f(i_2)\) when \(i_1 \neq i_2\).
Overflow
An overflow occurs when we hash a new identifier into a full bucket.
Example
Mapping \(n = 10\) C library functions into a hash table ht[]
with \(b = 26\) buckets and \(s = 2\) slots per bucket.
\(\quad\) | Slot 0 | Slot 1 |
---|---|---|
0 | \(\quad\) | \(\quad\) |
1 | \(\quad\) | \(\quad\) |
2 | \(\quad\) | \(\quad\) |
3 | \(\quad\) | \(\quad\) |
4 | \(\quad\) | \(\quad\) |
5 | \(\quad\) | \(\quad\) |
6 | \(\quad\) | \(\quad\) |
\(\ldots\) | \(\quad\) | \(\quad\) |
25 | \(\quad\) | \(\quad\) |
Loading density \(\lambda = 10 / (26 \times 2) = 0.19\).
To map the letter a ~ z to \(0\) ~ \(25\), we may define $f(x) = $ x[0] - 'a'
.
ceil
,clock
,ctime
...- Overflow!!
Without overflow,
Hash Function¶
Properties of a good hash function \(f\):
- \(f\) must be easy to compute and minimizes the number of collisions.
- \(f\) should be unbiased. That is, for \(\forall x,i\), we have \(P(f(x) = i) = 1/b\). Such kind of a hash function is called a uniform hash function.
密码学中的哈希算法,用来作文件完整性校验
- MD5(已被破解)
- SHA-1
- SHA-256
\(f(x) = x \mod \text{TableSize}\) if \(x\) is an integer.
TableSize = 10? Bad!
\(f(x) = (\sum_i \text{ASCII}(x[i]) \mod \text{TableSize})\) if \(x\) is a string.
TableSize = 10007 and string length \(\leq\) 8. If \(x[i] \in [0, 127]\), then \(f(x) \in [0, 127 \times 8]\).
but \(T = 128^8\). Many collisions!
\(f(x) = (x[0] + x[1] \times 27 + x[2] \times 27^2) \mod \text{TableSize}\) if \(x\) is a string.
\(f(x) = (\sum x[N - i - 1] \times 32^i) \mod \text{TableSize}\)
选 \(32\) 是因为在计算机中好表示,左移 \(5\) 位即可
Index Hash3(const char *x, int TableSize)
{
unsigned int HashVal = 0;
while (*x != '\0')
{
HashVal = (HashVal << 5) + *x++;
}
return HashVal % TableSize;
}
If \(x\) is too long (i.e. street address), the early characters will be left-shifted out of place.
- Carefully select some characters
Separate Chaining¶
keep a list of all keys that hash to the same value
- Create an empty table
HashTable InitTable(int TableSize)
{
HashTable H;
int i;
if (TableSize < MinTableSize)
{
Error("Table size too small");
return NULL;
}
H = malloc(sizeof(struct HashTbl)); // allocate table
if (H == NULL)
FatalError("Out of space!!!");
H->TableSize = NextPrime(TableSize);
H->TheLists = malloc(sizeof(List) * H->TableSize); // array of lists
if (H->TheLists == NULL)
FatalError("Out of space!!!");
for (i = 0; i < H->TableSize; i++)
H->TheLists[i] = NULL; // initialize all lists to NULL
return H;
}
- Find a key from a hash table
Position Find(ElementType Key, HashTable H)
{
Position P;
List L;
L = H->TheLists[Hash(Key, H->TableSize)];
P = L->Next;
while (P != NULL && P->Element != Key) // probably need `strcmp`
P = P->Next;
return P;
}
- Insert a key into a hash table
void Insert(ElementType Key, HashTable H)
{
Position Pos, NewCell;
List L;
Pos = Find(Key, H);
if (Pos == NULL) // not found
{
NewCell = malloc(sizeof(struct ListNode));
if (NewCell == NULL)
FatalError("Out of space!!!");
else
{
L = H->TheLists[Hash(Key, H->TableSize)]; // 重复计算!!
NewCell->Next = L->Next;
NewCell->Element = Key; // probably need `strcpy`
L->Next = NewCell;
}
}
}
Tip
Make the TableSize \(\approx\) the number of distinct keys. \(\lambda \approx 1\).
Open Addressing¶
开放寻址法:find another empty cell to solve collision (avoiding pointers)
/* Algorithm: insert key into an array of hash table */
{
index = hash(key);
initialize i = 0;
while (collision at index)
{
index = (hash(key) + f(i)) % TableSize; // f(i) is the collision resolving function
if (table is full)
break;
else
i++;
}
if (table is full)
ERROR("Table is full");
else
insert key at index;
}
Tip: Loading density \(\lambda < 0.5\).
关键:\(f(i)\) 的设计
Linear Probing¶
\(f(i) = i\).
Analysis
Analysis of linear probing show that the expected number of probes:
Worst case can be LARGE
Primary clustering
Any key that hashes into the cluster will add to the cluster after several attempts to resolve the collision.
Quadratic Probing¶
\(f(i) = i^2\).
Theorem
If quadratic probing is used, and the table size is a prime number, then a new element can always be inserted into the table as long as the loading density \(\lambda < 0.5\).
Proof.
We only need to show that the first \(\lfloor \text{TableSize} / 2 \rfloor\) alternative locations are distinct. That is, for \(\forall \, 0 < i \neq j \leq \lfloor \text{TableSize} / 2 \rfloor\), we have
Proof by contradiction.
Assume that
then
which implies that
Since \(\text{TableSize}\) is prime,
The first case implies \(i = j\), which contradicts the assumption. The second case implies \(i + j = \text{TableSize}\), which is impossible since \(i, j \leq \lfloor \text{TableSize} / 2 \rfloor\). Contradiction!
Note
If the table size is a prime number of the form \(4k + 3\), then quadratic probing \(f(i) = \pm i^2\) can probe the entire table.
Position Find(ElementType Key, HashTable H)
{
Position CurrPos;
int Collisions = 0;
CurrPos = Hash(Key, H->TableSize);
while (H->Cells[CurrPos].Info != Empty && H->Cells[CurrPos].Element != Key) // (1)
{
CurrPos += 2 * ++Collisions - 1; // Quadratic probing: f(i) = f(i-1) + 2i - 1
/* *2 is fast for computers, just a bit shift */
if (CurrPos >= H->TableSize)
CurrPos -= H->TableSize; // faster than mod
}
return CurrPos;
}
- 交换两个条件的顺序,会增加执行次数。
Cells
是结构体数组,成员有Info
和Element
。
void Insert(ElementType Key, HashTable H)
{
Position Pos;
Pos = Find(Key, H);
if (H->Cells[Pos].Info != Legitimate) // (1)
{
H->Cells[Pos].Info = Legitimate;
H->Cells[Pos].Element = Key; // probably need `strcpy`
}
else
Error("Duplicate key");
}
- 为什么是
Legitimate
而不是Empty
? 如果某个 cell 被删除变成empty
,会破坏连续性,之后二次探测如果途径这个 cell 就会停下来,后面的 cell 就无法找不到了。
所以 Info
需要有三个状态:
Empty
: 该 cell 还没有被使用Legitimate
: 该 cell 已经被使用Deleted
: 该 cell 已经被删除
使用枚举变量类型来实现
Double Hashing¶
\(f(i) = i \cdot h_2(x)\), where \(h_2(x)\) is a secondary hash function.
Tip
\(h_2(x) = R - (x \mod R)\), where \(R\) is a prime number and \(R < \text{TableSize}\), works well.
Note
- If double hashing is correctly implemented, simulations imply that the expected number of probes is almost the same as that for a random collision resolution strategy.
- Quadratic probing does not require the use of a secondary hash function. Thus, it is likely to be faster and simpler in practice.
Rehashing¶
- Build another table that is about twice as large as the original table.
- Scan down the entire original table
- Use a new hash function to rehash each non-deleted key into the new table
When to rehash?
- As soon as the table is half full.
- When an insertion fails.
- When the loading density \(\lambda\) exceeds a certain threshold.
Note
Usually there should have been \(N/2\) insertions before rehashing, so \(\mathcal{O}(N)\) rehashing time only adds a constant cost to each insertion.
However, in an interactive system, the unfortunate user whose insertion caused a rehash could see a slowdown.