The PDB Serialized Hash Table Format | |
==================================== | |
.. contents:: | |
:local: | |
.. _hash_intro: | |
Introduction | |
============ | |
One of the design goals of the PDB format is to provide accelerated access to | |
debug information, and for this reason there are several occasions where hash | |
tables are serialized and embedded directly to the file, rather than requiring | |
a consumer to read a list of values and reconstruct the hash table on the fly. | |
The serialization format supports hash tables of arbitrarily large size and | |
capacity, as well as value types and hash functions. The only supported key | |
value type is a uint32. The only requirement is that the producer and consumer | |
agree on the hash function. As such, the hash function can is not discussed | |
further in this document, it is assumed that for a particular instance of a PDB | |
file hash table, the appropriate hash function is being used. | |
On-Disk Format | |
============== | |
.. code-block:: none | |
.--------------------.-- +0 | |
| Size | | |
.--------------------.-- +4 | |
| Capacity | | |
.--------------------.-- +8 | |
| Present Bit Vector | | |
.--------------------.-- +N | |
| Deleted Bit Vector | | |
.--------------------.-- +M ─╮ | |
| Key | │ | |
.--------------------.-- +M+4 │ | |
| Value | │ | |
.--------------------.-- +M+4+sizeof(Value) │ | |
... ├─ |Capacity| Bucket entries | |
.--------------------. │ | |
| Key | │ | |
.--------------------. │ | |
| Value | │ | |
.--------------------. ─╯ | |
- **Size** - The number of values contained in the hash table. | |
- **Capacity** - The number of buckets in the hash table. Producers should | |
maintain a load factor of no greater than ``2/3*Capacity+1``. | |
- **Present Bit Vector** - A serialized bit vector which contains information | |
about which buckets have valid values. If the bucket has a value, the | |
corresponding bit will be set, and if the bucket doesn't have a value (either | |
because the bucket is empty or because the value is a tombstone value) the bit | |
will be unset. | |
- **Deleted Bit Vector** - A serialized bit vector which contains information | |
about which buckets have tombstone values. If the entry in this bucket is | |
deleted, the bit will be set, otherwise it will be unset. | |
- **Keys and Values** - A list of ``Capacity`` hash buckets, where the first | |
entry is the key (always a uint32), and the second entry is the value. The | |
state of each bucket (valid, empty, deleted) can be determined by examining | |
the present and deleted bit vectors. | |
.. _hash_bit_vectors: | |
Present and Deleted Bit Vectors | |
=============================== | |
The bit vectors indicating the status of each bucket are serialized as follows: | |
.. code-block:: none | |
.--------------------.-- +0 | |
| Word Count | | |
.--------------------.-- +4 | |
| Word_0 | ─╮ | |
.--------------------.-- +8 │ | |
| Word_1 | │ | |
.--------------------.-- +12 ├─ |Word Count| values | |
... │ | |
.--------------------. │ | |
| Word_N | │ | |
.--------------------. ─╯ | |
The words, when viewed as a contiguous block of bytes, represent a bit vector with | |
the following layout: | |
.. code-block:: none | |
.------------. .------------.------------. | |
| Word_N | ... | Word_1 | Word_0 | | |
.------------. .------------.------------. | |
| | | | | | |
+N*32 +(N-1)*32 +64 +32 +0 | |
where the k'th bit of this bit vector represents the status of the k'th bucket | |
in the hash table. |