Version-Controlled Databases with Prolly Trees: A Practical Guide for Developers

Overview

Modern databases and filesystems rely heavily on B-trees for efficient sorted key-value storage on block devices. However, traditional B-trees are not inherently version-controlled. The Dolt project (Apache 2.0-licensed) introduced a clever variant called a Prolly tree—a probabilistic, content-addressed B-tree that enables full version control for an entire database. This guide explains the core concepts, step-by-step implementation ideas, common pitfalls, and resources to help you understand and apply Prolly trees in your own projects.

Version-Controlled Databases with Prolly Trees: A Practical Guide for Developers

Prerequisites

Basic understanding of tree data structures (binary trees, B-trees)
Familiarity with version control concepts (commits, branches, merges) as used in Git
Some experience with key-value stores and simple hashing (SHA-256)
Ability to read pseudocode or simple Python-like code

Step-by-Step Implementation Guide

Step 1: Understand the B‑Tree Foundation

Start with a standard B‑tree that stores sorted keys and values. Each internal node contains a range of children, and leaf nodes hold actual key-value pairs. The height stays log_m(n) where m is the branching factor. This is efficient for disk I/O, but every update modifies the tree in place—no history is preserved.

Step 2: Introduce Content Addressing

Instead of modifying nodes in place, compute a cryptographic hash (e.g., SHA‑256) of each node’s content. The node is now addressed by its hash (content-addressed storage). When you update a key, you create new nodes along the path from root to leaf, but you reuse unchanged subtrees by pointing to their existing hashes. This is structural sharing, similar to Git’s model for files.

Pseudocode for update:

def update(node, key, value):
    if node.is_leaf:
        new_leaf = node.copy_with_new_kv(key, value)
        return new_leaf
    else:
        child_index = find_child(node, key)
        new_child = update(node.children[child_index], key, value)
        new_node = node.copy_with_updated_child(child_index, new_child)
        return new_node

Each returned node includes its hash. Only the affected path is new; all other subtrees are shared by reference.

Step 3: Make It Probabilistically Balanced (Prolly)

In a pure B‑tree, we enforce strict invariants (e.g., every node must be at least half full). With Prolly trees, you relax this using randomness: after creating a new node, you may probabilistically choose to split or merge it with a sibling. The decision is based on a hash of the node’s content, creating a deterministic but balanced structure. This eliminates the need for explicit rebalancing traversals and makes the tree resistant to adversarial key sequences.

Key idea: Each node has a split probability derived from its hash. If the hash falls below a threshold, the node splits. Over many insertions, the tree stays roughly balanced.

Step 4: Build a Version Graph

Every update produces a new root hash. Keep a data structure (e.g., a commit object) that stores the root hash, a parent commit hash, a timestamp, and a message. Each commit points to a specific version of the entire database. This forms a directed acyclic graph (DAG) of versions.

Commit = pointer to a Prolly tree root + metadata
Branch = moving reference to a commit
Clone = copy all reachable nodes

Step 5: Diff Two Versions

To diff commit A and commit B, recursively compare their root nodes by hash. If hashes equal, subtrees are identical. Otherwise, walk down to find differing leaf entries. Use the content-addressable structure to quickly detect added, removed, or changed key-value pairs. This is much faster than comparing every record.

Step 6: Implement Three‑Way Merge

When merging two branches (base + two diverged versions), you need to reconcile changes. Use the common ancestor (base) commit. For each key:

If only one branch changed it, take that change.
If both changed it to the same value, take it.
If both changed it differently, mark as conflict.

The Prolly tree’s hash structure makes retrieving the base version cheap—just follow the DAG.

Common Mistakes

Mistake 1: Ignoring Hash Collisions

Unless you use a secure hash like SHA‑256, collisions could merge different subtrees. Always use a well-known hash function with sufficient bit length.

Mistake 2: Tuning Node Size Poorly

Too small nodes lead to many tree levels; too large nodes waste memory on small changes. Typical B‑tree fan‑outs (e.g., hundreds of entries) work well. For Prolly trees, the probabilistic split threshold should be chosen so that average node sizes resemble those of a B‑tree.

Mistake 3: Storing Node Content Separately

If you store node content and its hash in different places, you break the content‑addressable invariant. Every node must be retrievable solely by its hash.

Mistake 4: Not Handling Large Keys/Values

If values are huge, storing them directly in leaf nodes is inefficient. Consider storing a pointer to a separate blob store and only the hash (or reference) in the leaf.

Summary

Prolly trees combine the efficiency of B‑trees with cryptographic content addressing and probabilistic balancing to create a version-controlled database structure. By understanding B‑tree foundations, adding content addressing, and building a commit graph, you can implement a system that supports diffing, branching, and merging—just like Dolt does. Avoiding common pitfalls such as hash collisions and poor node sizing ensures your implementation is both correct and performant.

Tags: