Principles of operation¶
DistKV relies on the fact that on most KV storage systems, any given record is rarely (if ever) changed by more than one entity at the same time. Thus, a simple gossip protocol is sufficient for distributing data.
To recover from missed changes, each node in a DistKV network maintains a
change counter (“tick”). All data records (distkv.model.Entry
) are
tagged with a chain of events (distkv.model.NodeEvent
), consisting
of the n
most recent (node, tick)
values which changed this
entry. Nodes do not appear in a chain more than once. Dropped ticks
are added to a per-node list of “known”(-to-have-been-superseded) counter
values.
The maximum chain length is determined by the number of partitions a DistKV network might split into. Thus the network guarantees that it is possible which side of a split modified a record when the split is healed.
If both sides did, the conflict is resolved deterministically. TODO: when this happens, send a notification to clients.
After a network split, a four-step protocol re-synchronizes the participants:
- broadcast the current counters
- broadcast known-value and known-deleted lists
- broadcast a list of missing node events
- broadcast the missed data
DistKV does not have a master node, much less a consensus-based election
system (Raft, Paxos, …). Instead, DistKV uses an asyncserf Actor
to
compile a short list of available servers that’s broadcast every few
seconds.
When a partitioned network is re-joined, the current housekeepers are responsible for driving and monitoring the re-sync protocol.
Storage¶
DistKV is intended to be used in a mostly-RAM architecture. There is no disk-based storage backend; snapshots and event logs are used to restore a system, if necessary. Feeding old snapshots to a running system is mostly benign, but see below.
DistKV is based on the gossip system provided by Hashicorp’s Serf. It supports all data types that can be transmitted by MsgPack <https://github.com/msgpack/msgpack/blob/master/spec.md>.
TODO: MsgPack has extension types, so constructing Python objects is possible.
Record Deletion¶
Deleting data records is when DistKV’s synchronization protocol breaks down, because DistKV can’t attach chains to records which no longer exist.
DistKV fixes this by keeping a separate record of deleted entries, or rather their chain links. This works well for mostly-static storages but becomes a problem on more dynamic systems.
Thus, periodic clean-up is required. This is achieved by creating a separate “Delete” Actor group which contains every system with persistent storage plus one system per network that’s not already covered.
When every node of this group is online, they periodically broadcast a
tuple of tock
values: one which signals that deletions with earlier
tock
s may safely be flushed, and a high-water limit for the next
round.
A node that receives this tuple compares the received first value with the last transmission’s second. If it’s higher, deletions may have been missed, most likely due to a network outage between that node and the closest Delete member. Since the records are now gone, the node will connect to one of the Delete group members and send a list of each entry’s last-change chain links. The recipient will re-broadcast any misses as “new” deletions.