Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Distributed, content-addressed version control for large ML artifacts

No cloud bills, no central bottlenecks. Local-first, protocol-first, peer-to-peer.


Shard is a protocol-first, local-first, peer-to-peer version control system for ML artifacts — models, datasets, checkpoints. It provides Git-like ergonomics, content-addressed storage, signed commits, and direct P2P transfers.

Why Shard?

Machine Learning artifacts are often too large for traditional version control systems like Git, leading developers to rely on cloud storage solutions (S3, GCS) or Git LFS (which still requires centralized servers).

Shard solves this by providing:

  • True Peer-to-Peer Distribution: No central server required. Discover peers and sync artifacts directly.
  • Content-Addressed Storage: Deduplication built-in through advanced chunking (Fixed or Rabin).
  • Git-like Ergonomics: Familiar CLI commands (add, commit, checkout, log).
  • Cryptographic Provenance: Every commit is signed using ed25519 keys for verifiable history.

Comparison

FeatureGitShard
Primary useSource codeML artifacts (models, datasets, checkpoints)
ChunkingCDC (git fast-import)Rabin + Fixed + configurable
Compressionzlib (default)Zstd or Zlib (runtime selectable)
HashingSHA-1 / SHA-256Blake3
P2PRemote-centricNative P2P (mDNS, Kademlia, Gossipsub)
Storage backendFlat files + packfilesSled or SQLite indexed store
SigningGPG (optional)ed25519 (built-in, every commit)
TransportSSH/HTTPSlibp2p TCP + Noise + Yamux

Performance

Shard is designed for large artifacts (100 MB – 100 GB). Key performance characteristics:

  • Chunking throughput: ~1 GB/s (fixed), ~500 MB/s (Rabin CDC)
  • Compression: Zstd level 3 — ~500 MB/s compress, ~2 GB/s decompress
  • Parallel pulls: concurrent chunk requests — saturates available bandwidth
  • Memory: bounded by configurable concurrency cap, not artifact size

Getting Started

Install

Linux & macOS (one-liner)

curl -fsSL https://raw.githubusercontent.com/sandy-sachin7/shard/main/scripts/install.sh | bash

Windows (PowerShell)

irm https://raw.githubusercontent.com/sandy-sachin7/shard/main/scripts/install.ps1 | iex

Cargo

cargo install shard-cli

Build from source

git clone https://github.com/sandy-sachin7/shard.git
cd shard
cargo build --release
./target/release/shard --help

Quick start

# Initialize a repository
shard init

# Add files (staged for commit)
shard add model.pt
shard add dataset/           # recursive directory add

# Commit with a message
shard commit -m "v1 checkpoint" --author "Alice"

# View history
shard log
shard log --json             # machine-readable

# Check out files from a commit
shard checkout <commit_id>

# Share with peers
shard share                  # announce on P2P network

# Pull from a peer
shard pull /ip4/192.168.1.2/tcp/9876 <commit_id>

# Verify integrity and signature
shard verify <commit_id>

# Branching and merging
shard branch create experiment
shard checkout experiment
shard add model.pt
shard commit -m "experimental changes"
shard checkout main
shard merge experiment -m "merge experiment" --author "Alice"

# Backup and recovery
shard backup /tmp/repo-backup.tar.gz
shard restore /tmp/repo-backup.tar.gz
shard export <commit_id> /tmp/reconstructed
shard import /tmp/datasets -m "imported dataset" --author "Alice"

CLI Reference

Shard exposes a unified CLI with Git-like ergonomics. All commands support a --json flag for machine-readable output.

CommandWhat it doesKey flags
initInitialize a repository--private, --db flat|sled|sqlite, --compression zstd|zlib|none, --chunker fixed|rabin, --passphrase
add <path>Stage files for commit(recursive for directories)
commitCreate a signed commit-m <msg>, --author <name>
logShow commit history--json
checkout <commit>Restore files from commit--json
statusShow working tree state--json
verify <commit>Verify integrity + signature--json
diff <commit1> <commit2>Compare two commits--json
pruneRemove unreachable objects--json
tagManage commit tagsadd, list, delete
branchManage branchescreate, delete, list
merge <branch>Merge branch into current HEAD-m <msg>, --author <name>
configView/edit configurationget, set
shareAnnounce commits to P2P network--json
syncDiscover + fetch from peers--json
pull <peer> <commit>Pull commit from specific peer--json
push <peer>Push commits to peer--json
peer add <multiaddr>Add a known peer--public-key <hex>
backup <output>Create a tar.gz backup--json
restore <backup>Restore repo from backup--json
export <commit> <dir>Reconstruct commit to directory--json
import <dir>Ingest directory as commit-m <msg>, --author <name>
recoverRecover from WAL crash--json
healthShow repository diagnostics + metrics--json
serveStart HTTP API server--addr <host:port>
unlockCache passphrase for session--passphrase
relayStart P2P relay node--listen <multiaddr>
transferManage P2P transfer queuelist, remove
keyManage signing keysrotate, list, verify, add-authorized, remove-authorized, list-authorized
completionsGenerate shell completionsbash, zsh, fish, elvish, powershell

Global flags

FlagEffect
--jsonMachine-readable JSON output
--log-formatLog output format: plain (default) or json
--verboseDebug-level logging

Architecture

Shard is structured as a collection of decoupled crates, tied together by the main CLI entrypoint.

Component Diagram

graph TD
    CLI["shard CLI<br/>(clap argument parsing)"]
    
    subgraph Core ["core crate"]
        OpQueue["Operation Queue<br/>(read/write locking)"]
        Config["Config System<br/>(env overrides + validation)"]
        GC["Garbage Collector<br/>(DAG reachability scan)"]
        Tracing["Distributed Tracing<br/>(thread-local trace IDs)"]
        Metrics["Runtime Metrics<br/>(atomic counters)"]
        Chunker["Chunker<br/>(Fixed / Rabin)"]
        Compression["Compression<br/>(Zstd / Zlib)"]
        Store["Store<br/>(Sled / SQLite)"]
        CommitDAG["Commit DAG<br/>Manifest / Index / WAL<br/>Branch / Merge / Push"]
        API["HTTP API<br/>(axum /api/v1/*)"]
    end
    
    subgraph Net ["net crate"]
        LibP2P["libp2p Node<br/>TCP+Noise+Yamux<br/>mDNS / Kademlia<br/>Gossipsub<br/>Relay / DCUtR / AutoNAT<br/>Rate Limiting"]
    end
    
    subgraph Crypto ["crypto crate"]
        KeyGen["ed25519 key generation<br/>Signing & Verification<br/>Passphrase encryption<br/>Key rotation"]
    end
    
    subgraph Storage ["storage crate"]
        SledBackend["Sled Backend"]
        SQLiteBackend["SQLite Backend<br/>(r2d2 connection pool)"]
    end
    
    CLI --> Core
    CLI --> Net
    Core --> Crypto
    Core --> Storage
    Net --> Crypto
    Core -.-> OpQueue
    Core -.-> Config
    Core -.-> GC
    Core -.-> Tracing
    Core -.-> Metrics
    Core -.-> Chunker
    Core -.-> Compression
    Core -.-> Store
    Core -.-> CommitDAG
    Core -.-> API

Storage Layout

Shard stores all its data in a .shard/ directory at the root of the repository.

graph LR
    Root[".shard/"] --> Objects["objects/"]
    Objects --> Prefix["<2-prefix>/"]
    Prefix --> Hash["<hash> (content-addressed chunks)"]
    
    Root --> HEAD["HEAD (current commit reference)"]
    Root --> Config["config.json (repository config)"]
    Root --> Index["index (staging area)"]
    Root --> Wal["wal.log (crash recovery)"]
    
    Root --> Keys["keys/"]
    Keys --> Sec["secret.key"]
    Keys --> Pub["public.key"]
    
    Root --> Refs["refs/heads/ (branch pointers)"]
    Root --> Auth["authorized_keys (P2P auth whitelist)"]
    Root --> Peers["peers.json (known P2P peers)"]
    Root --> Tags["tags.json (named commit pointers)"]

Key Design Decisions

DecisionChoiceRationale
ChunkingRabin (default) or FixedRabin CDC improves dedup across versions; fixed for predictable sizes
CompressionZstd or ZlibRuntime selection; zstd is faster with better ratios
HashingBlake3Fastest cryptographic hash, SIMD-accelerated
Signaturesed25519Proven, fast, small signatures (64 bytes)
StorageSled, SQLite, or Flat fileSled/SQLite for indexed queries; flat for portability
P2Plibp2p TCP+Noise+YamuxMature, NAT traversal via relay/DCUtR/AutoNAT
Wire formatJSON / CBORSerde over request-response + Gossipsub
ConcurrencyPer-repo read-write queueReads parallel, writes exclusive; no global lock
ConfigJSON + env var overrides12-factor friendly; SHARD_* env vars take precedence
TracingThread-local trace IDsCorrelate logs across operations
GCDAG reachability scanMarks all reachable from HEAD/branches/tags/index, prunes rest
MetricsStatic atomic countersRuntime operation counters with JSON snapshot

Protocol Specification

This document details the P2P wire protocol and data synchronization primitives used in Shard.

Discovery

Nodes in the Shard network must discover each other before exchanging data. We rely on standard libp2p protocols:

  • Primary: libp2p DHT + Kademlia.
  • LAN fallback: mDNS.
  • Manual bootstrap: shard peer add <multiaddr>.

Data Model (Deterministic and Canonical)

All metadata in Shard is canonicalized (sorted keys) and signed using ed25519.

  • Blob (chunk): Raw bytes saved in objects/<2prefix>/<hash>. Hash is computed using Blake3(chunk).
  • Manifest: Artifact descriptor containing filename, content type, compression flag, chunk list (ordered), merkle root, size, created_by, and created_at.
  • Commit node: JSON with commit_id (hash of canonical commit JSON), parents:[], manifests:[], author, message, timestamp, signature.

Announcements

When a repository is updated, nodes announce the new commit over the P2P network.

  • Topic: shard:ann (global) and /shard/repo/<repo_id> (repo-specific)
  • Payload: JSON-serialized Announcement:
{
  "commit_id": "<blake3 hash>",
  "file_count": 42,
  "total_size": 1073741824,
  "repo_name": "my-model",
  "peer_multiaddr": "/ip4/192.168.1.2/tcp/9876"
}

Nodes subscribe to both topics. Announcements are published on initial share, on a 5-second heartbeat, and when new connections are established.

Rate Limiting

  • Gossipsub: Custom message_id_fn (blake3 content-hash dedup), max_messages_per_rpc(100)
  • Announcements: Per-peer per-commit dedup, max 5 unique (peer,commit_id) pairs per 60s window
  • Requests: Max 50 per-peer requests per window; request-response timeout of 60s
  • Reset: All rate counters reset every 5s interval tick
  • Deduplication: Duplicate messages silently dropped via content-hash message IDs

Fetch Flow

The core mechanism for retrieving artifacts from peers operates via a parallelized fetch flow.

sequenceDiagram
    participant Peer A (Client)
    participant Peer B (Announcer)
    
    Peer B--)Peer A: 1. PubSub Announcement (commit_id)
    Peer A->>Peer B: 2. Request Manifest (commit_id)
    Peer B-->>Peer A: 3. Return Signed Manifest
    
    Note over Peer A: Compares chunk list<br/>to local index
    
    par Chunk Requests
        Peer A->>Peer B: 4a. Request Chunk (hash1)
        Peer B-->>Peer A: 5a. Return Chunk payload + header
    and
        Peer A->>Peer B: 4b. Request Chunk (hash2)
        Peer B-->>Peer A: 5b. Return Chunk payload + header
    end
    
    Note over Peer A: 6. Assemble artifact & verify Merkle root
  1. Announcement: Peer sees announcement → requests manifest via DHT or direct stream to announcer.
  2. Manifest: Manifest returned (signed).
  3. Diff: Peer compares chunk list to local index → requests missing chunks via parallel piece requests (libp2p streams).
  4. Chunks: Chunks transferred with piece headers {hash, offset, size} and signed payload.
  5. Assemble: On complete, client assembles artifact and verifies Merkle root and final digest.

Resilience & Security

  • Resilience: Parallel downloads, chunk retries, resumable transfer (persists partial chunks to .partial/).
  • Security: Metadata signed; payloads integrity verified; optional encryption with per-repo symmetric keys. Key rotation is handled via a signed revocation commit in the DAG.