Introduction
Distributed, content-addressed version control for large ML artifacts
No cloud bills, no central bottlenecks. Local-first, protocol-first, peer-to-peer.
Shard is a protocol-first, local-first, peer-to-peer version control system for ML artifacts — models, datasets, checkpoints. It provides Git-like ergonomics, content-addressed storage, signed commits, and direct P2P transfers.
Why Shard?
Machine Learning artifacts are often too large for traditional version control systems like Git, leading developers to rely on cloud storage solutions (S3, GCS) or Git LFS (which still requires centralized servers).
Shard solves this by providing:
- True Peer-to-Peer Distribution: No central server required. Discover peers and sync artifacts directly.
- Content-Addressed Storage: Deduplication built-in through advanced chunking (Fixed or Rabin).
- Git-like Ergonomics: Familiar CLI commands (
add,commit,checkout,log). - Cryptographic Provenance: Every commit is signed using
ed25519keys for verifiable history.
Comparison
| Feature | Git | Shard |
|---|---|---|
| Primary use | Source code | ML artifacts (models, datasets, checkpoints) |
| Chunking | CDC (git fast-import) | Rabin + Fixed + configurable |
| Compression | zlib (default) | Zstd or Zlib (runtime selectable) |
| Hashing | SHA-1 / SHA-256 | Blake3 |
| P2P | Remote-centric | Native P2P (mDNS, Kademlia, Gossipsub) |
| Storage backend | Flat files + packfiles | Sled or SQLite indexed store |
| Signing | GPG (optional) | ed25519 (built-in, every commit) |
| Transport | SSH/HTTPS | libp2p TCP + Noise + Yamux |
Performance
Shard is designed for large artifacts (100 MB – 100 GB). Key performance characteristics:
- Chunking throughput: ~1 GB/s (fixed), ~500 MB/s (Rabin CDC)
- Compression: Zstd level 3 — ~500 MB/s compress, ~2 GB/s decompress
- Parallel pulls: concurrent chunk requests — saturates available bandwidth
- Memory: bounded by configurable concurrency cap, not artifact size
Getting Started
Install
Linux & macOS (one-liner)
curl -fsSL https://raw.githubusercontent.com/sandy-sachin7/shard/main/scripts/install.sh | bash
Windows (PowerShell)
irm https://raw.githubusercontent.com/sandy-sachin7/shard/main/scripts/install.ps1 | iex
Cargo
cargo install shard-cli
Build from source
git clone https://github.com/sandy-sachin7/shard.git
cd shard
cargo build --release
./target/release/shard --help
Quick start
# Initialize a repository
shard init
# Add files (staged for commit)
shard add model.pt
shard add dataset/ # recursive directory add
# Commit with a message
shard commit -m "v1 checkpoint" --author "Alice"
# View history
shard log
shard log --json # machine-readable
# Check out files from a commit
shard checkout <commit_id>
# Share with peers
shard share # announce on P2P network
# Pull from a peer
shard pull /ip4/192.168.1.2/tcp/9876 <commit_id>
# Verify integrity and signature
shard verify <commit_id>
# Branching and merging
shard branch create experiment
shard checkout experiment
shard add model.pt
shard commit -m "experimental changes"
shard checkout main
shard merge experiment -m "merge experiment" --author "Alice"
# Backup and recovery
shard backup /tmp/repo-backup.tar.gz
shard restore /tmp/repo-backup.tar.gz
shard export <commit_id> /tmp/reconstructed
shard import /tmp/datasets -m "imported dataset" --author "Alice"
CLI Reference
Shard exposes a unified CLI with Git-like ergonomics. All commands support a --json flag for machine-readable output.
| Command | What it does | Key flags |
|---|---|---|
init | Initialize a repository | --private, --db flat|sled|sqlite, --compression zstd|zlib|none, --chunker fixed|rabin, --passphrase |
add <path> | Stage files for commit | (recursive for directories) |
commit | Create a signed commit | -m <msg>, --author <name> |
log | Show commit history | --json |
checkout <commit> | Restore files from commit | --json |
status | Show working tree state | --json |
verify <commit> | Verify integrity + signature | --json |
diff <commit1> <commit2> | Compare two commits | --json |
prune | Remove unreachable objects | --json |
tag | Manage commit tags | add, list, delete |
branch | Manage branches | create, delete, list |
merge <branch> | Merge branch into current HEAD | -m <msg>, --author <name> |
config | View/edit configuration | get, set |
share | Announce commits to P2P network | --json |
sync | Discover + fetch from peers | --json |
pull <peer> <commit> | Pull commit from specific peer | --json |
push <peer> | Push commits to peer | --json |
peer add <multiaddr> | Add a known peer | --public-key <hex> |
backup <output> | Create a tar.gz backup | --json |
restore <backup> | Restore repo from backup | --json |
export <commit> <dir> | Reconstruct commit to directory | --json |
import <dir> | Ingest directory as commit | -m <msg>, --author <name> |
recover | Recover from WAL crash | --json |
health | Show repository diagnostics + metrics | --json |
serve | Start HTTP API server | --addr <host:port> |
unlock | Cache passphrase for session | --passphrase |
relay | Start P2P relay node | --listen <multiaddr> |
transfer | Manage P2P transfer queue | list, remove |
key | Manage signing keys | rotate, list, verify, add-authorized, remove-authorized, list-authorized |
completions | Generate shell completions | bash, zsh, fish, elvish, powershell |
Global flags
| Flag | Effect |
|---|---|
--json | Machine-readable JSON output |
--log-format | Log output format: plain (default) or json |
--verbose | Debug-level logging |
Architecture
Shard is structured as a collection of decoupled crates, tied together by the main CLI entrypoint.
Component Diagram
graph TD
CLI["shard CLI<br/>(clap argument parsing)"]
subgraph Core ["core crate"]
OpQueue["Operation Queue<br/>(read/write locking)"]
Config["Config System<br/>(env overrides + validation)"]
GC["Garbage Collector<br/>(DAG reachability scan)"]
Tracing["Distributed Tracing<br/>(thread-local trace IDs)"]
Metrics["Runtime Metrics<br/>(atomic counters)"]
Chunker["Chunker<br/>(Fixed / Rabin)"]
Compression["Compression<br/>(Zstd / Zlib)"]
Store["Store<br/>(Sled / SQLite)"]
CommitDAG["Commit DAG<br/>Manifest / Index / WAL<br/>Branch / Merge / Push"]
API["HTTP API<br/>(axum /api/v1/*)"]
end
subgraph Net ["net crate"]
LibP2P["libp2p Node<br/>TCP+Noise+Yamux<br/>mDNS / Kademlia<br/>Gossipsub<br/>Relay / DCUtR / AutoNAT<br/>Rate Limiting"]
end
subgraph Crypto ["crypto crate"]
KeyGen["ed25519 key generation<br/>Signing & Verification<br/>Passphrase encryption<br/>Key rotation"]
end
subgraph Storage ["storage crate"]
SledBackend["Sled Backend"]
SQLiteBackend["SQLite Backend<br/>(r2d2 connection pool)"]
end
CLI --> Core
CLI --> Net
Core --> Crypto
Core --> Storage
Net --> Crypto
Core -.-> OpQueue
Core -.-> Config
Core -.-> GC
Core -.-> Tracing
Core -.-> Metrics
Core -.-> Chunker
Core -.-> Compression
Core -.-> Store
Core -.-> CommitDAG
Core -.-> API
Storage Layout
Shard stores all its data in a .shard/ directory at the root of the repository.
graph LR
Root[".shard/"] --> Objects["objects/"]
Objects --> Prefix["<2-prefix>/"]
Prefix --> Hash["<hash> (content-addressed chunks)"]
Root --> HEAD["HEAD (current commit reference)"]
Root --> Config["config.json (repository config)"]
Root --> Index["index (staging area)"]
Root --> Wal["wal.log (crash recovery)"]
Root --> Keys["keys/"]
Keys --> Sec["secret.key"]
Keys --> Pub["public.key"]
Root --> Refs["refs/heads/ (branch pointers)"]
Root --> Auth["authorized_keys (P2P auth whitelist)"]
Root --> Peers["peers.json (known P2P peers)"]
Root --> Tags["tags.json (named commit pointers)"]
Key Design Decisions
| Decision | Choice | Rationale |
|---|---|---|
| Chunking | Rabin (default) or Fixed | Rabin CDC improves dedup across versions; fixed for predictable sizes |
| Compression | Zstd or Zlib | Runtime selection; zstd is faster with better ratios |
| Hashing | Blake3 | Fastest cryptographic hash, SIMD-accelerated |
| Signatures | ed25519 | Proven, fast, small signatures (64 bytes) |
| Storage | Sled, SQLite, or Flat file | Sled/SQLite for indexed queries; flat for portability |
| P2P | libp2p TCP+Noise+Yamux | Mature, NAT traversal via relay/DCUtR/AutoNAT |
| Wire format | JSON / CBOR | Serde over request-response + Gossipsub |
| Concurrency | Per-repo read-write queue | Reads parallel, writes exclusive; no global lock |
| Config | JSON + env var overrides | 12-factor friendly; SHARD_* env vars take precedence |
| Tracing | Thread-local trace IDs | Correlate logs across operations |
| GC | DAG reachability scan | Marks all reachable from HEAD/branches/tags/index, prunes rest |
| Metrics | Static atomic counters | Runtime operation counters with JSON snapshot |
Protocol Specification
This document details the P2P wire protocol and data synchronization primitives used in Shard.
Discovery
Nodes in the Shard network must discover each other before exchanging data. We rely on standard libp2p protocols:
- Primary: libp2p DHT + Kademlia.
- LAN fallback: mDNS.
- Manual bootstrap:
shard peer add <multiaddr>.
Data Model (Deterministic and Canonical)
All metadata in Shard is canonicalized (sorted keys) and signed using ed25519.
- Blob (chunk): Raw bytes saved in
objects/<2prefix>/<hash>. Hash is computed using Blake3(chunk). - Manifest: Artifact descriptor containing filename, content type, compression flag, chunk list (ordered), merkle root, size, created_by, and created_at.
- Commit node: JSON with
commit_id(hash of canonical commit JSON),parents:[],manifests:[],author,message,timestamp,signature.
Announcements
When a repository is updated, nodes announce the new commit over the P2P network.
- Topic:
shard:ann(global) and/shard/repo/<repo_id>(repo-specific) - Payload: JSON-serialized
Announcement:
{
"commit_id": "<blake3 hash>",
"file_count": 42,
"total_size": 1073741824,
"repo_name": "my-model",
"peer_multiaddr": "/ip4/192.168.1.2/tcp/9876"
}
Nodes subscribe to both topics. Announcements are published on initial share, on a 5-second heartbeat, and when new connections are established.
Rate Limiting
- Gossipsub: Custom
message_id_fn(blake3 content-hash dedup),max_messages_per_rpc(100) - Announcements: Per-peer per-commit dedup, max 5 unique (peer,commit_id) pairs per 60s window
- Requests: Max 50 per-peer requests per window; request-response timeout of 60s
- Reset: All rate counters reset every 5s interval tick
- Deduplication: Duplicate messages silently dropped via content-hash message IDs
Fetch Flow
The core mechanism for retrieving artifacts from peers operates via a parallelized fetch flow.
sequenceDiagram
participant Peer A (Client)
participant Peer B (Announcer)
Peer B--)Peer A: 1. PubSub Announcement (commit_id)
Peer A->>Peer B: 2. Request Manifest (commit_id)
Peer B-->>Peer A: 3. Return Signed Manifest
Note over Peer A: Compares chunk list<br/>to local index
par Chunk Requests
Peer A->>Peer B: 4a. Request Chunk (hash1)
Peer B-->>Peer A: 5a. Return Chunk payload + header
and
Peer A->>Peer B: 4b. Request Chunk (hash2)
Peer B-->>Peer A: 5b. Return Chunk payload + header
end
Note over Peer A: 6. Assemble artifact & verify Merkle root
- Announcement: Peer sees announcement → requests manifest via DHT or direct stream to announcer.
- Manifest: Manifest returned (signed).
- Diff: Peer compares chunk list to local index → requests missing chunks via parallel piece requests (libp2p streams).
- Chunks: Chunks transferred with piece headers
{hash, offset, size}and signed payload. - Assemble: On complete, client assembles artifact and verifies Merkle root and final digest.
Resilience & Security
- Resilience: Parallel downloads, chunk retries, resumable transfer (persists partial chunks to
.partial/). - Security: Metadata signed; payloads integrity verified; optional encryption with per-repo symmetric keys. Key rotation is handled via a signed revocation commit in the DAG.