feat: configurable datastores with S3 sync and distributed locking#669
feat: configurable datastores with S3 sync and distributed locking#669
Conversation
d783c8a to
9237ef5
Compare
There was a problem hiding this comment.
Review Summary
This is a well-designed PR that adds configurable datastores with S3 sync and distributed locking. The implementation follows domain-driven design principles and adheres to the project's code style guidelines.
Architecture (DDD Perspective)
The architecture properly separates concerns across layers:
- Domain layer (
src/domain/datastore/): Pure interfaces (DistributedLock,DatastorePathResolver,DatastoreVerifier) and value objects (LockInfo,DatastoreConfig) - Infrastructure layer (
src/infrastructure/persistence/): Concrete implementations (FileLock,S3Lock,S3Client,S3CacheSyncService) - CLI layer: Thin adapters that wire domain/infrastructure and handle presentation
Code Quality Checks
- TypeScript strict mode - all types properly defined
- Named exports used throughout
- AGPLv3 copyright headers present on all new files
- The
type AnyOptions = anypattern in CLI commands follows established codebase convention (60+ files use it)
Test Coverage
Comprehensive unit tests provided:
resolve_datastore_test.ts- Config resolution logicdatastore_pattern_matcher_test.ts- Gitignore-style glob patternsdistributed_lock_test.ts- LockTimeoutError behaviorfile_lock_test.ts- FileLock acquire/release/timeout/stale handlings3_lock_test.ts- S3Lock with mock clientdefault_datastore_path_resolver_test.ts- Path resolution logic
Security Review
- Atomic file operations for lock creation (
createNew: true) - S3 conditional writes for lock acquisition (
If-None-Match: *) - Lock TTL + heartbeat prevents indefinite locks
- Force-release requires explicit
--forceflag - Proper error handling without credential leakage
- No OWASP vulnerabilities identified
No Blocking Issues
The PR is well-structured and ready to merge.
Suggestions (non-blocking)
S3CacheSyncService(~500 lines) could be split into smaller components in future iterations if it grows further- Consider adding a dedicated test file for
datastore_migration_service.tsin a future PR (migration is currently tested indirectly via setup commands) - The design document in
design/datastores.mdis excellent - great reference for future maintainers
LGTM! Excellent work on this feature.
There was a problem hiding this comment.
Adversarial Review
Critical / High
-
CRITICAL - Path Traversal in S3 Cache Sync (
s3_cache_sync.ts:139-143)S3 object keys can contain path traversal characters (
../). When pulling files from S3, keys are used directly inpath.join()which normalizes..sequences, allowing escape from the cache directory.Breaking example:
- Attacker controls S3 bucket (compromised credentials, malicious team member, social engineering)
- Attacker creates S3 object with key:
../../../home/user/.bashrc - User runs
swamp datastore sync --pull - Code executes:
join(cachePath, "../../../home/user/.bashrc")→/home/user/.bashrc - Attacker content overwrites user's
.bashrc(arbitrary file write)
Affected code paths:
pullFile()line 141:const localPath = join(this.cachePath, relativePath);pullChanged()uses keys from S3 indexpullAll()uses keys fromlistAllObjects()
Fix: Validate that the resolved path is within the cache directory:
const localPath = join(this.cachePath, relativePath); if (!localPath.startsWith(this.cachePath + "/")) { throw new Error(`Path traversal detected: ${relativePath}`); }
-
HIGH - Race Condition in Stale Lock Force-Acquire (
s3_lock.ts:106-121,file_lock.ts:109-119)The stale lock detection has a TOCTOU race between checking staleness and deleting:
Breaking example:
- Process A reads stale lock, decides to force-acquire (line 112)
- Process B legitimately acquires a fresh lock (replaces the stale one)
- Process A deletes Process B's fresh lock (line 115)
- Process A and Process C both acquire "the lock"
- Two processes hold the lock simultaneously → data corruption
Fix: Use conditional delete. For S3, check the ETag matches the stale lock before deleting. For filesystem, use the lockfile mtime to verify it hasn't changed.
-
HIGH - Heartbeat Race on Lock Release (
s3_lock.ts:161-168,file_lock.ts:164-170)The heartbeat's
extend()method can race withrelease():Breaking example:
- Heartbeat interval fires,
extend()starts executing extend()passes theif (!this.held)guard (held is still true)release()is called, setsthis.held = false, deletes lock- Process B acquires the lock
extend()completes itsawait writeTextFile/putObject, overwriting Process B's lock
Fix: Use a mutex or check
this.heldagain after the async write, then delete if the lock was released while extending. - Heartbeat interval fires,
-
HIGH - Degraded Mode Silently Allows Concurrent Corruption (
datastore_sync_coordinator.ts:77-84)If lock acquisition fails (network issue, permissions), the code logs a warning and continues without a lock. This silently degrades to unsafe mode where concurrent commands can corrupt data.
Breaking example:
- S3 bucket has restrictive permissions, lock acquire fails
- User runs two
swamp model runcommands concurrently - Both proceed without locks → interleaved writes → data corruption
- User only sees a warning in logs, not an error
Fix: Lock acquisition failure should be fatal for data-mutating operations, not a warning.
Medium
-
MEDIUM - Hidden Files Excluded from pushChanged() (
s3_cache_sync.ts:305-309)pushChanged()skips ALL files starting with.viarel.startsWith("."), butpullAll()only skips explicit metadata files. This asymmetry means:- Legitimate hidden files (
.gitkeep, hidden output files) won't sync to S3 - Other machines won't receive these files
Fix: Only skip the specific metadata files, not all hidden files.
- Legitimate hidden files (
-
MEDIUM - No Validation of S3 Bucket Name (
resolve_datastore.ts:63-72)The
SWAMP_DATASTORE=s3:...parsing accepts any string as a bucket name without validation. Invalid bucket names (uppercase, special chars, wrong length) will fail later with cryptic AWS errors.Fix: Validate bucket name format before use.
-
MEDIUM - Response.Body! Non-Null Assertion (
s3_client.ts:98)return new Uint8Array(await response.Body!.transformToByteArray());- the!assertion will throw a confusing error if Body is null/undefined.Fix: Check for null and throw a descriptive error.
Low
-
LOW - Migration Doesn't Preserve mtime (
datastore_migration_service.ts:103-105)Deno.copyFiledoesn't preserve modification times. Since sync uses mtime for change detection, this could cause spurious sync operations after migration. -
LOW - SIGINT Handler Doesn't Await Release (
datastore_sync_coordinator.ts:72-74)Deno.exit(130)is called immediately without awaitinglock.release(). The release may not complete before process exit. -
LOW - Multiple requireInitializedRepo Calls Would Deadlock (latent)
If
requireInitializedRepois called twice in the same process (nested commands, bugs), the second call creates a new lock instance and tries to acquire, blocking on the first lock. Eventually times out after 60s, but poor UX.
Verdict
FAIL - The path traversal vulnerability (finding #1) is a critical security issue that allows arbitrary file write when syncing from a malicious/compromised S3 bucket. The lock race conditions (#2, #3) undermine the distributed locking guarantees that are the core feature of this PR. These must be fixed before merge.
## Summary
Swamp's runtime data (model data, workflow runs, outputs, audit logs, telemetry, secrets, bundles) has always lived in .swamp/ inside the repo. This PR makes the storage backend configurable, adding S3 as a first-class option for team collaboration and
introducing distributed locking across both backends to prevent data corruption from concurrent CLI invocations.
- Two backends: filesystem (default, backward-compatible) and S3 (with local cache + automatic sync)
- Distributed locking: both backends now acquire a lock before command execution and release it after — filesystem uses advisory lockfiles, S3 uses conditional writes
- New swamp datastore command group: setup, status, sync, and lock management
- Automatic migration: swamp datastore setup migrates existing data to the new backend
- Concurrent S3 transfers: pull/push operations run in batches of 10 to minimize sync latency
## User Impact
### Zero breaking changes for existing users
Existing repos with no datastore configuration continue to work identically. The default datastore is filesystem at {repoDir}/.swamp/, which is exactly where data lives today. The only behavioral difference is that a .datastore.lock file will now be
created inside .swamp/ during command execution — this is transparent and auto-cleaned.
### New: S3 datastore for team collaboration
Teams can now share runtime data via S3:
swamp datastore setup s3 --bucket my-bucket --prefix my-project --region us-east-1
This pushes existing local data to S3 and updates .swamp.yaml. Every subsequent command automatically pulls changes before execution and pushes changes after. The local cache at ~/.swamp/repos/{repoId}/ is fully disposable — deleting it or cloning the
repo on a new machine repopulates from S3 on the next command.
### New: external filesystem datastore
Move runtime data outside the repo (e.g., shared NFS mount):
swamp datastore setup filesystem --path /mnt/shared/swamp-data
## Migration paths
| From | To | Command |
|----------------------|---------------------|--------------------------------------------------------|
| Default .swamp/ | External filesystem | swamp datastore setup filesystem --path /path |
| Default .swamp/ | S3 | swamp datastore setup s3 --bucket name |
| External filesystem | S3 | swamp datastore setup s3 --bucket name |
| S3 | External filesystem | swamp datastore setup filesystem --path /path |
| Any backend | Default .swamp/ | Edit .swamp.yaml, remove datastore field |
Each setup command: verifies the target is accessible, migrates existing data, updates .swamp.yaml, and cleans up the source. Use --skip-migration to skip the data copy.
### CI/CD: environment variable override
Override the datastore without modifying .swamp.yaml:
export SWAMP_DATASTORE=s3:my-bucket/my-prefix
export SWAMP_DATASTORE=filesystem:/tmp/swamp-data
Resolution priority: SWAMP_DATASTORE env var > --datastore CLI arg > .swamp.yaml > default.
### New: lock management commands
If a process crashes without releasing the datastore lock, the lock auto-expires after 30 seconds. For immediate recovery:
swamp datastore lock status # Show who holds the lock
swamp datastore lock release --force # Force-release stuck lock
## Design Choices
### Why a lock for filesystem too?
Without locking, two concurrent swamp model run invocations on the same repo could corrupt shared state (overlapping writes to data versions, workflow runs, audit logs). Previously this was a silent race condition. Both backends now acquire a lock at
command start and release at command end, making concurrent access safe. The filesystem lock uses advisory lockfiles with Deno.open({ createNew: true }) for atomic creation.
### Why not lazy S3 pull?
We considered three approaches for S3 sync performance:
1. Lazy pull (fetch on cache miss) — best latency but complicates offline behavior and requires intercepting all file reads throughout the codebase
2. Directory-scoped pull (each command declares which subdirectories it needs) — lower transfer volume but fragile annotation maintenance
3. Concurrent full pull (batch downloads in parallel) — simplest, no behavioral changes
We chose option 3 (concurrent transfers in batches of 10). It's the lowest-risk improvement and doesn't close any doors. Options 1 and 2 can be layered on later if repos grow large enough to warrant them.
### Why size + mtime instead of content hashing?
Change detection compares stat.size and stat.mtime rather than computing content hashes. This avoids reading every file on every sync. The write paths (atomicWriteTextFile, Deno.writeFile) always update mtime, so mtime changes reliably detect rewrites
even when file size doesn't change. Under the global lock, there are no concurrent writers to create ABA problems.
### Shared S3Client
A single S3Client instance is shared between the lock and the sync service for each command invocation. This avoids duplicate credential resolution and connection overhead.
### Sync coordinator supports lock-only mode
The datastore_sync_coordinator.ts accepts optional service and lock parameters independently. Filesystem datastores register lock-only (no pull/push). S3 datastores register both. This keeps the coordinator generic without backend-specific branching.
### Lock heartbeat and TTL
Both lock implementations use a background heartbeat that refreshes the lock every TTL/3 (default: every 10 seconds for a 30-second TTL). This prevents long-running commands from having their lock expire. If a process crashes, the lock self-expires
after the TTL. The SIGINT handler provides best-effort release on Ctrl-C.
### Error-path lock release
flushDatastoreSync() is called on both the success path and the error path in runCli(). This ensures locks are released even when commands fail, preventing stuck locks from cascading into subsequent command failures.
## Architecture
Follows domain-driven design with clean layer separation:
Domain layer (src/domain/datastore/):
- DatastoreConfig — discriminated union type for filesystem and S3 configs
- DatastorePathResolver — interface for routing file paths to the correct tier
- DatastorePatternMatcher — gitignore-style glob compiler for exclude patterns
- DatastoreVerifier — interface for health checks
- DatastoreMigrationService — file copy and verification for backend migration
- DistributedLock — interface with acquire(), release(), inspect(), withLock()
Infrastructure layer (src/infrastructure/persistence/):
- DefaultDatastorePathResolver — path resolver with pre-compiled exclude patterns
- S3Client — AWS S3 SDK wrapper (GetObject, PutObject, DeleteObject, HeadBucket, ListObjects)
- S3CacheSyncService — local cache management, index-based change detection, concurrent transfers
- S3Lock — conditional-write lock with heartbeat
- FileLock — advisory lockfile with heartbeat
- DatastoreSyncCoordinator — global singleton managing lock + sync lifecycle
CLI layer (src/cli/):
- resolve_datastore.ts — config resolution from env/CLI/yaml/default
- repo_context.ts — wires lock and sync into the repo lifecycle
- commands/datastore_*.ts — status, setup, sync, lock commands
- presentation/output/datastore_output.ts — log + JSON rendering
### Other changes in this branch
This PR also includes changes from earlier commits on this branch (help command, skill updates, extension improvements, bug fixes). The datastore-specific files are listed above; all other changes are from previously-merged work.
## Test plan
- deno check passes (type checking)
- deno lint passes
- deno fmt passes (formatting)
- deno run test passes (2794 tests, 0 failures)
- deno run compile succeeds
- Existing repos with no datastore config work identically (default filesystem backend)
- swamp datastore status reports health for default filesystem datastore
- swamp datastore setup filesystem --path /tmp/test-ds migrates data and updates .swamp.yaml
- swamp datastore setup s3 --bucket <bucket> migrates data to S3 and updates .swamp.yaml
- swamp datastore sync performs bidirectional sync with S3
- swamp datastore lock status shows lock holder or null
- swamp datastore lock release --force force-releases a stuck lock
- SWAMP_DATASTORE=filesystem:/tmp/test overrides .swamp.yaml config
- Concurrent CLI invocations block on lock rather than corrupting data
- Ctrl-C during a command releases the lock (best-effort SIGINT handler)
9237ef5 to
408f2af
Compare
There was a problem hiding this comment.
Code Review: Configurable Datastores with S3 Sync and Distributed Locking
This is an excellent, well-architected PR that adds significant functionality while maintaining clean code and following project conventions. No blocking issues found.
Domain-Driven Design Assessment
The implementation demonstrates excellent adherence to DDD principles:
- Clean layer separation: Domain interfaces (
DistributedLock,DatastorePathResolver,DatastoreVerifier) are properly abstracted with infrastructure implementations - Value objects:
DatastoreConfigis a proper discriminated union type;LockInfoandLockOptionsare well-structured - Domain errors:
LockTimeoutErrorextends Error properly with useful metadata - Repository pattern: Only aggregate roots get repositories, following DDD guidance
Code Quality
Positive:
- All files include required AGPLv3 license headers
- Named exports used throughout (no default exports)
- All commands support both
logandjsonoutput modes as required - TypeScript strict mode followed (no new type violations)
- S3 bucket name validation (
resolve_datastore.ts:38-47) - Path traversal prevention in
s3_cache_sync.ts:37-48withassertSafePath() - Thorough design document at
design/datastores.md
Test Coverage:
FileLock- 7 test cases covering acquire, release, timeout, stale locksS3Lock- 7 test cases with mock S3ClientDatastorePatternMatcher- 10 test cases for glob patternsDefaultDatastorePathResolver- 10 test casesresolveDatastoreConfig- 5 test cases covering priority resolutionLockTimeoutError- 3 test cases
Security
- S3 conditional writes (
If-None-Match: *) for atomic lock acquisition - TTL-based lock expiration prevents stuck locks from crashed processes
- SIGINT handler for best-effort lock release on Ctrl-C
- Helpful error messages for common AWS credential/permission issues
Architecture Highlights
- Shared S3Client instance between lock and sync service avoids duplicate credential resolution
- Lock-only mode for filesystem datastores keeps the coordinator generic
- Concurrent transfers in batches of 10 balances performance with rate limits
- Size+mtime change detection avoids expensive content hashing
Suggestions (non-blocking)
The type AnyOptions = any pattern in CLI commands follows the established codebase convention (used in 64+ existing command files) - not a concern for this PR.
Overall, this is production-quality code with thoughtful design decisions, comprehensive testing, and excellent documentation. The PR is ready to merge.
There was a problem hiding this comment.
Adversarial Review
Critical / High
-
HIGH: Lock theft via unconditional heartbeat overwrites (s3_lock.ts:168-175, file_lock.ts:174-180)
Both
S3Lock.extend()andFileLock.extend()use unconditional writes (s3.putObject/Deno.writeTextFile) to refresh the lock. This creates a race condition where a recovered process can overwrite another process's legitimately acquired lock:Breaking scenario:
- T=0: Process A acquires lock
- T=5: Process A becomes unresponsive (network partition, GC pause, etc.)
- T=40: Process B sees lock is stale (TTL=30s expired), deletes it, acquires with conditional write
- T=40.2: Process A recovers, heartbeat fires,
extend()runs extend()checksthis.held(still true - A never released), then does unconditionalputObject/writeTextFile- A's lock info overwrites B's lock
- Both A and B now believe they own the lock and proceed to modify data concurrently
Data corruption result: Two processes simultaneously hold the "distributed" lock, defeating the entire purpose of locking.
Fix: Use conditional writes for heartbeat extension. For S3, store the ETag on acquire and use
IfMatchfor extend. For filesystem, use file descriptor locking or atomic rename with a nonce. Alternatively, extend() should verify the lock content still belongs to this process before overwriting. -
HIGH: S3Lock stale detection TOCTOU allows lock theft (s3_lock.ts:107-127)
The stale lock detection has a time-of-check-time-of-use race:
const recheck = await this.s3.headObject(this.lockKey); if (recheck.lastModified.getTime() === head.lastModified.getTime()) { await this.s3.deleteObject(this.lockKey); // <-- TOCTOU gap here }
Between the
headObjectrecheck and thedeleteObject, another process's heartbeat could refresh the lock. The delete then removes a valid lock, and the subsequentputObjectConditionalsucceeds, stealing the lock.Fix: Use S3 Object Lock or versioning with conditional deletes, or accept that stale lock cleanup is best-effort and increase TTL tolerance.
Medium
-
MEDIUM: Partial push failure leaves index inconsistent (s3_cache_sync.ts:350-369)
In
pushChanged(), if some files fail to push (e.g., transient S3 errors), the code still updates the index for successful files and pushes the index to S3. This means the index claims files exist in S3 that weren't actually pushed.Breaking scenario: Push 10 files, 3 fail due to network blip. Index says all 10 exist. Later
pullChanged()on another machine fails silently when fetching the 3 missing files (error swallowed at line 188-201).Fix: Only update index after all pushes succeed, or track failed pushes and retry before updating index.
-
MEDIUM: getCachePath lacks path traversal protection (s3_cache_sync.ts:470-472)
getCachePath(relativePath)does a simplejoin(cachePath, relativePath)without callingassertSafePath(). If any caller passes user-controlled or S3-derived paths through this method and then writes to the result, path traversal is possible.Currently this appears to only be used for informational purposes, but it's a landmine for future code.
Fix: Apply
assertSafePath()ingetCachePath()or document it as internal-only. -
MEDIUM: SIGINT handler may hang indefinitely (datastore_sync_coordinator.ts:72-76)
The signal handler awaits
lock.release()with no timeout:signalHandler = () => { lock.release().catch(() => {}).finally(() => { Deno.exit(130); }); };
If S3 is unreachable,
deleteObjectmay hang for the full HTTP timeout (could be 60+ seconds), leaving the user stuck on Ctrl-C.Fix: Add a timeout (e.g., 5 seconds) before force-exiting.
-
MEDIUM: Force release can delete active lock (datastore_lock.ts:97-124)
datastoreLockReleaseCommanddoesinspect()then unconditionally deletes. Between inspect and delete, another process could acquire the lock legitimately. The force release deletes their active lock.This is documented as a "breakglass" operation, but should at least verify the holder info matches what was inspected before deleting.
Low
-
LOW: hasSymlinks recursion may fail on permission errors (datastore_setup.ts:91-103)
If any subdirectory in models/workflows/vaults is unreadable,
readDirthrows and fails the entire setup, even if the unreadable dir doesn't contain symlinks. -
LOW: Index modification during concurrent pulls (s3_cache_sync.ts:192-197)
pullChanged()modifiesthis.index.entries[rel]inside Promise.allSettled batches. While currently safe (each file writes to a different key), this pattern is fragile if the index structure changes. -
LOW: Missing null check on S3 ListObjects keys (s3_client.ts:198-199)
.map((obj) => obj.Key!)uses a non-null assertion. While unlikely, malformed S3 responses could cause crashes.
Verdict
FAIL - The unconditional heartbeat writes (Finding #1) allow two processes to simultaneously believe they hold the lock, completely defeating the distributed locking mechanism and enabling data corruption. This is a fundamental design flaw that must be fixed before merge.
The stale detection TOCTOU (Finding #2) compounds this issue by providing another path to lock theft.
## Summary - Replaces `gh pr merge --auto --squash` with `gh pr merge --squash` in the CI auto-merge job - The `--auto` flag enables GitHub's persistent auto-merge, which merges based on **branch protection rules** rather than the YAML `needs:` dependencies - This caused PR #669 to auto-merge after the normal review passed but before the adversarial review completed ## Root cause The `auto-merge` job has `needs: [test, deps-audit, claude-review, claude-adversarial-review]`, but `--auto` tells GitHub "merge this PR whenever branch protection requirements are met." If `claude-adversarial-review` isn't a required status check in branch protection settings, GitHub merges as soon as the other checks pass — ignoring the YAML dependency. ## Fix Remove `--auto` so the merge command executes directly when the job runs. Since the job is gated by `needs:`, it only runs after all four dependencies (including adversarial review) succeed. ## Test plan - [ ] Verify CI workflow passes on this PR - [ ] Confirm PRs no longer merge before adversarial review completes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Fixes: #492
Fixes: #511
Summary
Swamp's runtime data (model data, workflow runs, outputs, audit logs, telemetry, secrets, bundles) has always lived in .swamp/ inside the repo. This PR makes the storage backend configurable, adding S3 as a first-class option for team collaboration and introducing distributed locking across both backends to prevent data corruption from concurrent CLI invocations.
User Impact
Zero breaking changes for existing users
Existing repos with no datastore configuration continue to work identically. The default datastore is filesystem at {repoDir}/.swamp/, which is exactly where data lives today. The only behavioral difference is that a .datastore.lock file will now be created inside .swamp/ during command execution — this is transparent and auto-cleaned.
New: S3 datastore for team collaboration
Teams can now share runtime data via S3:
swamp datastore setup s3 --bucket my-bucket --prefix my-project --region us-east-1
This pushes existing local data to S3 and updates .swamp.yaml. Every subsequent command automatically pulls changes before execution and pushes changes after. The local cache at ~/.swamp/repos/{repoId}/ is fully disposable — deleting it or cloning the repo on a new machine repopulates from S3 on the next command.
New: external filesystem datastore
Move runtime data outside the repo (e.g., shared NFS mount):
swamp datastore setup filesystem --path /mnt/shared/swamp-data
Migration paths
Each setup command: verifies the target is accessible, migrates existing data, updates .swamp.yaml, and cleans up the source. Use --skip-migration to skip the data copy.
CI/CD: environment variable override
Override the datastore without modifying .swamp.yaml:
export SWAMP_DATASTORE=s3:my-bucket/my-prefix
export SWAMP_DATASTORE=filesystem:/tmp/swamp-data
Resolution priority: SWAMP_DATASTORE env var > --datastore CLI arg > .swamp.yaml > default.
New: lock management commands
If a process crashes without releasing the datastore lock, the lock auto-expires after 30 seconds. For immediate recovery:
swamp datastore lock status # Show who holds the lock
swamp datastore lock release --force # Force-release stuck lock
Design Choices
Why a lock for filesystem too?
Without locking, two concurrent swamp model run invocations on the same repo could corrupt shared state (overlapping writes to data versions, workflow runs, audit logs). Previously this was a silent race condition. Both backends now acquire a lock at command start and release at command end, making concurrent access safe. The filesystem lock uses advisory lockfiles with Deno.open({ createNew: true }) for atomic creation.
Why not lazy S3 pull?
We considered three approaches for S3 sync performance:
We chose option 3 (concurrent transfers in batches of 10). It's the lowest-risk improvement and doesn't close any doors. Options 1 and 2 can be layered on later if repos grow large enough to warrant them.
Why size + mtime instead of content hashing?
Change detection compares stat.size and stat.mtime rather than computing content hashes. This avoids reading every file on every sync. The write paths (atomicWriteTextFile, Deno.writeFile) always update mtime, so mtime changes reliably detect rewrites even when file size doesn't change. Under the global lock, there are no concurrent writers to create ABA problems.
Shared S3Client
A single S3Client instance is shared between the lock and the sync service for each command invocation. This avoids duplicate credential resolution and connection overhead.
Sync coordinator supports lock-only mode
The datastore_sync_coordinator.ts accepts optional service and lock parameters independently. Filesystem datastores register lock-only (no pull/push). S3 datastores register both. This keeps the coordinator generic without backend-specific branching.
Lock heartbeat and TTL
Both lock implementations use a background heartbeat that refreshes the lock every TTL/3 (default: every 10 seconds for a 30-second TTL). This prevents long-running commands from having their lock expire. If a process crashes, the lock self-expires after the TTL. The SIGINT handler provides best-effort release on Ctrl-C.
Error-path lock release
flushDatastoreSync() is called on both the success path and the error path in runCli(). This ensures locks are released even when commands fail, preventing stuck locks from cascading into subsequent command failures.
Architecture
Follows domain-driven design with clean layer separation:
Domain layer (src/domain/datastore/):
Infrastructure layer (src/infrastructure/persistence/):
CLI layer (src/cli/):
Other changes in this branch
This PR also includes changes from earlier commits on this branch (help command, skill updates, extension improvements, bug fixes). The datastore-specific files are listed above; all other changes are from previously-merged work.
Test plan