feat: configurable datastores with S3 sync and distributed locking by stack72 · Pull Request #669 · systeminit/swamp

stack72 · 2026-03-10T19:53:11Z

Fixes: #492
Fixes: #511

Summary

Swamp's runtime data (model data, workflow runs, outputs, audit logs, telemetry, secrets, bundles) has always lived in .swamp/ inside the repo. This PR makes the storage backend configurable, adding S3 as a first-class option for team collaboration and introducing distributed locking across both backends to prevent data corruption from concurrent CLI invocations.

Two backends: filesystem (default, backward-compatible) and S3 (with local cache + automatic sync)
Distributed locking: both backends now acquire a lock before command execution and release it after — filesystem uses advisory lockfiles, S3 uses conditional writes
New swamp datastore command group: setup, status, sync, and lock management
Automatic migration: swamp datastore setup migrates existing data to the new backend
Concurrent S3 transfers: pull/push operations run in batches of 10 to minimize sync latency

User Impact

Zero breaking changes for existing users

Existing repos with no datastore configuration continue to work identically. The default datastore is filesystem at {repoDir}/.swamp/, which is exactly where data lives today. The only behavioral difference is that a .datastore.lock file will now be created inside .swamp/ during command execution — this is transparent and auto-cleaned.

New: S3 datastore for team collaboration

Teams can now share runtime data via S3:

swamp datastore setup s3 --bucket my-bucket --prefix my-project --region us-east-1

This pushes existing local data to S3 and updates .swamp.yaml. Every subsequent command automatically pulls changes before execution and pushes changes after. The local cache at ~/.swamp/repos/{repoId}/ is fully disposable — deleting it or cloning the repo on a new machine repopulates from S3 on the next command.

New: external filesystem datastore

Move runtime data outside the repo (e.g., shared NFS mount):

swamp datastore setup filesystem --path /mnt/shared/swamp-data

Migration paths

From	To	Command
Default .swamp/	External filesystem	swamp datastore setup filesystem --path /path
Default .swamp/	S3	swamp datastore setup s3 --bucket name
External filesystem	S3	swamp datastore setup s3 --bucket name
S3	External filesystem	swamp datastore setup filesystem --path /path
Any backend	Default .swamp/	Edit .swamp.yaml, remove datastore field

Each setup command: verifies the target is accessible, migrates existing data, updates .swamp.yaml, and cleans up the source. Use --skip-migration to skip the data copy.

CI/CD: environment variable override

Override the datastore without modifying .swamp.yaml:

export SWAMP_DATASTORE=s3:my-bucket/my-prefix
export SWAMP_DATASTORE=filesystem:/tmp/swamp-data

Resolution priority: SWAMP_DATASTORE env var > --datastore CLI arg > .swamp.yaml > default.

New: lock management commands

If a process crashes without releasing the datastore lock, the lock auto-expires after 30 seconds. For immediate recovery:

swamp datastore lock status # Show who holds the lock
swamp datastore lock release --force # Force-release stuck lock

Design Choices

Why a lock for filesystem too?

Without locking, two concurrent swamp model run invocations on the same repo could corrupt shared state (overlapping writes to data versions, workflow runs, audit logs). Previously this was a silent race condition. Both backends now acquire a lock at command start and release at command end, making concurrent access safe. The filesystem lock uses advisory lockfiles with Deno.open({ createNew: true }) for atomic creation.

Why not lazy S3 pull?

We considered three approaches for S3 sync performance:

Lazy pull (fetch on cache miss) — best latency but complicates offline behavior and requires intercepting all file reads throughout the codebase
Directory-scoped pull (each command declares which subdirectories it needs) — lower transfer volume but fragile annotation maintenance
Concurrent full pull (batch downloads in parallel) — simplest, no behavioral changes

We chose option 3 (concurrent transfers in batches of 10). It's the lowest-risk improvement and doesn't close any doors. Options 1 and 2 can be layered on later if repos grow large enough to warrant them.

Why size + mtime instead of content hashing?

Change detection compares stat.size and stat.mtime rather than computing content hashes. This avoids reading every file on every sync. The write paths (atomicWriteTextFile, Deno.writeFile) always update mtime, so mtime changes reliably detect rewrites even when file size doesn't change. Under the global lock, there are no concurrent writers to create ABA problems.

Shared S3Client

A single S3Client instance is shared between the lock and the sync service for each command invocation. This avoids duplicate credential resolution and connection overhead.

Sync coordinator supports lock-only mode

The datastore_sync_coordinator.ts accepts optional service and lock parameters independently. Filesystem datastores register lock-only (no pull/push). S3 datastores register both. This keeps the coordinator generic without backend-specific branching.

Lock heartbeat and TTL

Both lock implementations use a background heartbeat that refreshes the lock every TTL/3 (default: every 10 seconds for a 30-second TTL). This prevents long-running commands from having their lock expire. If a process crashes, the lock self-expires after the TTL. The SIGINT handler provides best-effort release on Ctrl-C.

Error-path lock release

flushDatastoreSync() is called on both the success path and the error path in runCli(). This ensures locks are released even when commands fail, preventing stuck locks from cascading into subsequent command failures.

Architecture

Follows domain-driven design with clean layer separation:

Domain layer (src/domain/datastore/):

DatastoreConfig — discriminated union type for filesystem and S3 configs
DatastorePathResolver — interface for routing file paths to the correct tier
DatastorePatternMatcher — gitignore-style glob compiler for exclude patterns
DatastoreVerifier — interface for health checks
DatastoreMigrationService — file copy and verification for backend migration
DistributedLock — interface with acquire(), release(), inspect(), withLock()

Infrastructure layer (src/infrastructure/persistence/):

DefaultDatastorePathResolver — path resolver with pre-compiled exclude patterns
S3Client — AWS S3 SDK wrapper (GetObject, PutObject, DeleteObject, HeadBucket, ListObjects)
S3CacheSyncService — local cache management, index-based change detection, concurrent transfers
S3Lock — conditional-write lock with heartbeat
FileLock — advisory lockfile with heartbeat
DatastoreSyncCoordinator — global singleton managing lock + sync lifecycle

CLI layer (src/cli/):

resolve_datastore.ts — config resolution from env/CLI/yaml/default
repo_context.ts — wires lock and sync into the repo lifecycle
commands/datastore_*.ts — status, setup, sync, lock commands
presentation/output/datastore_output.ts — log + JSON rendering

Other changes in this branch

This PR also includes changes from earlier commits on this branch (help command, skill updates, extension improvements, bug fixes). The datastore-specific files are listed above; all other changes are from previously-merged work.

Test plan

deno check passes (type checking)
deno lint passes
deno fmt passes (formatting)
deno run test passes (2794 tests, 0 failures)
deno run compile succeeds
Existing repos with no datastore config work identically (default filesystem backend)
swamp datastore status reports health for default filesystem datastore
swamp datastore setup filesystem --path /tmp/test-ds migrates data and updates .swamp.yaml
swamp datastore setup s3 --bucket migrates data to S3 and updates .swamp.yaml
swamp datastore sync performs bidirectional sync with S3
swamp datastore lock status shows lock holder or null
swamp datastore lock release --force force-releases a stuck lock
SWAMP_DATASTORE=filesystem:/tmp/test overrides .swamp.yaml config
Concurrent CLI invocations block on lock rather than corrupting data
Ctrl-C during a command releases the lock (best-effort SIGINT handler)

github-actions

Review Summary

This is a well-designed PR that adds configurable datastores with S3 sync and distributed locking. The implementation follows domain-driven design principles and adheres to the project's code style guidelines.

Architecture (DDD Perspective)

The architecture properly separates concerns across layers:

Domain layer (src/domain/datastore/): Pure interfaces (DistributedLock, DatastorePathResolver, DatastoreVerifier) and value objects (LockInfo, DatastoreConfig)
Infrastructure layer (src/infrastructure/persistence/): Concrete implementations (FileLock, S3Lock, S3Client, S3CacheSyncService)
CLI layer: Thin adapters that wire domain/infrastructure and handle presentation

Code Quality Checks

TypeScript strict mode - all types properly defined
Named exports used throughout
AGPLv3 copyright headers present on all new files
The type AnyOptions = any pattern in CLI commands follows established codebase convention (60+ files use it)

Test Coverage

Comprehensive unit tests provided:

resolve_datastore_test.ts - Config resolution logic
datastore_pattern_matcher_test.ts - Gitignore-style glob patterns
distributed_lock_test.ts - LockTimeoutError behavior
file_lock_test.ts - FileLock acquire/release/timeout/stale handling
s3_lock_test.ts - S3Lock with mock client
default_datastore_path_resolver_test.ts - Path resolution logic

Security Review

Atomic file operations for lock creation (createNew: true)
S3 conditional writes for lock acquisition (If-None-Match: *)
Lock TTL + heartbeat prevents indefinite locks
Force-release requires explicit --force flag
Proper error handling without credential leakage
No OWASP vulnerabilities identified

No Blocking Issues

The PR is well-structured and ready to merge.

Suggestions (non-blocking)

S3CacheSyncService (~500 lines) could be split into smaller components in future iterations if it grows further
Consider adding a dedicated test file for datastore_migration_service.ts in a future PR (migration is currently tested indirectly via setup commands)
The design document in design/datastores.md is excellent - great reference for future maintainers

LGTM! Excellent work on this feature.

github-actions

Adversarial Review

Critical / High

CRITICAL - Path Traversal in S3 Cache Sync (s3_cache_sync.ts:139-143)

S3 object keys can contain path traversal characters (../). When pulling files from S3, keys are used directly in path.join() which normalizes .. sequences, allowing escape from the cache directory.

Breaking example:
- Attacker controls S3 bucket (compromised credentials, malicious team member, social engineering)
- Attacker creates S3 object with key: ../../../home/user/.bashrc
- User runs swamp datastore sync --pull
- Code executes: join(cachePath, "../../../home/user/.bashrc") → /home/user/.bashrc
- Attacker content overwrites user's .bashrc (arbitrary file write)
Affected code paths:
- pullFile() line 141: const localPath = join(this.cachePath, relativePath);
- pullChanged() uses keys from S3 index
- pullAll() uses keys from listAllObjects()
Fix: Validate that the resolved path is within the cache directory:
```
const localPath = join(this.cachePath, relativePath);
if (!localPath.startsWith(this.cachePath + "/")) {
  throw new Error(`Path traversal detected: ${relativePath}`);
}
```
HIGH - Race Condition in Stale Lock Force-Acquire (s3_lock.ts:106-121, file_lock.ts:109-119)

The stale lock detection has a TOCTOU race between checking staleness and deleting:

Breaking example:
1. Process A reads stale lock, decides to force-acquire (line 112)
2. Process B legitimately acquires a fresh lock (replaces the stale one)
3. Process A deletes Process B's fresh lock (line 115)
4. Process A and Process C both acquire "the lock"
5. Two processes hold the lock simultaneously → data corruption
Fix: Use conditional delete. For S3, check the ETag matches the stale lock before deleting. For filesystem, use the lockfile mtime to verify it hasn't changed.
HIGH - Heartbeat Race on Lock Release (s3_lock.ts:161-168, file_lock.ts:164-170)

The heartbeat's extend() method can race with release():

Breaking example:
1. Heartbeat interval fires, extend() starts executing
2. extend() passes the if (!this.held) guard (held is still true)
3. release() is called, sets this.held = false, deletes lock
4. Process B acquires the lock
5. extend() completes its await writeTextFile/putObject, overwriting Process B's lock
Fix: Use a mutex or check this.held again after the async write, then delete if the lock was released while extending.
HIGH - Degraded Mode Silently Allows Concurrent Corruption (datastore_sync_coordinator.ts:77-84)

If lock acquisition fails (network issue, permissions), the code logs a warning and continues without a lock. This silently degrades to unsafe mode where concurrent commands can corrupt data.

Breaking example:
- S3 bucket has restrictive permissions, lock acquire fails
- User runs two swamp model run commands concurrently
- Both proceed without locks → interleaved writes → data corruption
- User only sees a warning in logs, not an error
Fix: Lock acquisition failure should be fatal for data-mutating operations, not a warning.

Medium

MEDIUM - Hidden Files Excluded from pushChanged() (s3_cache_sync.ts:305-309)

pushChanged() skips ALL files starting with . via rel.startsWith("."), but pullAll() only skips explicit metadata files. This asymmetry means:
- Legitimate hidden files (.gitkeep, hidden output files) won't sync to S3
- Other machines won't receive these files
Fix: Only skip the specific metadata files, not all hidden files.
MEDIUM - No Validation of S3 Bucket Name (resolve_datastore.ts:63-72)

The SWAMP_DATASTORE=s3:... parsing accepts any string as a bucket name without validation. Invalid bucket names (uppercase, special chars, wrong length) will fail later with cryptic AWS errors.

Fix: Validate bucket name format before use.
MEDIUM - Response.Body! Non-Null Assertion (s3_client.ts:98)

return new Uint8Array(await response.Body!.transformToByteArray()); - the ! assertion will throw a confusing error if Body is null/undefined.

Fix: Check for null and throw a descriptive error.

Low

LOW - Migration Doesn't Preserve mtime (datastore_migration_service.ts:103-105)

Deno.copyFile doesn't preserve modification times. Since sync uses mtime for change detection, this could cause spurious sync operations after migration.
LOW - SIGINT Handler Doesn't Await Release (datastore_sync_coordinator.ts:72-74)

Deno.exit(130) is called immediately without awaiting lock.release(). The release may not complete before process exit.
LOW - Multiple requireInitializedRepo Calls Would Deadlock (latent)

If requireInitializedRepo is called twice in the same process (nested commands, bugs), the second call creates a new lock instance and tries to acquire, blocking on the first lock. Eventually times out after 60s, but poor UX.

Verdict

FAIL - The path traversal vulnerability (finding #1) is a critical security issue that allows arbitrary file write when syncing from a malicious/compromised S3 bucket. The lock race conditions (#2, #3) undermine the distributed locking guarantees that are the core feature of this PR. These must be fixed before merge.

## Summary Swamp's runtime data (model data, workflow runs, outputs, audit logs, telemetry, secrets, bundles) has always lived in .swamp/ inside the repo. This PR makes the storage backend configurable, adding S3 as a first-class option for team collaboration and introducing distributed locking across both backends to prevent data corruption from concurrent CLI invocations. - Two backends: filesystem (default, backward-compatible) and S3 (with local cache + automatic sync) - Distributed locking: both backends now acquire a lock before command execution and release it after — filesystem uses advisory lockfiles, S3 uses conditional writes - New swamp datastore command group: setup, status, sync, and lock management - Automatic migration: swamp datastore setup migrates existing data to the new backend - Concurrent S3 transfers: pull/push operations run in batches of 10 to minimize sync latency ## User Impact ### Zero breaking changes for existing users Existing repos with no datastore configuration continue to work identically. The default datastore is filesystem at {repoDir}/.swamp/, which is exactly where data lives today. The only behavioral difference is that a .datastore.lock file will now be created inside .swamp/ during command execution — this is transparent and auto-cleaned. ### New: S3 datastore for team collaboration Teams can now share runtime data via S3: swamp datastore setup s3 --bucket my-bucket --prefix my-project --region us-east-1 This pushes existing local data to S3 and updates .swamp.yaml. Every subsequent command automatically pulls changes before execution and pushes changes after. The local cache at ~/.swamp/repos/{repoId}/ is fully disposable — deleting it or cloning the repo on a new machine repopulates from S3 on the next command. ### New: external filesystem datastore Move runtime data outside the repo (e.g., shared NFS mount): swamp datastore setup filesystem --path /mnt/shared/swamp-data ## Migration paths | From | To | Command | |----------------------|---------------------|--------------------------------------------------------| | Default .swamp/ | External filesystem | swamp datastore setup filesystem --path /path | | Default .swamp/ | S3 | swamp datastore setup s3 --bucket name | | External filesystem | S3 | swamp datastore setup s3 --bucket name | | S3 | External filesystem | swamp datastore setup filesystem --path /path | | Any backend | Default .swamp/ | Edit .swamp.yaml, remove datastore field | Each setup command: verifies the target is accessible, migrates existing data, updates .swamp.yaml, and cleans up the source. Use --skip-migration to skip the data copy. ### CI/CD: environment variable override Override the datastore without modifying .swamp.yaml: export SWAMP_DATASTORE=s3:my-bucket/my-prefix export SWAMP_DATASTORE=filesystem:/tmp/swamp-data Resolution priority: SWAMP_DATASTORE env var > --datastore CLI arg > .swamp.yaml > default. ### New: lock management commands If a process crashes without releasing the datastore lock, the lock auto-expires after 30 seconds. For immediate recovery: swamp datastore lock status # Show who holds the lock swamp datastore lock release --force # Force-release stuck lock ## Design Choices ### Why a lock for filesystem too? Without locking, two concurrent swamp model run invocations on the same repo could corrupt shared state (overlapping writes to data versions, workflow runs, audit logs). Previously this was a silent race condition. Both backends now acquire a lock at command start and release at command end, making concurrent access safe. The filesystem lock uses advisory lockfiles with Deno.open({ createNew: true }) for atomic creation. ### Why not lazy S3 pull? We considered three approaches for S3 sync performance: 1. Lazy pull (fetch on cache miss) — best latency but complicates offline behavior and requires intercepting all file reads throughout the codebase 2. Directory-scoped pull (each command declares which subdirectories it needs) — lower transfer volume but fragile annotation maintenance 3. Concurrent full pull (batch downloads in parallel) — simplest, no behavioral changes We chose option 3 (concurrent transfers in batches of 10). It's the lowest-risk improvement and doesn't close any doors. Options 1 and 2 can be layered on later if repos grow large enough to warrant them. ### Why size + mtime instead of content hashing? Change detection compares stat.size and stat.mtime rather than computing content hashes. This avoids reading every file on every sync. The write paths (atomicWriteTextFile, Deno.writeFile) always update mtime, so mtime changes reliably detect rewrites even when file size doesn't change. Under the global lock, there are no concurrent writers to create ABA problems. ### Shared S3Client A single S3Client instance is shared between the lock and the sync service for each command invocation. This avoids duplicate credential resolution and connection overhead. ### Sync coordinator supports lock-only mode The datastore_sync_coordinator.ts accepts optional service and lock parameters independently. Filesystem datastores register lock-only (no pull/push). S3 datastores register both. This keeps the coordinator generic without backend-specific branching. ### Lock heartbeat and TTL Both lock implementations use a background heartbeat that refreshes the lock every TTL/3 (default: every 10 seconds for a 30-second TTL). This prevents long-running commands from having their lock expire. If a process crashes, the lock self-expires after the TTL. The SIGINT handler provides best-effort release on Ctrl-C. ### Error-path lock release flushDatastoreSync() is called on both the success path and the error path in runCli(). This ensures locks are released even when commands fail, preventing stuck locks from cascading into subsequent command failures. ## Architecture Follows domain-driven design with clean layer separation: Domain layer (src/domain/datastore/): - DatastoreConfig — discriminated union type for filesystem and S3 configs - DatastorePathResolver — interface for routing file paths to the correct tier - DatastorePatternMatcher — gitignore-style glob compiler for exclude patterns - DatastoreVerifier — interface for health checks - DatastoreMigrationService — file copy and verification for backend migration - DistributedLock — interface with acquire(), release(), inspect(), withLock() Infrastructure layer (src/infrastructure/persistence/): - DefaultDatastorePathResolver — path resolver with pre-compiled exclude patterns - S3Client — AWS S3 SDK wrapper (GetObject, PutObject, DeleteObject, HeadBucket, ListObjects) - S3CacheSyncService — local cache management, index-based change detection, concurrent transfers - S3Lock — conditional-write lock with heartbeat - FileLock — advisory lockfile with heartbeat - DatastoreSyncCoordinator — global singleton managing lock + sync lifecycle CLI layer (src/cli/): - resolve_datastore.ts — config resolution from env/CLI/yaml/default - repo_context.ts — wires lock and sync into the repo lifecycle - commands/datastore_*.ts — status, setup, sync, lock commands - presentation/output/datastore_output.ts — log + JSON rendering ### Other changes in this branch This PR also includes changes from earlier commits on this branch (help command, skill updates, extension improvements, bug fixes). The datastore-specific files are listed above; all other changes are from previously-merged work. ## Test plan - deno check passes (type checking) - deno lint passes - deno fmt passes (formatting) - deno run test passes (2794 tests, 0 failures) - deno run compile succeeds - Existing repos with no datastore config work identically (default filesystem backend) - swamp datastore status reports health for default filesystem datastore - swamp datastore setup filesystem --path /tmp/test-ds migrates data and updates .swamp.yaml - swamp datastore setup s3 --bucket <bucket> migrates data to S3 and updates .swamp.yaml - swamp datastore sync performs bidirectional sync with S3 - swamp datastore lock status shows lock holder or null - swamp datastore lock release --force force-releases a stuck lock - SWAMP_DATASTORE=filesystem:/tmp/test overrides .swamp.yaml config - Concurrent CLI invocations block on lock rather than corrupting data - Ctrl-C during a command releases the lock (best-effort SIGINT handler)

github-actions

Code Review: Configurable Datastores with S3 Sync and Distributed Locking

This is an excellent, well-architected PR that adds significant functionality while maintaining clean code and following project conventions. No blocking issues found.

Domain-Driven Design Assessment

The implementation demonstrates excellent adherence to DDD principles:

Clean layer separation: Domain interfaces (DistributedLock, DatastorePathResolver, DatastoreVerifier) are properly abstracted with infrastructure implementations
Value objects: DatastoreConfig is a proper discriminated union type; LockInfo and LockOptions are well-structured
Domain errors: LockTimeoutError extends Error properly with useful metadata
Repository pattern: Only aggregate roots get repositories, following DDD guidance

Code Quality

Positive:

All files include required AGPLv3 license headers
Named exports used throughout (no default exports)
All commands support both log and json output modes as required
TypeScript strict mode followed (no new type violations)
S3 bucket name validation (resolve_datastore.ts:38-47)
Path traversal prevention in s3_cache_sync.ts:37-48 with assertSafePath()
Thorough design document at design/datastores.md

Test Coverage:

FileLock - 7 test cases covering acquire, release, timeout, stale locks
S3Lock - 7 test cases with mock S3Client
DatastorePatternMatcher - 10 test cases for glob patterns
DefaultDatastorePathResolver - 10 test cases
resolveDatastoreConfig - 5 test cases covering priority resolution
LockTimeoutError - 3 test cases

Security

S3 conditional writes (If-None-Match: *) for atomic lock acquisition
TTL-based lock expiration prevents stuck locks from crashed processes
SIGINT handler for best-effort lock release on Ctrl-C
Helpful error messages for common AWS credential/permission issues

Architecture Highlights

Shared S3Client instance between lock and sync service avoids duplicate credential resolution
Lock-only mode for filesystem datastores keeps the coordinator generic
Concurrent transfers in batches of 10 balances performance with rate limits
Size+mtime change detection avoids expensive content hashing

Suggestions (non-blocking)

The type AnyOptions = any pattern in CLI commands follows the established codebase convention (used in 64+ existing command files) - not a concern for this PR.

Overall, this is production-quality code with thoughtful design decisions, comprehensive testing, and excellent documentation. The PR is ready to merge.

github-actions

Adversarial Review

Critical / High

HIGH: Lock theft via unconditional heartbeat overwrites (s3_lock.ts:168-175, file_lock.ts:174-180)

Both S3Lock.extend() and FileLock.extend() use unconditional writes (s3.putObject / Deno.writeTextFile) to refresh the lock. This creates a race condition where a recovered process can overwrite another process's legitimately acquired lock:

Breaking scenario:
- T=0: Process A acquires lock
- T=5: Process A becomes unresponsive (network partition, GC pause, etc.)
- T=40: Process B sees lock is stale (TTL=30s expired), deletes it, acquires with conditional write
- T=40.2: Process A recovers, heartbeat fires, extend() runs
- extend() checks this.held (still true - A never released), then does unconditional putObject/writeTextFile
- A's lock info overwrites B's lock
- Both A and B now believe they own the lock and proceed to modify data concurrently
Data corruption result: Two processes simultaneously hold the "distributed" lock, defeating the entire purpose of locking.

Fix: Use conditional writes for heartbeat extension. For S3, store the ETag on acquire and use IfMatch for extend. For filesystem, use file descriptor locking or atomic rename with a nonce. Alternatively, extend() should verify the lock content still belongs to this process before overwriting.
HIGH: S3Lock stale detection TOCTOU allows lock theft (s3_lock.ts:107-127)

The stale lock detection has a time-of-check-time-of-use race:
```
const recheck = await this.s3.headObject(this.lockKey);
if (recheck.lastModified.getTime() === head.lastModified.getTime()) {
  await this.s3.deleteObject(this.lockKey);  // <-- TOCTOU gap here
}
```
Between the headObject recheck and the deleteObject, another process's heartbeat could refresh the lock. The delete then removes a valid lock, and the subsequent putObjectConditional succeeds, stealing the lock.

Fix: Use S3 Object Lock or versioning with conditional deletes, or accept that stale lock cleanup is best-effort and increase TTL tolerance.

Medium

MEDIUM: Partial push failure leaves index inconsistent (s3_cache_sync.ts:350-369)

In pushChanged(), if some files fail to push (e.g., transient S3 errors), the code still updates the index for successful files and pushes the index to S3. This means the index claims files exist in S3 that weren't actually pushed.

Breaking scenario: Push 10 files, 3 fail due to network blip. Index says all 10 exist. Later pullChanged() on another machine fails silently when fetching the 3 missing files (error swallowed at line 188-201).

Fix: Only update index after all pushes succeed, or track failed pushes and retry before updating index.
MEDIUM: getCachePath lacks path traversal protection (s3_cache_sync.ts:470-472)

getCachePath(relativePath) does a simple join(cachePath, relativePath) without calling assertSafePath(). If any caller passes user-controlled or S3-derived paths through this method and then writes to the result, path traversal is possible.

Currently this appears to only be used for informational purposes, but it's a landmine for future code.

Fix: Apply assertSafePath() in getCachePath() or document it as internal-only.
MEDIUM: SIGINT handler may hang indefinitely (datastore_sync_coordinator.ts:72-76)

The signal handler awaits lock.release() with no timeout:
```
signalHandler = () => {
  lock.release().catch(() => {}).finally(() => {
    Deno.exit(130);
  });
};
```
If S3 is unreachable, deleteObject may hang for the full HTTP timeout (could be 60+ seconds), leaving the user stuck on Ctrl-C.

Fix: Add a timeout (e.g., 5 seconds) before force-exiting.
MEDIUM: Force release can delete active lock (datastore_lock.ts:97-124)

datastoreLockReleaseCommand does inspect() then unconditionally deletes. Between inspect and delete, another process could acquire the lock legitimately. The force release deletes their active lock.

This is documented as a "breakglass" operation, but should at least verify the holder info matches what was inspected before deleting.

Low

LOW: hasSymlinks recursion may fail on permission errors (datastore_setup.ts:91-103)

If any subdirectory in models/workflows/vaults is unreadable, readDir throws and fails the entire setup, even if the unreadable dir doesn't contain symlinks.
LOW: Index modification during concurrent pulls (s3_cache_sync.ts:192-197)

pullChanged() modifies this.index.entries[rel] inside Promise.allSettled batches. While currently safe (each file writes to a different key), this pattern is fragile if the index structure changes.
LOW: Missing null check on S3 ListObjects keys (s3_client.ts:198-199)

.map((obj) => obj.Key!) uses a non-null assertion. While unlikely, malformed S3 responses could cause crashes.

Verdict

FAIL - The unconditional heartbeat writes (Finding #1) allow two processes to simultaneously believe they hold the lock, completely defeating the distributed locking mechanism and enabling data corruption. This is a fundamental design flaw that must be fixed before merge.

The stale detection TOCTOU (Finding #2) compounds this issue by providing another path to lock theft.

## Summary - Replaces `gh pr merge --auto --squash` with `gh pr merge --squash` in the CI auto-merge job - The `--auto` flag enables GitHub's persistent auto-merge, which merges based on **branch protection rules** rather than the YAML `needs:` dependencies - This caused PR #669 to auto-merge after the normal review passed but before the adversarial review completed ## Root cause The `auto-merge` job has `needs: [test, deps-audit, claude-review, claude-adversarial-review]`, but `--auto` tells GitHub "merge this PR whenever branch protection requirements are met." If `claude-adversarial-review` isn't a required status check in branch protection settings, GitHub merges as soon as the other checks pass — ignoring the YAML dependency. ## Fix Remove `--auto` so the merge command executes directly when the job runs. Since the job is gated by `needs:`, it only runs after all four dependencies (including adversarial review) succeed. ## Test plan - [ ] Verify CI workflow passes on this PR - [ ] Confirm PRs no longer merge before adversarial review completes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

stack72 force-pushed the datastores-implementation branch 3 times, most recently from d783c8a to 9237ef5 Compare March 10, 2026 21:23

stack72 marked this pull request as ready for review March 10, 2026 22:08

github-actions bot approved these changes Mar 10, 2026

View reviewed changes

github-actions bot requested changes Mar 10, 2026

View reviewed changes

stack72 enabled auto-merge (squash) March 10, 2026 22:16

stack72 force-pushed the datastores-implementation branch from 9237ef5 to 408f2af Compare March 10, 2026 22:27

github-actions bot approved these changes Mar 10, 2026

View reviewed changes

stack72 merged commit 153e19a into main Mar 10, 2026
4 checks passed

stack72 deleted the datastores-implementation branch March 10, 2026 22:33

github-actions bot requested changes Mar 10, 2026

View reviewed changes

stack72 mentioned this pull request Mar 10, 2026

fix: prevent auto-merge race with adversarial review #673

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: configurable datastores with S3 sync and distributed locking#669

feat: configurable datastores with S3 sync and distributed locking#669
stack72 merged 1 commit intomainfrom
datastores-implementation

stack72 commented Mar 10, 2026 •

edited

Loading

Uh oh!

github-actions bot left a comment

Uh oh!

github-actions bot left a comment

Uh oh!

github-actions bot left a comment

Uh oh!

Uh oh!

github-actions bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stack72 commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

User Impact

Zero breaking changes for existing users

New: S3 datastore for team collaboration

New: external filesystem datastore

Migration paths

CI/CD: environment variable override

New: lock management commands

Design Choices

Why a lock for filesystem too?

Why not lazy S3 pull?

Why size + mtime instead of content hashing?

Shared S3Client

Sync coordinator supports lock-only mode

Lock heartbeat and TTL

Error-path lock release

Architecture

Other changes in this branch

Test plan

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Review Summary

Architecture (DDD Perspective)

Code Quality Checks

Test Coverage

Security Review

No Blocking Issues

Suggestions (non-blocking)

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Adversarial Review

Critical / High

Medium

Low

Verdict

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Code Review: Configurable Datastores with S3 Sync and Distributed Locking

Domain-Driven Design Assessment

Code Quality

Security

Architecture Highlights

Suggestions (non-blocking)

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Adversarial Review

Critical / High

Medium

Low

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

stack72 commented Mar 10, 2026 •

edited

Loading