Skip to content

feat: configurable datastores with S3 sync and distributed locking#669

Merged
stack72 merged 1 commit intomainfrom
datastores-implementation
Mar 10, 2026
Merged

feat: configurable datastores with S3 sync and distributed locking#669
stack72 merged 1 commit intomainfrom
datastores-implementation

Conversation

@stack72
Copy link
Copy Markdown
Contributor

@stack72 stack72 commented Mar 10, 2026

Fixes: #492
Fixes: #511

Summary

Swamp's runtime data (model data, workflow runs, outputs, audit logs, telemetry, secrets, bundles) has always lived in .swamp/ inside the repo. This PR makes the storage backend configurable, adding S3 as a first-class option for team collaboration and introducing distributed locking across both backends to prevent data corruption from concurrent CLI invocations.

  • Two backends: filesystem (default, backward-compatible) and S3 (with local cache + automatic sync)
  • Distributed locking: both backends now acquire a lock before command execution and release it after — filesystem uses advisory lockfiles, S3 uses conditional writes
  • New swamp datastore command group: setup, status, sync, and lock management
  • Automatic migration: swamp datastore setup migrates existing data to the new backend
  • Concurrent S3 transfers: pull/push operations run in batches of 10 to minimize sync latency

User Impact

Zero breaking changes for existing users

Existing repos with no datastore configuration continue to work identically. The default datastore is filesystem at {repoDir}/.swamp/, which is exactly where data lives today. The only behavioral difference is that a .datastore.lock file will now be created inside .swamp/ during command execution — this is transparent and auto-cleaned.

New: S3 datastore for team collaboration

Teams can now share runtime data via S3:

swamp datastore setup s3 --bucket my-bucket --prefix my-project --region us-east-1

This pushes existing local data to S3 and updates .swamp.yaml. Every subsequent command automatically pulls changes before execution and pushes changes after. The local cache at ~/.swamp/repos/{repoId}/ is fully disposable — deleting it or cloning the repo on a new machine repopulates from S3 on the next command.

New: external filesystem datastore

Move runtime data outside the repo (e.g., shared NFS mount):

swamp datastore setup filesystem --path /mnt/shared/swamp-data

Migration paths

From To Command
Default .swamp/ External filesystem swamp datastore setup filesystem --path /path
Default .swamp/ S3 swamp datastore setup s3 --bucket name
External filesystem S3 swamp datastore setup s3 --bucket name
S3 External filesystem swamp datastore setup filesystem --path /path
Any backend Default .swamp/ Edit .swamp.yaml, remove datastore field

Each setup command: verifies the target is accessible, migrates existing data, updates .swamp.yaml, and cleans up the source. Use --skip-migration to skip the data copy.

CI/CD: environment variable override

Override the datastore without modifying .swamp.yaml:

export SWAMP_DATASTORE=s3:my-bucket/my-prefix
export SWAMP_DATASTORE=filesystem:/tmp/swamp-data

Resolution priority: SWAMP_DATASTORE env var > --datastore CLI arg > .swamp.yaml > default.

New: lock management commands

If a process crashes without releasing the datastore lock, the lock auto-expires after 30 seconds. For immediate recovery:

swamp datastore lock status # Show who holds the lock
swamp datastore lock release --force # Force-release stuck lock

Design Choices

Why a lock for filesystem too?

Without locking, two concurrent swamp model run invocations on the same repo could corrupt shared state (overlapping writes to data versions, workflow runs, audit logs). Previously this was a silent race condition. Both backends now acquire a lock at command start and release at command end, making concurrent access safe. The filesystem lock uses advisory lockfiles with Deno.open({ createNew: true }) for atomic creation.

Why not lazy S3 pull?

We considered three approaches for S3 sync performance:

  1. Lazy pull (fetch on cache miss) — best latency but complicates offline behavior and requires intercepting all file reads throughout the codebase
  2. Directory-scoped pull (each command declares which subdirectories it needs) — lower transfer volume but fragile annotation maintenance
  3. Concurrent full pull (batch downloads in parallel) — simplest, no behavioral changes

We chose option 3 (concurrent transfers in batches of 10). It's the lowest-risk improvement and doesn't close any doors. Options 1 and 2 can be layered on later if repos grow large enough to warrant them.

Why size + mtime instead of content hashing?

Change detection compares stat.size and stat.mtime rather than computing content hashes. This avoids reading every file on every sync. The write paths (atomicWriteTextFile, Deno.writeFile) always update mtime, so mtime changes reliably detect rewrites even when file size doesn't change. Under the global lock, there are no concurrent writers to create ABA problems.

Shared S3Client

A single S3Client instance is shared between the lock and the sync service for each command invocation. This avoids duplicate credential resolution and connection overhead.

Sync coordinator supports lock-only mode

The datastore_sync_coordinator.ts accepts optional service and lock parameters independently. Filesystem datastores register lock-only (no pull/push). S3 datastores register both. This keeps the coordinator generic without backend-specific branching.

Lock heartbeat and TTL

Both lock implementations use a background heartbeat that refreshes the lock every TTL/3 (default: every 10 seconds for a 30-second TTL). This prevents long-running commands from having their lock expire. If a process crashes, the lock self-expires after the TTL. The SIGINT handler provides best-effort release on Ctrl-C.

Error-path lock release

flushDatastoreSync() is called on both the success path and the error path in runCli(). This ensures locks are released even when commands fail, preventing stuck locks from cascading into subsequent command failures.

Architecture

Follows domain-driven design with clean layer separation:

Domain layer (src/domain/datastore/):

  • DatastoreConfig — discriminated union type for filesystem and S3 configs
  • DatastorePathResolver — interface for routing file paths to the correct tier
  • DatastorePatternMatcher — gitignore-style glob compiler for exclude patterns
  • DatastoreVerifier — interface for health checks
  • DatastoreMigrationService — file copy and verification for backend migration
  • DistributedLock — interface with acquire(), release(), inspect(), withLock()

Infrastructure layer (src/infrastructure/persistence/):

  • DefaultDatastorePathResolver — path resolver with pre-compiled exclude patterns
  • S3Client — AWS S3 SDK wrapper (GetObject, PutObject, DeleteObject, HeadBucket, ListObjects)
  • S3CacheSyncService — local cache management, index-based change detection, concurrent transfers
  • S3Lock — conditional-write lock with heartbeat
  • FileLock — advisory lockfile with heartbeat
  • DatastoreSyncCoordinator — global singleton managing lock + sync lifecycle

CLI layer (src/cli/):

  • resolve_datastore.ts — config resolution from env/CLI/yaml/default
  • repo_context.ts — wires lock and sync into the repo lifecycle
  • commands/datastore_*.ts — status, setup, sync, lock commands
  • presentation/output/datastore_output.ts — log + JSON rendering

Other changes in this branch

This PR also includes changes from earlier commits on this branch (help command, skill updates, extension improvements, bug fixes). The datastore-specific files are listed above; all other changes are from previously-merged work.

Test plan

  • deno check passes (type checking)
  • deno lint passes
  • deno fmt passes (formatting)
  • deno run test passes (2794 tests, 0 failures)
  • deno run compile succeeds
  • Existing repos with no datastore config work identically (default filesystem backend)
  • swamp datastore status reports health for default filesystem datastore
  • swamp datastore setup filesystem --path /tmp/test-ds migrates data and updates .swamp.yaml
  • swamp datastore setup s3 --bucket migrates data to S3 and updates .swamp.yaml
  • swamp datastore sync performs bidirectional sync with S3
  • swamp datastore lock status shows lock holder or null
  • swamp datastore lock release --force force-releases a stuck lock
  • SWAMP_DATASTORE=filesystem:/tmp/test overrides .swamp.yaml config
  • Concurrent CLI invocations block on lock rather than corrupting data
  • Ctrl-C during a command releases the lock (best-effort SIGINT handler)

@stack72 stack72 force-pushed the datastores-implementation branch 3 times, most recently from d783c8a to 9237ef5 Compare March 10, 2026 21:23
@stack72 stack72 marked this pull request as ready for review March 10, 2026 22:08
Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

This is a well-designed PR that adds configurable datastores with S3 sync and distributed locking. The implementation follows domain-driven design principles and adheres to the project's code style guidelines.

Architecture (DDD Perspective)

The architecture properly separates concerns across layers:

  • Domain layer (src/domain/datastore/): Pure interfaces (DistributedLock, DatastorePathResolver, DatastoreVerifier) and value objects (LockInfo, DatastoreConfig)
  • Infrastructure layer (src/infrastructure/persistence/): Concrete implementations (FileLock, S3Lock, S3Client, S3CacheSyncService)
  • CLI layer: Thin adapters that wire domain/infrastructure and handle presentation

Code Quality Checks

  • TypeScript strict mode - all types properly defined
  • Named exports used throughout
  • AGPLv3 copyright headers present on all new files
  • The type AnyOptions = any pattern in CLI commands follows established codebase convention (60+ files use it)

Test Coverage

Comprehensive unit tests provided:

  • resolve_datastore_test.ts - Config resolution logic
  • datastore_pattern_matcher_test.ts - Gitignore-style glob patterns
  • distributed_lock_test.ts - LockTimeoutError behavior
  • file_lock_test.ts - FileLock acquire/release/timeout/stale handling
  • s3_lock_test.ts - S3Lock with mock client
  • default_datastore_path_resolver_test.ts - Path resolution logic

Security Review

  • Atomic file operations for lock creation (createNew: true)
  • S3 conditional writes for lock acquisition (If-None-Match: *)
  • Lock TTL + heartbeat prevents indefinite locks
  • Force-release requires explicit --force flag
  • Proper error handling without credential leakage
  • No OWASP vulnerabilities identified

No Blocking Issues

The PR is well-structured and ready to merge.

Suggestions (non-blocking)

  1. S3CacheSyncService (~500 lines) could be split into smaller components in future iterations if it grows further
  2. Consider adding a dedicated test file for datastore_migration_service.ts in a future PR (migration is currently tested indirectly via setup commands)
  3. The design document in design/datastores.md is excellent - great reference for future maintainers

LGTM! Excellent work on this feature.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial Review

Critical / High

  1. CRITICAL - Path Traversal in S3 Cache Sync (s3_cache_sync.ts:139-143)

    S3 object keys can contain path traversal characters (../). When pulling files from S3, keys are used directly in path.join() which normalizes .. sequences, allowing escape from the cache directory.

    Breaking example:

    • Attacker controls S3 bucket (compromised credentials, malicious team member, social engineering)
    • Attacker creates S3 object with key: ../../../home/user/.bashrc
    • User runs swamp datastore sync --pull
    • Code executes: join(cachePath, "../../../home/user/.bashrc")/home/user/.bashrc
    • Attacker content overwrites user's .bashrc (arbitrary file write)

    Affected code paths:

    • pullFile() line 141: const localPath = join(this.cachePath, relativePath);
    • pullChanged() uses keys from S3 index
    • pullAll() uses keys from listAllObjects()

    Fix: Validate that the resolved path is within the cache directory:

    const localPath = join(this.cachePath, relativePath);
    if (!localPath.startsWith(this.cachePath + "/")) {
      throw new Error(`Path traversal detected: ${relativePath}`);
    }
  2. HIGH - Race Condition in Stale Lock Force-Acquire (s3_lock.ts:106-121, file_lock.ts:109-119)

    The stale lock detection has a TOCTOU race between checking staleness and deleting:

    Breaking example:

    1. Process A reads stale lock, decides to force-acquire (line 112)
    2. Process B legitimately acquires a fresh lock (replaces the stale one)
    3. Process A deletes Process B's fresh lock (line 115)
    4. Process A and Process C both acquire "the lock"
    5. Two processes hold the lock simultaneously → data corruption

    Fix: Use conditional delete. For S3, check the ETag matches the stale lock before deleting. For filesystem, use the lockfile mtime to verify it hasn't changed.

  3. HIGH - Heartbeat Race on Lock Release (s3_lock.ts:161-168, file_lock.ts:164-170)

    The heartbeat's extend() method can race with release():

    Breaking example:

    1. Heartbeat interval fires, extend() starts executing
    2. extend() passes the if (!this.held) guard (held is still true)
    3. release() is called, sets this.held = false, deletes lock
    4. Process B acquires the lock
    5. extend() completes its await writeTextFile/putObject, overwriting Process B's lock

    Fix: Use a mutex or check this.held again after the async write, then delete if the lock was released while extending.

  4. HIGH - Degraded Mode Silently Allows Concurrent Corruption (datastore_sync_coordinator.ts:77-84)

    If lock acquisition fails (network issue, permissions), the code logs a warning and continues without a lock. This silently degrades to unsafe mode where concurrent commands can corrupt data.

    Breaking example:

    • S3 bucket has restrictive permissions, lock acquire fails
    • User runs two swamp model run commands concurrently
    • Both proceed without locks → interleaved writes → data corruption
    • User only sees a warning in logs, not an error

    Fix: Lock acquisition failure should be fatal for data-mutating operations, not a warning.

Medium

  1. MEDIUM - Hidden Files Excluded from pushChanged() (s3_cache_sync.ts:305-309)

    pushChanged() skips ALL files starting with . via rel.startsWith("."), but pullAll() only skips explicit metadata files. This asymmetry means:

    • Legitimate hidden files (.gitkeep, hidden output files) won't sync to S3
    • Other machines won't receive these files

    Fix: Only skip the specific metadata files, not all hidden files.

  2. MEDIUM - No Validation of S3 Bucket Name (resolve_datastore.ts:63-72)

    The SWAMP_DATASTORE=s3:... parsing accepts any string as a bucket name without validation. Invalid bucket names (uppercase, special chars, wrong length) will fail later with cryptic AWS errors.

    Fix: Validate bucket name format before use.

  3. MEDIUM - Response.Body! Non-Null Assertion (s3_client.ts:98)

    return new Uint8Array(await response.Body!.transformToByteArray()); - the ! assertion will throw a confusing error if Body is null/undefined.

    Fix: Check for null and throw a descriptive error.

Low

  1. LOW - Migration Doesn't Preserve mtime (datastore_migration_service.ts:103-105)

    Deno.copyFile doesn't preserve modification times. Since sync uses mtime for change detection, this could cause spurious sync operations after migration.

  2. LOW - SIGINT Handler Doesn't Await Release (datastore_sync_coordinator.ts:72-74)

    Deno.exit(130) is called immediately without awaiting lock.release(). The release may not complete before process exit.

  3. LOW - Multiple requireInitializedRepo Calls Would Deadlock (latent)

    If requireInitializedRepo is called twice in the same process (nested commands, bugs), the second call creates a new lock instance and tries to acquire, blocking on the first lock. Eventually times out after 60s, but poor UX.

Verdict

FAIL - The path traversal vulnerability (finding #1) is a critical security issue that allows arbitrary file write when syncing from a malicious/compromised S3 bucket. The lock race conditions (#2, #3) undermine the distributed locking guarantees that are the core feature of this PR. These must be fixed before merge.

@stack72 stack72 enabled auto-merge (squash) March 10, 2026 22:16
## Summary

Swamp's runtime data (model data, workflow runs, outputs, audit logs, telemetry, secrets, bundles) has always lived in .swamp/ inside the repo. This PR makes the storage backend configurable, adding S3 as a first-class option for team collaboration and
introducing distributed locking across both backends to prevent data corruption from concurrent CLI invocations.

- Two backends: filesystem (default, backward-compatible) and S3 (with local cache + automatic sync)
- Distributed locking: both backends now acquire a lock before command execution and release it after — filesystem uses advisory lockfiles, S3 uses conditional writes
- New swamp datastore command group: setup, status, sync, and lock management
- Automatic migration: swamp datastore setup migrates existing data to the new backend
- Concurrent S3 transfers: pull/push operations run in batches of 10 to minimize sync latency

## User Impact

### Zero breaking changes for existing users

Existing repos with no datastore configuration continue to work identically. The default datastore is filesystem at {repoDir}/.swamp/, which is exactly where data lives today. The only behavioral difference is that a .datastore.lock file will now be
created inside .swamp/ during command execution — this is transparent and auto-cleaned.

### New: S3 datastore for team collaboration

Teams can now share runtime data via S3:

swamp datastore setup s3 --bucket my-bucket --prefix my-project --region us-east-1

This pushes existing local data to S3 and updates .swamp.yaml. Every subsequent command automatically pulls changes before execution and pushes changes after. The local cache at ~/.swamp/repos/{repoId}/ is fully disposable — deleting it or cloning the
repo on a new machine repopulates from S3 on the next command.

### New: external filesystem datastore

Move runtime data outside the repo (e.g., shared NFS mount):

swamp datastore setup filesystem --path /mnt/shared/swamp-data

## Migration paths

| From                 | To                  | Command                                                |
|----------------------|---------------------|--------------------------------------------------------|
| Default .swamp/      | External filesystem | swamp datastore setup filesystem --path /path          |
| Default .swamp/      | S3                  | swamp datastore setup s3 --bucket name                 |
| External filesystem  | S3                  | swamp datastore setup s3 --bucket name                 |
| S3                   | External filesystem | swamp datastore setup filesystem --path /path          |
| Any backend          | Default .swamp/     | Edit .swamp.yaml, remove datastore field               |

Each setup command: verifies the target is accessible, migrates existing data, updates .swamp.yaml, and cleans up the source. Use --skip-migration to skip the data copy.

### CI/CD: environment variable override

Override the datastore without modifying .swamp.yaml:

export SWAMP_DATASTORE=s3:my-bucket/my-prefix
export SWAMP_DATASTORE=filesystem:/tmp/swamp-data

Resolution priority: SWAMP_DATASTORE env var > --datastore CLI arg > .swamp.yaml > default.

### New: lock management commands

If a process crashes without releasing the datastore lock, the lock auto-expires after 30 seconds. For immediate recovery:

swamp datastore lock status           # Show who holds the lock
swamp datastore lock release --force  # Force-release stuck lock

## Design Choices

### Why a lock for filesystem too?

Without locking, two concurrent swamp model run invocations on the same repo could corrupt shared state (overlapping writes to data versions, workflow runs, audit logs). Previously this was a silent race condition. Both backends now acquire a lock at
command start and release at command end, making concurrent access safe. The filesystem lock uses advisory lockfiles with Deno.open({ createNew: true }) for atomic creation.

### Why not lazy S3 pull?

We considered three approaches for S3 sync performance:
1. Lazy pull (fetch on cache miss) — best latency but complicates offline behavior and requires intercepting all file reads throughout the codebase
2. Directory-scoped pull (each command declares which subdirectories it needs) — lower transfer volume but fragile annotation maintenance
3. Concurrent full pull (batch downloads in parallel) — simplest, no behavioral changes

We chose option 3 (concurrent transfers in batches of 10). It's the lowest-risk improvement and doesn't close any doors. Options 1 and 2 can be layered on later if repos grow large enough to warrant them.

### Why size + mtime instead of content hashing?

Change detection compares stat.size and stat.mtime rather than computing content hashes. This avoids reading every file on every sync. The write paths (atomicWriteTextFile, Deno.writeFile) always update mtime, so mtime changes reliably detect rewrites
even when file size doesn't change. Under the global lock, there are no concurrent writers to create ABA problems.

### Shared S3Client

A single S3Client instance is shared between the lock and the sync service for each command invocation. This avoids duplicate credential resolution and connection overhead.

### Sync coordinator supports lock-only mode

The datastore_sync_coordinator.ts accepts optional service and lock parameters independently. Filesystem datastores register lock-only (no pull/push). S3 datastores register both. This keeps the coordinator generic without backend-specific branching.

### Lock heartbeat and TTL

Both lock implementations use a background heartbeat that refreshes the lock every TTL/3 (default: every 10 seconds for a 30-second TTL). This prevents long-running commands from having their lock expire. If a process crashes, the lock self-expires
after the TTL. The SIGINT handler provides best-effort release on Ctrl-C.

### Error-path lock release

flushDatastoreSync() is called on both the success path and the error path in runCli(). This ensures locks are released even when commands fail, preventing stuck locks from cascading into subsequent command failures.

## Architecture

Follows domain-driven design with clean layer separation:

Domain layer (src/domain/datastore/):
- DatastoreConfig — discriminated union type for filesystem and S3 configs
- DatastorePathResolver — interface for routing file paths to the correct tier
- DatastorePatternMatcher — gitignore-style glob compiler for exclude patterns
- DatastoreVerifier — interface for health checks
- DatastoreMigrationService — file copy and verification for backend migration
- DistributedLock — interface with acquire(), release(), inspect(), withLock()

Infrastructure layer (src/infrastructure/persistence/):
- DefaultDatastorePathResolver — path resolver with pre-compiled exclude patterns
- S3Client — AWS S3 SDK wrapper (GetObject, PutObject, DeleteObject, HeadBucket, ListObjects)
- S3CacheSyncService — local cache management, index-based change detection, concurrent transfers
- S3Lock — conditional-write lock with heartbeat
- FileLock — advisory lockfile with heartbeat
- DatastoreSyncCoordinator — global singleton managing lock + sync lifecycle

CLI layer (src/cli/):
- resolve_datastore.ts — config resolution from env/CLI/yaml/default
- repo_context.ts — wires lock and sync into the repo lifecycle
- commands/datastore_*.ts — status, setup, sync, lock commands
- presentation/output/datastore_output.ts — log + JSON rendering

### Other changes in this branch

This PR also includes changes from earlier commits on this branch (help command, skill updates, extension improvements, bug fixes). The datastore-specific files are listed above; all other changes are from previously-merged work.

## Test plan

- deno check passes (type checking)
- deno lint passes
- deno fmt passes (formatting)
- deno run test passes (2794 tests, 0 failures)
- deno run compile succeeds
- Existing repos with no datastore config work identically (default filesystem backend)
- swamp datastore status reports health for default filesystem datastore
- swamp datastore setup filesystem --path /tmp/test-ds migrates data and updates .swamp.yaml
- swamp datastore setup s3 --bucket <bucket> migrates data to S3 and updates .swamp.yaml
- swamp datastore sync performs bidirectional sync with S3
- swamp datastore lock status shows lock holder or null
- swamp datastore lock release --force force-releases a stuck lock
- SWAMP_DATASTORE=filesystem:/tmp/test overrides .swamp.yaml config
- Concurrent CLI invocations block on lock rather than corrupting data
- Ctrl-C during a command releases the lock (best-effort SIGINT handler)
@stack72 stack72 force-pushed the datastores-implementation branch from 9237ef5 to 408f2af Compare March 10, 2026 22:27
Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Configurable Datastores with S3 Sync and Distributed Locking

This is an excellent, well-architected PR that adds significant functionality while maintaining clean code and following project conventions. No blocking issues found.

Domain-Driven Design Assessment

The implementation demonstrates excellent adherence to DDD principles:

  • Clean layer separation: Domain interfaces (DistributedLock, DatastorePathResolver, DatastoreVerifier) are properly abstracted with infrastructure implementations
  • Value objects: DatastoreConfig is a proper discriminated union type; LockInfo and LockOptions are well-structured
  • Domain errors: LockTimeoutError extends Error properly with useful metadata
  • Repository pattern: Only aggregate roots get repositories, following DDD guidance

Code Quality

Positive:

  • All files include required AGPLv3 license headers
  • Named exports used throughout (no default exports)
  • All commands support both log and json output modes as required
  • TypeScript strict mode followed (no new type violations)
  • S3 bucket name validation (resolve_datastore.ts:38-47)
  • Path traversal prevention in s3_cache_sync.ts:37-48 with assertSafePath()
  • Thorough design document at design/datastores.md

Test Coverage:

  • FileLock - 7 test cases covering acquire, release, timeout, stale locks
  • S3Lock - 7 test cases with mock S3Client
  • DatastorePatternMatcher - 10 test cases for glob patterns
  • DefaultDatastorePathResolver - 10 test cases
  • resolveDatastoreConfig - 5 test cases covering priority resolution
  • LockTimeoutError - 3 test cases

Security

  • S3 conditional writes (If-None-Match: *) for atomic lock acquisition
  • TTL-based lock expiration prevents stuck locks from crashed processes
  • SIGINT handler for best-effort lock release on Ctrl-C
  • Helpful error messages for common AWS credential/permission issues

Architecture Highlights

  • Shared S3Client instance between lock and sync service avoids duplicate credential resolution
  • Lock-only mode for filesystem datastores keeps the coordinator generic
  • Concurrent transfers in batches of 10 balances performance with rate limits
  • Size+mtime change detection avoids expensive content hashing

Suggestions (non-blocking)

The type AnyOptions = any pattern in CLI commands follows the established codebase convention (used in 64+ existing command files) - not a concern for this PR.

Overall, this is production-quality code with thoughtful design decisions, comprehensive testing, and excellent documentation. The PR is ready to merge.

@stack72 stack72 merged commit 153e19a into main Mar 10, 2026
4 checks passed
@stack72 stack72 deleted the datastores-implementation branch March 10, 2026 22:33
Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial Review

Critical / High

  1. HIGH: Lock theft via unconditional heartbeat overwrites (s3_lock.ts:168-175, file_lock.ts:174-180)

    Both S3Lock.extend() and FileLock.extend() use unconditional writes (s3.putObject / Deno.writeTextFile) to refresh the lock. This creates a race condition where a recovered process can overwrite another process's legitimately acquired lock:

    Breaking scenario:

    • T=0: Process A acquires lock
    • T=5: Process A becomes unresponsive (network partition, GC pause, etc.)
    • T=40: Process B sees lock is stale (TTL=30s expired), deletes it, acquires with conditional write
    • T=40.2: Process A recovers, heartbeat fires, extend() runs
    • extend() checks this.held (still true - A never released), then does unconditional putObject/writeTextFile
    • A's lock info overwrites B's lock
    • Both A and B now believe they own the lock and proceed to modify data concurrently

    Data corruption result: Two processes simultaneously hold the "distributed" lock, defeating the entire purpose of locking.

    Fix: Use conditional writes for heartbeat extension. For S3, store the ETag on acquire and use IfMatch for extend. For filesystem, use file descriptor locking or atomic rename with a nonce. Alternatively, extend() should verify the lock content still belongs to this process before overwriting.

  2. HIGH: S3Lock stale detection TOCTOU allows lock theft (s3_lock.ts:107-127)

    The stale lock detection has a time-of-check-time-of-use race:

    const recheck = await this.s3.headObject(this.lockKey);
    if (recheck.lastModified.getTime() === head.lastModified.getTime()) {
      await this.s3.deleteObject(this.lockKey);  // <-- TOCTOU gap here
    }

    Between the headObject recheck and the deleteObject, another process's heartbeat could refresh the lock. The delete then removes a valid lock, and the subsequent putObjectConditional succeeds, stealing the lock.

    Fix: Use S3 Object Lock or versioning with conditional deletes, or accept that stale lock cleanup is best-effort and increase TTL tolerance.

Medium

  1. MEDIUM: Partial push failure leaves index inconsistent (s3_cache_sync.ts:350-369)

    In pushChanged(), if some files fail to push (e.g., transient S3 errors), the code still updates the index for successful files and pushes the index to S3. This means the index claims files exist in S3 that weren't actually pushed.

    Breaking scenario: Push 10 files, 3 fail due to network blip. Index says all 10 exist. Later pullChanged() on another machine fails silently when fetching the 3 missing files (error swallowed at line 188-201).

    Fix: Only update index after all pushes succeed, or track failed pushes and retry before updating index.

  2. MEDIUM: getCachePath lacks path traversal protection (s3_cache_sync.ts:470-472)

    getCachePath(relativePath) does a simple join(cachePath, relativePath) without calling assertSafePath(). If any caller passes user-controlled or S3-derived paths through this method and then writes to the result, path traversal is possible.

    Currently this appears to only be used for informational purposes, but it's a landmine for future code.

    Fix: Apply assertSafePath() in getCachePath() or document it as internal-only.

  3. MEDIUM: SIGINT handler may hang indefinitely (datastore_sync_coordinator.ts:72-76)

    The signal handler awaits lock.release() with no timeout:

    signalHandler = () => {
      lock.release().catch(() => {}).finally(() => {
        Deno.exit(130);
      });
    };

    If S3 is unreachable, deleteObject may hang for the full HTTP timeout (could be 60+ seconds), leaving the user stuck on Ctrl-C.

    Fix: Add a timeout (e.g., 5 seconds) before force-exiting.

  4. MEDIUM: Force release can delete active lock (datastore_lock.ts:97-124)

    datastoreLockReleaseCommand does inspect() then unconditionally deletes. Between inspect and delete, another process could acquire the lock legitimately. The force release deletes their active lock.

    This is documented as a "breakglass" operation, but should at least verify the holder info matches what was inspected before deleting.

Low

  1. LOW: hasSymlinks recursion may fail on permission errors (datastore_setup.ts:91-103)

    If any subdirectory in models/workflows/vaults is unreadable, readDir throws and fails the entire setup, even if the unreadable dir doesn't contain symlinks.

  2. LOW: Index modification during concurrent pulls (s3_cache_sync.ts:192-197)

    pullChanged() modifies this.index.entries[rel] inside Promise.allSettled batches. While currently safe (each file writes to a different key), this pattern is fragile if the index structure changes.

  3. LOW: Missing null check on S3 ListObjects keys (s3_client.ts:198-199)

    .map((obj) => obj.Key!) uses a non-null assertion. While unlikely, malformed S3 responses could cause crashes.

Verdict

FAIL - The unconditional heartbeat writes (Finding #1) allow two processes to simultaneously believe they hold the lock, completely defeating the distributed locking mechanism and enabling data corruption. This is a fundamental design flaw that must be fixed before merge.

The stale detection TOCTOU (Finding #2) compounds this issue by providing another path to lock theft.

stack72 added a commit that referenced this pull request Mar 10, 2026
## Summary

- Replaces `gh pr merge --auto --squash` with `gh pr merge --squash` in
the CI auto-merge job
- The `--auto` flag enables GitHub's persistent auto-merge, which merges
based on **branch protection rules** rather than the YAML `needs:`
dependencies
- This caused PR #669 to auto-merge after the normal review passed but
before the adversarial review completed

## Root cause

The `auto-merge` job has `needs: [test, deps-audit, claude-review,
claude-adversarial-review]`, but `--auto` tells GitHub "merge this PR
whenever branch protection requirements are met." If
`claude-adversarial-review` isn't a required status check in branch
protection settings, GitHub merges as soon as the other checks pass —
ignoring the YAML dependency.

## Fix

Remove `--auto` so the merge command executes directly when the job
runs. Since the job is gated by `needs:`, it only runs after all four
dependencies (including adversarial review) succeed.

## Test plan

- [ ] Verify CI workflow passes on this PR
- [ ] Confirm PRs no longer merge before adversarial review completes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Populate .gitattributes with linguist-generated markers during repo init Include models/ and vaults/ symlink dirs in managed .gitignore section

1 participant