docs: partial recovery — add guide for host failure recovery and orphaned instance cleanup #261

tsivaprasad · 2026-02-03T18:41:03Z

Summary

This PR adds documentation for partial failure recovery (quorum intact) and improves node unavailability detection during force host removal operations.

Changes

Add comprehensive disaster recovery guide for partial failure scenarios (docs/disaster-recovery/partial-recovery.md)
Add ErrNodeUnavailable sentinel error for precise node unavailability detection
Update WaitForService to detect stuck tasks due to unavailable nodes instead of relying on generic timeouts
Update postgres_service.go to use the new sentinel error during service deletion
Add CleanupInstance activity to clean up orphaned instance records when force-removing hosts

Testing

Manually validated the partial failure recovery (quorum intact) workflow on a 3-node Docker Swarm cluster.
Simulated permanent loss of a host-3 and Quorum state Intact (2 healthy hosts remained)
Phase 1: Failed Host Removal (Verified)

✅ Step 1.1: Force Remove Host from Control Plane

Executed:

curl -X DELETE http://192.168.105.3:3000/v1/hosts/host-3?force=true

Observed:

remove_host task created and completed
Automatic database update task triggered for storefront
All instances and subscriptions belonging to host-3 were removed
Control Plane remained available throughout

Verified via:

curl /v1/hosts/host-3/tasks/<task-id>
curl /v1/databases/storefront/tasks/<task-id>

Matches documented behavior: forced host removal + automatic database cleanup

✅ Step 1.3: Docker Swarm Cleanup

On a healthy manager (host-1):

docker node ls
docker node demote lima-host-3
docker node rm lima-host-3 --force

Observed:

Failed node removed from Swarm
Swarm manager quorum preserved
Remaining managers stayed Ready

Confirms documented Swarm cleanup procedure

Phase 2: Reduced-Capacity Operation (Verified)
✅ Step 2.1: Host Status Verification
curl http://192.168.105.3:3000/v1/hosts

Observed:

Only host-1 and host-2 listed
Both hosts healthy
etcd quorum intact

✅ Step 2.2: Database Health Verification
curl http://192.168.105.3:3000/v1/databases/storefront

Observed:

Database state: available

Instances:
n1 on host-1
n2 on host-2
No references to host-3

✅ Step 2.3: Data Replication Verification

Executed:

Inserted data on n2
Verified visibility on n1

Observed:

Writes succeeded
Data replicated correctly

Confirms cluster remained fully operational with reduced capacity

Phase 3: Restored Host Preparation (Verified)
✅ Step 3.1: Clean Old Control Plane Data

On restored host-3:

sudo rm -rf /data/control-plane/*
ls -la /data/control-plane/

Observed:

Directory fully empty
No stale etcd state

✅ Step 3.2: Rejoin Docker Swarm (Manager)

From host-1:
docker swarm join-token manager

On host-3:
docker swarm join --token <TOKEN> 192.168.104.1:2377

Observed:

Host rejoined Swarm successfully as manager

✅ Step 3.4: Redeploy Control Plane Stack
docker stack deploy -c /tmp/stack.yaml control-plane

Observed:

Control Plane service started on host-3
Service reached Running state

Verified via:

  docker service ps control-plane_host-3
  docker service logs control-plane_host-3

Phase 4: Control Plane Cluster Rejoin (Verified)
✅ Step 4.1: Generate Join Token
curl http://192.168.105.3:3000/v1/cluster/join-token

Response included:

token
server_url

✅ Step 4.2: Join Cluster from Restored Host

curl -X POST http://192.168.105.5:3000/v1/cluster/join \
  -H 'Content-Type:application/json' \
  --data '<join-token-json>'

Observed:

Host successfully joined Control Plane cluster

✅ Step 4.3: Host Verification
curl http://192.168.105.3:3000/v1/hosts

Observed:

host-3 present
Status: healthy

Phase 5: Database Capacity Restoration (Verified)
✅ Step 5.1: Add Database Node Back

curl -X POST http://192.168.105.3:3000/v1/databases/storefront \
  -H 'Content-Type:application/json' \
  --data '{
    "spec": {
      "nodes": [
        {"name":"n1","host_ids":["host-1"]},
        {"name":"n2","host_ids":["host-2"]},
        {"name":"n3","host_ids":["host-3"]}
      ]
    }
  }'

Observed:

New instance storefront-n3-* created
Patroni + Spock configured automatically
Database state transitioned modifying → available

✅ Step 5.3: Final Replication Verification

Executed:

Inserted data on n3
Verified data on n2

Observed:

Data fully consistent across all three nodes

Verification Summary:

Host failure handled without downtime
Quorum preserved
Database continued serving traffic
Failed host safely removed and re-added
Database capacity fully restored
Replication integrity verified at every stage

Checklist

Changelog

PLAT-313

…aned instance cleanup

coderabbitai · 2026-02-03T18:41:21Z

📝 Walkthrough

Walkthrough

Adds a comprehensive partial-recovery guide, a CleanupInstance activity for removing orphaned DB instances during host-removal Delete flows, docker-level node-unavailability detection (ErrNodeUnavailable), and updates Postgres service scaling logic to explicitly handle not-found and node-unavailable cases before service removal.

Changes

Cohort / File(s)	Summary
Partial Recovery Documentation `changes/unreleased/Added-20260204-000115.yaml`, `docs/disaster-recovery/partial-recovery.md`	Adds a multi-phase partial failure recovery guide for quorum-intact clusters with prerequisites, step-by-step commands, API examples, task monitoring, warnings, and troubleshooting notes.
CleanupInstance Activity & Registration `server/internal/workflows/activities/cleanup_instance.go`, `server/internal/workflows/activities/activities.go`	Introduces `CleanupInstanceInput`/`CleanupInstanceOutput`, activity implementation (cleanup of orphaned DB instances), executor method, and registers the activity in the activities registry.
Host-removal workflow integration `server/internal/workflows/common.go`	When handling a host-removed Delete event for resources of type Instance, constructs and executes a `CleanupInstance` task to remove the instance record from etcd and surfaces cleanup errors.
Docker node availability & Postgres service handling `server/internal/docker/docker.go`, `server/internal/orchestrator/swarm/postgres_service.go`	Adds exported `ErrNodeUnavailable` and node-status checks in `WaitForService` to detect tasks blocked by down/unreachable nodes; `PostgresService.Delete` now explicitly handles `docker.ErrNotFound` and `docker.ErrNodeUnavailable` before proceeding with service removal.

Poem

🐰 I hop through logs at break of day,
Cleaning orphans tucked away,
Nodes that nap or vanish fast,
I tidy up and bind the past,
Six paws steady — clusters mend their way.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main changes: adding a partial recovery guide and orphaned instance cleanup during host failure recovery.
Description check	✅ Passed	The description covers all required template sections: Summary, Changes, Testing (comprehensive manual validation), and Checklist. Changelog entry is documented.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch PLAT-313-recovery-approach-when-swarm-quorum-is-lost

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@docs/disaster-recovery/partial-recovery.md`:
- Around line 179-183: Replace the placeholder comments with concrete psql
commands that show how to insert test data on the primary and verify it on the
replica; for example, add a psql INSERT on n1 targeting the example database and
test_table (e.g., INSERT INTO test_table (id, value) VALUES (...)) and a psql
SELECT on n2 to confirm the replicated row (e.g., SELECT * FROM test_table WHERE
value = 'recovery_test'); ensure you include host/port/user flags (-h, -p, -U)
and use the same database name (example) and table name (test_table) as shown in
the suggested improvement so readers can run the steps end-to-end.
- Around line 413-416: Replace the placeholder comments "# Insert on n1" and "#
Verify on n3" with concrete psql commands: on the primary node run a psql
command to INSERT a test row into the example database (e.g., use psql -h host-1
-p 5432 -U admin -d example -c "INSERT INTO test_table (id, value) VALUES
(...);"), and on the restored node run a psql SELECT to verify the row exists
(e.g., psql -h host-3 -p 5432 -U admin -d example -c "SELECT * FROM test_table
WHERE value = '...';"); ensure the SQL matches the test_table schema and uses a
unique value (e.g., 'full_recovery_test') so the verification step is
unambiguous.

In `@server/internal/orchestrator/swarm/postgres_service.go`:
- Around line 171-185: The timeout branch currently falls through and lets
ServiceRemove proceed even though scale-down may not have completed; update the
isTimeoutError(err) case in the same function (the scale-down/ServiceRemove
sequence) to return a descriptive error instead of continuing, e.g., return an
error indicating the scale-down timed out (wrap the original err), so callers
must handle the timeout (until the TODO RemovingHosts/context distinction is
implemented); reference isTimeoutError, the scale-down logic, and the subsequent
ServiceRemove call when making the change.

🧹 Nitpick comments (1)

server/internal/workflows/activities/cleanup_instance.go (1)
27-31: MaxAttempts: 1 means no retries on transient failures.

With MaxAttempts: 1, any transient failure (network blip, etcd leader election) will fail the cleanup permanently. Consider allowing a small number of retries with backoff for resilience.
Optional: Allow limited retries
 	options := workflow.ActivityOptions{
 		Queue: utils.HostQueue(a.Config.HostID),
 		RetryOptions: workflow.RetryOptions{
-			MaxAttempts: 1,
+			MaxAttempts: 3,
 		},
 	}

docs/disaster-recovery/partial-recovery.md

coderabbitai · 2026-02-03T18:45:19Z

server/internal/orchestrator/swarm/postgres_service.go

+	switch {
+	case errors.Is(err, docker.ErrNotFound):
 		// Service is already deleted.
 		return nil
-	} else if err != nil {
+	case isTimeoutError(err):
+		// NOTE: This assumes timeout always indicates an unreachable host.
+		// This is currently intended for remove-host --force operations.
+		// If timeout occurs during normal operations (e.g., temporary network
+		// issues), this could incorrectly proceed with removal.
+		//
+		// TODO: Add RemovingHosts to resource.Context to safely distinguish
+		// between force-removal scenarios and transient network issues.
+	case err != nil:
 		return fmt.Errorf("failed to scale down postgres service before removal: %w", err)
 	}


⚠️ Potential issue | 🟠 Major

Timeout case falls through to ServiceRemove without guaranteed scale-down completion.

When a timeout occurs, the code proceeds to ServiceRemove without confirmation that containers have stopped. The comment on lines 161-164 explicitly states the scale-down is performed "to guarantee that containers have stopped before this function returns." The timeout case bypasses this guarantee.

While the TODO acknowledges this is intended for --force operations, without the proposed RemovingHosts context field, there's no way to distinguish force-removal from transient network issues. This could lead to:

Attempting ServiceRemove while containers are still running

Resource cleanup errors downstream

Consider returning an error for timeout cases until the RemovingHosts context is implemented, or explicitly document that callers must handle potential race conditions.

Alternative: Return error until context distinction is available

case isTimeoutError(err): - // NOTE: This assumes timeout always indicates an unreachable host. - // This is currently intended for remove-host --force operations. - // If timeout occurs during normal operations (e.g., temporary network - // issues), this could incorrectly proceed with removal. - // - // TODO: Add RemovingHosts to resource.Context to safely distinguish - // between force-removal scenarios and transient network issues. + // TODO: Once RemovingHosts is added to resource.Context, allow + // proceeding for force-removal scenarios only. + return fmt.Errorf("timeout scaling down postgres service; host may be unreachable: %w", err)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

switch {

case errors.Is(err, docker.ErrNotFound):

// Service is already deleted.

return nil

} else if err != nil {

case isTimeoutError(err):

// NOTE: This assumes timeout always indicates an unreachable host.

// This is currently intended for remove-host --force operations.

// If timeout occurs during normal operations (e.g., temporary network

// issues), this could incorrectly proceed with removal.

//

// TODO: Add RemovingHosts to resource.Context to safely distinguish

// between force-removal scenarios and transient network issues.

case err != nil:

return fmt.Errorf("failed to scale down postgres service before removal: %w", err)

}

switch {

case errors.Is(err, docker.ErrNotFound):

// Service is already deleted.

return nil

case isTimeoutError(err):

// TODO: Once RemovingHosts is added to resource.Context, allow

// proceeding for force-removal scenarios only.

return fmt.Errorf("timeout scaling down postgres service; host may be unreachable: %w", err)

case err != nil:

return fmt.Errorf("failed to scale down postgres service before removal: %w", err)

}

🤖 Prompt for AI Agents

In `@server/internal/orchestrator/swarm/postgres_service.go` around lines 171 - 185, The timeout branch currently falls through and lets ServiceRemove proceed even though scale-down may not have completed; update the isTimeoutError(err) case in the same function (the scale-down/ServiceRemove sequence) to return a descriptive error instead of continuing, e.g., return an error indicating the scale-down timed out (wrap the original err), so callers must handle the timeout (until the TODO RemovingHosts/context distinction is implemented); reference isTimeoutError, the scale-down logic, and the subsequent ServiceRemove call when making the change.

docs/disaster-recovery/partial-recovery.md

server/internal/orchestrator/swarm/postgres_service.go

server/internal/workflows/common.go

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@docs/disaster-recovery/partial-recovery.md`:
- Around line 271-273: Fix the typo and endpoint mismatches: remove the extra
slash in the curl invocation used to set JOIN_TOKEN so it calls
"/v1/cluster/join-token" (the line containing JOIN_TOKEN="$(curl
http://host-1:3000//v1/cluster/join-token)") and update the summary table
entries so they exactly match the routes used earlier—use
"/v1/cluster/join-token" for fetching the join token and "/v1/cluster/join" for
joining the cluster (also correct the same occurrences noted around lines
388-390).

In `@server/internal/docker/docker.go`:
- Around line 424-449: The checkNodesUnavailable function currently treats only
NodeStateDown and NodeStateUnknown as unavailable; update the availability check
in Docker.checkNodesUnavailable to also treat swarm.NodeStateDisconnected as
unavailable (i.e., include node.Status.State == swarm.NodeStateDisconnected in
the conditional alongside NodeStateDown and NodeStateUnknown) and update the
function comment to reflect "(status 'down', 'unknown' or 'disconnected')";
ensure you reference the NodeList call and the loop over nodes in
checkNodesUnavailable when making the change.

🧹 Nitpick comments (1)

docs/disaster-recovery/partial-recovery.md (1)
131-136: Add language identifiers to output blocks (markdownlint MD040).

The example output fences are missing a language. Adding text (or console) will satisfy MD040.
Example tweak
-```
+```text
 ID                            HOSTNAME      STATUS    AVAILABILITY   MANAGER STATUS
 4aoqjp3q8jcny4kec5nadcn6x *   lima-host-1   Ready     Active         Leader
 959g9937i62judknmr40kcw9r     lima-host-2   Ready     Active         Reachable
 l0l51d890edg3f0ccd0xppw06     lima-host-3   Down      Active         Unreachable
- +text
docker swarm join --token SWMTKN-1-xxx...xxx 10.0.0.1:2377
Also applies to: 222-224

docs/disaster-recovery/partial-recovery.md

server/internal/docker/docker.go

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@server/internal/docker/docker.go`:
- Around line 424-432: The current loop swallows errors from
checkNodesUnavailable by continuing, which hides Docker API failures; change the
behavior in the block that calls d.checkNodesUnavailable(ctx, stuckNodeIDs) to
return the error (or wrap it with context) instead of continue so upstream sees
the real failure; update the call site that currently checks unavailable, err :=
d.checkNodesUnavailable(...) to handle err by returning fmt.Errorf("checking
node availability: %w", err) (or similar) and keep the existing
ErrNodeUnavailable return path when unavailable is true.

🧹 Nitpick comments (1)

docs/disaster-recovery/partial-recovery.md (1)
131-136: Add language identifiers to fenced output blocks (MD040).
Use text (or similar) to satisfy linting.
📝 Example fix
-```
+```text
 ID                            HOSTNAME      STATUS    AVAILABILITY   MANAGER STATUS
 4aoqjp3q8jcny4kec5nadcn6x *   lima-host-1   Ready     Active         Leader
 959g9937i62judknmr40kcw9r     lima-host-2   Ready     Active         Reachable
 l0l51d890edg3f0ccd0xppw06     lima-host-3   Down      Active         Unreachable
Also applies to: 222-224

coderabbitai · 2026-02-04T18:15:50Z

server/internal/docker/docker.go

+			// If we have tasks that should be stopping, check if their nodes are unavailable
+			if len(stuckNodeIDs) > 0 {
+				unavailable, err := d.checkNodesUnavailable(ctx, stuckNodeIDs)
+				if err != nil {
+					continue
+				}
+				if unavailable {
+					return fmt.Errorf("%w: tasks cannot stop because their nodes are down", ErrNodeUnavailable)
+				}


⚠️ Potential issue | 🟡 Minor

Don’t swallow node-list errors.
Continuing on checkNodesUnavailable errors hides persistent Docker API failures and turns them into a timeout instead of a root-cause error.

🔧 Suggested fix

- if err != nil { - continue - } + if err != nil { + return fmt.Errorf("failed to check node availability: %w", err) + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// If we have tasks that should be stopping, check if their nodes are unavailable

if len(stuckNodeIDs) > 0 {

unavailable, err := d.checkNodesUnavailable(ctx, stuckNodeIDs)

if err != nil {

continue

}

if unavailable {

return fmt.Errorf("%w: tasks cannot stop because their nodes are down", ErrNodeUnavailable)

}

// If we have tasks that should be stopping, check if their nodes are unavailable

if len(stuckNodeIDs) > 0 {

unavailable, err := d.checkNodesUnavailable(ctx, stuckNodeIDs)

if err != nil {

return fmt.Errorf("failed to check node availability: %w", err)

}

if unavailable {

return fmt.Errorf("%w: tasks cannot stop because their nodes are down", ErrNodeUnavailable)

}

🤖 Prompt for AI Agents

In `@server/internal/docker/docker.go` around lines 424 - 432, The current loop swallows errors from checkNodesUnavailable by continuing, which hides Docker API failures; change the behavior in the block that calls d.checkNodesUnavailable(ctx, stuckNodeIDs) to return the error (or wrap it with context) instead of continue so upstream sees the real failure; update the call site that currently checks unavailable, err := d.checkNodesUnavailable(...) to handle err by returning fmt.Errorf("checking node availability: %w", err) (or similar) and keep the existing ErrNodeUnavailable return path when unavailable is true.

docs: partial recovery — add guide for host failure recovery and orph…

41978dc

…aned instance cleanup

tsivaprasad marked this pull request as draft February 3, 2026 18:41

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

jason-lynch reviewed Feb 3, 2026

View reviewed changes

addressing review comments

f5b3a69

tsivaprasad marked this pull request as ready for review February 4, 2026 17:30

coderabbitai bot reviewed Feb 4, 2026

View reviewed changes

docs/disaster-recovery/partial-recovery.md Show resolved Hide resolved

server/internal/docker/docker.go Show resolved Hide resolved

addressing AI review comments

1392fe9

coderabbitai bot reviewed Feb 4, 2026

View reviewed changes

docs: partial recovery — add guide for host failure recovery and orphaned instance cleanup #261

Are you sure you want to change the base?

docs: partial recovery — add guide for host failure recovery and orphaned instance cleanup #261

Conversation

tsivaprasad commented Feb 3, 2026 • edited by atlassian bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Checklist

Uh oh!

coderabbitai bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tsivaprasad commented Feb 3, 2026 •

edited by atlassian bot

Loading

coderabbitai bot commented Feb 3, 2026 •

edited

Loading