Skip to main content

Backup and Disaster Recovery Plan

Version: 1.0 Last Updated: 2026-02-16 Review Frequency: Quarterly Test Frequency: Quarterly

1. Executive Summary

This document outlines the backup and disaster recovery (DR) procedures for Agentic Trust to ensure business continuity in the event of:
  • Data loss or corruption
  • Infrastructure failure
  • Security incidents (ransomware, data breach)
  • Natural disasters
  • Human error
Recovery Time Objective (RTO): 4 hours Recovery Point Objective (RPO): 24 hours

2. System Architecture Overview

2.1 Infrastructure Components

ComponentProviderCriticalityBackup Strategy
PostgreSQL DatabaseNeonCRITICALAutomated daily backups
File Storage (blob)Vercel BlobHIGHRedundant across regions
Application CodeVercelHIGHGit repository + deployments
Secrets/ConfigVercel Env VarsCRITICALEncrypted backup in 1Password
DNSVercelCRITICALConfiguration documented

2.2 Data Criticality

Tier 1 - Critical (Cannot operate without):
  • Customer database (products, organizations, users)
  • API keys database
  • Conversation and message data
Tier 2 - Important (Degraded operation):
  • Knowledge base embeddings
  • Uploaded files
  • Workflow configurations
Tier 3 - Nice to Have (Can be recreated):
  • Logs (Sentry)
  • Cache data (Redis)
  • Agent debug logs

3. Backup Strategy

3.1 Database Backups (Neon PostgreSQL)

Provider: Neon Backup Type: Automated continuous backup + point-in-time recovery (PITR) Schedule:
  • Continuous WAL archiving: Real-time
  • Full snapshots: Daily at 3:00 AM UTC
  • Retention:
    • Last 7 days: Restore to any point in time
    • Last 30 days: Daily snapshots
    • Last 90 days: Weekly snapshots
    • Last 1 year: Monthly snapshots
Backup Location: Neon managed storage (S3-compatible, encrypted at rest) Recovery Capability:
  • Point-in-time recovery (PITR) to any moment in last 7 days
  • Snapshot recovery for older backups
  • Estimated recovery time: 15-30 minutes for full database
Manual Backup Process:
# Create on-demand backup via Neon CLI
npx neonctl branches create --name "manual-backup-$(date +%Y%m%d-%H%M%S)"

# Or via Neon console:
# 1. Go to Neon console
# 2. Select project
# 3. Click "Branches" → "Create Branch"
# 4. Name: "backup-YYYY-MM-DD-reason"
# 5. Create from: Current point in time
Verification: See Section 4.2

3.2 File Storage Backups (Vercel Blob)

Provider: Vercel Backup Type: Multi-region redundancy Strategy:
  • Files automatically replicated across multiple AWS regions
  • No manual backup required (provider handles redundancy)
  • Versioning not enabled (cost optimization)
Recovery Capability:
  • Provider guarantees 99.9% durability
  • Lost files cannot be recovered without separate backup
  • Consider enabling S3 versioning for critical files
Recommendation: Implement weekly snapshot of critical files to separate S3 bucket

3.3 Application Code & Configuration

Code Repository: GitHub (primary) Backups:
  • Git repository provides version history
  • GitHub has own backup and redundancy
  • Deploy to Vercel from git (reproducible builds)
Configuration (Environment Variables):
  • Stored in Vercel dashboard
  • CRITICAL: Maintain encrypted backup in 1Password or similar
  • Documented in .env.example (without sensitive values)
Backup Process:
# Export environment variables from Vercel
vercel env pull .env.backup

# Encrypt and store securely
gpg --encrypt --recipient security@company.com .env.backup

# Store encrypted file in 1Password secure notes
Frequency: After any environment variable change

3.4 Secrets and API Keys

Storage: 1Password Teams / Vault Items to backup:
  • WorkOS API keys
  • Anthropic API key
  • OpenAI API key
  • Sentry DSN and auth token
  • Database connection strings
  • HMAC secrets (per product - store in encrypted export)
  • Deployment keys
Backup Process:
  • All secrets stored in 1Password shared vault
  • 1Password has enterprise-grade backup
  • Export vault monthly to encrypted archive
  • Store archive in separate secure location (offline storage)
Recovery: Re-import secrets to new environment from 1Password

4. Verification and Testing

4.1 Backup Verification

Automated Checks (Weekly):
#!/bin/bash
# File: scripts/verify-backups.sh

echo "Verifying Neon backups..."
npx neonctl branches list | grep -q "main" || echo "ERROR: Cannot access Neon"

echo "Verifying Vercel Blob access..."
curl -I https://your-domain.vercel.app/uploads/test.txt || echo "ERROR: Blob storage issue"

echo "Verifying git repository..."
git ls-remote https://github.com/your-org/agentic-trust.git || echo "ERROR: Git access issue"

echo "Backup verification complete"
Manual Verification (Monthly):
  1. Review Neon backup dashboard
  2. Confirm backup completion and size
  3. Check backup age (last backup < 24 hours)
  4. Verify backup storage usage

4.2 Restore Testing

Quarterly DR Test (Required for SOC 2): Test Procedure:
  1. Preparation (Monday)
    • Schedule test during low-traffic period
    • Notify team of test
    • Document current state
  2. Create Isolated Environment (Monday)
    • Create new Vercel project (“agentic-trust-dr-test”)
    • Create new Neon branch from backup
    • Configure environment variables
    • Deploy application code
  3. Restore Database (Monday)
    # Create branch from 7-day-old backup
    npx neonctl branches create \
      --name dr-test-$(date +%Y%m%d) \
      --point-in-time "-7 days"
    
    # Get connection string
    npx neonctl connection-string --branch dr-test-$(date +%Y%m%d)
    
  4. Verify Data Integrity (Tuesday)
    • Login with test account
    • Verify products and API keys visible
    • Check conversation history
    • Verify file uploads accessible
    • Run data consistency checks:
      -- Check for orphaned records
      SELECT COUNT(*) FROM "ApiKey" WHERE "productId" NOT IN (SELECT id FROM "Product");
      
      -- Check conversation counts
      SELECT productId, COUNT(*) FROM "Conversation" GROUP BY productId;
      
  5. Functional Testing (Tuesday)
    • Test chat widget
    • Test API endpoints
    • Test dashboard access
    • Verify embeddings and knowledge base
  6. Document Results (Wednesday)
    • Record time to restore: _____ minutes
    • Record any data discrepancies
    • Document issues encountered
    • Update runbook with lessons learned
  7. Cleanup (Wednesday)
    • Delete test Vercel project
    • Delete test Neon branch
    • Archive test results
Success Criteria:
  • Restoration completed within RTO (4 hours)
  • No data loss beyond RPO (24 hours)
  • Application fully functional
  • All critical features working
Test Log Template: See Appendix A

5. Disaster Recovery Scenarios

5.1 Scenario: Database Corruption

Trigger: Data corruption detected in production database Response (RTO: 2 hours):
  1. Immediate Actions (0-15 min)
    • Notify incident commander
    • Stop all write operations (set app to read-only mode)
    • Identify scope of corruption
    • Identify last known good state
  2. Assessment (15-30 min)
    • Determine if corruption is fixable (run VACUUM, REINDEX)
    • If not fixable, proceed to restore
    • Determine restore point (latest clean backup)
  3. Restore (30-60 min)
    # Create new branch from last good backup
    npx neonctl branches create \
      --name recovery-$(date +%Y%m%d-%H%M) \
      --point-in-time "2 hours ago"  # Adjust based on analysis
    
    # Update application to use new branch
    vercel env rm DATABASE_URL production
    vercel env add DATABASE_URL production  # Paste new connection string
    
    # Redeploy
    vercel --prod
    
  4. Validation (60-90 min)
    • Run data integrity checks
    • Verify application functionality
    • Check with affected users
  5. Post-Incident (After resolution)
    • Document root cause
    • Update monitoring to detect similar issues
    • Post-mortem review

5.2 Scenario: Complete Database Loss

Trigger: Neon region outage, account compromise, or catastrophic failure Response (RTO: 4 hours):
  1. Immediate Actions (0-30 min)
    • Declare Severity 1 incident
    • Assemble DR team
    • Contact Neon support (if provider issue)
    • Notify customers (status page)
  2. Recovery (30-120 min)
    • Create new Neon project in different region
    • Restore from latest backup:
      # Option 1: Neon backup
      npx neonctl branches create --name main --from-backup latest
      
      # Option 2: Manual backup (if maintained)
      psql $NEW_DATABASE_URL < backup-latest.sql
      
    • Update connection strings in Vercel
    • Redeploy application
  3. Data Loss Assessment (120-180 min)
    • Identify data loss window (up to RPO of 24 hours)
    • Notify affected customers
    • Offer assistance in recreating recent data
  4. Validation (180-240 min)
    • Full system test
    • User acceptance testing
    • Monitor for issues
Data Loss: Up to 24 hours (RPO) in worst case

5.3 Scenario: Ransomware Attack

Trigger: Ransomware detected, database encrypted or deleted Response (RTO: 4 hours):
  1. Immediate Actions (0-15 min)
    • Isolate affected systems (revoke API keys, rotate passwords)
    • Notify security team and legal
    • Do NOT pay ransom
    • Preserve evidence for forensics
  2. Containment (15-60 min)
    • Identify infection vector
    • Rotate all secrets and API keys
    • Force password reset for all users
    • Enable MFA for all accounts
  3. Recovery (60-240 min)
    • Restore from clean backup (pre-infection)
    • Deploy to new infrastructure (fresh Vercel project)
    • Update DNS to point to new infrastructure
  4. Validation (240-300 min)
    • Security scan of restored environment
    • Verify no backdoors or persistence
    • Monitor for re-infection
  5. Post-Incident
    • Full security audit
    • Penetration testing
    • Customer notification (if PII exposed)
    • Law enforcement contact (if required)

5.4 Scenario: Human Error (Accidental Deletion)

Trigger: Accidental DROP TABLE, DELETE without WHERE, etc. Response (RTO: 30 minutes):
  1. Immediate Actions (0-5 min)
    • Stop all write operations
    • Identify what was deleted and when
    • Check if within 90-day grace period
  2. Recovery (5-20 min)
    # Point-in-time recovery to just before deletion
    npx neonctl branches create \
      --name recovery-before-deletion \
      --point-in-time "10 minutes ago"
    
    # Export only the deleted data
    pg_dump --table=Product --data-only $RECOVERY_BRANCH_URL > deleted_products.sql
    
    # Restore to main database
    psql $MAIN_DATABASE_URL < deleted_products.sql
    
  3. Validation (20-30 min)
    • Verify restored data
    • Check for conflicts or duplicates
    • Confirm with user who triggered deletion

6. Communication Plan

6.1 Internal Communication

Incident Channels: Escalation:
  1. On-call engineer → Incident Commander
  2. Incident Commander → CTO
  3. CTO → CEO (for major incidents)

6.2 External Communication

Status Page: status.agentictrust.com Update Frequency: Every 30 minutes during active incident Customer Notification (Major outage):
Subject: Service Disruption - We're Working to Restore Access

We are currently experiencing a service disruption affecting [scope].

WHAT HAPPENED:
[Brief description]

WHAT WE'RE DOING:
[Recovery steps]

EXPECTED RESOLUTION:
[Estimated time]

We apologize for the inconvenience and will provide updates every 30 minutes.

The Agentic Trust Team

7. Recovery Runbooks

Runbook 1: Database Restore from Backup

When to Use: Data loss, corruption, or need to roll back Prerequisites:
  • Neon CLI installed: npm install -g neonctl
  • Neon API key configured
  • Access to Vercel dashboard
Steps:
# 1. List available backups
npx neonctl branches list

# 2. Create restore branch from specific time
npx neonctl branches create \
  --name restore-YYYYMMDD-HHMM \
  --point-in-time "2024-01-15 14:30:00"  # or "-2 hours"

# 3. Get connection string
export RESTORE_DB_URL=$(npx neonctl connection-string --branch restore-YYYYMMDD-HHMM)

# 4. Verify restore
psql $RESTORE_DB_URL -c "SELECT COUNT(*) FROM \"Product\";"

# 5. If restore looks good, promote to production:
# Option A: Update Vercel env var to point to restore branch
vercel env rm DATABASE_URL production
vercel env add DATABASE_URL production  # Paste new connection string

# Option B: Copy data from restore to main
pg_dump $RESTORE_DB_URL | psql $MAIN_DB_URL

# 6. Redeploy
vercel --prod

# 7. Verify production
curl https://api.agentictrust.com/api/health

# 8. Clean up (after verification)
npx neonctl branches delete restore-YYYYMMDD-HHMM
Estimated Time: 30-45 minutes

Runbook 2: Secrets Recovery

When to Use: Lost access to Vercel, need to rebuild environment Steps:
  1. Access 1Password vault: “Agentic Trust Production Secrets”
  2. Retrieve required secrets:
    • DATABASE_URL
    • WORKOS_CLIENT_ID
    • WORKOS_API_KEY
    • WORKOS_COOKIE_PASSWORD
    • ANTHROPIC_API_KEY
    • OPENAI_API_KEY
    • SENTRY_DSN
    • SENTRY_AUTH_TOKEN
  3. Create new Vercel project or update existing:
    vercel env add DATABASE_URL production
    vercel env add WORKOS_CLIENT_ID production
    # ... repeat for all secrets
    
  4. Deploy:
    vercel --prod
    
  5. Verify:
    curl https://your-app.vercel.app/api/health
    
Estimated Time: 15-20 minutes

8. Responsibilities

RoleResponsibilities
Incident CommanderOverall DR coordination, decision-making
Database AdministratorDatabase restore, data validation
DevOps EngineerInfrastructure recovery, deployment
Security LeadSecurity assessment, forensics (if needed)
CTOExecutive decisions, resource allocation
Customer SuccessCustomer communication, support

9. Maintenance Schedule

ActivityFrequencyOwnerLast Completed
Backup verificationWeeklyDevOps-
DR plan reviewQuarterlySecurity-
DR test (full restore)QuarterlyDevOps-
Update contact listMonthlyAll teams-
Secrets backupAfter each changeDevOps-
Documentation updateAfter each DR eventIncident Commander-

10. Appendices

Appendix A: DR Test Log Template

# DR Test Log - [Date]

## Test Details
- **Date**: YYYY-MM-DD
- **Tester**: [Name]
- **Scenario**: [Database restore / Full recovery / etc.]
- **Backup Age**: [How old was the backup used?]

## Timeline
| Time | Action | Status | Notes |
|------|--------|--------|-------|
| 10:00 | Test initiated | ✅ | - |
| 10:15 | Backup identified | ✅ | Used 7-day-old backup |
| 10:30 | Restore started | ✅ | - |
| 10:45 | Restore completed | ✅ | 15 minutes total |
| 11:00 | Data validation | ✅ | All checks passed |
| 11:30 | Functional testing | ⚠️ | Widget slow to load |
| 12:00 | Test completed | ✅ | - |

## Metrics
- **RTO Achieved**: 2 hours (Target: 4 hours) ✅
- **RPO Achieved**: 24 hours (Target: 24 hours) ✅
- **Data Loss**: None detected ✅
- **Issues Found**: 1 (widget performance)

## Issues Encountered
1. Widget loaded slowly after restore (cold cache)
   - Resolution: Warmed up cache manually
   - Action item: Add cache warming to runbook

## Action Items
- [ ] Update runbook to include cache warming step
- [ ] Investigate widget performance optimization
- [ ] Review monitoring alerts (none triggered)

## Conclusion
Test PASSED. System recovered successfully within RTO/RPO targets.

## Sign-off
- Tester: [Signature] [Date]
- Reviewer: [Signature] [Date]

Appendix B: Contact List

Emergency Contacts (Available 24/7):
RoleNamePhoneEmail
Incident Commander[Name][Phone][Email]
Backup IC[Name][Phone][Email]
Database Admin[Name][Phone][Email]
DevOps Lead[Name][Phone][Email]
CTO[Name][Phone][Email]
Vendor Support:
VendorSupport ChannelSLA
Neonsupport@neon.tech4 hours (business)
VercelEnterprise support portal1 hour (P1)
WorkOSsupport@workos.com4 hours

Appendix C: RTO/RPO by Service

ServiceRTORPOJustification
Chat API4 hours24 hoursCritical revenue impact
Dashboard8 hours24 hoursCan use API directly
Widget Embed4 hours24 hoursCustomer-facing
Knowledge Base24 hours24 hoursCan recreate from source
Admin Panel8 hours24 hoursLower priority

Document Approval
RoleNameSignatureDate
CTO[Name]
Security Lead[Name]
DevOps Lead[Name]
Next Test Date: [3 months from today] Next Review Date: [3 months from today]