Backup and Disaster Recovery Plan
Version: 1.0 Last Updated: 2026-02-16 Review Frequency: Quarterly Test Frequency: Quarterly1. Executive Summary
This document outlines the backup and disaster recovery (DR) procedures for Agentic Trust to ensure business continuity in the event of:- Data loss or corruption
- Infrastructure failure
- Security incidents (ransomware, data breach)
- Natural disasters
- Human error
2. System Architecture Overview
2.1 Infrastructure Components
| Component | Provider | Criticality | Backup Strategy |
|---|---|---|---|
| PostgreSQL Database | Neon | CRITICAL | Automated daily backups |
| File Storage (blob) | Vercel Blob | HIGH | Redundant across regions |
| Application Code | Vercel | HIGH | Git repository + deployments |
| Secrets/Config | Vercel Env Vars | CRITICAL | Encrypted backup in 1Password |
| DNS | Vercel | CRITICAL | Configuration documented |
2.2 Data Criticality
Tier 1 - Critical (Cannot operate without):- Customer database (products, organizations, users)
- API keys database
- Conversation and message data
- Knowledge base embeddings
- Uploaded files
- Workflow configurations
- Logs (Sentry)
- Cache data (Redis)
- Agent debug logs
3. Backup Strategy
3.1 Database Backups (Neon PostgreSQL)
Provider: Neon Backup Type: Automated continuous backup + point-in-time recovery (PITR) Schedule:- Continuous WAL archiving: Real-time
- Full snapshots: Daily at 3:00 AM UTC
- Retention:
- Last 7 days: Restore to any point in time
- Last 30 days: Daily snapshots
- Last 90 days: Weekly snapshots
- Last 1 year: Monthly snapshots
- Point-in-time recovery (PITR) to any moment in last 7 days
- Snapshot recovery for older backups
- Estimated recovery time: 15-30 minutes for full database
3.2 File Storage Backups (Vercel Blob)
Provider: Vercel Backup Type: Multi-region redundancy Strategy:- Files automatically replicated across multiple AWS regions
- No manual backup required (provider handles redundancy)
- Versioning not enabled (cost optimization)
- Provider guarantees 99.9% durability
- Lost files cannot be recovered without separate backup
- Consider enabling S3 versioning for critical files
3.3 Application Code & Configuration
Code Repository: GitHub (primary) Backups:- Git repository provides version history
- GitHub has own backup and redundancy
- Deploy to Vercel from git (reproducible builds)
- Stored in Vercel dashboard
- CRITICAL: Maintain encrypted backup in 1Password or similar
- Documented in
.env.example(without sensitive values)
3.4 Secrets and API Keys
Storage: 1Password Teams / Vault Items to backup:- WorkOS API keys
- Anthropic API key
- OpenAI API key
- Sentry DSN and auth token
- Database connection strings
- HMAC secrets (per product - store in encrypted export)
- Deployment keys
- All secrets stored in 1Password shared vault
- 1Password has enterprise-grade backup
- Export vault monthly to encrypted archive
- Store archive in separate secure location (offline storage)
4. Verification and Testing
4.1 Backup Verification
Automated Checks (Weekly):- Review Neon backup dashboard
- Confirm backup completion and size
- Check backup age (last backup < 24 hours)
- Verify backup storage usage
4.2 Restore Testing
Quarterly DR Test (Required for SOC 2): Test Procedure:-
Preparation (Monday)
- Schedule test during low-traffic period
- Notify team of test
- Document current state
-
Create Isolated Environment (Monday)
- Create new Vercel project (“agentic-trust-dr-test”)
- Create new Neon branch from backup
- Configure environment variables
- Deploy application code
-
Restore Database (Monday)
-
Verify Data Integrity (Tuesday)
- Login with test account
- Verify products and API keys visible
- Check conversation history
- Verify file uploads accessible
- Run data consistency checks:
-
Functional Testing (Tuesday)
- Test chat widget
- Test API endpoints
- Test dashboard access
- Verify embeddings and knowledge base
-
Document Results (Wednesday)
- Record time to restore: _____ minutes
- Record any data discrepancies
- Document issues encountered
- Update runbook with lessons learned
-
Cleanup (Wednesday)
- Delete test Vercel project
- Delete test Neon branch
- Archive test results
- Restoration completed within RTO (4 hours)
- No data loss beyond RPO (24 hours)
- Application fully functional
- All critical features working
5. Disaster Recovery Scenarios
5.1 Scenario: Database Corruption
Trigger: Data corruption detected in production database Response (RTO: 2 hours):-
Immediate Actions (0-15 min)
- Notify incident commander
- Stop all write operations (set app to read-only mode)
- Identify scope of corruption
- Identify last known good state
-
Assessment (15-30 min)
- Determine if corruption is fixable (run VACUUM, REINDEX)
- If not fixable, proceed to restore
- Determine restore point (latest clean backup)
-
Restore (30-60 min)
-
Validation (60-90 min)
- Run data integrity checks
- Verify application functionality
- Check with affected users
-
Post-Incident (After resolution)
- Document root cause
- Update monitoring to detect similar issues
- Post-mortem review
5.2 Scenario: Complete Database Loss
Trigger: Neon region outage, account compromise, or catastrophic failure Response (RTO: 4 hours):-
Immediate Actions (0-30 min)
- Declare Severity 1 incident
- Assemble DR team
- Contact Neon support (if provider issue)
- Notify customers (status page)
-
Recovery (30-120 min)
- Create new Neon project in different region
- Restore from latest backup:
- Update connection strings in Vercel
- Redeploy application
-
Data Loss Assessment (120-180 min)
- Identify data loss window (up to RPO of 24 hours)
- Notify affected customers
- Offer assistance in recreating recent data
-
Validation (180-240 min)
- Full system test
- User acceptance testing
- Monitor for issues
5.3 Scenario: Ransomware Attack
Trigger: Ransomware detected, database encrypted or deleted Response (RTO: 4 hours):-
Immediate Actions (0-15 min)
- Isolate affected systems (revoke API keys, rotate passwords)
- Notify security team and legal
- Do NOT pay ransom
- Preserve evidence for forensics
-
Containment (15-60 min)
- Identify infection vector
- Rotate all secrets and API keys
- Force password reset for all users
- Enable MFA for all accounts
-
Recovery (60-240 min)
- Restore from clean backup (pre-infection)
- Deploy to new infrastructure (fresh Vercel project)
- Update DNS to point to new infrastructure
-
Validation (240-300 min)
- Security scan of restored environment
- Verify no backdoors or persistence
- Monitor for re-infection
-
Post-Incident
- Full security audit
- Penetration testing
- Customer notification (if PII exposed)
- Law enforcement contact (if required)
5.4 Scenario: Human Error (Accidental Deletion)
Trigger: Accidental DROP TABLE, DELETE without WHERE, etc. Response (RTO: 30 minutes):-
Immediate Actions (0-5 min)
- Stop all write operations
- Identify what was deleted and when
- Check if within 90-day grace period
-
Recovery (5-20 min)
-
Validation (20-30 min)
- Verify restored data
- Check for conflicts or duplicates
- Confirm with user who triggered deletion
6. Communication Plan
6.1 Internal Communication
Incident Channels:- Slack:
#incidents - Email: incidents@company.com
- Phone: On-call engineer (PagerDuty)
- On-call engineer → Incident Commander
- Incident Commander → CTO
- CTO → CEO (for major incidents)
6.2 External Communication
Status Page: status.agentictrust.com Update Frequency: Every 30 minutes during active incident Customer Notification (Major outage):7. Recovery Runbooks
Runbook 1: Database Restore from Backup
When to Use: Data loss, corruption, or need to roll back Prerequisites:- Neon CLI installed:
npm install -g neonctl - Neon API key configured
- Access to Vercel dashboard
Runbook 2: Secrets Recovery
When to Use: Lost access to Vercel, need to rebuild environment Steps:- Access 1Password vault: “Agentic Trust Production Secrets”
-
Retrieve required secrets:
- DATABASE_URL
- WORKOS_CLIENT_ID
- WORKOS_API_KEY
- WORKOS_COOKIE_PASSWORD
- ANTHROPIC_API_KEY
- OPENAI_API_KEY
- SENTRY_DSN
- SENTRY_AUTH_TOKEN
-
Create new Vercel project or update existing:
-
Deploy:
-
Verify:
8. Responsibilities
| Role | Responsibilities |
|---|---|
| Incident Commander | Overall DR coordination, decision-making |
| Database Administrator | Database restore, data validation |
| DevOps Engineer | Infrastructure recovery, deployment |
| Security Lead | Security assessment, forensics (if needed) |
| CTO | Executive decisions, resource allocation |
| Customer Success | Customer communication, support |
9. Maintenance Schedule
| Activity | Frequency | Owner | Last Completed |
|---|---|---|---|
| Backup verification | Weekly | DevOps | - |
| DR plan review | Quarterly | Security | - |
| DR test (full restore) | Quarterly | DevOps | - |
| Update contact list | Monthly | All teams | - |
| Secrets backup | After each change | DevOps | - |
| Documentation update | After each DR event | Incident Commander | - |
10. Appendices
Appendix A: DR Test Log Template
Appendix B: Contact List
Emergency Contacts (Available 24/7):| Role | Name | Phone | |
|---|---|---|---|
| Incident Commander | [Name] | [Phone] | [Email] |
| Backup IC | [Name] | [Phone] | [Email] |
| Database Admin | [Name] | [Phone] | [Email] |
| DevOps Lead | [Name] | [Phone] | [Email] |
| CTO | [Name] | [Phone] | [Email] |
| Vendor | Support Channel | SLA |
|---|---|---|
| Neon | support@neon.tech | 4 hours (business) |
| Vercel | Enterprise support portal | 1 hour (P1) |
| WorkOS | support@workos.com | 4 hours |
Appendix C: RTO/RPO by Service
| Service | RTO | RPO | Justification |
|---|---|---|---|
| Chat API | 4 hours | 24 hours | Critical revenue impact |
| Dashboard | 8 hours | 24 hours | Can use API directly |
| Widget Embed | 4 hours | 24 hours | Customer-facing |
| Knowledge Base | 24 hours | 24 hours | Can recreate from source |
| Admin Panel | 8 hours | 24 hours | Lower priority |
Document Approval
| Role | Name | Signature | Date |
|---|---|---|---|
| CTO | [Name] | ||
| Security Lead | [Name] | ||
| DevOps Lead | [Name] |