Operational playbook for monitoring system health, diagnosing service failures, and maintaining platform uptime.

Service Status and Diagnostics

Maintaining high availability for VeriWorkly requires active monitoring of its distributed components, including the primary API, background workers, and persistent data stores. This guide outlines the procedures for health assessment and technical diagnostics.

1. Automated Health Monitoring

The most reliable mechanism for assessing system state is the backend health endpoint. This route performs active probes of the PostgreSQL database and Redis connection state rather than simply confirming service availability.

Endpoint: GET /api/v1/health

Nominal Operational State (`200 OK`)

When all integrated systems are functioning within defined parameters, the API returns a success response:

{
  "success": true,
  "message": "Server is healthy",
  "data": {
    "status": "ok",
    "database": "connected",
    "redis": "connected",
    "timestamp": "2026-04-26T10:00:00.000Z"
  }
}

Degraded Operational State (`503 Service Unavailable`)

If a dependency failure is detected, the API returns a service unavailable status:

{
  "success": false,
  "message": "Service Unavailable",
  "data": {
    "status": "degraded",
    "database": "disconnected",
    "redis": "connected",
    "timestamp": "2026-04-26T10:05:00.000Z"
  }
}

Implementation Recommendation: Integrate this endpoint with monitoring services such as Datadog, UptimeRobot, or Better Stack to facilitate automated alerting upon state transitions.

2. Containerized Environment Diagnostics

In scenarios where health checks fail or containers exhibit unstable behavior (e.g., restart loops), utilize the following diagnostic commands.

Service State Assessment

Verify the operational status of all containerized services defined in the orchestration configuration:

docker compose ps

Resource Utilization Analysis

Document rendering via Playwright is memory-intensive. Monitor resource consumption to identify potential Out-of-Memory (OOM) conditions:

docker stats

Significant memory consumption (approaching 100% of the allocated limit) necessitates either additional host resources or a reduction in concurrent rendering job limits.

3. Log Analysis and Troubleshooting

Technical diagnostics require analysis of service-specific logs. Execute these commands from the root directory containing the compose.yaml file.

Backend API Logs

docker compose logs --tail=100 -f api

Key Indicators: Uncaught exceptions, Prisma connectivity errors, or Zod validation failures for environment variables.

Redis Service Logs

bash docker compose logs --tail=100 -f redis Key Indicators: Memory allocation failures or persistence (RDB/AOF) errors.

Frontend Application Logs

docker compose logs --tail=100 -f web

Key Indicators: Fetch errors or internal network resolution failures (ECONNREFUSED).

4. Common Operational Failure Modes

Issue	Potential Cause	Resolution Strategy
Database Connectivity	Invalid `DATABASE_URL` or network firewall restrictions.	Verify credentials and ensure the API has network access to the PostgreSQL instance.
Redis Memory Exhaustion	Backed-up job queue or excessive data caching.	Increase container memory limits or optimize job processing frequency.
CORS Policy Violations	Misconfigured `ALLOWED_ORIGINS` variable.	Ensure the backend environment configuration exactly matches the public frontend URI.
Document Rendering Delay	Hanging Playwright processes or zombie Chromium instances.	Restart the API service to clear the process tree and re-initialize the rendering workers.
Zod Validation Failure	Missing or malformed environment variables in production.	Review the startup logs to identify the specific variable failing validation.

Service Status and Diagnostics

Backend API Logs

Redis Service Logs

Frontend Application Logs

On this page