Troubleshooting

This troubleshooting guide helps you resolve common issues you may encounter while using Qovery. Use the sections below to find solutions quickly.

Service Deployment Issues

Find solutions for the most common deployment errors and issues you may encounter when deploying services on Qovery.

Connection Refused Error

Symptom: Your deployment fails with a “connection refused” error during health checks.Common Causes:

Port mismatch - The port in your Qovery configuration doesn’t match the port your application listens on
Localhost binding - Your application listens on localhost (127.0.0.1) instead of all interfaces (0.0.0.0)

Solutions:

Verify Port Configuration

Check that your Qovery service port matches your application’s listening port.

If your app listens on port 8080, configure port 8080 in Qovery
Check your application logs to see which port it’s actually using

Update Application Binding

Ensure your application binds to 0.0.0.0 (all interfaces) instead of localhost:Node.js Example:

// ✅ Correct - Binds to all interfaces
app.listen(8080, '0.0.0.0');

// ❌ Wrong - Only binds to localhost
app.listen(8080, 'localhost');

Python Flask Example:

# ✅ Correct
app.run(host='0.0.0.0', port=8080)

# ❌ Wrong
app.run(host='127.0.0.1', port=8080)

Most web frameworks default to localhost in development. Always explicitly set 0.0.0.0 for containerized environments.

Not Enough Resources

Symptom: Deployment fails with “insufficient resources” or pods remain in “Pending” state.Cause: Your cluster doesn’t have enough CPU or memory capacity to run your service.Solutions:

Reduce Service Resources
Upgrade Instance Type
Increase Node Count

Lower the resource requests for your service if they’re set too high:

Go to your service Settings → Resources
Reduce CPU or Memory requests
Redeploy the service

Start with minimum resources and scale up as needed. Most applications don’t need as much as you think!

If your services legitimately need more resources:

Go to Cluster Settings → Node Pools
Select larger instance types
Update the cluster

Example: Upgrade from t3.medium (2 vCPU, 4 GB RAM) to t3.large (2 vCPU, 8 GB RAM)

Application is Crashing

Symptom: Your service deploys but immediately crashes or restarts repeatedly.Solution: Debug using the Qovery Shell

Access the Container

Use the Qovery CLI to access your container:

qovery shell

This opens an interactive shell inside your running container.

Investigate the Issue

Once inside:

Check environment variables: env
Test your startup command manually
Review application configuration files
Check for missing dependencies

For Rapidly Crashing Apps

If your app crashes too fast to shell into:

Remove the port temporarily from service settings (this prevents Kubernetes from restarting it)

Modify your Dockerfile to use a sleep command:

# Comment out your entrypoint
# ENTRYPOINT ["npm", "start"]

# Add sleep to keep container running
ENTRYPOINT ["sleep", "infinity"]

Deploy with this change
Use qovery shell to debug
Fix the issue and restore the original entrypoint

Remember to restore your port configuration and entrypoint after debugging!

SSL/TLS Certificate Issues

Symptom: SSL certificates aren’t being generated for your custom domain.Cause: DNS records are not properly configured for your custom domain.Solution:

Identify the Problem

Check the Qovery Console for which domain is failing certificate generation. You’ll see an error indicator next to the domain.

Verify DNS Configuration

Your domain should have a CNAME record pointing to your Qovery cluster URL.

Verify DNS resolution:

dig your-domain.com CNAME

You should see a CNAME pointing to your Qovery cluster domain.

Fix and Redeploy

Update your DNS CNAME record with your domain provider
Wait for DNS propagation (can take up to 48 hours, usually minutes)
Redeploy your application in Qovery
Certificate generation should succeed

DNS changes can take time to propagate. Use DNS Checker to verify propagation globally.

Docker Build Timeout

Symptom: Your build fails with a timeout error after 30 minutes.Cause: The default Docker build timeout is 1800 seconds (30 minutes). Complex builds (like compiling large codebases) may exceed this limit.Solution:

Increase Build Timeout

Go to your service Settings → Advanced Settings
Find the build.timeout_max_sec parameter
Increase the value (e.g., 3600 for 1 hour)
Save and redeploy

Optimize Your Build (Recommended)

Consider optimizing your Dockerfile:

Use multi-stage builds
Leverage build caching effectively
Only copy necessary files
Install dependencies before copying source code

Example Multi-stage Dockerfile:

# Build stage
FROM node:18 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

# Production stage
FROM node:18-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY package*.json ./
RUN npm ci --production
CMD ["node", "dist/index.js"]

Git Submodule Errors

Symptom: Build fails when trying to clone private Git submodules.Cause: Private submodules require authentication, which isn’t available during the build.Solutions:

Make Submodule Public (Recommended)
Use Git Credential Helper
Use SSH Keys

If possible, make your submodule repository public. This is the simplest solution.

Embed basic authentication in your .gitmodules file:

[submodule "my-private-module"]
    path = my-private-module
    url = https://username:token@github.com/org/private-repo.git

Be careful not to commit plain-text credentials! Use environment variables or secrets management.

Configure SSH keys in your build:

Add your private SSH key as a secret environment variable

Configure SSH in your Dockerfile:

RUN mkdir -p ~/.ssh && \
    echo "${SSH_PRIVATE_KEY}" > ~/.ssh/id_rsa && \
    chmod 600 ~/.ssh/id_rsa && \
    ssh-keyscan github.com >> ~/.ssh/known_hosts

Lifecycle Jobs & Cronjobs Execution Failed

Symptom: Your Lifecycle Job or Cronjob fails to complete successfully.Common Causes:

Code exceptions - Errors in your application code
Out of memory - Job exceeds memory limits
Execution timeout - Job takes longer than configured maximum duration

Solutions:

Check Job Logs

Go to your Job service
Click Logs tab
Look for error messages or stack traces
Identify the root cause (exception, OOM, timeout)

Fix Based on Cause

For Code Exceptions:

Fix the bug in your code
Redeploy the job

For Out of Memory:

Increase memory allocation in Settings → Resources
Optimize your code to use less memory

For Timeouts:

Go to Settings → Max Duration
Increase the timeout value
Or optimize your job to run faster

For long-running jobs, consider breaking them into smaller tasks or using a queue system.

SnapshotQuotaExceeded Error (Database)

Symptom: Database deletion fails with SnapshotQuotaExceeded error.Cause: Qovery automatically creates a snapshot before deleting a database. If you’ve reached your cloud provider’s snapshot quota, this fails.

Solutions:

Delete Old Snapshots
Request Quota Increase

Remove obsolete database snapshots from your cloud provider:AWS RDS:

Go to AWS RDS Console
Navigate to Snapshots
Delete old snapshots you no longer need
Retry database deletion in Qovery

Other Providers:

GCP Cloud SQL: Console
Azure Database: Portal
Scaleway: Console

Service Runtime Issues

Find solutions for common runtime errors and issues you may encounter when operating services on Qovery after successful deployment.

SIGKILL Signal 137 - Memory Exhaustion

Symptom: Your container terminates unexpectedly with exit code 137 or SIGKILL signal.Cause: Your application has exceeded its memory limit. When system resources become constrained, Kubernetes forcibly terminates the container to reclaim memory (Out of Memory Kill - OOMKill).How to Identify:Check your logs for messages like:

Container killed with exit code 137

OOMKilled: true

Solutions:

Increase Memory Allocation

Go to your service Settings → Resources
Increase the Memory limit
Start with a 50% increase (e.g., 512MB → 768MB)
Redeploy and monitor

Watch your memory usage metrics to find the right allocation. Don’t over-allocate unnecessarily!

Investigate Memory Leaks

Before just increasing memory, check if your application has a memory leak:Signs of a Memory Leak:

Memory usage steadily increases over time
Container was fine, then started crashing after recent code changes
Memory never levels off or decreases

Recent Changes to Review:

New dependencies or library updates
Code changes in recent deployments
New features that load data into memory
Caching implementations without expiration

Optimize Memory Usage

Common optimization strategies:

Clear unused variables and objects
Implement pagination for large datasets
Use streaming for file processing
Add proper cache eviction policies
Profile your application to find memory-intensive code

Node.js Example - Memory Profiling:

// Check memory usage
console.log(process.memoryUsage());

// Force garbage collection (requires --expose-gc flag)
if (global.gc) {
  global.gc();
}

Python Example - Memory Profiling:

import tracemalloc

tracemalloc.start()
# Your code here
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')

for stat in top_stats[:10]:
    print(stat)

Continuously increasing memory without investigating the root cause will lead to higher costs and may just delay the problem!

Debugging Rapidly Crashing Applications

Symptom: Your application crashes within seconds of starting, making it impossible to connect and debug.Challenge: The container restarts so quickly that you can’t use qovery shell to investigate.Solution:

Temporarily Remove Application Port

Go to your service Settings → Ports
Remove or disable the application port
Deploy the changes

Removing the port prevents Kubernetes from performing health checks and auto-restarting the container.

Modify Dockerfile to Keep Container Running

Update your Dockerfile to override the entrypoint with a sleep command:

# Comment out your normal entrypoint/CMD
# ENTRYPOINT ["npm", "start"]
# CMD ["python", "app.py"]

# Add sleep to keep container alive
ENTRYPOINT ["sleep", "infinity"]

Or for debugging purposes:

# Run a shell instead
ENTRYPOINT ["/bin/sh"]
CMD ["-c", "while true; do sleep 30; done"]

Commit and deploy these changes.

Access the Container

Once deployed, use the Qovery CLI to shell into the container:

qovery shell

Now your container stays running and you can debug interactively!

Debug Manually

Inside the container, you can now:Check environment variables:

env | grep -i app

Test your application manually:

# Node.js
node server.js

# Python
python app.py

# Go
./app

Check for missing dependencies:

# Node.js
npm list

# Python
pip list

# Check system packages
which <command>

Review configuration files:

cat config/app.json
cat .env

Fix and Restore

Identify and fix the issue in your code
Restore the original Dockerfile entrypoint
Re-add the application port
Deploy the fixed version

Don’t forget to restore your port configuration and original entrypoint! The sleep command is only for debugging.

Helm Service Logging Limitations

Symptom: When deploying Helm charts, you can’t see logs or pod status in the Qovery Console.Cause: Qovery requires specific labels and annotations on your Kubernetes resources to enable log access and pod status visibility.Solution:Add Qovery-specific macros to your Helm chart templates:

Add Labels and Annotations

Update your Helm chart’s deployment.yaml, service.yaml, or job.yaml to include Qovery macros:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "mychart.fullname" . }}
  labels:
    {{- include "qovery.labels.service" . | nindent 4 }}
  annotations:
    {{- include "qovery.annotations.service" . | nindent 4 }}
spec:
  template:
    metadata:
      labels:
        {{- include "qovery.labels.service" . | nindent 8 }}
      annotations:
        {{- include "qovery.annotations.service" . | nindent 8 }}
    spec:
      containers:
      - name: app
        image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"

Required Resources

Apply these macros to the following Kubernetes resources:

Deployments - For long-running applications
StatefulSets - For stateful applications
Jobs - For one-time tasks
CronJobs - For scheduled tasks
Services - For networking
Pods - If you create standalone pods

Both labels and annotations are required for full functionality!

Override Values During Deployment

If you can’t modify the Helm chart directly, override the values:

# In your Qovery Helm service overrides
deployment:
  metadata:
    labels:
      qovery.com/service-id: "{{ QOVERY_SERVICE_ID }}"
      qovery.com/service-type: "{{ QOVERY_SERVICE_TYPE }}"
      qovery.com/environment-id: "{{ QOVERY_ENVIRONMENT_ID }}"
    annotations:
      qovery.com/service-version: "{{ QOVERY_SERVICE_VERSION }}"

Redeploy and Verify

Update your Helm chart with the macros
Redeploy the Helm service in Qovery
Verify logs are now visible in the Console
Check that pod status appears correctly

These labels and annotations allow Qovery to identify and track your Helm-deployed resources within the cluster.

High CPU Usage

Symptoms:

Application becomes slow or unresponsive
CPU throttling warnings in logs
Pods getting OOMKilled even with sufficient memory

Quick Checks:

Check CPU metrics in Qovery Console
Review recent code changes that might be CPU-intensive
Look for infinite loops or inefficient algorithms
Check for CPU-intensive operations running on every request

Solutions:

Optimize hot paths in your code
Implement caching for expensive operations
Move heavy processing to background jobs
Increase CPU allocation if legitimately needed
Use profiling tools to identify bottlenecks

Slow Application Response

Possible Causes:

Database connection issues
External API timeouts
Insufficient resources
Inefficient code paths
Network latency

Debugging Steps:

Check application logs for slow queries or timeouts
Review database connection pools
Monitor external API response times
Profile your application to find slow endpoints
Check network policies that might be blocking traffic

Cluster Issues

Find solutions for common errors you might encounter while deploying or updating Qovery clusters.

DependencyViolation Errors During Cluster Deletion

Symptom: When attempting to delete a Qovery cluster, you receive a DependencyViolation error.Cause: Resources managed outside of Qovery remain attached to cluster infrastructure elements, preventing deletion.Example Error:

DeleteError - Unknown error while performing Terraform command
(terraform destroy -lock=false -no-color -auto-approve), here is the error:

Error: deleting EC2 Subnet (subnet-xxx): operation error EC2: DeleteSubnet,
https response error StatusCode: 400, RequestID: xxx, api error DependencyViolation:
The subnet 'subnet-xxx' has dependencies and cannot be deleted.

Solution:

Access Cloud Provider Console

Log into your cloud provider console (AWS, GCP, Azure, or Scaleway).

Navigate to VPC Section

AWS Example:

Go to VPC Console
Find the VPC associated with your Qovery cluster
Look for the subnet mentioned in the error message

Attempt to Delete the Resource

Try to delete the failing resource (subnet, security group, etc.):

Select the resource
Click Delete or Actions → Delete
The cloud provider will show what’s blocking the deletion

Identify Blocking Resources

Review the blocking resources, typically:

Network Interfaces - Check the Type and Description fields
NAT Gateways - May be attached to subnets
Load Balancers - Can block subnet deletion
EC2 Instances - Running or stopped instances
Lambda Functions - With VPC configuration
RDS Instances - In the VPC

AWS network interfaces blocking deletion

Delete Blocking Resources

Delete any resources that were created outside of Qovery:

Only delete resources you created manually! Don’t delete resources managed by other applications.

Note down which resources are blocking
Delete them from the cloud console
Wait for deletion to complete

Retry Cluster Deletion

Return to Qovery and retry the cluster deletion. It should now succeed.

Common culprits: Lambda functions attached to VPC, manually created EC2 instances, or RDS databases not managed by Qovery.

Removing Qovery Resources Without Platform Access

Scenario: You no longer have access to the Qovery platform but need to clean up AWS resources.

Method A: AWS Resource Groups & Tag Editor

Access Resource Groups

Log into AWS Console
Go to Resource Groups & Tag Editor service
Click Create Resource Group

Filter by Qovery Cluster Tag

Choose Tag based group type
Add tag filter:
- Tag key: ClusterLongId
- Tag value: Your Qovery cluster ID (found in cluster settings or URL)
Click Preview group resources

Review and Delete Resources

Review all resources tagged with your cluster ID
Note the resource types (VPC, EC2, RDS, etc.)
Delete resources in this order:
- EC2 instances
- Load balancers
- RDS databases
- NAT gateways
- Internet gateways
- Route tables
- Subnets
- Security groups
- VPC (last)

Method B: AWS CLI Script

Use this bash script to list all resources in a VPC by ID:

#!/bin/bash

# Set your VPC ID
VPC_ID="vpc-xxxxxxxxx"

echo "=== Resources in VPC: $VPC_ID ==="

# EC2 Instances
echo -e "\n=== EC2 Instances ==="
aws ec2 describe-instances \
  --filters "Name=vpc-id,Values=$VPC_ID" \
  --query "Reservations[].Instances[].[InstanceId,Tags[?Key=='Name'].Value|[0],State.Name]" \
  --output table

# Subnets
echo -e "\n=== Subnets ==="
aws ec2 describe-subnets \
  --filters "Name=vpc-id,Values=$VPC_ID" \
  --query "Subnets[].[SubnetId,CidrBlock,AvailabilityZone]" \
  --output table

# Security Groups
echo -e "\n=== Security Groups ==="
aws ec2 describe-security-groups \
  --filters "Name=vpc-id,Values=$VPC_ID" \
  --query "SecurityGroups[].[GroupId,GroupName]" \
  --output table

# NAT Gateways
echo -e "\n=== NAT Gateways ==="
aws ec2 describe-nat-gateways \
  --filter "Name=vpc-id,Values=$VPC_ID" \
  --query "NatGateways[].[NatGatewayId,State]" \
  --output table

# Internet Gateways
echo -e "\n=== Internet Gateways ==="
aws ec2 describe-internet-gateways \
  --filters "Name=attachment.vpc-id,Values=$VPC_ID" \
  --query "InternetGateways[].[InternetGatewayId]" \
  --output table

# Route Tables
echo -e "\n=== Route Tables ==="
aws ec2 describe-route-tables \
  --filters "Name=vpc-id,Values=$VPC_ID" \
  --query "RouteTables[].[RouteTableId,Associations[0].Main]" \
  --output table

# Load Balancers (ALB/NLB)
echo -e "\n=== Load Balancers (ALB/NLB) ==="
aws elbv2 describe-load-balancers \
  --query "LoadBalancers[?VpcId=='$VPC_ID'].[LoadBalancerName,LoadBalancerArn,Type]" \
  --output table

# Network Interfaces
echo -e "\n=== Network Interfaces ==="
aws ec2 describe-network-interfaces \
  --filters "Name=vpc-id,Values=$VPC_ID" \
  --query "NetworkInterfaces[].[NetworkInterfaceId,Description,Status]" \
  --output table

# VPC Endpoints
echo -e "\n=== VPC Endpoints ==="
aws ec2 describe-vpc-endpoints \
  --filters "Name=vpc-id,Values=$VPC_ID" \
  --query "VpcEndpoints[].[VpcEndpointId,ServiceName,State]" \
  --output table

Usage:

chmod +x list-vpc-resources.sh
./list-vpc-resources.sh

Always double-check you’re deleting the correct resources! Deleting production resources by mistake can cause downtime.

Blocked Cloud Account

Symptom: You receive one of these errors:

"This account is currently blocked by your cloud provider"

"This AWS account is currently blocked and not recognized as a valid account"

Common Causes:

Billing Issues
- Outstanding payment
- Credit card expired
- Payment method declined
Free Tier Restrictions
- Attempting to deploy in regions not supported by free tier
- Exceeding free tier limits
Account Compliance Violations
- Terms of service violations
- Abuse reports
- Security issues

Solution:

Contact Your Cloud Provider

Qovery cannot resolve account-level blocks. You must contact your cloud provider directly:

Verify Billing

Check your billing dashboard
Ensure payment method is valid
Resolve any outstanding payments

Check Account Status

Review your account status page for:

Active incidents
Service health issues
Account restrictions

Retry After Resolution

Once your cloud provider resolves the block:

Wait 5-10 minutes for changes to propagate
Retry your cluster operation in Qovery

Account blocks are typically resolved quickly once you contact your cloud provider. Most issues are billing-related and easy to fix.

Missing SQS Permissions with Karpenter (AWS)

Symptom: Cluster creation fails with a permissions error related to AWS SQS when using Karpenter.Error Message:

Permissions issue. Check your AWS permissions to ensure you have
the necessary authorizations for this action.

Cause: The IAM credentials provided don’t have necessary SQS permissions for Karpenter’s interruption handling.Solution:

Verify IAM Policy

Check the official Qovery IAM policy
Compare with your current IAM user/role permissions

Ensure all SQS permissions are included:

{
  "Effect": "Allow",
  "Action": [
    "sqs:CreateQueue",
    "sqs:DeleteQueue",
    "sqs:GetQueueAttributes",
    "sqs:SetQueueAttributes",
    "sqs:TagQueue"
  ],
  "Resource": "arn:aws:sqs:*:*:qovery*"
}

Check AWS Organization SCPs

If you’re using AWS Organizations, Service Control Policies (SCPs) may be blocking permissions:

Go to AWS Policy Simulator
Select your IAM user or role
Choose SQS service
Test these actions:
- CreateQueue
- DeleteQueue
- GetQueueAttributes
- SetQueueAttributes
- TagQueue
For resource, use: arn:aws:sqs:::qovery*

Interpret Results

Scenario 1: Permissions Allowed (with SCPs enabled)

If permissions show as allowed with SCPs:

Contact Qovery Support
Provide cluster ID and error logs

Scenario 2: Permissions Denied (with SCPs enabled but allowed when disabled)

If denied with SCPs but allowed without:

SCP is blocking Qovery access
Contact your AWS administrator
Request SQS permissions for Qovery resources

Update IAM Policy

If using static credentials:

Go to IAM Console
Find your Qovery IAM user
Update attached policies to include SQS permissions
Retry cluster creation

If using STS Assume Role:

Update the CloudFormation stack
Add missing SQS permissions to the role
Wait for stack update to complete
Retry cluster creation

Karpenter uses SQS to handle EC2 spot instance interruption notifications. Without these permissions, the cluster cannot properly handle node lifecycle events.

Cluster Stuck in 'Deploying' State

Symptoms:

Cluster shows “Deploying” for more than 45 minutes
No progress in deployment logs

Possible Causes:

AWS service quotas exceeded
Region capacity issues
Network connectivity problems
Invalid cluster configuration

Solutions:

Check Cluster Logs:
- Go to Cluster Settings → Logs
- Look for specific error messages
Verify Service Quotas:
- Check AWS Service Quotas for EC2, VPC, ELB
- Request increases if needed
Try Different Region:
- Some regions may have capacity issues
- Try deploying to an alternative region
Contact Support:
- If issue persists > 1 hour, contact support
- Provide cluster ID and deployment logs

Need More Help?

If you don’t find what you need in this troubleshooting guide, we’re here to help:

Community Forum

Read updates and announcements from the Qovery team

Help & Support

View all support options

Documentation

Browse the complete documentation

Quick Links

Service Logs - Learn how to access and analyze service logs
Deployment Statuses - Understand deployment status indicators
Cluster Configuration - Configure your cluster settings
Advanced Settings - Fine-tune service configurations

Get Started

Platform Capabilities

How-to Guides

Tutorials

Troubleshooting

Useful Resources

Troubleshooting

Service Deployment Issues

Service Runtime Issues

Cluster Issues

Method A: AWS Resource Groups & Tag Editor

Method B: AWS CLI Script

Need More Help?

Community Forum

Help & Support

Documentation

Quick Links

Get Started

Platform Capabilities

How-to Guides

Tutorials

Troubleshooting

Useful Resources

Documentation Index

​Service Deployment Issues

​Service Runtime Issues

​Cluster Issues

​Method A: AWS Resource Groups & Tag Editor

​Method B: AWS CLI Script

​Need More Help?

Community Forum

Help & Support

Documentation

​Quick Links

Service Deployment Issues

Service Runtime Issues

Cluster Issues

Method A: AWS Resource Groups & Tag Editor

Method B: AWS CLI Script

Need More Help?

Quick Links