GitLab Geo secondary site unable to refresh status and stuck in unhealthy state due to timeout errors
Overview
GitLab Geo secondary sites may become stuck in an unhealthy state when the Geo metrics collection process (`Geo::MetricsUpdateWorker`) times out while attempting to count large database tables, particularly `ci_job_artifacts`. This results in the secondary site being unable to refresh its status with the primary site.
Description
When running `gitlab-rake geo:status` on a secondary site, the command fails with a `PG::QueryCanceled: ERROR: canceling statement due to statement timeout` error. The `Geo::MetricsUpdateWorker` also fails with a `Geo::Errors::StatusTimeoutError` when attempting to collect metrics for Geo replication status.
Impacted offerings:
- GitLab Self-Managed
Impacted versions:
All versions with GitLab Geo enabled
Resolution
1. **Clear Sidekiq deduplication key**: Manually clear the deduplication key to allow the worker to retry
2. **Reduce Geo metrics collection frequency**: Configure less frequent metrics collection via GitLab configuration
3. **Configure longer statement timeouts**: Increase database query timeout settings
- Manually clear the Sidekiq deduplication key
- Reduce the frequency of Geo metrics collection via configuration
- Configure longer statement timeouts for database queries
Cause
The issue occurs when `Geo::MetricsUpdateWorker` takes too long to complete and approaches its deduplication TTL (Time To Live). This typically happens with GitLab instances that have large numbers of records that the Geo metrics worker struggles to process within the default timeout period.
Symptom
- Secondary Geo site shows as unhealthy in GitLab Admin interface
- `gitlab-rake geo:status` command times out with `PG::QueryCanceled` error
- `Geo::MetricsUpdateWorker.new.perform` fails with `Geo::Errors::StatusTimeoutError`
- Error traces show timeouts in `batch_counter.rb` and `verification_failed_count` methods
Root Cause
The issue occurs when `Geo::MetricsUpdateWorker` takes too long to complete and approaches its deduplication TTL (Time To Live). This typically happens with GitLab instances that have large numbers of records that the Geo metrics worker struggles to process within the default timeout period.
Additional information
This issue is more likely to occur in environments with:
- Large numbers of CI/CD artifacts
- High database load
- Multiple replication slots active
- Insufficient database performance tuning
Long-term improvements being developed include:
- Making timeouts configurable
- Using approximate counts instead of exact counts
- Maintaining cached counts or using analytics databases
Related Links
- Issue#414047: Geo status unhealthy and out-of-date for 6 hours after Sidekiq was killed
- Issue#512646: Geo: Metrics collection must scale
- Issue#523536: Geo: Unblock the updating of site status even if some metrics are slow to collect
- Issue#370158: Geo::MetricsUpdateWorker slow total job artifacts count