Postgres Replica Lags
Problem
After a node failure or any network related issue, in rare cases, the health-check output might look like the following:
====================================================== Check Routingbot Postgres health status ====================================================== AMBER Attempt 1/5 failed. Retrying in 30 seconds... Attempt 2/5 failed. Retrying in 30 seconds... Attempt 3/5 failed. Retrying in 30 seconds... Attempt 4/5 failed. Retrying in 30 seconds... routingbot postgres test... Checking if postgres is fully up... creating table at rbpostgres-db.routingbotdb test_table_6616... creating entry at rbpostgres-db.routingbotdb 6616... checking entry at rbpostgres-db.routingbotdb... deleting table at rbpostgres-db.routingbotdb rbdebug... Checking PostgreSQL replica status... Replica rbpostgres-db-2 is lagging behind. If all postgres pods are running, please run below command to manually recover replicas. paragon-utils common-utils/postgres_replicas_check_and_reinit
In this example, rbpostgres-db is lagging behind. Sometimes, atom-db can lag behind as well.
Cause
The Postgres replica has not caught up with the leader instance.
Solution
Run the repair command as displayed in the health-check output.
# paragon-utils common-utils/postgres_replicas_check_and_reinit
Verify the Postgres replica status using the health-check command. If the Postgres replica still lags behind, perform the following steps:
Use any one of the rbpostgres-db pods and run
patronictl listcommand to verify which instance has problem. Here, as also confirmed by the health-check output, rbpostgres-db-2 is lagging.# kubectl exec -it -n routingbotdb rbpostgres-db-0 -- patronictl list Defaulted container "postgres" out of: postgres, exporter + Cluster: rbpostgres-db (7517327816498606139) ----------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +-----------------+----------------+---------+-----------+----+-----------+ | rbpostgres-db-0 | 10.1.2.3 | Replica | streaming | 26 | 0 | | rbpostgres-db-1 | 10.1.3.4 | Leader | running | 26 | | | rbpostgres-db-2 | 10.1.4.5 | Replica | running | 19 | 272716 | +-----------------+----------------+---------+-----------+----+-----------+
Log in to the leader instance.
# kubectl exec -it -n routingbotdb rbpostgres-db-1 -- bash
Reinitialize the lagging pod and enter
Ywhen prompted to proceed.root@rbpostgres-db-1:/home/postgres# patronictl reinit rbpostgres-db rbpostgres-db-2
Wait for a few minutes and run the
patronictl listcommand. Depending on the size of the data inside the database, the rbpostgres-db-2 pod catches up to the leader instance and moves to thestreamingstate.root@rbpostgres-db-1:/home/postgres# patronictl list + Cluster: rbpostgres-db (7517327816498606139) ----------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +-----------------+----------------+---------+-----------+----+-----------+ | rbpostgres-db-0 | 10.1.2.3 | Replica | streaming | 26 | 0 | | rbpostgres-db-1 | 10.1.3.4 | Leader | running | 26 | | | rbpostgres-db-2 | 10.1.4.5 | Replica | streaming | 19 | 0 | +-----------------+----------------+---------+-----------+----+-----------+