Feb 24, 20184 min read

vPostgres Database Failover Scenarios

Updated: Feb 7, 2021

Master Database constantly streams transactions to Replica servers

Once Master fails, the active and working Replica is ready to proceed with Read-Only requests right away.When the new Master is promoted ( manually or automatically ), all upcoming requests are moved to it.

Up to vRA 7.2 release , manual failover was supported which means that the MASTER is replaced with a Replica selected by Administrator

Administrator needs to trigger the promotion operation and wait for its completion

With vRA 7.3, PostgreSQL database failover is automatic, when three appliances are deployed and synchronous replication is configured

Failover scenarios does not differ from the manual failover scenarios except parts of the scenarios with the steps needed to recover the environment are performed automatically, when automatic failover is on.

It is the ability of the vRA PostgreSQL database component to automatically sustain a Master node alive at any time without further intervention of Administrator.

There are some cases that vRA is unable to perform an automatic failover, for example if more than half of the nodes are off at the same time.Monitoring the https://vra-fqdn:5434/api/status when in sync mode ( automatic failover turned on ) as specified in documentation will notify the client to perform manual failover when the manualFailoverNeeded status flag is true

3 Node ( vRA Appliances ) Failover Scenarios

Database screen under vRA Settings from VAMI Portal

As you can see in the screenshot we do see three nodes

One being MASTER , other being SYNC REPLICA and the last one is POTENTIAL REPLICA

For systems that use synchronous replication, there is one replica node completely in sync with the MASTER known as SYNC REPLICA. In the above screenshot it's vra3.vcloud.local which is SYNC REPLICA

The SYNC REPLICA which is also called as HOT-SYNC node will be the one who would be promoted as MASTER as soon as current MASTER state would be changed to DOWN

Only SYNC REPLICA will be the one in complete SYNC with MASTER node , other's status would be set to POTENTIAL.

In ASYNCHRONOUS replication mode all node's have Sync State set to ASYNC. MASTER does not have any Sync State value.

The Valid column in the screenshot shows that the database instance in the node is healthy

The Priority column shows the position of Replica in relation to the Master node. The Master node has no priority.

SYNC REPLICA's priority would be set to 0 as it's the one next in line to become MASTER. Subsequent POTENTIAL REPLICA's would be having priority set to 1, 2.... and so on

Three Node ( vRA Appliances ) Auto Failover Scenarios

When one of three nodes is down , the AutoFailover will be performed. No more AutoFailover capability till all three nodes are alive

Let's explore few scenarios

Scenario 1: MASTER is DOWN

What Happens

SYNC REPLICA becomes MASTER and picks up the database function automatically ( performed by automatic failover agent )
POTENTIAL REPLICA is now a new SYNC REPLICA
vRA appliances get into READ ONLY mode till automatic failover completes.

Once former MASTER is back

It would be set to REPLICA automatically by failover agent. No manual action is needed.
If former MASTER cannot be fixed, set the PostgreSQL database to asynchronous mode

Scenario 2: SYNC REPLICA is DOWN

What Happens

There would be no change to MASTER, hence no downtime.
You may experience some delay in the database requests for a couple of seconds until Potential Replica becomes the new SYNC REPLICA
Promotion of nodes is automatically done by the PostgreSQL service

Once former SYNC REPLICA is back online

It would be automatically set to POTENTIAL REPLICA

Scenario 3: POTENTIAL REPLICA is DOWN

What happens

Nothing. There is no downtime

When former POTENTIAL REPLICA comes back online

It will automatically be set to POTENTIAL REPLICA
If former POTENTIAL REPLICA does not come online then set the PostgreSQL database to asynchronous mode.

When two out of three nodes are DOWN at the same time, the system starts working in a read only mode until a manual repair is performed. When just one Database on node is available in the system, it should be switched to asynchronous mode.

Scenario 4: MASTER AND POTENTIAL REPLICA are DOWN

What happens

SYNC REPLICA will not be prompted as MASTER automatically
System starts working in a read only mode until the manual promotion is performed

Once Former MASTER and POTENTIAL REPLICA are back online

They should be reset manually to synchronize against MASTER
Then the replication should be turned back to SYNCHRONOUS mode

Scenario 5: SYNC REPLICA and POTENTIAL REPLICA are DOWN

What Happens

MASTER is not able to process R/W transactions
System starts working in READ ONLY mode until a manual repair is performed

When the SYNC REPLICA and POTENTIAL REPLICA are back

They must be reset manually to synchronize against the MASTER
The system should be turned back to SYNCHRONOUS mode

When link failure happens between NODES

Scenario 6: Link Failure between SITE A and SITE B, but all 3 nodes are UP and ONLINE

SITE A has MASTER and POTENTIAL REPLICA

SITE B has SYNC REPLICA

What happens

POTENTIAL REPLICA on SITE A becomes SYNC REPLICA

There is no downtime

When the link is re-established no manual action is needed

Scenario 7: Link failure between SITE A and SITE B and all three nodes are UP and ONLINE

SITE A has MASTER

SITE B has POTENTIAL REPLICA

What happens

SYNC REPLICA becomes MASTER and picks up the DB function automatically, and POTENTIAL REPLICA becomes the new SYNC REPLICA (performed by the automatic failover agent).
System is working in a read only mode until this promotion finishes.

When the link is re-established , former MASTER is reset as REPLICA.

There is no MANUAL action needed

vRA Database Configuration Files and it's log locations

vPostgres configuration file: /var/vmware/vpostgres/current/pgdata/pg_hba.conf
vPostgreslog: /storage/db/pgdata/pg_log/postgresql.csv, /storage/db/pgdata/serverlog
Automatic Failover Agent log: /var/log/vcac/vcac-config.log

#vRealizeAutomation

arunnukula

arunnukula

vPostgres Database Failover Scenarios

Recent Posts