Failover

Failover is the process by which a standby server takes over as the new "master" server.

The master can be configured as a master server, a commit server, or an edge server.

High Availability and Disaster Recovery

The Failover feature supports two scenarios:

  • High Availability (HA)
    • The master can be configured as a master server, a commit server, or an edge server.

    • Typically, the standby server is in the same hardware rack as the master server.

    • Typical use case: scheduled maintenance, but also possible if the master hardware fails

    • Typically, the master participates in the failover process:

      • disabling itself in an orderly fashion

      • waiting for the journalcopy of the remaining transactions to the Standby

      • allowing the standby to stop the Master

      Note

      If the master server does not participate in the failover, a check is made to ensure that the standby server to which failover is to occur has the mandatory option set. Without the participation of the master server, failing over to a mandatory standby server is required to ensure that the other replicas remain consistent with the new master server after failover. Consistency is assured because during production operations, metadata must be journalcopy'd by all mandatory standby servers before that metadata is replicated to the other replicas. Deploying one or more mandatory standby servers local to the master server is recommended. This is because journalcopy performance of the mandatory standby servers can affect the production replication to the other replicas.

  • Disaster Recovery (DR)
    • Typical use case: due to a sudden catastrophe, the master server and its HA standbys are unable to operate.
    • Contact support for assistance with failing over to a non-mandatory standby server when the master server is inaccessible.

Consistency of the downstream replicas is assured for failing over when:

  • the master server participates, in which case:
    • the standby server need not be a "mandatory" standby
    • the standby server's journalcopy, pull -L, and pull -u threads are an integral part of the failover
  • the master server does not participate and the standby server is a "mandatory" standby, in which case only the standby server's pull -L thread is an integral part of the failover
Important
  • The p4 failover command must be run on a server of Type standby or forwarding-standby
  • The server from which failover can occur is usually called the master. However, failover can occur from a server that provides standard, commit-server, or edge-server services.

Server type: standby or forwarding-standby

For a streamlined failover, consider a dedicated standby.

For situations where failover completion is less time-critical, a forwarding-standby might reduce hardware costs.

High availability with the mandatory server specification option

In the server specification, under Options, mandatory is possible for a standby (or forwarding-standby) server. This option ensures that no replica has metadata that has not been copied to the journalcopy of all standby (or forwarding-standby) servers, which ensures that a consistent failover is possible whether or not the original master is available at failover time.

If the server from which failover is to occur is not participating in the failover (because the master is unavailable or the -i option causes the master to be ignored), the p4 failover command returns an error if it is running on a standby (or forwarding-standby) server that is not properly configured with the mandatory option.

For high availability failover, the local standby typically has a server specification with the Option set to mandatory.

Note

A best practice for deploying a mandatory standby is:

  1. Deploy as nomandatory
  2. Wait for its journalcopy to catch up with the master
  3. Change on the master the server spec for that standby to mandatory

Disaster recovery with the nomandatory server specification option

For disaster recovery failover, the remote standby typically has a server specification with the Option set to nomandatory. This is because the journalcopy performance of a mandatory standby can affect the speed of replication to the replicas of the master.

Potential data loss

Master participates Master does not participate

 

Standby is mandatory

Any commands that were not completed when failover began might need to be executed again on the new master server.

There should not be any data loss.

The transactions that were done directly on the master prior to the failover that had not yet been journalcopy'd to the standby being used for the failover will be lost.

 

To minimize data loss when the master server does not participate in the failover, the standby used for the failover should be the standby that was the most current with the master at the time of the failover, which is likely the standby that is in the same rack with the master.

The downstream replicas are consistent with the new master server

The downstream replicas will not have data loss relative to the new master server

Failover process

The Failover feature allows the super user to:

  1. Get a report of whether conditions look good for a successful failover.
    Warning

    If the report indicates that the existing master server is still accessible and ignoring that server has been requested with the -i option, this could result in two separate servers, each of which is unaware of the other. This "split-brain" situation can produce inconsistencies that compromise the integrity of your data.

  2. Initiate the failover process.
    1. This automatically stops the standby (or forwarding standby) server that will become the new master.

    2. During the failover process, end-user might notice that the master server does not process any new commands.

    3. A verification process ensures that recent file content was correctly replicated to the new master. See the p4 failover command for the -v option.

    4. During the failover process, the P4ROOT directory will get a new file named statefailover. This file is the last consistency point journalcopy'd by the standby immediately prior to the failover. This file will be deleted by the new master server when it is no longer needed.

  3. Monitor the steps that are reported during the process. If the Failover process encounters an error, the process is designed to inform the superuser and to stop the failover process so that corrective action can be taken and a new attempt can occur.

  4. If an error is encountered after the standby server has stopped the master server, the standby server will not restart the master server.

  5. Verify, after the completion of a successful failover, that the former standby (or forwarding standby) has been restarted as the new master by issuing the p4 info command and checking the ServerID to ensure that it is the ServerID of the master server that was failed over.

  6. Following a successful failover, site-specific changes might be needed to use the new master server. For example, it might be necessary to make DNS changes so that users and replicas can connect to the new master server.

The end users can now issue new commands.

Prerequisites

  • Make sure that monitoring (p4 monitor) was enabled when the standby server was started.
    • Monitoring must be enabled at server startup of the standby prior to running the p4 failover' command because the monitor subsystem is used to terminate the journalcopy, pull -L, and pull -u threads during the failover sequence.

  • Ensure that all the standby and forwarding-standby servers have a value for the ReplicatingFrom field in their server spec. This will allow the statefailover file in the P4ROOT of the new master server to be automatically deleted when it is no longer needed.

  • If an edge server is being failed over, the service user of the edge server should be logged into the commit (or master) server using the file specified by the P4TICKETS configurable (and likely the P4TRUST configurable) defined for the standby of the edge server.
  • The standby server must be appropriately licensed for its new role following the failover.

    Configurables affected

    The failover process:

    Configurables and edge server

    When failing over to a standby from an edge (or other replica) server, the updated configurables for the edge server will need to be manually changed on the commit server. This is because the update of the configurables cannot be propagated back to the commit (or upstream) server automatically, given that the edge server might, or might not, be participating in the failover.