Intermediate

 

Failover Setup

This tutorial outlines a failover setup method for two validator-class machines, which allows for seamless software upgrades and failover to the secondary machine when monitoring detects a problem with the primary machine. It requires an additional machine running an etcd cluster to store the tower voting records. The tutorial covers the installation and configuration of the etcd cluster with TLS certificates, as well as the configuration of the primary and secondary validators with the necessary solana-validator parameters to enable tower storage in etcd. It also explains how to generate and use a secondary validator identity, and how to configure the secondary validator to authorize the primary validator.

A simple two machine instance failover method is described here, which allows you to:

  • upgrade your validator software with virtually no down time, and

  • failover to the secondary instance when your monitoring detects a problem with the primary instance without any safety issues that would otherwise be associated with running two instances of your validator.

You will need two validator-class machines for your primary and secondary validator. A third machine for running an etcd cluster, which is used to store the tower voting record for your validator.

Setup

etcd cluster setup

There is ample documentation regarding etcd setup and configuration at , please generally familiarize yourself with etcd before continuing.

It’s recommended that etcd be installed on a separate machine from your primary and secondary validator machines. This machine must be highly available, and depending on your needs you may wish to configure etcd with more than just one node.

First install etcd as desired for your machine. Then TLS certificates must be created for authentication between the etcd cluster and your validator. Here is one way to do this:

With Golang installed, run go install github.com/cloudflare/cfssl/cmd/cfssl@latest. The cfssl program should now be available at ~/go/bin/cfssl. Ensure ~/go/bin is in your PATH by running PATH=$PATH:~/go/bin/.

Now create a certificate directory and configuration file:

mkdir -p certs/
echo '{"CN":"etcd","hosts":["localhost", "127.0.0.1"],"key":{"algo":"rsa","size":2048}}' > certs/config.json

 

then create certificates for the etcd server and the validator:

 
cfssl gencert -initca certs/config.json | cfssljson -bare certs/etcd-ca
cfssl gencert -ca certs/etcd-ca.pem -ca-key certs/etcd-ca-key.pem certs/config.json | cfssljson -bare certs/validator
cfssl gencert -ca certs/etcd-ca.pem -ca-key certs/etcd-ca-key.pem certs/config.json | cfssljson -bare certs/etcd

 

Copy these files to your primary and secondary validator machines:

  • certs/validator-key.pem

  • certs/validator.pem

  • certs/etcd-ca.pem

and these files to the machine running the etcd server:

  • certs/etcd.pem

  • certs/etcd-key.pem

  • certs/etcd-ca.pem

With this configuration, both the validator and etdc will share the same TLS certificate authority and will each authenticate the other with it.

Start etcd with the following arguments:

etcd --auto-compaction-retention 2 --auto-compaction-mode revision \
  --cert-file=certs/etcd.pem --key-file=certs/etcd-key.pem \
  --client-cert-auth \
  --trusted-ca-file=certs/etcd-ca.pem \
  --listen-client-urls=https://127.0.0.1:2379 \
  --advertise-client-urls=https://127.0.0.1:2379

 

and use curl to confirm the etcd TLS certificates are properly configured:

curl --cacert certs/etcd-ca.pem https://127.0.0.1:2379/ --cert certs/validator.pem --key certs/validator-key.pem

 

On success, curl will return a 404 response.

For more information on etcd TLS setup, please refer to

Primary Validator

The following additional solana-validator parameters are required to enable tower storage into etcd:

solana-validator ... \
  --tower-storage etcd \
  --etcd-cacert-file certs/etcd-ca.pem \
  --etcd-cert-file certs/validator.pem \
  --etcd-key-file certs/validator-key.pem \
  --etcd-endpoint 127.0.0.1:2379  # <-- replace 127.0.0.1 with the actual IP address

 

Note that once running your validator will terminate if it’s not able to write its tower into etcd before submitting a vote transaction, so it’s essential that your etcd endpoint remain accessible at all times.

Secondary Validator

Configure the secondary validator like the primary with the exception of the following solana-validator command-line argument changes:

  • Generate and use a secondary validator identity: --identity secondary-validator-keypair.json

  • Add --no-check-vote-account

  • Add --authorized-voter validator-keypair.json (where validator-keypair.json is the identity keypair for your primary validator)

Triggering a failover manually

When both validators are running normally and caught up to the cluster, a failover from primary to secondary can be triggered by running the following command on the secondary validator

$ solana-validator wait-for-restart-window --identity validator-keypair.json \
  && solana-validator set-identity validator-keypair.json

 

The secondary validator will acquire a lock on the tower in etcd to ensure voting and block production safely switches over from the primary validator.

The primary validator will then terminate as soon as it detects the secondary validator using its identity.

Note: When the primary validator restarts (which may be immediate if you have configured your primary validator to do so) it will reclaim its identity from the secondary validator. This will in turn cause the secondary validator to exit. However if/when the secondary validator restarts, it will do so using the secondary validator identity and thus the restart cycle is broken.

Triggering a failover via monitoring

Monitoring of your choosing can invoke the solana-validator set-identity validator-keypair.json command mentioned in the previous section.

It is not necessary to guarantee the primary validator has halted before failing over to the secondary, as the failover process will prevent the primary validator from voting and producing blocks even if it is in an unknown state.

Validator Software Upgrades

To perform a software upgrade using this failover method:

Install the new software version on your primary validator system but do not restart it yet.

Trigger a manual failover to your secondary validator. This should cause your primary validator to terminate.

When your primary validator restarts it will now be using the new software version.

Once the primary validator catches up upgrade the secondary validator at your convenience.

Troubleshooting

There is a #validator-support Discord channel available to reach other testnet participants, .

Can’t find what you’re looking for? Send an email to ryan@solana.com or reach out to @rshea#2622 on Discord.

Blockstore

The validator blockstore rocksdb database can be inspected using the ldb tool. ldb is part of the rocksdb code base and is also available in the rocksdb-tools package.

RocksDB Administration and Data Access Tool

Upgrade

If a new software version introduces a new column family to the blockstore, that new (empty) column will be automatically created. This is the same logic that allows a validator to start fresh without the blockstore directory.

Downgrade

If a new column family has been introduced to the validator blockstore, a subsequent downgrade of the validator to a version that predates the new column family will cause the validator to fail while opening the blockstore during startup.

List column families:

ldb --db=<validator ledger path>/rocksdb/ list_column_families

 

Warning: Please seek guidance on discord before modifying the validator blockstore.

Drop a column family:

ldb --db=<validator ledger path>/rocksdb drop_column_family <column family name>