How to run a core node - Part 2: Publishing History Archives and Running

Intermediate

 

How to run a core node – Part 2: Publishing History Archives and Running

 

Welcome to Part 2 of our guide on how to run a core node for the Stellar network. In this section, we will cover the steps to publish history archives and run your core node. We will start by discussing the importance of publishing history archives and the different methods available such as caching and local history archives using nginx or Amazon S3. We will also cover how to backfill a history archive to ensure a complete history of the network. Next, we will go over the steps to start Stellar Core, interact with your instance, and join the network. We will also discuss how to establish connections to other peers, observe consensus, and catch up to ensure your node is synced with the network. Additionally, we will cover logging and validator maintenance, including recommended steps to perform as part of maintenance and special considerations during quorum set updates. By the end of this guide, you will have a fully functional core node up and running on the Stellar network.

Publishing History Archives

If you want to run a Full Validator or an Archiver, you need to set up your node to publish a history archive. You can host an archive using a blob store such as Amazon’s S3 or Digital Ocean’s spaces, or you can simply serve a local archive directly via an HTTP server such as Nginx or Apache. If you’re setting up a Basic Validator, you can skip this section. No matter what kind of node you’re planning to run, make sure to set it up to get history, which is covered in Configuration.

Caching and History Archives

You can significantly reduce the data transfer costs associated with running a public History archive by using common caching techniques or a CDN.

Three simple rules apply to caching the History archives:

  • Do not cache the archive state file .well-known/stellar-history.json (“Cache-Control: no-cache”)

  • Do not cache HTTP 4xx responses (“Cache-Control: no-cache”)

  • Cache everything else for as long as possible (> 1 day)

 

Local History Archive Using nginx

To publish a local history archive using nginx:

  • Add a history configuration stanza to your /etc/stellar/stellar-core.cfg:

  • Example

[HISTORY.local]
get="cp /mnt/xvdf/stellar-core-archive/node_001/{0} {1}"
put="cp {0} /mnt/xvdf/stellar-core-archive/node_001/{1}"
mkdir="mkdir -p /mnt/xvdf/stellar-core-archive/node_001/{0}"
  • Run new-hist to create the local archive

# sudo -u stellar stellar-core –conf /etc/stellar/stellar-core.cfg new-hist local

This command creates the history archive structure:

  • Example

# tree -a /mnt/xvdf/stellar-core-archive/
/mnt/xvdf/stellar-core-archive
└── node_001
    ├── history
    │   └── 00
    │       └── 00
    │           └── 00
    │               └── history-00000000.json
    └── .well-known
        └── stellar-history.json

6 directories, 2 files
  • Configure a virtual host to serve the local archive (Nginx)

  • Example

server {
  listen 80;
  root /mnt/xvdf/stellar-core-archive/node_001/;

  server_name history.example.com;

  # default is to deny all
  location / { deny all; }

  # do not cache 404 errors
  error_page 404 /404.html;
  location = /404.html {
    add_header Cache-Control "no-cache" always;
  }

  # do not cache history state file
  location ~ ^/.well-known/stellar-history.json$ {
    add_header Cache-Control "no-cache" always;
    try_files $uri;
  }

  # cache entire history archive for 1 day
  location / {
    add_header Cache-Control "max-age=86400";
    try_files $uri;
  }
}

 

Amazon S3 History Archive

To publish a history archive using Amazon S3:

  • Add a history configuration stanza to your /etc/stellar/stellar-core.cfg:

  • TOML

[HISTORY.s3]
get='curl -sf http://history.example.com/{0} -o {1}' # Cached HTTP endpoint
put='aws s3 cp --region us-east-1 {0} s3://bucket.name/{1}' # Direct S3 access
  • Run new-hist to create the s3 archive

# sudo -u stellar stellar-core –conf /etc/stellar/stellar-core.cfg new-hist s3

  • Serve the archive using an Amazon S3 static site

  • Optionally place a reverse proxy and CDN in front of the S3 static site

  • Example

server {
  listen 80;
  root /srv/nginx/history.example.com;
  index index.html index.htm;

  server_name history.example.com;

  # use google nameservers for lookups
  resolver 8.8.8.8 8.8.4.4;

  # bucket.name s3 static site endpoint
  set $s3_bucket "bucket.name.s3-website-us-east-1.amazonaws.com";

  # default is to deny all
  location / { deny all; }

  # do not cache 404 errors
  error_page 404 /404.html;
  location = /404.html {
    add_header Cache-Control "no-cache" always;
  }

  # do not cache history state file
  location ~ ^/.well-known/stellar-history.json$ {
    add_header Cache-Control "no-cache" always;
    proxy_intercept_errors on;
    proxy_pass  http://$s3_bucket;
    proxy_read_timeout 120s;
    proxy_redirect off;
    proxy_buffering off;
    proxy_set_header        Host            $s3_bucket;
    proxy_set_header        X-Real-IP       $remote_addr;
    proxy_set_header        X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header        X-Forwarded-Proto $scheme;
  }

  # cache history archive for 1 day
  location / {
    add_header Cache-Control "max-age=86400";
    proxy_intercept_errors on;
    proxy_pass  http://$s3_bucket;
    proxy_read_timeout 120s;
    proxy_redirect off;
    proxy_buffering off;
    proxy_set_header        Host            $s3_bucket;
    proxy_set_header        X-Real-IP       $remote_addr;
    proxy_set_header        X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header        X-Forwarded-Proto $scheme;
  }
}

 

Backfilling a history archive

Given the choice, it’s best to configure your history archive prior to your node’s initial synch to the network. That way your validator’s history publishes as you join/synch to the network.

However, if you have not published an archive during the node’s initial synch, it’s still possible to use the stellar-archivist command line tool to mirror, scan, and repair existing archives.

Using the SDF package repositories you can install stellar-archivist by running apt-get install stellar-archivist

The steps required to create a History archive for an existing validator — in other words, to upgrade a Basic Validator to a Full Validator — are straightforward:

  • Stop your stellar-core instance (systemctl stop stellar-core)

  • Configure a history archive for the new node

  • TOML

[HISTORY.local]
get="cp /mnt/xvdf/stellar-core-archive/node_001/{0} {1}"
put="cp {0} /mnt/xvdf/stellar-core-archive/node_001/{1}"
mkdir="mkdir -p /mnt/xvdf/stellar-core-archive/node_001/{0}"
  • Run new-hist to create the local archive

# sudo -u stellar stellar-core –conf /etc/stellar/stellar-core.cfg new-hist local

  • Example

# tree -a /mnt/xvdf/stellar-core-archive/
/mnt/xvdf/stellar-core-archive
└── node_001
    ├── history
    │   └── 00
    │       └── 00
    │           └── 00
    │               └── history-00000000.json
    └── .well-known
        └── stellar-history.json

6 directories, 2 file
  • Start your Stellar Core instance (systemctl start stellar-core)

  • Allow your node to join the network and watch it start publishing a few checkpoints to the newly created archive

  • Example

2019-04-25T12:30:43.275 GDUQJ [History INFO] Publishing 1 queued checkpoints [16895-16895]: Awaiting 0/0 prerequisites of: publish-000041ff

At this stage your validator is successfully publishing its history, which enables other users to join the network using your archive (although it won’t allow them to CATCHUP_COMPLETE=true as the archive only has partial network history).

 

Complete History Archive

If you decide to publish a complete archive — which enables other users to join the network from the genesis ledger — it’s also possible to use stellar-archivist to add all missing history data to your partial archive, and to verify the state and integrity of your archive. For example:

  • Example

# stellar-archivist scan file:///mnt/xvdf/stellar-core-archive/node_001
2019/04/25 11:42:51 Scanning checkpoint files in range: [0x0000003f, 0x0000417f]
2019/04/25 11:42:51 Checkpoint files scanned with 324 errors
2019/04/25 11:42:51 Archive: 3 history, 2 ledger, 2 transactions, 2 results, 2 scp
2019/04/25 11:42:51 Scanning all buckets, and those referenced by range
2019/04/25 11:42:51 Archive: 30 buckets total, 30 referenced
2019/04/25 11:42:51 Examining checkpoint files for gaps
2019/04/25 11:42:51 Examining buckets referenced by checkpoints
2019/04/25 11:42:51 Missing history (260): [0x0000003f-0x000040ff]
2019/04/25 11:42:51 Missing ledger (260): [0x0000003f-0x000040ff]
2019/04/25 11:42:51 Missing transactions (260): [0x0000003f-0x000040ff]
2019/04/25 11:42:51 Missing results (260): [0x0000003f-0x000040ff]
2019/04/25 11:42:51 No missing buckets referenced in range [0x0000003f, 0x0000417f]
2019/04/25 11:42:51 324 errors scanning checkpoints

 

As you can tell from the output of the scan command, some history, ledger, transactions, and results are missing from the local history archive.

You can repair the missing data using stellar-archivist’s repair command combined with a known full archive — such as the SDF public history archive:

# stellar-archivist repair http://history.stellar.org/prd/core-testnet/core_testnet_001/ file:///mnt/xvdf/stellar-core-archive/node_001/

  • Example
2019/04/25 11:50:15 repairing http://history.stellar.org/prd/core-testnet/core_testnet_001/ -> file:///mnt/xvdf/stellar-core-archive/node_001/
2019/04/25 11:50:15 Starting scan for repair
2019/04/25 11:50:15 Scanning checkpoint files in range: [0x0000003f, 0x000041bf]
2019/04/25 11:50:15 Checkpoint files scanned with 244 errors
2019/04/25 11:50:15 Archive: 4 history, 3 ledger, 263 transactions, 61 results, 3 scp
2019/04/25 11:50:15 Error: 244 errors scanning checkpoints
2019/04/25 11:50:15 Examining checkpoint files for gaps
2019/04/25 11:50:15 Repairing history/00/00/00/history-0000003f.json
2019/04/25 11:50:15 Repairing history/00/00/00/history-0000007f.json
2019/04/25 11:50:15 Repairing history/00/00/00/history-000000bf.json
...
2019/04/25 11:50:22 Repairing ledger/00/00/00/ledger-0000003f.xdr.gz
2019/04/25 11:50:23 Repairing ledger/00/00/00/ledger-0000007f.xdr.gz
2019/04/25 11:50:23 Repairing ledger/00/00/00/ledger-000000bf.xdr.gz
...
2019/04/25 11:51:18 Repairing results/00/00/0e/results-00000ebf.xdr.gz
2019/04/25 11:51:18 Repairing results/00/00/0e/results-00000eff.xdr.gz
2019/04/25 11:51:19 Repairing results/00/00/0f/results-00000f3f.xdr.gz
...
2019/04/25 11:51:39 Repairing scp/00/00/00/scp-0000003f.xdr.gz
2019/04/25 11:51:39 Repairing scp/00/00/00/scp-0000007f.xdr.gz
2019/04/25 11:51:39 Repairing scp/00/00/00/scp-000000bf.xdr.gz
...
2019/04/25 11:51:50 Re-running checkpoing-file scan, for bucket repair
2019/04/25 11:51:50 Scanning checkpoint files in range: [0x0000003f, 0x000041bf]
2019/04/25 11:51:50 Checkpoint files scanned with 5 errors
2019/04/25 11:51:50 Archive: 264 history, 263 ledger, 263 transactions, 263 results, 241 scp
2019/04/25 11:51:50 Error: 5 errors scanning checkpoints
2019/04/25 11:51:50 Scanning all buckets, and those referenced by range
2019/04/25 11:51:50 Archive: 40 buckets total, 2478 referenced
2019/04/25 11:51:50 Examining buckets referenced by checkpoints
2019/04/25 11:51:50 Repairing bucket/57/18/d4/bucket-5718d412bdc19084dafeb7e1852cf06f454392df627e1ec056c8b756263a47f1.xdr.gz
2019/04/25 11:51:50 Repairing bucket/8a/a1/62/bucket-8aa1624cc44aa02609366fe6038ffc5309698d4ba8212ef9c0d89dc1f2c73033.xdr.gz
2019/04/25 11:51:50 Repairing bucket/30/82/6a/bucket-30826a8569cb6b178526ddba71b995c612128439f090f371b6bf70fe8cf7ec24.xdr.gz
...

A final scan of the local archive confirms that it has been successfully repaired

# stellar-archivist scan file:///mnt/xvdf/stellar-core-archive/node_001

  • Example

2019/04/25 12:15:41 Scanning checkpoint files in range: [0x0000003f, 0x000041bf]
2019/04/25 12:15:41 Archive: 264 history, 263 ledger, 263 transactions, 263 results, 241 scp
2019/04/25 12:15:41 Scanning all buckets, and those referenced by range
2019/04/25 12:15:41 Archive: 2478 buckets total, 2478 referenced
2019/04/25 12:15:41 Examining checkpoint files for gaps
2019/04/25 12:15:41 Examining buckets referenced by checkpoints
2019/04/25 12:15:41 No checkpoint files missing in range [0x0000003f, 0x000041bf]
2019/04/25 12:15:41 No missing buckets referenced in range [0x0000003f, 0x000041bf]

Start your stellar-core instance (systemctl start stellar-core), and you should have a complete history archive being written to by your full validator.

Running

Starting Stellar Core

Once you’ve set up your environment, configured your node, set up your quorum set, and selected archives to get history from, you’re ready to start Stellar Core.

Use a command equivalent to:

$ stellar-core run

At this point, you’re ready to observe your node’s activity as it joins the network.

You may want to skip ahead and review the logging section to familiarize yourself with Stellar Core’s output.

Interacting With Your Instance

When your node is running, you can interact with Stellar Core via an administrative HTTP endpoint. Commands can be submitted using command-line HTTP tools such as curl, or by running a command such as

$ stellar-core http-command <http-command>

That HTTP endpoint is not intended to be exposed to the public internet. It’s typically accessed by administrators, or by a mid-tier application to submit transactions to the Stellar network.

See commands for a description of the available commands.

Joining the Network

Your node will go through the following phases as it joins the network:

Establishing Connection to Other Peers.

You should see authenticated_count increase.

  • JSON

"peers" : {
   "authenticated_count" : 3,
   "pending_count" : 4
},

Observing Consensus

Until the node sees a quorum, it will say:

  • JSON

"state" : "Joining SCP"

After observing consensus, a new field quorum will display information about network decisions. At this point the node will switch to “Catching up“:

  • JSON

"quorum" : {
   "qset" : {
      "ledger" : 22267866,
      "agree" : 5,
      "delayed" : 0,
      "disagree" : 0,
      "fail_at" : 3,
      "hash" : "980a24",
      "missing" : 0,
      "phase" : "EXTERNALIZE"
   },
   "transitive" : {
      "intersection" : true,
      "last_check_ledger" : 22267866,
      "node_count" : 21
   }
},
"state" : "Catching up",

 

Catching up

This is a phase where the node downloads data from archives. The state will start with something like:

  • JSON

"state" : "Catching up",
"status" : [ "Catching up: Awaiting checkpoint (ETA: 35 seconds)" ]

And then go through the various phases of downloading and applying state such as

  • JSON

"state" : "Catching up",
"status" : [ "Catching up: downloading ledger files 20094/119803 (16%)" ]

You can specify how far back your node goes to catch up in your config file. If you setCATCHUP_COMPLETE to true, your node will replay the entire history of the network, which can take a long time. Weeks. Satoshipay offers a parallel catchup script to speed up the process, but you only need to replay the complete network history if you’re setting up a Full Validator. Otherwise, you can specify a starting point for catchup using CATCHUP_RECENT. See the complete example configuration for more details.

Synced

When the node is done catching up, its state will change to:

  • JSON

"state" : "Synced!"

 

Logging

Stellar Core sends logs to standard output and stellar-core.log by default, configurable as LOG_FILE_PATH.

Log messages are classified by progressive priority levels:TRACE, DEBUG, INFO, WARNING, ERROR and FATAL. The logging system only emits those messages at or above its configured logging level.

The log level can be controlled by configuration, the -ll command-line flag, or adjusted dynamically by administrative (HTTP) commands. To do so, run:

$ stellar-core http-command "ll?level=debug"

while your system is running.

Log levels can also be adjusted on a partition-by-partition basis through the administrative interface. For example the history system can be set to DEBUG-level logging by running:

$ stellar-core http-command "ll?level=debug&partition=history"

Against a running system.

The default log level is INFO, which is moderately verbose and should emit progress messages every few seconds under normal operation.

Validator maintenance

Maintenance here refers to anything involving taking your validator temporarily out of the network (to apply security patches, system upgrade, etc).

As an administrator of a validator, you must ensure that the maintenance you are about to apply to the validator is safe for the overall network and for your validator.

Safe means that the other validators that depend on yours will not be affected too much when you turn off your validator for maintenance and that your validator will continue to operate as part of the network when it comes back up.

If you are changing some settings that may impact network wide settings such as protocol version, review the section on network configuration.

If you’re changing your quorum set configuration, also read the section on what to do.

We recommend performing the following steps in order (repeat sequentially as needed if you run multiple nodes).

  1. Advertise your intention to others that may depend on you. Some coordination is required to avoid situations where too many nodes go down at the same time.

  2. Dependencies should assess the health of their quorum, refer to the section “Understanding quorum and reliability”.

  3. If there is no objection, take your instance down

  4. When done, start your instance that should rejoin the network

  5. The instance will be completely caught up when it’s both Synced and there is no backlog in uploading history.

Special considerations during quorum set updates

Sometimes an organization needs to make changes that impact other’s quorum sets:

  • taking a validator down for long period of time

  • adding new validators to their pool

In both cases, it’s crucial to stage the changes to preserve quorum intersection and general good health of the network:

  • removing too many nodes from your quorum set before the nodes are taken down : if different people remove different sets the remaining sets may not overlap between nodes and may cause network splits

  • adding too many nodes in your quorum set at the same time : if not done carefully can cause those nodes to overpower your configuration

Recommended steps are for the entity that adds/removes nodes to do so first between their own nodes, and then have people reflect those changes gradually (over several rounds) in their quorum configuration.