My Server Setup Part 5: Eliminating the Downtime

11 Feb 2023

In this series of posts I am describing the infrastructure that hosts all of my websites and development projects.

Summary

In the Outstanding Issues section of part 4 I noted that the update mechanism results in a few seconds of downtime for a site when it’s updated. This is clearly unacceptable.
The resolution for this was to separate out the web server from the content.
Docker images are no longer built containing the site content.
Instead a sidecar service clones and periodically pulls the repo into a directory on disk shared by the sidecar and the web server containers.
Downtime has been almost completely eliminated (miniscule window per file updated).

How Nomad enables this solution

When a Docker-based Nomad job is started a number of directories are created and auto-mounted into the containers. One of these is shared across all tasks in a group, known as the alloc directory. This directory is what enabled this solution to work.

Web server configuration

The setup still uses statigo as the web server. Since the last post a number of improvements have been made, specifically around metrics and logging, but fundamentally it’s still the small web server it was last time I discussed it.

The difference now is that the web server container is used without any changes - the stock server runs for each site hosted on Nomad. You can see this in the config section of the webserver task in the Nomad job:

config {
  image = "stut/statigo:v7"
  ports = ["http"]
  args = [
    "--root-dir", "${NOMAD_ALLOC_DIR}/www",
  ]
}

The root directory is being set to the www folder in the ALLOC folder. The ALLOC folder is created by Nomad when it creates the allocation, which is a group of tasks that must run together on the same node. Essentially this folder is shared across all tasks in the group.

In this case the group consists of two tasks: statigo and syncer. The statigo task is simply the web server pointed at that shared directory. The syncer task is where the “magic” happens.

Syncer configuration

The syncer task is running a new component, oddly called syncer. This component is responible for cloning a git repository and then periodically pulling for changes. this mechanism is what will now update the content of the static webite with next to zero downtime.

The syncer docker container can be found on and used directly from DockerHub, and the source is available on GitHub. The syncer Nomad task is configured as follows, broken up with some explanation.

task "syncer" {
  driver = "docker"

  lifecycle {
    hook = "prestart"
    sidecar = true
  }

The lifecycle settings are important as that is what will ensure the syncer starts before the other task(s) in the group, specifically ensuring that the initial clone is done before the statigo task is started. If this is not the case then the root directory for statigo will not exist when it starts, so it will fail.

  config {
    image = "stut/syncer:v7"
    ports = ["http_syncer"]
  }

Nothing too cmplicated here. As with statigo we get the version for syncer from a terraform local variable.

{% raw %}  template {
    data = <<EOF
SYNCER_SOURCE="git@github.com:stut/blog.git"
SYNCER_DEST="{{ env "NOMAD_ALLOC_DIR" }}/www"
SYNCER_GIT_BRANCH="deploy"
SYNCER_SSH_KEY_FILENAME="{{ env "NOMAD_SECRETS_DIR" }}/ssh_key"
SYNCER_UPDATE_INTERVAL="1h"
EOF
    destination = "local/env"
    env         = true
    change_mode = "restart"
  }{% endraw %}

This sets the environment variables which configure the syncer instance. They should be self-explanatory. One thing to note in this configuration is that the branch, deploy, only contains the output from running Jekyll on the repository. This is a GitHub action covered later in this article. In your configuration the site to serve may be in a sub-directory of the repository in which case you’ll need to adjust the root directory passed to statigo.

The change_mode is largely unnecessary since none of the variables used will change unless the task is already being restarted, but I’ve put it there just in case I ever move these variables into Consul or Vault.

  template {
    data = <<EOF
[INSERT SSH PRIVATE KEY HERE, IDEALLY FROM A SECRETS MANAGEMENT SYSTEM SUCH AS VAULT]
EOF
    destination = "secrets/ssh_key"
    change_mode = "restart"
  }

This is the private key that will be used when cloning the git repository. If the repository does not require this (i.e. it’s a public repo) is can be omitted, but don’t forget to remove the SSH options from the environment template above.

  resources {
    cpu    = 4
    memory = 10
  }

  logs {
    max_files     = 1
    max_file_size = 10
  }

Resource requirements are pretty low for this task, but be sure to check the logs when you run it for the first time. The memory in particular is determined primarily by the amount required to perform the initial clone. For a large repository this may need to be increased.

service {
  port = "http_syncer"
  meta {
    metrics_port = "${NOMAD_HOST_PORT_http_syncer}"
  }
  check {
    type     = "http"
    port     = "http_syncer"
    path     = "/health"
    interval = "1m"
    timeout  = "2s"
  }
}

Finally we configure a service and associated health check. We add a meta variable which will allow prometheus within the cluster to automatically pick up the new metrics source, and the HTTP server will respond successfully to requests for /health so that can be used as a health check.

Group startup process

The Nomad user interface displays the group lifecycle in in a pretty straightforward manner:

The prestart tasks start first, and as soon as they’re running and healthy only then will the main tasks start. In this instance there are no poststart or poststop tasks but you can imagine when and why those would be started.

For the syncer/statigo group this translates to the syncer starting and cloning the repo into the shared directory, then the statigo task is started with its root pointed at that directory.

Both the syncer and statigo have a health check endpoint so if the syncer can’t perform the inital clone the statigo process won’t start, and if the statigo task can’t access the shared directory for some reason it will fail to start.

Updates

The syncer process will periodically pull on the clone which will update the website content. Due to the mechanism git uses to pull on a repo this results in milliseconds of “downtime” per file in the repo. This is what we want!

What happenes if…?

…the repo clone already exists in the shared directory?
The syncer will see the directory, will confirm that it’s a valid clone of the right branch of the right repo, and will perform a pull instead of a clone. If it’s not a clone of the right branch of the right repo it will crash out with an error causing the Nomad deployment to fail.

…the repo can’t be cloned for some reason?
The syncer will exit with an error causing the Nomad deployment to fail.

…there are changes in the clone for some reason?
There is a configuration option specifying whether a hard reset should be done if changes are found. If the hard reset fails or is not enabled the syncer process will exit.

Experience so far

This arrangement has now been running my static sites for a couple of weeks at the time of writing and so far it’s been working perfectly. If you can see this post that’s a testament to that!

There are certainly some parts of this architecture that could be improved and/or made more resilient but for now it works well enough for my purposes.

Tell me what you think

Contact me on Twitter (@stut) or by email.