A Jenkins Master, with a Jenkins Master, with a ...

How to setup Jenkins as an active/passive cluster in order to increase reliability and prevent downtime

by Sebastian Sucker, Gregor Jahn, Arne Schreiber | August 17, 2018

Some time ago, when kicking off a new project to integrate yet another Jenkins based CI setup for one of our customers, they asked us for a rather unusual feature: a second Jenkins master the system can failover to in case of malfunction, that eventually would reduce risk of downtime. What we came up with, why we got there and how we implemented it is what we’re sharing here with you.

NOTE: The contents of this post assumes CentOS/RHEL to be the underlying operating system

Buzzword Bingo

Minimizing the risk of downtime typically aims to prevent at least one of these two situations: (A) reaching load capacity from an instance running some service(s) (B) having an instance or service either malfunction or stop running completely Introducing redundancy to the equation is a common approach to address those cases. It is done by adding instances to the system that have the same capabilities as the existing one. The result is usually referred to as a cluster. These clusterized machines are then supposed to act together as one. But to ensure high availability or even make this system scale horizontally further effort is needed, either programmatically or manually.

One approach would be to integrate the existing setup with some orchestration technology. This is basically another instance sitting next to the cluster and taking care of dynamic resource allocation and management according to a set of constraints. But this, naturally, increases complexity, whose necessity was yet to be seen. So, in this case, we decided to start rather simple by just setting up a second Jenkins master node and putting it right next to the existing one.

Challenges

Jenkins, as an application, is stateful. It’s state is stored directly on the filesystem (${JENKINS_HOME}). Due to the nature of how a filesystem works, adding a second Jenkins master node might raise the problem of concurrent file access. As of now, Jenkins is not aware of such an issue nor addresses it. It just assumes that it’s the only process writing to its home directory, thus it has no ability of running as a cluster out of the box. Even its frontend (webserver & web UI) is tightly coupled to the rest of the software and not able to run separately. Whereas, its agents might continue to function until their master comes back online.

In that sense, the approach of distributing workload provided by those agents, in a way, already implements concepts similar to technologies that address topics like cluster orchestration. This can be illustrated, for example, by the following responsibilities a master has: - delegating jobs & monitoring statuses - functioning like an agent (running jobs by itself) - aggregating build logs - storing artifacts

One might even be tempted to say, that the master is the “etcd” of Jenkins, but instead of utilizing a storage middleware Jenkins directly interacts with the filesystem API to store its state. As a consequence, though, concepts like distribution & split-brain prevention, which are no-brainers to etcd, needed to be adressed and added for Jenkins.

Two different approaches derive from that: (A) either migrate Jenkins state into a database and write a middleware/translator that also takes care of reading/writing from multiple sources (B) or make the Jenkins home directory, that represents the entire state, available to all master nodes

Obviously, the first one (A) is a very complex task and affects the very foundation of Jenkins too. It would result in a significant amount of organisational work and ongoing work as a maintainer of such contribution. That is why we opted for the latter (B). But it’s also fare from being trivial, like it may seem at first. Due to the concurrent file access problem mentioned above, it needs to be ensured that only one master instance at a time is writing to the home directory. Otherwise the whole Jenkins setup might end up in an inconsistent state.

So, let’s take a look at how we tried to solve that issue in order to implement a high available Jenkins setup.

Ingredients

We identified two corner stones to make this work. The first one is a distributed, or at least
remote/sharable persistence layer. The idea here is to provide a more flexible way for Jenkins master nodes to access the state. Depending on the technology at hand, it might also allow us to utilize data duplication as a passive backup strategy. With this approach we are inherently bound to an active/passive implementation. That means, only one of the master nodes that are available hosts the running Jenkins service, everything else is on cold standby. Another implementation called active/active (hot standby) and ultimately boils down to load balancing, is no option due to the reasons described above.

Going down the cold-standby road inevitably leads us to the second ingredient. Something an orchestration software has already build-in, a component that monitors and controls all Jenkins master nodes. To be more specific: it needs to (1) monitor and measure the health state of each node, and (2) if required switch over automatically to another available node, hence failover.

Architecture

Moving form a rather theoretical and abstract perspective to a more concrete architectural design. Ideally, the whole setup runs in a dedicated sub-net, encapsulated and with a proxy node in front of everything which can be used as a jump host, and accommodate the monitoring/controlling unit as well. The least amount of jenkins master nodes that quorum demands is two. Because of the active/passive design, adding more Jenkins nodes to the setup would merely increase expenses without having an actual effect on the level of reliability.

The last missing piece is the persistence layer for the ${JENKINS_HOME} directory, which is going to be provided as a network storage. Easily made available for any node in the same network,
mounted into the local filesystem and out of the box accessible through standard commands. Although different implementations have different features, it essentially comes down to the following three design options: (A) one node providing the entire volume (B) two nodes enabling general redundancy; either each of them stores a complete replica of the volume, or it gets distributed among them so that every node only contains a fraction of Jenkins’ state
© one storage node as well as every Jenkins master node stores a replica; volume is mounted through localhost on each master node

Jenkins HA Setup - Concept

As of now, we went with option (B), because it provides some flexibility in case the persistence layer must be relocated to get shared among multiple setup instances. Also, it’s nicely self-contained with encapsulating all the data it needs. But there is some elegance to © as well, because if the storage software is able to detect whether the volume is actually on the same node, the network overhead could be saved.

In this project scalability is not so much of a technical question, but more of an organisational concern. So, instead of adding more machines and splitting the load across them in case the capacity of one setup instance is reached, a new instance is being created and the subsequent projects will be hosted on that one.

Implementation

The first task was to setup a proper development environment, that bootstraps a minimum testable cluster on a local machine. We wanted to be able to test our ideas very quickly and then eventually get to a stable state by improving iteratively, so we went with Vagrant and Virtualbox. Not only because it’s easy to use and widely adopted, but primarily because it’s very convenient to just throw everything away and start from scratch without burning any resources. With that, one can pick almost any configuration management tool. As long as we are in a phase of prototyping and experimenting we wanted to avoid unnecessary overhead, which resulted into Bash being the smallest denominator for implementing the bootstrapping, which builds a great foundation when migrating to salt-modules, recipes, or resources later on. And with hard-coded version pinning we got at least a bit determinism and reproducibility, e.g. in case of debugging.

GlusterFS became the storage technology of choice. From a user perspective, it feels like any other network storage technology (e.g. NFS), except the mount-type is glusterfs and high availability is build right in. This is done by introducing another abstraction layer called a Brick. A mounted volume consists of one or more bricks distributed across nodes. Those bricks can store just some parts of the volume contents up to a complete replication. The GlusterFS client is a proxy of itself and uses
FUSE underneath to mount Gluster volumes. If a brick, as part of a volume, is not available, the client seemlessly redirects all requests to those bricks that are still online.

The proxy node mentioned earlier doesn’t need a lot of resources, therefore it’s only required to host an HAProxy installation. Because of its straightforward and readable way to get configured, it was chosen over Nginx, although, they are interchangeable, since both support the necessary features (reverse proxying and load balancing). The primary responsibility of the proxy is to provide a static entry point no matter which Jenkins master node with its own IP is currently running, namely for the web UI and the JNLP port.

In in order to make this work, aside form the usual HAProxy configuration at least two parameters in Jenkins itself needs to be configured:

1. ${JENKINS_HOME}/config.xml --> <slaveAgentPort/> (JNLP)  
2. ${JENKINS_HOME}/jenkins.model.JenkinsLocationConfiguration.xml --> <jenkinsUrl/> (publicly exposed IP of the proxy node or a fully qualified domain name resolving to the proxy)

TLS termination can be handled either by HAProxy, which has the disadvantage of unencrypted web traffic being sent between proxy and jenkins master node, or the active master node can handle it, which enables end-to-end transport encryption. The drawback here is, that the private key for the certificate needs to be copied over to all existing master nodes (e.g. through the shared ${JENKINS_HOME}), and some effort is required in order to convert certificate chain and key from PEM format into a Java KeyStore and make it accessible to Jenkins (${JENKINS_HTTPS_KEYSTORE}).

By using a proxy node, connecting Jenkins agents to a master becomes trivial again. Any available option to make an agent known to the master works out of the box. In case the whole setup is encapsulated in a sub-network, the traffic to establish a connection could actually be routed through the proxy’s local network interface and would therefore remains inside. It works also with the Swarm Plugin for example, which we used to enable Jenkins agent nodes to discover the master on their own. This way, is was no effort to include them right from the beginning, when we started to automate the bootstrapping process. In general, we were not concerned about Jenkins agents working with our design, but it was very helpful when testing some basic behaviour on the fly.

Schrödinger’s Filesystem

Getting a Jenkins Master running on top of a ${JENKINS_HOME} directory mounted with type = glusterfs was pretty straightforward, but after putting a HAProxy node in front of the master, we encountered an odd issue. Whenever somebody would ask Jenkins to fetch the list of all available plugins, the browser would show 504 - Gateway Time-out after some time. First, we suspected latencies in the network routing depending on the environment where the setup was running in, which at that time was a local development environment on our machines and a VPC hosted on AWS in Ireland (EU West 1). After increasing the timeout parameters in the HAProxy configuration, the time-out error stopped occurring, though the loading time remained unusually high. We tried to remove HAProxy from the equation, but it had no impact on the loading time.

Eventually, the waiting time for installing new plugins became very annoying, and we had another look into this issue. Neither traceroute nor curl http://updates.jenkins.io/update-center.actual.json showed any sign of delay similar to how the web interface responds. So, networking is not the reason. While working on the failover mechanics, we, almost coincidentally, discovered that the glusterfs process spikes in its CPU usage while Jenkins is retrieving the list of available plugins. This would also explain the noticeable difference of the loading time between local setup and AWS, because those nodes have significantly more resources. So, whatever glusterfs is computing, bigger and more CPUs need less time for that. Yet another symptom, sure, but we’re getting closer.

After obtaining some inside data with gluster’s monitoring tools, trying out various mount-options for GlusterFS and FUSE, and taking a deeper look into Jenkins’ codebase to find out what exactly is happening when the plugin list gets requested, without any luck. So, we went for a more systematic approach and mounted the volume with the glusterfs CLI directly in debug-mode, to actually see what’s going on. We also thought of switching to nfs for the mount-type, which is supported by GLusterFS as well. But then, we would loose the proxy feature of GlusterFS, if a brick becomes unavailable.

When stepping through the log results, a lot of message included the error file or directory not exist. Then, suddenly it dawned on us. It seems that GlusterFS tries to process and synchronise all the temporary files, Jenkins creates and changes during this request, but when it does those filesystem entries don’t exist anymore. In other words: the Gluster client, or more specifically the FUSE translator, can’t keep up with Jenkins, and therefore with the Java engine, but it doesn’t stop trying, which the client is not handling very well. To solve this, our thinking was to “slow down” GlusterFS, meaning to delay the recognition of changes in the filesystem. We found a mount parameter in the advanced options for the FUSE kernel module called negative-timeout, whose default is 0. And voila, increasing the value to something greater than 0 was the fix we were searching for. With that, an unfounded assumption became strong speculation until today.

Making it failoverable

The active/passive implementation, which we went for, comes with the main challenge of ensuring that the Jenkins master nodes doen’t end up in a split-brain state, or to be more precise, that the ${JENKINS_HOME} directory doesn’t get corrupted. This means, we need to implement a complete deterministic procedure to prevent a master node, which might have been active before and is now presumebly dead, from changing the state (aka contents of ${JENKINS_HOME} directory) anymore.

For the first approach we initially developed a small piece of software that simply checks, if the Jenkins web interface of the master in question is available via HTTP. If not, it ensures that the Jenkins process on the unresponsive master node gets stopped. Afterwards it starts the process on the currently inactive master node. This was great to showcase the design as a Proof of Concept, but that implementation was far from being production-ready. We were looking for a more robust and feature-rich tooling, which we wouldn’t need to maintain ourselves. Through the Linux-HA project we discovered ClusterLabs. It provides a stack of tools that can be used to clusterrize any arbitrary service, and, originating form Linux-HA, it has a long history and proven track record of productive operation.

One of the two core components is Pacemaker. It acts as the resource manager, supports various cluster architecture designs, and takes care of stopping rogue nodes or activating new ones. This is done over a messaging layer, which is the second core component. This layer not only provides communication channels across all nodes, it also supplies the resource manager with information reflecting each node’s state. Based on this data Pacemaker can then take appropriate action if necessary. ClusterLabs supports two implementation of the messaging layer: Heartbeat and Corosync. Whilst the former has been growing old, the latter has become the default.

As a counterpart to the resource manager and located on cluster nodes there are the resource agents. They implement an interface to a specific stateful part of a node and abstracts a small set of functionality to e.g. gather information about the resource or start/stop it. As the starting point in the project context we were combining a systemd-based service resource for Jenkins with a filesystem resource for the shared GlusterFS storage. To integrate the web UI check from the early Proof of Concept we were thinking about creating a small custom resource that would either be reside on the HAproxy node or on the master’s localhost.

The ClusterLabs stack also comes with it’s own tool to share and sync filesystem-state across nodes. But since in our case the decision on that matter has already been made, not to forget the existing experience with GlusterFS, we left the tool called DRBD aside. Though, it might be (at least) equally suitable, because it comes with its own kernel module, instead of relying on FUSE. Furthermore, we found a resource called ocf:heartbeat:IPaddr2. This makes it possible to implement a floating cluster IP, which could then replace one major purpose of our HAProxy node.

As mentioned before, by design the state of a Jenkins service is reflected in ${JENKINS_HOME}. This means, in whatever phase the service is, it can be transitional - thus unstable. So, we not only have to ensure that just one single master node is writing to the filesystem, but more importantly to initiate the activation of another node only after being certain that the Jenkins service on the former node has been stopped successfully and completely. This leads us to a feature known by the beautiful acronym called STONITH (or “Shoot The Other Node In The Head’ aka. fencing). The ClusterLabs stack supports that mechanism, but if you are in a IaaS environment you might gain rather little from implementing it, because most of the building blocks there are software-defined, the network for example. If the traffic doesn’t reach the active node anymore because some network parts are malfunctioning, the node is declared unavailable from the resource manager’s perspective. Before enabling another node, fencing tasks get triggered, but they’ll never reach the node in question. That being said, there is a great amount of fence agents available to access the outer layer, e.g. cloud providers like AWS or Google Cloud Engine as well as for all major Hypervisors, and many more. They use available APIs in order to perform actions reflecting what the interpretation of STONITH is in that specific context. While this approach might be sufficient in most cases, if you ultimately want to make sure, that the presumably unavailable node doesn’t keep changing the persistence state, you need to separate the STONITH related infrastructure physically from the cluster environment (e.g. dedicated device for IPMI).

For further information on ClusterLabs and its stack components, please have a look into the documentation. The Cluster from scratch tutorial is also a great starting point and a nice way to get a feeling for what the stack can do for you.

Perspective / Future work

For this specific assignment and according to the requirements we were confronted with, scalability - unlike availability - was not one of them - at least not form a technical point of view. It was an issue more suited to be solved by establishing a formal process that allocates new resources on demand or based on the company structure. Meaning, instead of just adding more machines and splitting the load across them, because some resource capacity has been reached, it’s also conceivable to define other indicators like the amount of repositories a Jenkins setup is responsible for, or each company department gets their own instance including the redundant masters to preserve High Availability but within their own scope. And, using an FQDN for every setup instance would automatically define a nice and unique namespace.

A word about containers. When implementing redundancy nowadays, the usual approach would be to containerize everything and let the orchestration software handle it. For the project at hand we might have gone down that road, if orchestration would have already been in place. Instead, we just had a set of empty mid-sized VMs available. Admittedly, the ephemeral-concept would be a convenient way to deal with broken Jenkins masters, but on the other hand this approach really shines when it comes to scalability, for which there was no need, as mentioned above. So, if the setup is designed to separate the components properly, there really is no need for a orchestration and containers in this case. On the contrary, it would add unnecessary complexity. Though, equipping Jenkins agents to support containers makes sense, especially since most build jobs, in this context, require to run in a Windows context.

Even though larger problems often require custom solutions that fit right into their particular scope and environment, a commercial SaaS solutions that is capable of handling most cases might already do the trick. CloudBees, one of the more prominent vendors, for example, offers increased reliability with a High Availability Plugin which is accessible trough their enterprise product. Another interesting development is Jenkins X. It’s an extended Jenkins enabled to run as a cluster in Kubernetes (and thereby in OpenShift as well) with the help of some Kubernetes configurations, including components like a Docker Registry. These kind of movement show, that Open Source projects are starting to acknowledge that native cluster support has become a mandatory ability.

A fully working prototype can be found on github.com. Please take a look into the README.md for further instructions on how to get started on your local machine.