Habitat Enterprise Packaging: CrateDB

June 22, 2017

Habitat is a new open source community build by the team at Chef. Habitat aims to bring together the different participants in a devops process by acting as a one-stop solution for building, deploying and configuring your applications.

Recently, Endocode has been working within the Habitat Community to develop support for crucial technologies that others may wish to use as dependencies in their own plans. In this blog post we describe how we went about building one of these enterprise plans: CrateDB.

“Enterprise readiness” is an intangible concept which really needs to be defined carefully for each application domain; “enterprise-ready” does not mean the same thing to a database and a CMS. In the case of CrateDB we aimed for the following:

  • Highly available
  • Resilient to node failure
  • Easily scalable
  • Health-checked
  • Identity-checked

This blog post will go into the specifics of how Endocode’s CrateDB package supports all this.

CrateDB

CrateDB is a distributed SQL database aimed at machine-data (particularly IoT) applications. Unlike traditional SQL databases, which scale vertically by being deployed on meatier hardware, CrateDB scales horizontally by being deployed across multiple machines, sharing the load between them.

Somewhat Masterless

In general, CrateDB follows a shared-nothing architecture: each node in a CrateDB cluster is equal to all others. This is one of CrateDB’s most subtly powerful features; the architecture is masterless (in DB terms) and so clients can concurrently read/write from any node. In most database technologies there are dedicated read/write “master” nodes that are a single point of failure in the architecture.

We say CrateDB is “somewhat” masterless because, while it is masterless for data read/writes, CrateDB elects a node to maintain the canonical version of the overall state; all other nodes simply keep their own copy.

The Habitat Plan

The Habitat plan file for CrateDB is largely unsurprising and there is not to much to comment on: we simply download the upstream tarball for CrateDB and then run it using the Habitat core JRE. That said, it is in the plan that also checks the identity of the tarball’s creator.

Habitat And Source Trust

By default, Habitat knows it is packaging the correct thing because we provide the sha256 sum for the file we are downloading. For example, from the CrateDB plan:

pkg_source="https://cdn.crate.io/downloads/releases/${pkg_name}-${pkg_version}.tar.gz"
pkg_shasum="8f22b6531b3d1c8602a880779bbe09e5295ef0959a30aff0986575835aadc937"

This ensures that we are packaging what we were planning to package, but it does not help us with trust of the upstream source. Many projects use GnuPG to sign their source tarballs, thus proving the identity of the publisher. Similarly, we can use GnuPG as packagers to ensure that the sources are trust worthy; if not, the package build will simply fail.

To implement this take a few extra steps in our plan. First, we must download the signature file for the downloaded source:

do_download() {
  # Download the source file, as usual
  do_default_download

  # Now also grab the signature for the source
  # Provide the checksum so that file does not get downloaded with every build
  download_file "https://cdn.crate.io/downloads/releases/${pkg_name}-${pkg_version}.tar.gz.asc" \
    		"${pkg_name}-${pkg_version}.tar.gz.asc" \
		"4e6007a35b99c0da75356cb6cd7aeafd7d380e1a5f5fa26b79a0dfa0a9898924"
}

Next, during the verification step, we can run GnuPG to test the sources:

do_verify() {
  # Firstly perform the standard checksum-based verification
  do_default_verify

  # Now verify the signature file
  verify_file "${pkg_name}-${pkg_version}.tar.gz.asc" \
              "4e6007a35b99c0da75356cb6cd7aeafd7d380e1a5f5fa26b79a0dfa0a9898924"
    
  # Now do the GPG-based verification
  build_line "Verifying crate-${pkg_version}.tar.gz signature"
  export GNUPGHOME="$(mktemp -d -p $HAB_CACHE_SRC_PATH)"
  gpg --keyserver ha.pool.sks-keyservers.net --recv-keys 90C23FC6585BC0717F8FBFC37FAAE51A06F6EAEB
  gpg --batch --verify ${HAB_CACHE_SRC_PATH}/${pkg_name}-${pkg_version}.tar.gz.asc \
                       ${HAB_CACHE_SRC_PATH}/${pkg_name}-${pkg_version}.tar.gz
  rm -r "$GNUPGHOME"
  build_line "Signature verified for ${pkg_name}-${pkg_version}.tar.gz"
}

If either the download of the signature or the checking of that signature fails, the overall build will fail, ensuring that the overall trust worthiness of the CrateDB has been preserved.

Using Habitat Topologies For Leader Election

Habitat supports different deployment topologies for services:

A topology describes the intended relationship between peers within a service group.

One such topology is the LEADER-FOLLOWER topology and is specifically designed for the occasion where one peer in the service has a special status within the group. In the case of CrateDB, we have the node which is elected to hold the canonical state of the cluster.

Enabling this on CrateDB can be achieved by running CrateDB as follows…

  • On the first node:

    hab start endocode/crate --topology leader
    
  • On subsequent nodes:

    hab sup start endocode/crate --topology leader --peer <first node ip>
    

How Does This Work?

CrateDB has a special piece of configuration which is used to show that a node is electable as a leader. Once three nodes have joined the cluster with this setting enabled, a leadership election can take place.

With our Habitat plan we do things a little differently. We include the following in the CrateDB config:

{{#if svc.me.leader ~}}
  node.master: true
{{else ~}}
  node.master: false
{{/if ~}}

As we start the CrateDB service using the LEADER topology, Habitat will wait for three peers to join the ring so that it can hold its own election algorithm. Once that election is complete, all the Habitat supervisors write out their service configuration. The code snippet above ensures that only the Habitat leader node allows its CrateDB to be electable as leader. This ensures that the Habitat leader node is always the CrateDB leader node; something which users of the CrateDB plan are likely to expect.

Most importantly, if the state of the cluster is to change (leader node is lost or additional nodes join the ring) the Habitat will conduct its own election and, in the process, elect a new leader for CrateDB.

Scaling The Cluster

Once we have an initial cluster up-and-running we can scale it up/down to whatever size we need. Adding additional nodes to the cluster takes nothing more than starting a new instance of CrateDB inside a different supervisor and connecting the supervisor to the peer ring. Once this has been done, again, Habitat can take care of the rest.

Typically, CrateDB only needs to know the IP address of one other node in the cluster to communicate with initially. However, for resiliency, it is best to provide as many IPs as possible. Again, we have handled this in the configuration of our CrateDB plan:

discovery.zen.ping.unicast.hosts:
{{#each svc.members ~}}
  - {{sys.ip}}
{{/each}}

This code snippet in our configuration ensures that the IP address of every node in the CrateDB cluster gets written into the configuration by the Habitat supervisor for each node.

After scaling the nodes in the cluster, it is highly likely that CrateDB will start displaying errors related to the total number of nodes in the cluster. Whilst benign for the day-to-day purpose of running the database, it is best to update your configuration to make these errors stop. Once you have scaled your cluster and it is running correctly, it is possibly to apply new configuration as follows:

HAB_CRATEDB="gateway.expected_nodes=<new node count>" hab config apply cratedb.default 1

Conclusions

CrateDB take an interesting new approach to solve the old problem of tabulated data storage. By aiding with the deployment and configuration of CrateDB, Habitat makes an excellent partner technology.

For anyone wishing to play with CrateDB in the Habitat context, the details of how to use Endocode’s CrateDB package for Habitat can be found here.