From Rabbit to Pub/Sub

by Sebastian - February 23, 2022
single-post-thumbnail.jpeg

From RabbitMQ to Google Pub/Sub

Endocode specializes in migrations to the google cloud platform (GCP). An important and challenging component of cloud migrations is how to introduce cloud services to existing applications.

This post is a case study on how we managed to migrate from RabbitMQ (Rabbit) to Google Pub/Sub (Pub/Sub).

Background

Rabbit and Pub/Sub are both message brokers. Message brokers play a key role in event-driven software architecture that has been heavily embraced in recent years. Companies often dedicate significant amounts of time in choosing between the different types, and the final decision is often a once-in-an-application lifetime call.

In software, as in life, what is hard to find is hard to let go of.

In the open-source space, Rabbit is a popular choice when self-hosting. It supports the two main patterns; message queuing and publish/subscribe. In a simplified view; one-to-one and one-to-many.

Pub/Sub is, as the name suggests, first a publish/subscribe message broker, only since 2020 GCP released support for message queuing by exposing a field called ordering keys, that allow Pub/Sub

With this release GCP covers Rabbit in basic message broker functionality, making it interesting for companies that depend on Rabbit but desire a self-managed scalable solution in a cloud environment.

Client Specifications and investigation

Our client was in exactly this situation in early 2021. The company was moving from on-premise hosting to GCP and their backend logistics group had a dependency on Rabbit.

Along with the engineers we made an investigation into how hard it would be to switch to Pub/Sub given the existing codebases’ dependencies.

Blockers

The conclusion to this investigation was that the migration was not feasible due to two more subtle dependencies on Rabbit that were not considered initially, but that the managed Pub/Sub service does not support; infinite retention of not-acknowledged messages and priority queues.

Retention

The retention problem is the easiest to understand. The managed Pub/Sub service has a maximum retention time of seven days. If messages for some reason are not acknowledged seven days after their arrival they are deleted.

Pub/Sub has a mechanism of dealing with this called dead-letter-queues however this mechanism is mutually exclusive with ordering keys – whose introduction was the reason why Pub/Sub was considered in the first place.

This mutual exclusiveness is due to the conflict of guarantees given by Pub/Sub when using ordering keys; at least-once-delivery and ordering of messages.

One might argue that this is not practically a problem, since most messages will probably be consumed within seven days. This is true, however, due to the sensitivity of the messages for the clients’ operations, the risk of data loss was not acceptable.

Priority queues

The other issue that was identified has to do with the queue data structure. Rabbit queues are priority queues. This additional functionality is relevant in the case when only a part of a message queue is not-acknowledgeable. In this case, the priority queue will unblock automatically.

This unblocking automatism is crucial to support a larger number of queues.

Resolution

In a single line, the two issues that had to be resolved were; Mitigate the risk of data loss to not-acknowledge messages and implement an unblocking mechanism to queues.

If we look closer we see that the two problems have a common solution; Never not-acknowledge!

Technically we moved the problem from the message broker to the consumer that now has a message it doesn’t know what to do with.

But now we are almost done because we can simply write these messages to persistent storage with a fast write API like Google Datastore.

A sequence diagram of the proposed solution looks something like this:

                ┌────────┐          ┌─────────┐          ┌──────┐          
                │consumer│          │datastore│          │pubsub│          
                └───┬────┘          └────┬────┘          └──┬───┘          
                    │   nacked-message   │                  │              
                    │ <──────────────────│                  │              
                    │                    │                  │              
                    │        nack        │                  │              
                    │ ──────────────────>│                  │              
                    │                    │                  │              
                    │                    │                  │              
      ╔═════════════╪══╤═════════════════╪══════════════════╪═════════════╗
      ║ BLOCKING CALL  │                 │                  │             ║
      ╟────────────────┘             message                │             ║
      ║             │ <──────────────────────────────────────             ║
      ║             │                    │                  │             ║
      ║             │        nack        │                  │             ║
      ║             │ ──────────────────>│                  │             ║
      ╚═════════════╪════════════════════╪══════════════════╪═════════════╝
                ┌───┴────┐          ┌────┴────┐          ┌──┴───┐          
                │consumer│          │datastore│          │pubsub│          
                └────────┘          └─────────┘          └──────┘          

Where the vertical axis is the lifetime of a consumer process.

The only thing we have to keep in mind is to maintain the ordering from Pub/Sub to Datastore. We did this by implementing a linked list, so the not-acknowledged messages could be consumed in an ordered fashion, when the application – hosted in Kubernetes – was being rescheduled, thus creating a priority queue with persistent storage.

The Datastore table of nacked messages would then have the following table structure:

idmessagepointer
0data1
1data2
2datanull

Aftermath

The python solution that came into production was around 400 lines long, and it took the client engineers a couple of weeks to introduce the new google client libraries in the codebase and move the 20 queues across.

Sebastian

CTO @ endocode

    Leave a Reply

    Your email address will not be published.