As soon as your technical team reaches a certain size, you usually start to divide your platform into specialized servers (aka microservices) that are individually more manageable by a dedicated small team. But those different services still need to smoothly work together to deliver your business goals.
In this article, I describe different approaches and why I think Apache Pulsar is an ideal choice to build a modern orchestration engine at scale based on what we learnt building Zenaton. If you are interested to follow this work, please subscribe here.
Let’s consider a typical “flight+hotel booking” process with the following steps:
Each of those steps is processed by a different service and may (and will) fail for technical (network issues, system failures…) or business (fraud detection, inventory not available anymore…) reasons. As often, the main difficulties in an actual implementation would come from those issues and the needs to respond gracefully to them (retry, reimbursement, notifications, booking cancellations..).
A first implementation that comes to mind is to have a coordinator that calls each system and ensure that each step is orchestrated according to business needs. For example a simple controller sequentially requesting each service through HTTP calls can do the trick. But It’s a brittle implementation as any issue into only one service will propagate to the entire process.
Also when you start having a lot of services and processes, it becomes an issue to keep track of who is using a particular service, making it difficult to update. Also each orchestrator needs to know each service to be able to connect directly (which servers? Which APIs?).
The current practice is to implement an event-driven architecture. Each service is connected to a message bus (hello Kafka!), subscribes to some messages and publishes others. For example, the payment service will subscribe to the “CheckoutTravelCart” command message and produce a new “PaymentProcessed” event. This latter will be catch by the flight booking service to trigger a “FlightBooked” event, catch by the hotel booking service to trigger the “HotelBooked” event, and so on.
This approach is seen as more decoupled as you do not have synchronous interactions between services. If a particular service is down, the process will pause and will resume as soon as the broken service is fixed. Nevertheless, this architecture is far from being a silver bullet as the definition of your business processes is actually distributed through events subscriptions and publishing, making it hard to update and really hard to have a clear understanding of the business situation without adding a dedicated service that will actually record and monitor each event.
Actually, you can extend the event-driven architecture by adding an orchestrator service. This service will be in charge of triggering commands based on the history of past events. This orchestrator maintains a durable state for each business process, making it easy both to implement sophisticated processes and to have a source of truth for each process state. Moreover, each service is even more decoupled as it does not need anymore to be aware of other service’s events. They can behave as simple asynchronous services receiving their own commands and publishing their own events only.
The main downside of this architecture is that it’s actually a lot of work to implement — you need:
Here is a more detailed technical view of requirements :
This architecture was actually implemented at Zenaton:
Despite having not that much customers, we were struggling with scalability issues:
In the next section, I’ll show how Apache Pulsar is actually a much better platform to build this type of architecture.
Pulsar is sometimes presented as the “next-generation Kafka”. It’s already used in production at — very — large scale at Yahoo or Tencent. It has an impressive set of features:
Let’s see how we can take benefit of these features to reach this simple architecture :
An ideal event-driven architecture requires both a streaming platform for managing events and a queuing system for job processing. Of course, it’s possible to have only one or the other — at Zenaton we had a unique RabbitMQ cluster — but you make your life more complicated.
Apache Pulsar proposes a unified messaging/streaming system. It means you do not need to operate 2 different clusters and can use Pulsar to provide command queues and the event bus.
By leveraging Bookkeeper, Pulsar provides you out of the box an horizontal scaling of your storage. Bookkeeper is used to store your messages permanently. Of course you can define a limit in retention. You can also set up a tiered storage to offload long-term storage to S3 or other providers.
Using Pulsar, you do not need additional databases.
As described in Pulsar documentation, functions are lightweight compute processes that
Pulsar lets you “deploy” functions in its own cluster and ensure they are up and running. Pulsar also provides an API to retrieve and store a state in the embedded Bookeeper cluster.
Functions can be used to deploy our orchestrator, our decision service and even clients (if services can be reached through gRPC or http) directly inside the Pulsar cluster.
> Note: as of summer 2020, it’s not possible to configure a function to guarantee than not more than one decision is running at a given time. It should be possible in next release 2.7. Also, stateful functions are still in developer preview. You can expect some bugs, but hopefully they will be soon fixed.
Delayed messages can be used in the job manager to manage retries of failed jobs, but can also be used to manage delays in workflows.
By using this feature, you do not need a Timer service.
As Pulsar provides a way to query data stored in Bookkeeper, it’s quite easy to set up an API that a dashboard can use to display what happens in workflows. Cherry on the cake, this API will include the data offloaded to tiered storage as S3.
Based on my experience at Zenaton, Apache Pulsar appears to be an ideal platform to build a lean but powerful enterprise-grade event-driven workflow orchestration platform.
That's what we are doing at Infintic. Infinitic proposes a "workflow as code" pattern, that lets you describe workflows as if you were on a single infallible machine! If you are not familiar with this pattern, please read code is the best DSL for building workflows.
You can have amore complete description on how it actually works by looking under the hood of an event-driven workflow engine.