As mentioned in my previous post, I recently finished and launched a major systems integration project at work. I'm leaving the specifics vague in order to avoid preempting any official announcements, but as it was the single largest and most consequential bit of programming I've ever done, I wanted to show off a little bit. As Olivia Pope says, you gotta stand in the sun after all.

Olivia Pope (Scandal) standing in the sun

I also find technical postmortems to be interesting to write and often inspiring to read. I've only done a few myself, one on making a "free" website and a bunch of game dev retrospectives.

So in the spirit of trying to share interesting patterns and techniques with the broader community, I present an account of part of the integration project that has been consuming my life for the last nine months and the architecture patterns I picked up along the way.

Project requirements

The general requirements of the project went something like this:

  • Our partner has a platform similar to ours for several features.

  • But they have a more mature system, which abstracts certain things in clever ways and would give us interesting opportunities to scale if we could use it.

  • We cannot expose their platform to our users. It would be terrible UX and also strategically foolish.

  • We also don't want to complicate existing operational workflows any more than necessary.

  • The existing/legacy versions of the features we are integrating must continue to work, whether or not the integration is enabled for a given tenant in our platform.

If you read this list and you have any experience with systems design, a naive implementation is pretty straightforward, especially when you realize both platforms are (like everything else on the planet) essentially REST apps. We just need to copy everything a user does in the other system. So, a v1 plan looks something like this:

  1. 1.

    Set up a module containing the request (or chain of requests, if necessary) to the partner app for each create, update or delete action we need to replicate. Their SDK saves you from writing your own requests library—just figure out which functions you need and put them in the right order.

public class IntegrationProvider(SdkClient sdkClient)
{
    // When we're done creating the product in our own system,
    // we call IntegrationProvider.CreateShopItem and pass it
    // the new product--the new integration requests module
    // handles the actual procedure of syncing it.
    public async void CreateShopItem(Product product)
    {
        var shopItem = new ShopItem
        {
            Name = product.Name,
            Description = product.Description
        };
        var shopItemPrice = new ShopPrice
        {
            Amount = product.Price
        };

        var item = await sdkClient.ShopItem.Create(shopItem);
        var price = await sdkClient.ShopPrice.Create(shopItemPrice);
    }
}           
  1. 2.

    At the end of each function that handles users' requests in our system, check to see if that user's tenant needs to sync with the partner platform. If it does, call the relevant function in your sync module and watch the data appear (or disappear) in the other platform.

All glory to MS Paint

There's a fair amount of work involved here in keeping things organized, but at the end of the day you're essentially just adding a couple of extra steps to your existing business logic.

It's never that simple

In this case, the naive implementation is not only naive, it immediately falls flat just achieving basic functionality. We have no way to map our objects to theirs. Creating new data is fine—just tell the other system what you want to do. If you want to update or delete something, however, you need a way to figure out exactly what it is you want to change in the other system.

There are two ways to do this:

  1. 1.

    Keep track of which of their object maps to your objects.

  2. 2.

    Give them the data so they can keep track of your objects.

As it turns out, this is not the first system our platform has integrated with. We already had mapping tables for some of the things we wanted to keep track of; for the rest we just created additional tables using the same pattern. As an added bonus, they're generic, so we just made life a little bit easier for anyone doing future integrations. It helps that a mapping table is dead simple to put together. You have five columns (minus any audit stuff):

  • The primary key (PK) of the mapping row

  • The foreign key (FK) to the object you're mapping

  • The FK of the mapped object in the foreign system

  • The enum/ID/lookup of the integration using the mapping. Not strictly necessary, but having it lets you use one table per object for ALL your integrations, from Steam to Stripe to Salesforce.

All glory to MS Excel

Platforms that give you a way to set custom fields or metadata lend themselves to the second option. The metadata field on objects in Stripe is a perfect (and oft-abused) example of this. Put your key into a metadata field, and you can query your invoices/customers/products based on your own data. You lose a bit of control, but you don't have to maintain the extra tables—which is especially nice for things like Zapier flows or other situations where being stateless is a requirement.

Maintaining a mapping table increases the complexity of the project architecture, but only slightly. Our v2 integration modules now include an interface with our mapping tables. Creating an object now goes like this:

  1. 1.

    We create the object in our system and cache its ID.

  2. 2.

    If we need to sync it to the partner system, we call the relevant request module and pass it our ID.

  3. 3.

    The request module makes the object in their system and gets an ID back.

  4. 4.

    Finally, it inserts a new row into the our mapping table: our ID, their ID, and the name of the integration.

If we ever need to update or delete that object, we just rearrange the order of operations:

  1. 1.

    Update/delete the object in our system and cache its ID.

  2. 2.

    If we need to sync to the partner system, call the relevant request module and pass it our ID.

  3. 3.

    The request module checks our mapping table and gets the foreign ID that corresponds to our ID.

  4. 4.

    Finally, it makes a request to the partner system to update or delete the object at the foreign ID.

Why do people use UML anyway?

As a bonus, we suddenly have a way to toggle our integration on and off for a given tenant. To enable the integration, we simply go through each of the objects in our system and call the relevant create action in our request module. The module doesn't care when we created our object, it just needs an ID to sync and it'll handle the rest. Rolling back the integration is the same thing—look up all the objects for the tenant, tell the request module to try to delete them. If they have a mapping, the request module will delete the object out of the other platform.

Perfect synchronicity. Time to ship.

The real world: internet edition

If only it were that simple. The mapping table gets us from naive-and-broken to just naive. This integration will now work long enough to impress someone at a demo, and then come crumbling down the moment it makes it to production.

It's not that the logic isn't sound. The core pattern is correct: if a user's tenant has the integration enabled, any actions that need to be synced over will make requests in the other system, using our mapping table as the source of shared state.

The problem is that humans...suck.

  • A bug in the requests module will cause the sync request to fail or not to be saved to our mapping table.

  • Our partner might introduce a bug into their SDK, causing our requests to fail or not get saved to our mapping table.

  • We cannot guarantee 100% uptime in the partner platform. We don't control their platform or what they do with it.

  • We cannot guarantee 100% uptime in our platform. Shit happens.

If you could guarantee nothing would ever go wrong you'd be done. You'd also be sought after by literally everyone on the planet and would not be reading technical retrospectives about system architecture and design.

For the rest of us, we need a way to handle failures and recover from them.1 The core issue is systems drift. If a user creates something in our platform, e.g., a product, the expectation is that it syncs over to the partner system. If the request fails, any future requests that work with that product will fail. As the errors stack up, the systems will grow further and further apart, a pernicious issue when you're relying on your partner's tech to support your customers. Few things break user trust faster than pressing a button or changing a setting and finding that it doesn't do what they expect.

We could roll the integration forward and backward when we run into problems. Our integration mechanism does allow us to completely resync the entire integration for a given tenant, which lets us reset the state to a specific moment in time. Relying on a human to monitor each tenant (and both platforms) and manually reset after every error is a solution, but I'm not about to volunteer for that job. Even if you could somehow manage to stay on top of it, if your integration has any features that don't use the REST infrastructure you'll be forced to reboot them even when they're working fine. This project falls into that bucket—the REST stuff is the biggest part of the integration, but hardly the only part.

The move is to find a way to isolate the request processing. If user requests can go to our dedicated processor, then perhaps we can manage its failures directly. If the partner system goes down, we don't block our users from making requests—we just tell our processor to wait until the other system comes back up to send them.

The outbox pattern

The transactional outbox pattern is a data-driven messaging broker microservice designed to ensure atomicity and multi-system synchronization and definitely isn't a glorified list wrapped up in résumé buzzwords.

In plain English, the outbox pattern is queue of transactions that sit and wait to be processed. When we want to sync something to a different system, we don't do it directly. Instead, we put it in a queue and let a dedicated processor take charge of actually sending it to where it needs to go.

AI-free, baby

It's an extra step, an extra service to maintain, an extra abstraction to reason about. But for the cost of a cron job/queue function/lambda, a database table2 and a wrapper class, we separate the "requesting" functionality from the "processing" functionality, which buys us some massive benefits:

  • Integration users making requests are not making them direct to the partner system. They make them to our database, which we can effectively guarantee will be online (if it's not, they probably wouldn't be able to make the request in the first place). An unavailable 3rd party won't block our users.

  • Saving requests to a queue mean they are easy to review. Just query the table to see all the activity. Suddenly we have serviceability: we can see exactly which request failed and why.

  • Depending on the implementation, having the list of messages saved means we could theoretically roll the partner system forwards and backwards to specific moments in time if needed.

  • We can turn off the processor independent of the larger system; say, if someone pushed buggy code to it, causing a cascade of messages to our unsuspecting counterpart.3

Call it a message broker, a processing queue, or any of a dozen other names I found for the pattern with a cursory internet search. Each variant has its own specifics and flavors and uses different underlying libraries. Technically speaking, a transactional outbox implementation should only put data into the queue as part of the same database transaction as whatever else it's doing, ensuring that if you complete your request then by definition you will get it into the outbox. Other flavors have different fault tolerances at all stages. Sometimes it's fine if a message doesn't make it to the queue; sometimes it's fine if the other party doesn't receive all the messages or receives them out of order.

I'll tell you a joke about UDP but you might not get it.

Shipping the work

The specific implementation I shipped had a couple of details that made sense for us specifically. For example, instead of using a built-in cloud queue (e.g., an Azure queue + queue-triggered Azure Function), I opted for a database table + cron job:

  • Cloud queues delete items out once they're processed, making it hard to troubleshoot faults.

  • The common pattern for bad requests in a queue is to put them in a "poison queue," but that's not an option here since we require jobs to be processed in order.

  • Using a table meant I could have a column with the tenant ID. So if a job fails in tenant A, the processor can move on to all the outstanding jobs in tenant B-Z without issue. A user messing something up in their tenant doesn't affect what you do in yours.

  • Some jobs require another system to process some data before they can be run. Using a table makes it easier to define statuses distinct from "to-do" and "done"; we can have a "preprocess" status while we wait for the other system to finish running.

The main disadvantage is that the processor is a little bit slower in actually sending stuff to our partner system. In practice it's not a big deal; a few seconds' delay is perfectly acceptable for our needs. For systems where latency is a constraint, there are other messaging designs that wouldn't have this problem. Or you can go direct, just like we were doing before we introduced the queue! The outbox pattern is a tool to improve stability and resiliency. If speed is the critical factor and you don't care about things like keeping state consistent, you can still go direct—you just have to accept that if the communication breaks down the system won't retry the request for you. Think of a button that triggers a fire alarm controlled by the foreign system. If their system is down, it's not going to work anyway, so it doesn't matter if your request fails. And you definitely don't want it to finally succeed a few hours later when the fire department clears the building and reconnects the systems.

We've been in production for a few weeks now. There were a few (several) sleepless nights and a long string of 12-hour days right up to launch. Critical patches went out the door literally the night before customers went live; such is the startup life. But it has been working flawlessly4 since then, including the other pieces of the project that don't use this pattern. There are still a few more things to clean up and some edge cases we need to handle as we scale the system. But the thing is out the door, it's working, and any day now I look forward to sleeping easy once again.


UUIDs responsibly sourced from everyuuid.com