October 20, 2022

Event Storm to Production Supergraph: a Domain-Driven Design Approach to GraphQL

Author
Xolvio

Or how to tidy up your system’s domains, so that you get a supergraph schema design for free.

The evolution of GraphQL has spawned the supergraph—a declarative, modular, and agile approach to architecting a unified graph that combines multiple GraphQL APIs. The supergraph architecture allows developers to work independently and at the same time deliver features faster, improve customer experience, as well as unlock data across the org.

Sounds sweet, doesn’t it?

But how do you even start to design something as ambitious as a unified composition layer for your entire company? 🤔

Different departments may have different definitions of what a customer is. More often than not, data and business logic are spread across various silos while APIs are scattered all across the organization.

Then, there’s also the consequence of the rapid pace we achieved with Agile, which forces us to move really fast without considering the bigger picture of architecture and software design. On top of that, breaking up legacy monolithic systems into microservices has made things more complex for devs.

The sheer totality of data, services, legacy, and capabilities that exist within your company can be truly mind-boggling.

It’s no wonder then that enterprise-wide GraphQL adoption is such a huge challenge. The gist of this challenge is that it’s not just a technical problem, but also, and for the most part, an organizational and people problem.

We need to talk

If GraphQL adoption is about people (and we know for a fact that organizational structure and software architecture are closely tied) then surely we need to bring people together in a room or on a Zoom call and get them to talk to each other.

In this article I'm going to show you a fun and engaging workshop technique that will help you achieve consensus on an initial software design in just a few meetings and allow you to begin implementation and iteration immediately. The key benefit of this approach is that it produces a cohesive, shared understanding of the business—a tidy domain model.

You may well ask, what’s that to do with GraphQL? The answer is: if you begin your supergraph journey with a tidy domain model, you practically get your schema design for free. Read on to find out how.

Event storm your way to the supergraph: the basics

This very special yet super simple workshop technique that we use at Xolvio in our client work all the time is called event storming. Originally invented by Alberto Brandolini within domain-driven design, event storming allows you to easily model your system, so that you have a clear, high-level picture of it and a solid basis for schema design.

In the many years I’ve been doing digital transformation work and Agile coaching, I found event storming to be hands down the best way to get everybody on the same page across both business and technology stakeholders. The technique basically consists in:

  1. Inviting all relevant business and technology experts to the session
  2. Using sticky notes (or an online equivalent) to brainstorm business processes on a board

As simple as that. This works equally well for documenting both new and existing systems. The notation is straightforward and flexible and it’s very easy for everyone to take part and learn event storming as they go.

Event storming involves three different modeling levels, which you can think of as layers of an onion:

  1. Big picture: we look at things from a high level.
  2. Process modeling: we dig a little deeper and document a business process in detail.
  3. Software design: we use our modeled processes as the basis for building actual software.

In order to demonstrate and explain the sticky note notation in practice, let’s dive into a realistic example of an imaginary enterprise system.

Taming complexity into a crystal clear picture

We’re not joking around here, so I’m going to walk you through event storming for one of the most complex kinds of businesses imaginable—an airline.

First, let's have a look at the systems involved at our company. We have four systems dedicated to loyalty, and as many as nine different systems that deal with bookings. It turns out the airports of the world do not all use the same software platform, so we have to integrate with all of them independently. As a sour cherry on top, the flight reservations system happens to have been written in the 70s. 😬

Now, let the event storm begin!

As the very name suggests, event storming is going to help us make sense of this bouillabaisse of systems by focusing on what’s called domain events.

We use orange stickies to denote domain events, always expressed as verbs in the past tense, e.g. flight arrived. This is where we want to notate things relevant for domain experts. We’re not trying to dig into the weeds on technical stuff, we just focus on the business level.

We start out the event storm by simply throwing domain events on the wall in no particular order yet. It’s even fine to have duplication and disagreement, since the point is to just get the domain experts going and list all possible events taking place in the business.

The next step is to start organizing this mess. Let’s begin to enforce a timeline first. Are there any stories here we can tell? Is there a user journey maybe? Thinking in such terms allows us to quickly arrange the events into a logical progression. A passenger boarded the plane, the flight departed and then arrived. The passenger is awarded miles to their loyalty account and at some point they achieve platinum status, so that we send them a platinum card.

Once we have arranged our events on a timeline, we begin to see that there are some pivotal events to the story that mark the beginnings and endings of specific “chapters” of our business narrative. Let’s mark them:

Identifying these pivotal events is going to help us with the next step, which is defining the bounded contexts. We often see that different series of events belong together in groups like reservations, flight operations, or loyalty.

Why are we breaking up the events like this? In domain-driven design, a bounded context is defined as a boundary of language consistency. Sure, a flight is a flight, but different groups of people working at our airline are concerned with different aspects of a flight. For example, to the ramp crew a flight is something they need to fuel, load baggage, or tow. On the other hand, to the crew management a flight is something they need to staff and calculate income tax based on where the flight landed or took off.

At the end of the day, which do you think is better: to have an overloaded understanding of the flight that encompasses everyone's view of the flight and start there, or to allow people to have independent perspectives and then join across them?

I’m pretty sure this question already made you connect the dots. Yup, you’re right—the bounded contexts in fact give us guidelines on how to form our subgraphs!

Diving deeper to flesh out the details

So far, we’ve covered a taste of big picture event storming. Let’s now continue with process modeling and event storm one domain in a bit more detail, that is, loyalty. In order to model how loyalty works in our airline, we’re going to introduce a few more event storming concepts represented in different colored sticky notes.

For a start, our airline isn’t actually sending out loyalty cards themselves, so we need to denote an external system here. We put the external system next to the domain event that it handles.

External systems are represented with pink sticky notes. Being external, they're outside of our control. This could be a system owned by someone that couldn’t be invited to the event storm, or it may be something we can put the blame on, like the weather.

What about the behavior we do control?

Our airline differentiates between customer status when awarding miles, e.g. a platinum card holder gets awarded nine times the miles they traveled. That’s reactive logic, so we need to denote that with a lavender sticky note which represents a policy.

A policy is triggered whenever its associated domain event takes place. Business rules are hidden here. The policy can be automated or it could be manual. We don’t necessarily need to worry about its details and requirements at this point, but it’s important to note there’s some sort of policy here as a placeholder. Policies are also easy to test in a declarative style such as behavior-driven development.

When the flight arrives, the mileage award policy is triggered, but the policy needs to know the passenger’s status level. Moreover, we need to know how many miles a customer accumulated in the year to check if they’re eligible for platinum status (governed by the status promotion policy). Such information that feeds the conversation and helps make a decision is represented with green stickies.

A green sticky note is called a read model. This is usually like a pure function over the event history. A read model is read-optimized, so you can think of it as a queryable data structure.

The read model provides information needed to make a decision, but what kind of a decision are we talking about? In event storming that’s called a command, represented with blue stickies.

Notice how domain events are preceded by commands, so they are actually initiated by the commands. A command represents intent or user decision. Some people like to think of them as requests because they can fail. If a command succeeds though then it always results in a domain event.

There you have it, a complete business process modeled!

Translating a process model to schema design

What we covered thus far amounts to around 90% of event storming. You now know about domain events, external systems, policies, read models, and commands. Let’s focus on the latter two to see how they can inform our GraphQL schema design.

We have our read models, the information we need to make a decision.

We have our commands initiating domain events.

Does this sound familiar in the context of a GraphQL schema?

A read model lines up logically with a query, which provides us with information.

A command lines up logically with a mutation, which is the effect we want to have on the system.

Let’s now map this onto our existing architecture. Even though it’s messy, GraphQL allows us to write resolvers to get the data we need. The read model and command stickies are going to inform our work.

The “Miles accumulated in the calendar year” read model is going to fetch flight master data from flight reservations, as well as the loyalty account multiplier and credit card miles from the loyalty accounts system. These could be REST APIs or databases.

In turn, “Promote Loyalty Status” is going to mutate the traveler master data, as well as data in the relevant loyalty systems accordingly. “Flight reservations” may even be just FTP uploads, it doesn’t matter really. The point is, we can map our schema to the backend systems.

This correspondence between read models/queries and commands/mutations is a true game changer for healthy schema design. You can now well see that a properly modeled business domain essentially makes schema design a walk in the park, even in a brownfield scenario.

Leveraging the power of the supergraph

Let’s stick to our loyalty use case and see how a new requirement will take us to the supergraph. 🚀

Consider the following scenario. Our airline would like to award platinum status travelers with extra miles if their flight was delayed more than four hours, except for discounted flights. We have a batch process (Midnight) that runs every night to collect all late arrivals but it needs to know all the platinum status fliers who traveled on late flights in the past 24 hours.

As you can see that’s a new read model up there, so we need to design a query that’ll fetch the data across different subgraphs. How are we going to do that?

First, let’s have a look at the following bounded contexts to figure out where to find the data we need:

The Reservations domain knows the fare discounts.✅

Flight Operations has the data on arrival time.✅

Loyalty contains the travelers’ status level.✅

Now we know which subgraphs own which parts of our query!

Let’s now investigate the relevant subgraph schemas and see how we are going to join these 3 sources together.

Notice that we can logically share the same entity “Flight” across all of the three subgraphs. This is possible thanks to the Apollo Federation-specific @key directive. This directive defines an entity's primary key that consists of one or more of the type's fields. In our example, the Flight entity's primary key is its “id” field. This makes all instances of Flight uniquely identifiable and allows the Apollo Router to associate field data from different subgraphs with the same entity instance.

Consequently, we don't need to have just a flight system or subdomain. Instead, we can allow independent systems to collaborate on the same concept while having their own bounded contexts—their own boundaries of language consistency.

Let’s now take a look at our supergraph introspection schema:

As you can see, we get the number of hours delayed from Flight Ops, the loyalty accounts traveling on that flight from Loyalty, and the discount from Reservations.

And voilà! Here’s our nice and clean federated query right there:

Notice that we achieved a solution to a complex problem across multiple domain boundaries without the need for any additional work in the systems providing the data. Domain owners can work independently of each other, as well as independently of the frontend developers who need to aggregate information across multiple domains.

You can check out the full supergraph demo featured in this article on GitHub: https://github.com/xolvio/pan-am-supergraph-demo

Mighty thanks to my colleague Samuel Combs for preparing the code demo!

Final words

Congratulations, you’ve made it! You now know:

  1. How to apply the technique of event storming to model your business and use cases
  2. How to translate a process model into healthy schema design
  3. How to create a federated query across multiple subgraphs in a supergraph

I do realize that many engineers who came across domain-driven design in the past may have found it esoteric. As you made your way through the above demonstration, I hope you can now see how event storming can rapidly build consensus around a good design. After all, the intractable problem about GraphQL is the need for coordination and communication across so many business units and teams. Event storming facilitates this greatly, especially that it can actually be a fun activity.

One thing to keep in mind is that all models are wrong, but some are useful. That’s why you want to iterate on event storming and stay agile in your implementation.

I wish you great success with your supergraph, but please remember it’s okay to ask for a helping hand. As the official Apollo GraphQL professional services partner, Xolvio has you covered with a unique methodology for predictable supergraph design and implementation.

Let me know if you have any questions or thoughts in the comments below.

Keep reading