Vermiculus - Solid Systems for Challenging Times

By: Hannes Edvardson, System Architect, Vermiculus Financial Technology Sep 2022

The last three years have been challenging for market operators. The pandemic and the Ukraine war have increased price volatility and trading activity. Furthermore, the number of retail investors with direct market access keeps increasing and events like the GameStop rush may well become more common in the future.

Ever-increasing performance requirements

The new trading patterns put high pressure on the post-trade system installations at clearing houses. Two requirements are clear:

Fast-changing market conditions may require new, more advanced functionality, for example, the ability to calculate margin requirements and issue margin calls much more frequently than once a day.

Systems need to be able to handle considerably higher trade volumes, which are unevenly distributed over time and across products.

The latter requirement is a performance requirement, but even for functional requirements, the performance limit of a system often sets boundaries for what can be implemented. A clearing house may want to recalculate the initial margin every hour, every minute, or at every new trade, but the system architecture may simply not allow for such frequent recalculation without adverse effects on the normal trade handling procedures. In short, having a post-trade system of high performance is both a direct and indirect requirement for a modern and capable clearing house.

Historically, several strategies have been applied to increase the performance of a post-trade system:

Vertical scaling (buying faster hardware). However, the current hardware trend is not to increase clock speeds, but to add more CPU cores. If a system has a sequential rather than parallel design, the performance boost from more advanced hardware will eventually flatten out.

Different system installations for different markets served by the same clearing house. This is a common choice in a design, but it has limitations. If a particular market is very busy, then that market may reach the performance limitations of the system on its own. There are also functional problems with this approach; it may for example become more complex to offer a common collateral pool for all the markets served by the clearing house.

Sharding system components. In this design, several instances of central system components are started. Different clearing members or products are assigned to different instances. This, too, is a common design, but it is vulnerable to sudden trading pattern changes that make the initial assignment of members or products a bad fit for the actual load.

All strategies described thus far share the disadvantage that they rely on a static configuration of hardware resources for different partitions of the market. This results in a direct and visible cost for the clearing house; when deciding to deploy the system for market X on a machine of type Y, you need to leave some headroom in case the system traffic increases unexpectedly. Maybe a test is made to install on machine Y with a load that is twice the current load of market X. While this gives confidence that the system will be able to handle a sudden surge in activity, it also effectively means paying for twice the hardware resources actually needed.

Elastic scalability

Enter elastic scalability. The goal of an elastically scalable system is to be able to add more hardware resources very quickly to a system and have it use the new resources in a meaningful way. Elastic scalability is associated with cloud environments, where new virtual machines can be created with the click of a button and the user only pays for the time a machine has been used. Tools like Docker make it possible to tailor virtual environments for different applications, and tools like Kubernetes make it much easier to monitor systems and start and stop virtual machines.

However, simply deploying a statically sharded system in the cloud does not make it elastically scalable. Building a system that detects the presence of new resources and rebalances itself to use the new resources is a complex issue. Building the software infrastructure needed for rebalancing requires expert knowledge in distributed systems theory. Furthermore, building application components where data in one component can suddenly be split up and run in several components requires a deep understanding of consistency requirements in post-trade systems. How should the data be split up? When do different components need to synchronize to form a well-defined global state?

Good infrastructure should leverage the proven building blocks for distributed systems that exist in the market today. There are reliable and scalable message brokers, databases, and load balancing solutions, and only in very special cases should own implementations of such products be built. Of course, combining existing products into a platform that can be used for all the different applications a clearing house needs can be a challenging task that requires an in-depth knowledge both of business processes and technology.

For the last 15 years or so, the microservices approach has been the most successful design pattern for applications in elastically scalable systems. The basic idea is to split up the system into many small services that are developed and deployed individually and co-operate by sending messages to each other. Pathbreakers like Amazon and Netflix have been transparent in their work of transitioning from a monolithic system to microservices. In their installations today, thousands of independent services run in parallel, sharing the load. As the system load increases, more instances can be started, and these new instances are allocated a part of the load. When traffic slows down and the system load decreases, instances can be stopped. The hardware cost per minute from the cloud provider increases and decreases with the system load and several started and stopped microservices. On a busy day, the hardware bill is higher than on a less busy day. This is exactly the type of behavior that a clearing house needs, given the varying and volatile load patterns.

Not only does a microservice-based system have good performance characteristics, but the decoupling of different parts of the processing also makes upgrades to a particular part of the system less risky since the different services are dependent not on the other services' implementations, but solely on the message protocol specification. It also makes it possible to organize more efficiently and let different teams work and schedule upgrades independently.

Building a microservice-based system

In a database-based system, the most critical design activity is data modelling. In a microservice-based system, data modelling is equally important, but several additional design challenges arise.

The first challenge is to set the consistency requirements. This typically requires a big change of mindset for both business analysts and developers. In a database-based system, there are typically several triggers that check consistency and invariance after each new transaction. For example, after a new deal from an exchange has been added, you can check that all positions and payment instructions are balanced. In a distributed system, there is no global clock, and you cannot view the system processing as a sequence of consistent states. For example, payments for the buyer and seller may be handled by different microservice instances. These two instances do not run in lockstep with each other, and at a given point in time the buyer's payment instruction may have been processed while the seller’s has not. The concept of one point in time is not meaningful in a distributed system where many processes run concurrently.

This may sound like a chaotic system with unpredictable output, but that need not be the case. Most microservices-based systems are eventually consistent, which means that the system will eventually settle into a consistent state when all input data has been processed. For a clearing system, that typically occurs at the end of the business day. However, this is typically not enough for the business process, and it is necessary to be able to produce a consistent state at any point in time during the day. There are many mechanisms to implement synchronization points in a distributed system, points where the entire system state is consistent and predictable. Nevertheless, such synchronization points are expensive from a throughput perspective as they typically mean queueing incoming messages while a consistent state is built.

Therefore, a critical task for the business analyst is deciding when consistency is needed. For example, when do all positions need to be balanced? When settling variation margin profit and losses between members? Yes, most certainly. When monitoring the margin/collateral balance for members in real-time? No. This classification of use cases may feel unusual and uncomfortable, but they are key to building a truly concurrent system.

Another challenge is the message protocol between microservices. A good practice is for a microservice to expose only the minimum information needed by other microservices to do their work. While it might be tempting to publish all the information that a microservice holds just in case some other service will need it in the future, this ties the hands of the development team that is responsible for the microservice. They are now less free to change the implementation of the microservice since this would likely change the message flow. We end up with a tightly coupled system where microservices are dependent on each other’s implementations.

Transitioning from monolithic systems to microservices

Microservice-based systems are different in the deepest sense from traditional monolithic database-based systems. Transitioning may appear an impossibly large task. Here, the concept of microservices comes to the rescue. While a big-bang approach where the entire system is replaced makes it possible to streamline the data modelling and processes, the concept of small microservices suggests that it is possible to take a gradual approach. Perhaps a first step could be to break out trade half matching, stress testing, payment netting, or any processing step that lies at the beginning or end of the transaction chain. After an initial project as such where a framework for microservices has been deployed, it is likely be much easier to transition more and more business logic from the monolithic system to the microservices-based one.

Any market operator that feels that a system modernisation is due, either because the system is approaching its performance limits, or because the system has become so complex that modification and upgrades become unwieldy, should consider the microservice approach. Vendors of modern, microservices-based post-trade systems can help with analysis to find suitable parts to break out of the original system.