This series of posts first examined the importance of the CME to Optiver’s trading activities and the implications of their iLink architecture upgrade. We next examined our high-level approach: combine a thorough understanding of the problem with a disciplined scientific approach. Today we want to dive deeper and examine how we strive for simplicity in our system design and architecture.
To start, we aim to socialize our designs within our technology group. We spend a lot of time seeking feedback from other engineers within our group. We discuss them in meetings. We spend time at whiteboards with traders, developers, and operations engineers. We conduct design reviews with a diverse group of senior engineers from all parts of our technology group. Our goal in all of this is to beat up a design as much as possible in the hopes of exposing any flaws and finding areas to improve.
In the last post I mentioned our CTO’s challenge to design our FPGA with only one risk limit check. One of our most tenured engineers came up with a very clever way of achieving this goal. However, when we exposed this to a broad group of senior engineers in whiteboard discussions and a design review, we arrived at a different consensus. It would be far simpler for both the clients of the FPGA and the FPGA itself to use two counters rather than one. The group determined that the added complexity to the FPGA design was minimal enough that the broader reduction in systemic complexity was far more important. We arrived at a simple design by starting from an extreme constraint, designing a solution, and making some iterative improvements via a larger group discussion about the benefits and drawbacks of various proposals.
Crucial to these discussions being healthy conversations and not continual religious war is that we operate from a set of shared technical principles. As we stated in our introductory post, this does not mean we follow these principles dogmatically. But it does mean we start from the same place, and thus speak the same “technical language”. We still have strong, differing opinions on the best solution. But we always strive to learn from each other.
All of this begs the question: what are these technical principles? What follows are three principles which apply to many of our discussions.
No Future Proofing
In our last post we mentioned a couple of self-imposed constraints. Avoiding future proofing is a broad class of constraints we enforce on ourselves. Future proofing often results in a generic system, so we aim to avoid it. Our reason is that we have found that highly specific systems are easier to code, more robust, and generally simpler than generic systems that can solve a variety of problems. Some examples of this in practice:
Do not start with a generic protocol, especially if you only have one concrete use case.
Do not build in hooks to support multiple order types when you will only support one in your first release.
Avoid optional and nullable fields in protocols, objects, and databases.
In the case of the iLink upgrade, our very first FPGA deployment was merely a TCP passthrough. We then only solved for a subset of the trading signals, order types, and strategies that our old system traded. And in building our system we strictly avoided anticipatory abstractions, templates, and generic structures in favor of specific code tailored to the problem at hand.
Fail Hard and Loud
We initially code our systems to crash loudly when they encounter unexpected situations. We are even open to embracing single points of failure which will bring down a host of other applications when they fail. In addition to being easier to implement, it gives us deterministic behavior in the midst of unanticipated situations. We can leverage this environmental invariant to bring “the humans”, who tend to be better than computers at handling vague and ambiguous situations, to come and intervene. We reinforce this principle by eschewing automatic restarts and keeping simple reset buttons out of the hands of our traders.
Our new FPGA-based system was no different. In fact, failures became even harder and louder as there were more dependencies, single points of failure, and cross-application state transitions than in our previous system. Leaning on a hard-fail mentality quickly surfaced problems. And because we had not future-proofed and focused on small, well-contained part of the trading problem, a hard failure in our new system was not as catastrophic to our overall trading activities as it may have been otherwise.
No Tolerance for Errors
Tightly related to the previous point, we have very little tolerance for errors in our system. We have long had an instinct that there is a lot of value to be gained by paying attention when your system logs an error, or more generally does something unexpected. As we have put this philosophy into practice over eight years we have seen it consistently reveal larger problems than we expected.
No Errors in Logs: Diligently examining the errors in our logs has had a massive positive impact on our trading system. We have discovered misconfigurations, uncovered dead code, revealed opportunities for major architectural improvements, and stumbled upon strategic evolutions which improved our trading.
No Dropped Network Packets: Our continual fight against dropped packets has born much fruit over the years. We have found bugs in network drivers, performance problems in desktop trading applications, idiosyncratic networking differences between allegedly similar versions of operating systems, and much more.
Deterministic Latency: Paying close attention to latency drifts has shown us everything from one-line bugs to important architectural limitations which only surfaced under specific, unanticipated use patterns.
At first glance this principle appears more operational than architectural. In fact, it is both. To have no tolerance for errors, you must first have a definition for how you expect your system to behave. So we make specific definitions of intent a first class concept in system design. Which message types do you normally expect to receive from the exchange? What range of values do you expect in each field? What is maximum size of your incoming data buffer and what does it mean when that buffer approaches being full? What is the expected shape and magnitude of a response latency graph?
When designing a new system, you have the opportunity to start with a clean slate and quickly learn from situations where reality does not match your intent. In our preparations we worked to define a number of expected behaviors in our new system. As we rolled out our system, we leveraged a lack of future proofing to narrow the scope of our problem and build tailored, specific solutions. We knew if there was a problem, our system would crash and we could begin investigating the situation immediately. We could then leverage both these facts to define myriad expected behaviors. Whenever those expectations were violated, we would learn from the violation and evolve our system accordingly. Generic, resilient solutions make this sort of approach nearly impossible. There are simply too many possibilities to enumerate, and the mechanics of monitoring for all those possibilities in a system that runs continuously requires an overwhelming degree of sophistication.
When you start from these three simple principles, you are more likely to build a simply designed system. When you show that design to other smart engineers and genuinely seek their thoughts and input, you are more likely to produce a design others think is simple as well.
David Kent, Chief of Staff - Technology
David is a Stanford Computer Science alum and spent several years as a developer at Amazon.com. He joined Optiver as a Software Engineering Lead in 2009 and has led many of Optiver’s software development teams. He is presently Chief of Staff for the Optiver US Technology Group.