Optimization strategies for large scale distributed systems
View/ Open
Date
14/09/2023Author
Verbowski, Chad
Metadata
Abstract
Every day we rely on large distributed software services for communication, information, commerce, entertainment, and many other personal and business use cases. These services are globally available, sometimes with billions of users. A single service may use hundreds of thousands of interconnected servers running in datacenters around the world to ensure it is both highly available and able to provide users with sub-second responses. Operating such a service costs hundreds of millions of dollars annually and typically requires thousands of engineers working on millions of lines of software to maintain and evolve it.
The scale and complexity of these systems make their optimization difficult, however, it is vital that they operate as efficiently as possible given the costs involved. This thesis focuses on two of the most impactful optimization opportunities common to these systems: minimizing their resource allocations, and minimizing the time taken to resolve request attributes into processing parameters. This work was done in conjunction with one of the largest global
commercial services to show the effectiveness of the techniques on real-world systems.
First, we focus on capacity planning, which has significant business and environmental impact. More than 1% (292 TWh) of global electricity is currently consumed by datacenters and that amount continues to increase. The challenge here is to determine resource needs for unexpected usage increases or failures in addition to steady-state activity. The goal is to reduce over-provisioning without impacting any service guarantees. This thesis presents a significant improvement of the state-of-the-art via a new iterative black-box capacity planning model relying only on the relationships between workload, utilization, and quality. Collaborating with one of the largest global service owners, we enabled capacity reductions between 20% and 40% saving $50,000,000 USD annually. A global datacenter usage reduction at this scale would save enough electricity to offset the annual consumption of 44 million people in the UK – eliminating 34 billion tonnes of CO₂ emissions.
Second, we focus on improving the request latency by optimizing a common component across all services – resolving request attributes into processing settings. We describe a technique for translating tabular data into code and compiling it into a binary index for point look-ups on sparse data-sets. Query performance of a large commercial service improved by 58 times compared to R-tree based solutions and O(10⁸) with commercial databases.
Additionally, a new compiler optimization is introduced for addressing large blocks of conditional evaluations, motivated by observing that up to 70% of all processing time is consumed by 10% or less of the requests. Machine learning was shown to be effective at guiding the compiler towards the most efficient optimization to use for blocks of conditional evaluations, resulting in a 92 times reduction in average latency.