We want the servers to crash at the same time

If you have a zoo of servers, then you should not compare absolute indicators. Normalize.

A customer from the sphere contacted us entertaining video streaming with an interesting problem – its servers did not crash at the same time. I would really like to achieve synchronicity.

The servers that confused the customer worked as a backend for storing video files. Essentially, it was a lot of nodes containing tens of terabytes of video files, which were previously cut into different resolutions by converters. Then, all these millions of files were sent to the outside world using nginx + kaltura, which made it possible to repack mp4 on the fly into DASH/HLS segments. This made it possible to withstand even high loads well, giving the player only the necessary segments without sudden bursts.

Problems arose when the issue of geo-reservation and scaling as loads grew. Servers within the same redundancy group did not die synchronously, since they were a very diverse zoo with different providers, channel widths, disks and RAID controllers. We had to audit all this beauty and rebuild almost from scratch all monitoring with a resource management methodology.

Architecture Features

The first question that arose during the audit was why a full-fledged commercial CDN was not used? The answer turned out to be very prosaic – it’s cheaper. The video storage architecture allowed for fairly good horizontal scaling, and the relatively stable traffic made it possible to rent hardware servers with a good channel more profitable than third-party boxed solutions. A paid CDN was still set up, but with petabyte-scale content delivery, it was very expensive for the customer and was connected only in emergency situations. This rarely happened, since in most cases the increase in the number of visitors was predicted in advance, which made it possible to leisurely order another server and add to the overloaded group.

All video content was divided into dozens of groups. Each group contained 3-8 servers with identical content. At the same time, each group except the last one worked entirely in read-only mode. The scheme looked very workable, but there were several very unpleasant moments:

The servers were different. No, not even that. The servers were a ZOO. The diversification policy required having several independent suppliers with many locations around the world. Naturally, the hardware was rented at different times, with different disks, RAID controllers, and even different channel widths.
For a number of reasons, at this stage it was impossible to provide full load balancing between servers with health-check. That is, if one server from the cluster crashed, then users still received its IP address in DNS. Before manually removing the load, some users continued to hit a dead node and received errors when loading content.
Due to the fact that servers with a channel of 1 and 10 gigabits/s could coexist in the same group, it was completely unclear when it was time to purchase additional capacity.

As a result, as it turned out later, some groups of servers were over-purchased, and others, vice versa.

Servers go down one by one

During the analysis, we divided the servers into two large groups:

A group of servers with a small upload bandwidth limit (1G-2.5G)
A group of servers with a large upload bandwidth limit (10G)

The problem was that sometimes incidents happened. One of the latter looked like a lot of sudden Asian users who began to drag tens of gigabits of all available content. All this coincided with the fact that one of the powerful nodes in the group was inoperative due to a failed disk and a lengthy reassembly of the array.

The servers that had the widest channel were the first to give up, running up against IOPS. At the same time, nodes with a small channel were disconnected later, but due to network capacity limits. The customer had tools for redistributing traffic between servers, but they were used rather empirically. It was very difficult to compare nodes with each other.

Thus, we got two main groups:

WAN-limited server group

This group includes servers with a bandwidth of less than 2.5G. The threshold is determined empirically by analyzing peak values.

When the load increases to critical, these servers never reach critical CPU iowait values – DoS occurs due to network congestion. Moreover, within this group, failure occurs at different loads – 1G will go into DoS earlier than 2G. It is impossible to use the traffic indicator in its pure form without normalization.

IOPS-limited server group

This group includes servers with a bandwidth of more than 2.5G.

When the load increases to critical, these servers upload traffic never reaches critical values – DoS occurs due to overload of the disk subsystem.

Let’s normalize!

In case all the data is presented at different scales, we need to normalize it before doing anything with it. That is why we immediately decided to abandon absolute values on the main dashboard. Instead, we isolated two indicators and marked it from 0 to 100%:

The cpu iowait value clearly shows the overload of the disk subsystem. We have empirically determined from historical data that customer systems perform acceptably approximately as long as iowait is below 60%.
Channel width. Here, too, everything is transparent – as soon as the load reaches the limit, users begin to suffer due to restrictions on the provider’s side.

We accepted the state as 100% when the server reached its IOPS or channel limit and went into DoS.

Thus, the scale became relative, and we began to see when the server was hitting its personal limits. At the same time, it does not matter at all whether it is 1 gigabit or 10 – the data has now become normalized.

A second problem has arisen – the server can fail for any of two reasons, and both must be taken into account during the final diagnosis. We added another layer of abstraction that showed the worst value for any of the resources.

What we got

Final solution for one of the groups. Ideally, the normalized load should be almost identical.

We got a multi-layered sandwich of formulas that did not show the real absolute load on the server at all, but accurately predicted when exactly it would fail.

Imagine a team with strong and weak sled dogs. It doesn’t matter to us exactly how much horsepower (or dog power?) each of them produces. We need the whole cart to move evenly, and each of the dogs to be loaded equally, depending on their capabilities.

As a result, we got this:

We automated the entire process of adding new nodes using terraform/ansible with the simultaneous generation of dashboards that take into account the physical limitations of each server in the calculation formulas for normalization.
We prepared a high-quality model that made it possible to add new nodes at the right time without the situation of “three servers are half loaded, and the fourth is already down.”
The servers actually began to crash completely synchronously. The real test happened some time later with a powerful DDoS from parser bots. The load almost simultaneously reached its limits on all nodes – somewhere in IOPS, somewhere in the channel width.
The customer was able to maintain its model of infrastructure expansion using equipment of heterogeneous power.

In general, this was not an ideal solution from an architectural point of view, but it quickly and optimally solved the main problems, allowing for the most complete utilization of existing resources. In particular, through new monitoring and normalization, we were able to obtain the specific efficiency of each node in the group in terms of financial efficiency. To put it very simply, it became clear how many users each node serves per dollar spent.

In the future, this allowed us to continue refactoring, introduce caching with bcache, increasing the performance of individual nodes tenfold, and generally throwing more than half of the servers out into the cold. But I’ll tell you about this another time.

Well, if you need to go through your infrastructure piece by piece, set up monitoring and completely automate everything, come contact us at WiseOps. We will help.

Acknowledgement and Usage Notice

The editorial team at TechBurst Magazine acknowledges the invaluable contribution of the author of the original article that forms the foundation of our publication. We sincerely appreciate the author’s work. All images in this publication are sourced directly from the original article, where a reference to the author’s profile is provided as well. This publication respects the author’s rights and enhances the visibility of their original work. If there are any concerns or the author wishes to discuss this matter further, we welcome an open dialogue to address potential issues and find an amicable resolution. Feel free to contact us through the ‘Contact Us’ section; the link is available in the website footer.