Go to article list
03/09/23Dev

How We Got to the Best-Performing VALORANT Servers Since Launch

Share:

Delivering 128-tick servers has been a priority for VALORANT since long before launch. Our goal has always been to deliver a world-class competitive experience with 128-tick servers for everyone. To read more about how we achieved this goal prior to launch, check out our article on VALORANT’s 128-Tick Servers.

Maintaining server performance is a constant battle. The introduction of new features over time means we need to stay hyper vigilant about meeting our performance budgets to continue providing 128-tick servers. Over time, minor degradations compound together and occasional major degradations require immediate attention.

In August 2022, we encountered the largest server performance degradation since launch. This issue affected games of VALORANT with more than the standard 10 players (e.g. custom matches and esports matches). Solving this issue required the efforts of a small army of engineers, producers, leadership, and the esports team on the ground in İstanbul right before one of our biggest competitive events of the year.

Our investigation efforts while fixing this degradation revealed several opportunities for significant improvement to server frametime, which resulted in large performance improvements. Since patch 5.07, players have been experiencing the most consistent 128-tick servers since VALORANT launched in June 2020.

My name is Aaron Cheney, and I’m an engineer on VALORANT’s Performance Team. Here’s an overview of what this article covers:

  • It describes how the Performance Team catches degradations.
  • It discusses a major performance degradation discovered just before Champions 2022.
  • It reviews our triage process, provides insight into how our teams handle emergent issues, and explains our learnings from this incident.

First, I’ll cover how we catch performance degradations and explain some of the important background needed to understand the severity of the degradation.

How Do We Catch Performance Degradations?

Performance degradations are a reality of game development. Games are massively complex systems with many moving parts, each of which contributes to overall performance. This complexity is magnified as features and systems are added over time. It’s important to accurately detect when degradations occur and attribute them to a specific root cause.

VALORANT has a dedicated Performance Team responsible for this. Our team is charged with monitoring, maintaining, and improving both client and server performance across a variety of hardware. We leverage several tools and processes to catch degradations early and correct them before they ship to players.

Server Performance Targets

Maintaining 128-tick servers means each server frame needs to complete in less than 7.8125 milliseconds (ms), but if we did that, a single game would take up an entire CPU core. To meet our strict 128-tick requirements and operating targets, VALORANT’s servers are optimized to fit 3 frames into ~7.8ms. This means each server frame on average needs to be shorter than 2.6ms. In actuality, our performance target for server frame time is 2.34ms. This performance target fulfills two important purposes:

  1. It provides necessary headroom. Having extra room in our performance targets helps to absorb longer frames without going under 128-tick. This makes our servers resilient in more situations and allows room for the OS, scheduling, and other software running on the server.
  2. It helps us ensure VALORANT stays financially viable. Server infrastructure is expensive. Although it would be cool if each game of VALORANT ran on a single piece of hardware, that’s not a sustainable business model.

AskVal_Feb22_Champions_Article_Graph_3.jpg

Measuring the Present and Predicting the Future

The Performance Team utilizes two primary sources of data to catch degradations:

  1. Prerelease Data – This is data generated from several sources, including: internal playtests, automated tests, and PBE.
  2. Live Data – This is data generated from players in our live environments around the world.

Prerelease Data allows us to predict the future by analyzing unreleased content and features. This data set is much smaller than live data, which means it’s often hard to draw conclusions due to noise and variance; trends don’t show up as clearly with few samples. Corner cases can also go untested because–try as we might–we can’t exhaustively test everything.

On the other hand, Live Data allows us to understand the current player experience by measuring exactly what players see in the game. This data set is much larger than internal data (generating billions of records), which can make it difficult to process. We aggregate this data to make sense of it.

Generally speaking, Prerelease Data empowers us to be proactive with unreleased content while Live Data enables us to be reactive to live conditions. When used together, these two data sets catch the majority of issues.

A Typical Game of VALORANT

The vast majority of VALORANT players engage with the game through the Unrated and Competitive queues. Both queues represent the same game mode with exactly 10 players, just with different stakes. Although there are many other ways to play VALORANT (Replication, Spike Rush, Deathmatch, etc.), we spend a significant amount of resources to ensure the Spike Game Mode is the best possible experience.

While most games of VALORANT have exactly 10 players, some special (and important) situations arise where more than 10 players are needed.

Non-Standard Games of VALORANT

As many as 22 clients are connected to a Spike Game Mode during an esports broadcast. These are custom matches that support the watch experience. In these games, it’s common to see the following breakdown:

  • 10 slots are dedicated to the players (5 for each team).
  • 2 slots are dedicated to coaches (1 for each team).
  • 10 slots are dedicated to observers controlled by our broadcast team. These extra slots enable the watch experience for fans cheering from home or in the arena!

These 22-client games are well outside the typical game of VALORANT. Each connected client in a game must be considered by the server to ensure each client receives necessary information from the server to display the game on screen. This would normally present an issue in our live environment. However, for an esports game, we use dedicated local hardware to maintain gameplay integrity for the athletes.

Performance Targets for Non-Standard Games of VALORANT

Remember the server performance targets from earlier? Our target of 2.34ms per server frame is intended for the typical game of VALORANT with exactly 10 players. For situations where more than 10 players are connected, our expectation is performance will scale linearly with the number of additional players.

For example, with 22 clients the average frame time would roughly scale up by a factor of 2.2, going from 2.34ms to 5.2ms. That is still significantly under the necessary 7.8ms frame time required to maintain 128-tick servers.

Based on our experience with non-standard games of VALORANT and based on our strong understanding of VALORANT’s server performance, this has been an accurate way to extrapolate performance data from standard games to understand non-standard games.

But what happens when something changes? What happens when server performance no longer scales linearly with the number of connected clients?

The Degradation

Let’s rewind the clock to August 2022. The first sign of trouble appeared during a 22-client playtest. Players in the playtest noticed server tick fluctuations and felt general inconsistencies during gameplay. Video recordings of the game confirmed something was amiss. We were quite lucky to catch this issue because 22-client playtests were not standardized at the time. In the midst of team shuffling and preparations for Champions 2022, one clutch VALORANT developer remembered to request a 22-client playtest from our testing vendors.

Although it wasn’t yet time to sound the alarm, we went to work immediately to validate the issue, quickly triage the problem, and work toward creating a fix.

Validating the Issue

A single data point doesn’t indicate a trend. We began by processing telemetry from our live environment to fully understand the severity and extent of the problem, separating games by the number of clients involved ranging from 10 to 22 clients. What we uncovered was shocking.

Although we didn’t know the exact cause of the issue, the graphs showed an alarming relationship between the number of connected clients and server performance. The issue scaled non-linearly with the number of players.

This data validated that we had a major issue on our hands–particularly with Champions 2022 around the corner, which was going to be played on patch 5.04. We sounded the alarm and immediately began triaging the problem.

03092023_Champions2022ServerPerformanceArticle_Champions_Article_Graph_2.jpg

(This chart shows the server frame times by patch. Each line represents the number of clients, ranging from the standard 10-client games to the maximum of 22-client games. As expected, performance gets progressively worse with each connected client, but the spike with patch 5.03 meant our performance was degrading non-linearly.)

Triaging the Degradation

Triaging a problem generally means mitigating the effects of the problem without fully understanding the root cause. This technique is used in medicine to assess the situation and to “stop the bleeding,” and it’s used throughout Riot to quickly respond to emergent issues.

Two simultaneous efforts were undertaken to triage the problem.

Creating Contingency Plans

First, VALORANT leadership worked with our partners in İstanbul to create contingency plans. This was done to determine what would happen if we couldn’t fix the issue before the start of Champions 2022.

Since we understood the problem scaled with the number of connected clients, we were particularly interested in determining the minimum number of observers needed to deliver an uncompromised esports watch experience. Running 22 clients was off the table… so how many clients could we theoretically run under these conditions? Partnering with our friends in esports, we determined that 15 connected clients would deliver stable 128-tick servers while giving esports 5 total observer slots to create the broadcast experience. This was made possible by the headroom built into our standard server frametime.

Determining the Root Cause

Second, VALORANT developers continued investigations to better understand the problem and find the root cause of the degradation. When something this major appears, it’s unlikely to be attributed to multiple systems.

We thought we might get lucky, quickly find the root cause of the issue, and deploy a fix before the start of Champions 2022. Unfortunately, that didn’t happen.

Fixing the Degradation

The degradation was first identified on August 25, 2022, and was resolved one week later. Over the course of 7 days, many VALORANT engineers were pulled onto the issue to explore multiple potential solutions.

On the first day of investigations, we determined which part of the code base was responsible for the degradation: Replication (no, not the game mode). Replication is the way Unreal ensures consistency between servers and clients by synchronizing properties on Actors and Components. You can read more about Unreal’s Replication system here.

Our initial priority was analyzing changes between 5.02 and 5.03 to find evidence of what may have changed with Replication. This proved difficult since we had just upgraded to Unreal Engine 4.26, meaning several core parts of the engine were modified. (To read more about how VALORANT updates the Unreal Engine, check out this Twitter thread by VALORANT’s Tech Lead, Marcus Reid.)

Investigating Replication was not easy. It’s not a system that requires significant attention from VALORANT developers because we’ve had established best-practices for many years. It’s also not a system created by Riot engineers; we’ve made modifications in the base engine to meet our requirements, but deep expertise on the subject is uncommon.

Next, we systematically broke down various parts of the game to measure differences between 5.02 and 5.03 with the goal of narrowing in on the root cause. We leveraged profiling tools from Unreal and in-house tools at Riot to compare performance characteristics between the two patches. This eventually led us to the conclusion that Characters, Weapons, and Abilities were replicating significantly more often in 5.03 when compared to 5.02.

Two kinds of avenues were explored during this time:

  1. Finding THE degradation and fixing it. This was the ideal scenario we strived to achieve.
  2. Finding other opportunities to improve performance. If we could find other areas for improvement (enough to reclaim the lost frame time), then we could still bring performance back under control even if we couldn’t find and fix the original problem.

Multiple pods of engineers followed threads of investigation to test potential fixes and improvements. Along the way, many engineers gained familiarity and expertise with Replication. Several opportunities for improving performance were discovered during this time, which were deemed too risky to deploy and were instead recorded for future work.

On September 1 2022, we found the issue: a single line of code from the UE 4.26 engine upgrade caused Replication to occur much more often under certain conditions that occur frequently within VALORANT. This resulted in a performance degradation that got significantly worse as more clients connected to the server. Although the problem was noticeable in standard 10-client games, it was unplayable with 22-clients. 

The degradation was identified; the fix wasn’t risky; and we had a high degree of confidence. This was the best of all worlds. We didn’t pop the champagne immediately, but we did have a really satisfying change in our code.

// <RGI> We handle this ForcePropertyCompare() call above. Adding it here (outside of the bNewlyReplicatesBlock)
// significantly increases the rate of comparisons, because the FScopedRoleDowngrades in UActorChannel::ReplicateActor() cause us to constant
//change the bReplicates flag and re-force comparisons.
// Here lies the sanity of 12 VAL devs.
// ForcePropertyCompare();
// </RGI>

After the Degradation

Once the source of the degradation was identified and fixed, we conducted extensive testing to validate that performance was back in line with expectations and to ensure no bugs resulted from the change.

Rolling Out the Fix

Although the fix was identified on September 1st, 2022, the changes weren’t deployed to our esports environment until September 8th, 2022 after the Group stages. This was done to mitigate risk by providing extra time for testing and validation, and to minimize disruption to folks in İstanbul.

Our live environments had the fix deployed in patch 5.05, leading to significant improvements.

Beyond Champions 2022

The fun didn’t stop with 5.05. During the week of investigation into the Replication system, many performance opportunities were identified and scheduled for further development. At the time they were too risky to deploy since they didn’t address the root cause of the problem. After fixing the original degradation, we took advantage of the newly-gained expertise with Replication to address several areas we felt would further improve server performance.

These efforts resulted in significant performance improvements across the board, bringing a 15% reduction in overall server frametime compared to pre-degradation numbers. These changes also reduced variance across all regions, increasing server frametime stability. 

Since patch 5.07, 99.3+% of server frames have met our strict 128-tick requirements, which means players are enjoying the most stable, consistent servers since VALORANT launched.

03092023_Champions2022ServerPerformanceArticle_Champions_Article_Graph_1.jpg

(This chart shows the server frame times by patch. Each line represents the number of clients, ranging from the standard 10-client games to the maximum of 22-client games. After the major degradation in patches 5.03 and 5.04, we fixed the degradation and returned to normal values in patch 5.05. Further significant improvements released with patch 5.07, which saw a 15% improvement to server performance from pre-degradation numbers.)

Lessons Learned

This emergent performance issue took a small army to solve: a dozen engineers, esports partners, testing vendors, producers, team leadership, and the generosity of partner teams giving us advice and guidance. We learned a lot from this incident and found gaps in our processes that we want to address.

  • Monitoring for Non-Standard Games – Most of our attention is directed toward the typical VALORANT experience in a 10-player classic Bomb Game Mode. Our data did not show breakdowns according to the number of connected players, making the scale of the degradation unknown until new charts were created. We have since built new processes to check and understand performance characteristics for non-standard games of VALORANT with more than 10 clients.
  • Increasing Testing Coverage for Non-Standard Games – While the standard 10-player Bomb Game Mode will remain our top priority, we’re looking to build new processes and testing procedures to understand server performance in environments with 22 clients. The sample size for 22-client games is dwarfed by everything else, so gathering this information means regularly running 22-client playtests. It’s now standard to run 22-client playtests with every release instead of just before major tournaments. We’ll likely also need other, more sophisticated testing procedures.
  • Encouraging the Development of Subject Matter Experts (SMEs) – Even when there’s a significant investment of time required, creating SMEs on the team helps fill knowledge gaps and generally levels up everyone. Several engineers gained expertise with Replication in the process of fixing the degradation, which turned out to be a great investment of resources.
  • Minimizing Risk for Tournaments – Playing Champions 2022 on a fresh patch–especially considering we just finished integrating a new version of the Unreal Engine–put us at risk. While we want to ensure gameplay changes (e.g. balancing, characters, etc.) make it in for major tournaments, future tournaments will likely not be played on the bleeding edge. At the very least we’ll double-check the risks in scenarios where that’s unavoidable.

You can expect us to continue making improvements like this to make it better to be a VALORANT player. Enjoy the 128-tick servers, and happy fragging!

We arewaiting

Related content