Search

Wednesday, December 16, 2020

Avoiding Quality Disasters


On September 23, 1999, NASA's latest planetary probe, the Mars Climate Orbiter, began its orbital insertion maneuver to enter Mars orbit. It was never heard from again. The subsequent investigation determined that the likely cause was that one component of the software system was communicating measurements in imperial units, while the rest was communicating in metric. The discrepancy in the measurements resulted in the orbiter crashing on Mars rather than orbiting it. As summed up by a NASA executive: "The problem here was not the error; it was the failure of NASA's systems engineering, and the checks and balances in our processes, to detect the error."

This was not the first mission failure or mishap for NASA around the time. Failed rocket launches, blurry space telescopes, and the pair of shuttle tragedies underscored how the organization had let their quality slip. Subsequent studies of their culture coined the phrase "normalization of deviance" to describe how the increasing tolerance of known problems led to these failures. Managers and teams just got used to the lowered quality until it was too late.

While the consequences of low quality aren't as significant for game developers and studios, "normalization of deviance" impacts us as well...especially near our own launches. We increasingly tolerate bugs and crashes until there is a disastrous release.

Anyone who has read the reviews about some of the recently released games knows the impact of compromised quality. Players will ignore ingenious gameplay, technology, and art when bugs get in their way.

Debt

Bugs, glitches, bad art, unfinished mechanics, etc., are all forms of "debt". Ward Cunningham, one of the authors of the Agile Manifesto, once likened the problems with technology, such as bugs and unmaintainable code, to financial debt—it grows the cost of payback over time.

Debt results in an ever-expanding amount of work that must be addressed before

the game is deployed. Numerous forms of debt exist for game development, such as

  • Technical debt: Bugs and slow and unmaintainable code.
  • Art debt: Assets such as models, textures, and audio needing replacement or tuning.
  • Design debt: Unproven mechanics waiting to be fully integrated into the game to prove their value.
  • Production debt: The amount of content that the game needs to be commercially viable (for example, 12 to 20 hours of gameplay).
  • Optimization debt: Work waiting to be done to run the game at acceptable frame rates on target platforms.

If unmanaged, the amount of work to address debt is unpredictable and results in crunch and a compromise in quality as key features or assets are dropped in favor of hitting a ship or deployment date. Worse yet, testing, which is often delayed until the end of development, is curtailed to meet a scheduled release.

Debt is always going to exist. The quality issues seen with recent games released are all instances of poorly managed debt. This article describes one approach to managing debt effectively.

Managing Debt

When studios decide that they need to improve debt management, usually it's when they are badly impacted by it. A release's sales are low, or a live game is shedding players who won't put up with the problems any longer. At that point, like a doctor, we have to employ some practices to stabilize the patient, fix some of the root problems, and set up a system where the patient slowly returns to health and stays that way.

The approach I find works the best has two parts:

  • Create a quality strike team to stabilize the game, establish metrics and testing tools.
  • Continuously roll out improved studio-wide practices.

Strike Team, Assemble!

The adage that "quality is everyone's responsibility" is correct. However, building that culture takes time, and without proper metrics and systems in place, it won't happen. Therefore putting together a team of developers with insight into the technology and a passion for fixing problems is a necessary step. It's useful to add a dedicated product owner to prioritize the effort, measure this team's cost, and protect it. Protection is crucial because a common cause of the massive debt is the pressure put on teams by stakeholders to add features as quickly as possible to the game. This pressure always results in the team skipping some practices that improve quality, such as iterating on a mechanic or refactoring messy unmaintainable code, or polishing an asset. This isn't a "cleanup crew." This team's responsibility isn't to refactor code, iterate on mechanics, or polish assets. That would be cruel to them and even encourage developers outside the team not to address debt...they'd just hand problems off to this team.

The focus of the strike team is to:

  • Implement metrics that clearly show the quality of the current build
  • Build test automation and a test flow that will catch debt quickly
  • Find the root causes of debt and address them

Strike Team Product Ownership Role

A Product Owner for the strike team is essential. The team will have its own backlog. This PO will work with other product POs to understand the emergent practices for developers and their impact on their short term velocity. While boosting productivity over the long-term, quality improvements will cost time in the short term (this is one reason teams often don't do enough of it).

The Initial Product Backlog

A Product Owner's primary responsibility is to maintain a Product Backlog that the team works from. Below is an example of the initial epic goals (goals too big to be completed in a single Sprint/iteration) of a strike team product backlog:

The overall epic user story: "As the Quality Product Owner, I want the current release to be free of priority one issues, so we don't lose existing and potential customers."

Epics are broken down into smaller epics as the team approaches the work on them. An example:

"As the Quality Product Owner, I want a set of metrics that show the quality of the current release."

Acceptance Criteria:

  • The rate of crashes and locked builds per hour, per 1000 users
  • The rate at which users encounter problems (e.g., microphone issues, etc.)

--

"As the Quality Product Owner, I want a set of automated tests that will "bless" a potential build as free of priority one issues before it is released."

Acceptance Criteria: App launches in a test suite, and the avatar runs through several tests (e.g., crossing room boundaries, testing audio, etc.).

--

"As the Quality Product Owner, I want a definition of done established, with regular retrospectives to catch defects and lead to improved development practices to avoid release issues."

These epic goals will often challenge the team to innovate in new areas. For example, they might ask, "how do we measure quality?" For this example, the team came up with the following solutions:

  • Have the game engine send an email containing debug information every time it crashes.
  • Build up a series of automated tests that catch problems.

Most modern engines and operating systems can catch asserts and use them to trigger emails sent to an address that collects the debug information (stack traces, player behaviors, hardware information, etc.). Additionally, a watchdog process can determine if the game is "locked up". Using these tools, you can capture measures that give you a reliable metric of stability.

Create a target

What is our quality target? What not shoot for 100%? This is near impossible, but that's OK. For those of you who use OKRs, this is called an "aspirational OKR." "Aspirational OKRs express how we'd like the world to look, even though we have no clear idea how to get there and/or the resources necessary to deliver the OKR." - Google OKR Playbook.

Aiming for 100% quality: it took us six months to go from 25% (the percentage of time a build had zero detectable priority one defects) to 95%. From there, it took another six months to get to 98%. The benefits were tremendous. The boost in productivity working with a stable and performing game was clear.

Strike Team Work

The work the strike team took on from here fell into three categories:

  1. Find and fix root causes
  2. Build test automation
  3. Collaborate with the rest of the developers on improved practices

Find and Fix Root Causes

When collecting metrics and crash data, it's useful to capture what the user was doing at the time of the crash. Using that information, the team can categorize the causes and, using root cause analysis, identify the most impactful culprits. In this example, the team found that poorly named assets were responsible for 25% of the crashes. With a simple fix to a few exporters, they eliminated a quarter of the crashes.

Build Test Automation

Test automation is another key to quality. Manual testing can't keep up when it takes a full day to test a game manually while developers are committing hundreds of changes. 

Test automation should provide a layered approach that can catch various problems that take an increasingly long time for each layer.

An example of the layers from simple/quick to complex/time consuming:

  1. Compile for all build configurations for all hardware. This will catch compile/link errors.
  2. Localized unit testing to test code around the areas committed
  3. Asset export/validation. Export assets to all hardware configurations and run tests (naming conventions and other standards (telex density, etc.)
  4. Smoke tests all levels or modes on all platforms. Detect if any crashes the game.
  5. Scripted gameplay. Have a replay or scripted gameplay run through the entire game (or portions in levels or modes) and find crashes.
  6. Run full unit tests for the entire code base. This can take an entire night.

The number and type of these tests are determined by the frequency of the problems found in the root cause analysis. For example, a team found that code changes made often broke Android builds, so running tests that found those problems first was a good approach.

Test automation requires not only the cost to code the tests but an investment in test servers and, for large AAA games, an investment in improving a studio's network to handle the frequent transfer of huge code/asset files.

Scripted gameplay/replays can find problems that are subtle. For one racing game, we had all the AI vehicles race each other for all races. It took most of the night. The test would record all the finish times and flag discrepancies. One morning we found that none of the AI players had finished one of the races. Replaying the race, we found that a prop had been accidentally moved into an intersection. The collision with this prop set off a pileup among all the AI players that they could not recover from.

Collaborate with developers on improved practices

The most challenging part of this approach is for development teams to change their practices and culture to support higher quality. Developers often resist changing the practices they've spent a career building up. Trying to force change on them will always fail.  

The following are some approaches that can better influence changes to improve quality.

Create a shared vision of why these changes are needed.

It's not a hard sell to convince developers that making better games with less "death-march crunch" is a good thing. That vision has to be connected with changing the way they work.

Work together to improve practices, starting with outcomes.

Scrum has a "definition of done," which all features that are considered complete in a Sprint must achieve. After every Sprint, the Product Owner and Development Team will meet to discuss what quality issues impact them and the game and explore improving that definition of done. Those improvements will lead to changes that the team will experiment with over the coming Sprints and determine if they help and should be continued or don't help and should be abandoned. Keep these changes small so as not to overwhelm the team. They will typically aim to improve quality by 1-2% every Sprint, which doesn't sound like a lot, but can lead to 25-60% improvement over the course of a year (1.01 X 26 sprints/year ~= 25%).

Explore significant changes with beachhead teams

Some changes, such as implementing unit testing, are significant and are harder to roll out incrementally. For introducing these changes, try having a single development team explore the value and barriers to unit testing can be a valuable start. These teams, called beachhead teams, referring to soldiers that would land on an enemy's beach first, can refine the practices to fit the current development effort and culture best. In rolling those changes out, they are also effective coaches to other teams.

Summary

Disaster is often the catalyst for dramatic change, but the cost paid for that is often too high. Years of work on a game can be wasted on a bad launch. Day one patches are not the solution. They're an admission of failure.

Changing a development culture before the impact of a disaster is not easy. It costs money and time. It requires courage, leadership, and patience.