why video encoding fps games is hard

why fast-paced games like apex legends tend to artifact more than third person games

hero-image

pov: you are making a video player beg for its life

Image

you know that moment when a stream turns into a pixelated mess of rectangles during an intense firefight in Apex? yeah, that’s bad encoding doing its thing. i saw this post by theo and its spot on—video encoding struggles hard with chaos, and as someone who deals with high-motion anime for a living, I can tell you why third-person games stream smoother and how encoding magic works (or doesn’t) when everything’s moving at once.”

.

theo is right, this is inherently an encoding problem, and this has been solved for the most part if you have the time and infra to fix it. there are many tricks the industry uses to make encoding high motion video work, but all of it comes at a cost. if you have an hour and a couple gpus and you can make something great. but no one is open to an hour long delay when watching a livestream. the problem is in fixing the problem really fast and making the packet sizes really small when it all gets packaged for dash or hls video streaming.

first person vs third person

Image

imagine you have a coloring book. for parts of the page where things don’t change much, like a sky that’s mostly blue, you can just color one section and say, “the rest of this is the same.” but when you’re coloring a picture of a busy playground with kids running around, you have to color each part differently because everything is moving or changing.

now, let’s look at a screenshot from apex legends. suppose we highlight areas:

  • static parts: the health bar, map, and ammo count at the corners of the screen don’t move much between frames. these are like the sky in the coloring book. you don’t have to redraw them every time. the video encoder saves effort by reusing these parts.
  • high motion parts: the character sprinting, bullets flying, explosions, and camera movements are like the busy playground. every frame has something new, so the encoder has to “color” (encode) those changes instead of reusing parts from the previous frame.

high-motion video, like fast gameplay in apex legends, has lots of changing parts. this means the encoder has to work harder to analyze and compress these new details, which takes more computing power and results in larger video files if you want to keep the quality high.

less things change in third person

Image

here’s an example of the best third person game ever made, elden ring. in the first shot we mark what areas the encoder may mark as “sections”. the dots represent the boxes of pixels that can be chunked and processed. in the second image nothing really changes. its not like the fps where everything is changing significantly. the only thing that is really changing is that the character stopped moving, so there is a lot the encoder can play around with to optimize the video. a significant amount of the video is relatively the same. a bonus here is that fromsoftware games have a desaturated monochromatic colorscheme/filter for a lot of the game, so although there is detail in the castle the fact that there isn’t a huge variance in color hue from pixel to pixel means that a bad encode might go unnoticed. dark horror games also benefit from this (but the truth of a bad encode gets revealed if a light turns on or something all the sudden, you’ll start to see artifacting).

third person games aren’t immune

Image

there can be times where a third person game may have the same problem. when you turn the camera around while keeping your character still, your character can be optimized and will be considered still, while the rest of the video is changes significantly. this won’t be at the magnitude of an fps game as tps games don’t rely as much as fast, twitchy, camera movements.

the time constraints: vod-to-live vs live

Image

at crunchyroll i led a lot of the launches for kaiju no 8 last year, the first anime to ever air simultaneously in japan as well as an international streaming service. but even then, we would get the anime a week or so before the launch. this is considered a ‘vod-to-live’ process rather than a traditional live streaming setup, since you get the entire video prior to launching, but in live streaming you are simultaneously encoding new content and delivering it. due to this. in vod-to-live we have the time to use a very resource intensive encode and spin up a bunch of nodes of servers to split up the work of encoding. there is this idea of “slow” encoding where you can make the outputted video as efficient and small as possible. but it takes time. this can take up to a half hour to an hour if we’re encoding something like. 50gb prores video file. streaming services like twitch and youtube live simply just cannot do this. they’re realtime.

the resource constraints

there’s also the constraint of resources. with twitch you have thousands of people streaming at the same time. when i worked at disney streaming, we had espn+ which only would have like 30-40 live streams at once max. although it was true live, you can allocate a large amount of processing power since the volume isn’t high. you also have complete control over the broadcasting part. you still have influence on the encoding profiles used. twitch, or aws for that matter, has a very optimized infra designed to use a minimal amount of twitch’s infra as possible per stream.

encoding is motion pilled

most encoders, like ffmpeg, or any encoder similar to what twitch uses, doesn’t just blindly send frames. it breaks each frame into macroblocks (like 16x16 grids of pixels), then compares these blocks to the previous frame. it figures out what’s changed and only sends data about those changes (motion vectors) + some residual data for details it couldn’t predict. this is called inter-frame compression, and it’s what makes real-time encoding possible without melting your internet.

further more, twitch isn’t encoding just any content. its not a traditional movie or sitcom tv show. its mostly gaming.

fps games like apex are chaotic as hell. constant motion means most macroblocks are changing every frame. the encoder’s like, “wtf do i prioritize here?” and with twitch’s 6mbps bitrate limit, it can’t fit all that data in, so it guesses and cuts corners. that’s where you get blocky artifacts, smearing, and blurry textures.

cartoon or slow third-person games are much easier. static backgrounds and predictable movements mean the encoder doesn’t need to work nearly as hard. fewer changes = less data = better-looking streams, even at the same bitrate.

also, twitch’s encoding pipeline is built for everyone. they use h.264, which is compatible with almost everything but isn’t as efficient as newer codecs like av1 or vp9. and they’re stuck with one-pass encoding for live streams. no time for a second pass to analyze and optimize. the encoder just processes the frame, guesses what matters most, and sends it out.

how ffmpeg processes frames

under the hood, ffmpeg and similar encoders split video into groups of pictures (gops). gops have:

  • i-frames (keyframes): a full image, like a jpeg.
  • p-frames: partial frames based on differences from previous frames.
  • b-frames: bidirectional frames that reference both previous and future frames (though live streaming often skips b-frames for lower latency).

when encoding live, the encoder keeps a buffer of recent frames to calculate motion vectors and residuals for p-frames. but with fps games, the sheer amount of change overwhelms the motion estimation algorithms, especially when bitrate is capped. ffmpeg, for example, uses block-matching and other prediction models, but the more change there is, the less accurate it gets.

solutions…

there are a couple things they can do to further optimize this. but i imagine they’ve tried most of this and there is probably a good reason that they aren’t already

  • better codecs: switch to av1 or vp9. av1, in particular, offers way better compression efficiency, meaning better quality at the same bitrate. the downside is increased hardware requirements for both encoding and decoding.
  • adaptive bitrate streaming: twitch already supports this to some extent (lowering quality for viewers with slow internet), but they could push for more dynamic encoder settings based on scene complexity, even within a single stream.
  • per-title optimization: platforms like netflix optimize encoding settings for specific shows. twitch could implement presets for game genres—fps, moba, chill streams—that adjust encoder bias based on expected motion complexity.
  • higher bitrate caps: even bumping it to 8mbps for 1080p60 would help a lot. sure, it’s more strain on their cdn, but with modern internet speeds improving, the trade-off might be worth it. but this might mean they lose even more money then they are already
  • educating streamers: twitch can make new profiles for obs to tune it to their service. they should really be partnering with obs and have a twitch mode that optimizes it for their service. they could also do content around building tools or tutorials for streamers to tweak settings. things like limiting resolution to 900p or using fps caps (e.g., 48fps) for fast games could improve quality without increasing bitrate.

im personally learning cuda and low level languages like c and assembly so that i can push the limits of whats possible with video. i think there are a lot of ways we can create new algorithms and perhaps use computer vision to make smarter encodes that understand the stylistic aspects of a video from a high level rather than just relying on encoding profiles to be tuned to your type of content, as most of these profiles are proprietary and locked behind closed doors.