If you've been watching YouTube lately, the encoding settings for the video you watched might have been selected by MuZero - using less of your bandwidth for the same quality. The task here is rate control: selecting the quantization parameters within the VP9 codec to maximize the quality at a specified bitrate.
This is a constrained RL problem, requiring us to optimize two conflicting objectives of variable difficulty at the same time. To deal with this challenge we introduce a self-competition based reward mechanism, where the reward depends on how successful other recent episodes were at maximizing the quality while staying under the bitrate concern. This allows the agent to improve smoothly at any stage of learning, no matter whether it has not yet learned to stay within the constraints, or whether it is in the final stages of maximizing quality.