If you've been watching YouTube lately, the encoding settings for the video you watched might have been selected by MuZero - using less of your bandwidth for the same quality. The task here is rate control: selecting the quantization parameters within the VP9 codec to maximize the quality at a specified bitrate.
This is a constrained RL problem, requiring us to optimize two conflicting objectives of variable difficulty at the same time. To deal with this challenge we introduce a self-competition based reward mechanism, where the reward depends on how successful other recent episodes were at maximizing the ...