Identifying WebRTC bugs in Google Chrome originating from feature experiments
We at WebinarGeek, like many web apps, rely heavily on the stability and performance of Google Chrome. Chrome is the dominant browser in our user base and powers our media streaming solution that is build upon WebRTC.
So, we rely on Google Chrome and it is important that things keep working. But what if they don’t? And what if it only happens for a small percentage of your users? And what if you can’t reproduce it? You need some help, some tooling, a lot of time and some luck, but we figured it out.
The issue
Part of our web app relies on functionality within WebRTC that allows a MediaStream to be relayed to another peer. I have a video meeting with person X, and then I relay the video of person X to person Y.
One user experienced an issue with this functionality. The issue is that person Y had no video. Or well, they got the video object, but it had no video frames. So all person Y could see was a black video screen. But they could hear the audio of person X, so that worked fine!
N=1 right? Nope. More users came. Not too many, but they came. Same issue. But when we tried it on our machines, everything worked just fine. So we had to investigate further.
Finding commonalities
Since we couldn’t reproduce the issue we had to look at commonalities among the users with issues. Here is what we found out:
- All users ran on Windows 10.
- All users ran on the latest Google Chrome version 85.0.4183.102 or higher.
Since we were unable to reproduce the issue on Windows 10 we started looking at specific versions of Windows 10 of the users with issues, and thanks to our users providing this info we found out that all of them were on Windows 10 Version 1903 or higher. Unfortunately, still no luck in reproducing.
The rule of exclusion
If we are unable to reproduce an issue we often start with the rule of exclusion. That is, excluding as many factors from the equation as you can, until you no longer have the issue. Or in this case, our users. Thanks to a few willing users we could run some of these experiments with them and see if it made things work. We often start with a list of assumptions of possible problems and try to proof them.
What we tried:
- Switching up roles. So Person X would become Person Y etc. That often worked (which also gave us a bit of a workaround, yaay!). Even if both of them ran the same browser and OS.
- Change devices. Different webcam? No luck. Different computer? Yes, that often worked.
- Run Chrome incognito, as sometimes browser plugins can result in weird behaviour. No luck.
- Run Canary and Beta versions of Chrome. No luck, if there is a browser issue it is also in future versions.
- Run a different browser based on Chromium, for instance Microsoft Edge. That worked in all cases. That gave us another viable workaround.
But Microsoft Edge and Google Chrome run the same browser, so how can they be different when it comes to such core functionality?
Is it our code, or is it the browser code?
Even though it made no sense at all, it could still be an issue in our code. With WebRTC you rely on a long list of events that have to be run in a specific order in order for everything to work. Maybe if you change the order somewhere you would get this issue? No luck. We analysed and compared detailed logs of all WebRTC connections and couldn’t find anything strange or different.
Our first goal was to replicate this specific functionality (relaying a MediaStream) outside of our application, so we can rule out anything within the application.
I often first go and look at the official GitHub samples for WebRTC, as there may be a sample that does exactly what our application does. And I found it, an example called Multiple relay.
I shared the example with our users who experienced the issue and all of them confirmed that by clicking on “Insert relay” the screen on the right turned black, whereas if we ran the sample ourselves it just showed our own camera.
So that gave us enough confidence to say the issue was not in our application, but an issue in the browser. An issue in Google Chrome.
Browsing code commits in Chromium
Thanks to Git we have some great tools like bisect, and thanks to this great outline on bisecting browser bugs we could try to narrow down the issue to a specific code commit. However, we failed. Also sort of what we expected, as we already knew that many users with the exact same browser version did not experience the issue. And for doing things like bisect, you need to be able to reproduce the issue, which we still couldn’t up until this point.
I did scan through all the code changes between Chrome M85 .83 and .102 (.83 is the last known version which didn’t have this problem) and stumbled upon this change which sort of was what I was looking for. A change in code that would stop sending frames if the source would be “tainted” (something to do with cross-origins).
But still, it didn’t sit right, why does it occur for some users and not for all?
Could a Chrome variation be the culprit?
When I’m really lost in the world of WebRTC I send a message to Philipp Hancke, a true WebRTC gurus. He told me to check out the Chrome variations.
Variations?
Chrome Variations are a way for Google to try out new features in the real world to assess their usefulness.
To help guide the construction of features that users actually find useful, a subset of users may get a sneak peek at new functionality before it’s launched to the world at large. The field trials that are currently active on your installation of Chrome will be included in all requests sent to Google servers to allow Google to filter logs for only those generated by a given variation of Chrome. This Chrome-Variations header will not contain any personally identifiable information, and will strictly describe the state of the installation of Chrome itself.
The variations active for a given installation are determined by a seed number between 1 and 8192 (13 bits of entropy) which is randomly selected on first run. If you would like to reset your variations seed, run Chrome with the command line flag
--reset-variation-state
.Source: the Google Chrome Privacy Whitepaper
Challenge accepted. We asked our affected users to submit the output of the chrome://version?show-internals-cmd command, listing the variations currently active.
I put them all in a Google Spreadsheet along with a list of variations on machines which were unaffected. Surprisingly, after checking all available variations there was one variation which was present on all affected machines and on none of the unaffected machines: 15e323ed-a79d803f.
Unfortunately, these variations are not in some public database so we had no clue what this variation was. But we had a strong suspicion this was the issue we’ve been looking for.
Unfortunately too, there is no way as web app to opt-out from a variation.
To further strengthen this theory, we ran the — reset-variation-state command on one of the affected machines and.. the issue was gone! Which also gave us another workaround.
Filing a bug report
With all the information in hand it was time to submit a bug report. Hoping Google would soon abandon their experiment which was causing this peer relay to break for some users.
Bisect variations
Unfortunately, Google engineers thought that this variation was not the issue, but another issue was. To find out which variation was responsible for this issue, we had to run a bisect_variations Python script. Which is a fun exercise!
The script basically opens up a session of Google Chrome and upon exit asks you whether you could reproduce the issue or not. Based on these responses it will drill down the list of experiments to one experiment.
After answering 10 or so questions the result was that MediaFoundationVP8Decoding was responsible for the issue, which is the only variation which made us able to reproduce the video without frames.
It turned out that with some Intel GPU decoding remote video with the VP8 codec would not work.
The workaround that didn’t work: more bugs
As VP8 turned out to be key in reproducing this issue, we tried changing the video codec to VP9. Which worked fine until we tried it on one of the affected computers, which gave the same issue.
Then we tried H264, but that also didn’t work. H264 failed on all Chrome versions we could find and try so that turned out to be something else entirely and affecting everyone regardless of variations or set-up.
So, we submitted another bug report for the VP9 and H264 decoding issues with relaying a peer connection.
The resolution
Google disabled the MediaFoundationVP8Decoding experiment and that solved our issue.
It took a lot of time, research, brainstorming, prototyping and trial and error to get there.
Take-aways
We’ve learned a lot along the way. About Chrome variations, bisecting variations and how important it is to keep testing your web app in upcoming Chrome versions. And how even that does not give you a guarantee things won’t break.
This issue also emphasised how important good customer support is, and how important customers are that are willing to help out identifying issues. We couldn’t have done this without our customers.