Identifying WebRTC bugs in Google Chrome originating from feature experiments

We at WebinarGeek, like many web apps, rely heavily on the stability and performance of Google Chrome. Chrome is the dominant browser in our user base and powers our media streaming solution that is build upon WebRTC.

So, we rely on Google Chrome and it is important that things keep working. But what if they don’t? And what if it only happens for a small percentage of your users? And what if you can’t reproduce it? You need some help, some tooling, a lot of time and some luck, but we figured it out.

The issue

Part of our web app relies on functionality within WebRTC that allows a MediaStream to be relayed to another peer. I have a video meeting with person X, and then I relay the video of person X to person Y.

One user experienced an issue with this functionality. The issue is that person Y had no video. Or well, they got the video object, but it had no video frames. So all person Y could see was a black video screen. But they could hear the audio of person X, so that worked fine!

N=1 right? Nope. More users came. Not too many, but they came. Same issue. But when we tried it on our machines, everything worked just fine. So we had to investigate further.

Finding commonalities

Since we couldn’t reproduce the issue we had to look at commonalities among the users with issues. Here is what we found out:

  1. All users ran on Windows 10.
  2. All users ran on the latest Google Chrome version 85.0.4183.102 or higher.

Since we were unable to reproduce the issue on Windows 10 we started looking at specific versions of Windows 10 of the users with issues, and thanks to our users providing this info we found out that all of them were on Windows 10 Version 1903 or higher. Unfortunately, still no luck in reproducing.

The rule of exclusion

If we are unable to reproduce an issue we often start with the rule of exclusion. That is, excluding as many factors from the equation as you can, until you no longer have the issue. Or in this case, our users. Thanks to a few willing users we could run some of these experiments with them and see if it made things work. We often start with a list of assumptions of possible problems and try to proof them.

What we tried:

  1. Switching up roles. So Person X would become Person Y etc. That often worked (which also gave us a bit of a workaround, yaay!). Even if both of them ran the same browser and OS.
  2. Change devices. Different webcam? No luck. Different computer? Yes, that often worked.
  3. Run Chrome incognito, as sometimes browser plugins can result in weird behaviour. No luck.
  4. Run Canary and Beta versions of Chrome. No luck, if there is a browser issue it is also in future versions.
  5. Run a different browser based on Chromium, for instance Microsoft Edge. That worked in all cases. That gave us another viable workaround.

But Microsoft Edge and Google Chrome run the same browser, so how can they be different when it comes to such core functionality?

Is it our code, or is it the browser code?

Even though it made no sense at all, it could still be an issue in our code. With WebRTC you rely on a long list of events that have to be run in a specific order in order for everything to work. Maybe if you change the order somewhere you would get this issue? No luck. We analysed and compared detailed logs of all WebRTC connections and couldn’t find anything strange or different.

Our first goal was to replicate this specific functionality (relaying a MediaStream) outside of our application, so we can rule out anything within the application.

I often first go and look at the official GitHub samples for WebRTC, as there may be a sample that does exactly what our application does. And I found it, an example called Multiple relay.

I shared the example with our users who experienced the issue and all of them confirmed that by clicking on “Insert relay” the screen on the right turned black, whereas if we ran the sample ourselves it just showed our own camera.

Camera image on the left, but no video frames on the right. The essence of the problem.

So that gave us enough confidence to say the issue was not in our application, but an issue in the browser. An issue in Google Chrome.

Browsing code commits in Chromium

Thanks to Git we have some great tools like bisect, and thanks to this great outline on bisecting browser bugs we could try to narrow down the issue to a specific code commit. However, we failed. Also sort of what we expected, as we already knew that many users with the exact same browser version did not experience the issue. And for doing things like bisect, you need to be able to reproduce the issue, which we still couldn’t up until this point.

I did scan through all the code changes between Chrome M85 .83 and .102 (.83 is the last known version which didn’t have this problem) and stumbled upon this change which sort of was what I was looking for. A change in code that would stop sending frames if the source would be “tainted” (something to do with cross-origins).

But still, it didn’t sit right, why does it occur for some users and not for all?

Could a Chrome variation be the culprit?

When I’m really lost in the world of WebRTC I send a message to Philipp Hancke, a true WebRTC gurus. He told me to check out the Chrome variations.

Variations?

Chrome Variations are a way for Google to try out new features in the real world to assess their usefulness.

To help guide the construction of features that users actually find useful, a subset of users may get a sneak peek at new functionality before it’s launched to the world at large. The field trials that are currently active on your installation of Chrome will be included in all requests sent to Google servers to allow Google to filter logs for only those generated by a given variation of Chrome. This Chrome-Variations header will not contain any personally identifiable information, and will strictly describe the state of the installation of Chrome itself.

The variations active for a given installation are determined by a seed number between 1 and 8192 (13 bits of entropy) which is randomly selected on first run. If you would like to reset your variations seed, run Chrome with the command line flag .

Source: the Google Chrome Privacy Whitepaper

Challenge accepted. We asked our affected users to submit the output of the chrome://version?show-internals-cmd command, listing the variations currently active.

I put them all in a Google Spreadsheet along with a list of variations on machines which were unaffected. Surprisingly, after checking all available variations there was one variation which was present on all affected machines and on none of the unaffected machines: 15e323ed-a79d803f.

Unfortunately, these variations are not in some public database so we had no clue what this variation was. But we had a strong suspicion this was the issue we’ve been looking for.

Unfortunately too, there is no way as web app to opt-out from a variation.

To further strengthen this theory, we ran the — reset-variation-state command on one of the affected machines and.. the issue was gone! Which also gave us another workaround.

Filing a bug report

With all the information in hand it was time to submit a bug report. Hoping Google would soon abandon their experiment which was causing this peer relay to break for some users.

Bisect variations

Unfortunately, Google engineers thought that this variation was not the issue, but another issue was. To find out which variation was responsible for this issue, we had to run a bisect_variations Python script. Which is a fun exercise!

The script basically opens up a session of Google Chrome and upon exit asks you whether you could reproduce the issue or not. Based on these responses it will drill down the list of experiments to one experiment.

After answering 10 or so questions the result was that MediaFoundationVP8Decoding was responsible for the issue, which is the only variation which made us able to reproduce the video without frames.

It turned out that with some Intel GPU decoding remote video with the VP8 codec would not work.

Bisecting variations will give you a folder of command line parameters that can be either ruled out as the problem or identified as the problem. The final output of the command is the variation that reproduces the issue.

The workaround that didn’t work: more bugs

As VP8 turned out to be key in reproducing this issue, we tried changing the video codec to VP9. Which worked fine until we tried it on one of the affected computers, which gave the same issue.

Then we tried H264, but that also didn’t work. H264 failed on all Chrome versions we could find and try so that turned out to be something else entirely and affecting everyone regardless of variations or set-up.

So, we submitted another bug report for the VP9 and H264 decoding issues with relaying a peer connection.

The resolution

Google disabled the MediaFoundationVP8Decoding experiment and that solved our issue.

It took a lot of time, research, brainstorming, prototyping and trial and error to get there.

Take-aways

We’ve learned a lot along the way. About Chrome variations, bisecting variations and how important it is to keep testing your web app in upcoming Chrome versions. And how even that does not give you a guarantee things won’t break.

This issue also emphasised how important good customer support is, and how important customers are that are willing to help out identifying issues. We couldn’t have done this without our customers.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store