MSEdgeExplainers/AudioStreamCategory/explainer.md

4.0 KiB

Audio Stream Category

Authors: Sam Dallstream, Greg Whitworth, Rahul Singh

Status of this Document

This document is intended as a starting point for engaging the community and standards bodies in developing collaborative solutions fit for standardization. As the solutions to problems described in this document progress along the standards-track, we will retain this document as an archive and use this section to keep the community up-to-date with the most current standards venue and content location of future work and discussions.

Introduction

The Audio Category is a proposed addition to the mst-content-hint spec that will allow websites to set a contentHint on a MediaStreamTrack that specifies that the track is meant for speech recognition by a machine.

The contentHint we are proposing is speechRecognition.

Background

We believe there is a general need to differentiate between streams intended for human consumption and streams meant to be used for transcription by a machine because there are many differences in the optimizations that are applied for each scenario. Specifically, requirements for communications between humans can be found in the ETSI TS 126 131 specification, and include optimizations in noise suppression like the addition of pink noise in order to increase user satisfaction, which is in direct opposition to the needs of a speech recognition system. There is also a draft of testing methods for speech recognition systems that outlines some of the different requirements for those systems STQ63-260v0210.

The proposed solution below was inspired by the categories that Windows offers for audio streams. These categories allow you to specify what kind of audio stream you want (ex: “speech” for when someone is dictating into a mic), which gives the operating system a chance to optimize the stream for that type of input. After some research, we found that similar categories exist across Android, iOS, and, of course, Windows.

Proposed Solution

We plan to follow the lead of native applications across Android, iOS, and Windows, and extend the list of content-hints for the developer to choose from when working with a stream. We will adapt this to the web by modifying the mst-content-hint API. For operating systems, such as Mac, that do not have one to one mappings of these categories, a best effort approach will be taken to applying categories.

Proposed API

Add the speechRecognition option to contentHint for audio tracks.

IDL

Extension to MediaStreamTrack

partial interface MediaStreamTrack {
  attribute DOMString contentHint;
};

Examples

Example 1: Get an audio stream and set the category set to “speech”

const constraints = {volume: 1}; 
navigator.mediaDevices.getUserMedia({ audio : constraints})
      .then(handleMediaStreamAcquired.bind(this),
          handleMediaStreamAcquiredError.bind(this));

function handleMediaStreamAcquired(mediaStream) {
  mediaStream.getTracks()[0].contentHint = 'speechRecognition';
}

function handleMediaStreamAcquiredError(mediaStreamError) {
  console.log(mediaStreamError);
}