Use the Web Audio API to transfer multi-channel audio to Amazon

Multi-channel transcription streaming is a feature of Amazon transcription that can often be used using a web browser. Creating this stream source has its challenges, but the JavaScript Web Audio API allows you to connect and combine a variety of audio sources, such as video, audio files, or hardware such as microphones, to get transcripts.

In this post, I'll guide you through how to use two microphones as audio sources, merge them into a single dual channel audio, perform the required encoding, and stream them to Amazon transcription. It provides the VUE.JS application source code that requires two microphones connected to the browser. However, the versatility of this approach goes far beyond this use case. It can be adapted to accommodate a wide range of devices and audio sources.

Using this approach, you can get transcripts for two sources in one Amazon transcription session, offering cost savings and other benefits compared to using separate sessions for each source.

Issues with two microphones

In a use case, you could use a single channel stream for two microphones, allowing you to identify speaker labels that will be transcribed by Amazon, but there are some considerations.

Speaker labels are randomly assigned at the start of the session. This means that after the stream has started, the results of the application must be mapped.
You can get false labeled speakers with similar audio tones.
If two speakers speak at the same time with one audio source, audio duplication may occur

By using two audio sources with microphones, these concerns can be addressed by ensuring that each transfer is from a fixed input source. By assigning devices to speakers, our applications know in advance which transcripts to use. However, if two nearby microphones are picking up multiple voices, the audio may still be duplicated. This can be alleviated by transcription of word-level reliability scores using directional microphones, volume management, and Amazon.

Solution overview

The following diagram illustrates the solution workflow.

Two microphone application diagram

Uses two audio inputs in the Web Audio API. This API allows you to merge two inputs, Microphone A and MIC B, into a single audio data source. The left channel represents MIC A and the right channel represents MIC B.

Next, convert this audio source to PCM (pulse code modulation) audio. PCM is a common form of audio processing and is one of the formats that Amazon needs to transfer to audio inputs. Finally, stream PCM audio to transfer to Amazon for transcription.

Prerequisites

The following prerequisites must be provided:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DemoWebAudioAmazonTranscribe",
      "Effect": "Allow",
      "Action": "transcribe:StartStreamTranscriptionWebSocket",
      "Resource": "*"
    }
  ]
}

Start the application

Complete the following steps to launch the application:

Navigate to the root directory where you downloaded the code.
Create a .ENV file and set up your AWS access key from env.sample file.
Install and run the package bun install (If you are using a node, please do so node install).
Start and run the web server bun dev (If you are using a node, please do so node dev).
Open your browser http://localhost:5173/.

Application running on http://localhost:5173 using two connected microphones

Code Walkthrough

In this section, we will examine the important code pieces for implementation.

The first step is to list connected microphones using the browser API navigator.mediaDevices.enumerateDevices():

const devices = await navigator.mediaDevices.enumerateDevices()
return devices.filter((d) => d.kind === 'audioinput')

Next, you need to get MediaStream Objects for each connected microphone. This can be done using navigator.mediaDevices.getUserMedia() The API allows access to the user's media devices (such as cameras and microphones). Then you can get a MediaStream Objects representing audio or video data from those devices:

const streams = []
const stream = await navigator.mediaDevices.getUserMedia({
  audio: {
    deviceId: device.deviceId,
    echoCancellation: true,
    noiseSuppression: true,
    autoGainControl: true,
  },
})

if (stream) streams.push(stream)

To combine audio from multiple microphones, you need to create an AudioContext interface for audio processing. Inside this AudioContext,ChannelMergerNode can be used to merge audio streams from different microphones. connect(destination, src_idx, ch_idx) The method arguments are:
- destination – Destination, in our case Mergernode.
- src_idx – Source channel index, in this case both cases (as each microphone is a single channel audio stream).
- ch_idx – Channel indexes for creating stereo outputs, in our case, destinations 0 and 1 respectively.

// instance of audioContext
const audioContext = new AudioContext({
       sampleRate: SAMPLE_RATE,
})
// this is used to process the microphone stream data
const audioWorkletNode = new AudioWorkletNode(audioContext, 'recording-processor', {...})
// microphone A
const audioSourceA = audioContext.createMediaStreamSource(mediaStreams[0]);
// microphone B
const audioSourceB = audioContext.createMediaStreamSource(mediaStreams[1]);
// audio node for two inputs
const mergerNode = audioContext.createChannelMerger(2);
// connect the audio sources to the mergerNode destination.  
audioSourceA.connect(mergerNode, 0, 0);
audioSourceB.connect(mergerNode, 0, 1);
// connect our mergerNode to the AudioWorkletNode
merger.connect(audioWorkletNode);

Microphone data is processed in an audio walklet that emits data messages for each defined number of recorded frames. These messages contain audio data encoded in PCM format and are sent to Amazon transcription. The P-Event library can be used to make events from worklets asynchronously iterative. A more detailed description of this worklet can be found in the next section of this post.

import { pEventIterator } from 'p-event'
...

// Register the worklet
try {
  await audioContext.audioWorklet.addModule('./worklets/recording-processor.js')
} catch (e) {
  console.error('Failed to load audio worklet')
}

//  An async iterator 
const audioDataIterator = pEventIterator<'message', MessageEvent>(
  audioWorkletNode.port,
  'message',
)
...

// AsyncIterableIterator: Every time the worklet emits an event with the message `SHARE_RECORDING_BUFFER`, this iterator will return the AudioEvent object that we need.
const getAudioStream = async function* (
  audioDataIterator: AsyncIterableIterator>,
) {
  for await (const chunk of audioDataIterator) {
    if (chunk.data.message === 'SHARE_RECORDING_BUFFER') {
      const { audioData } = chunk.data
      yield {
        AudioEvent: {
          AudioChunk: audioData,
        },
      }
    }
  }
}

To start streaming data to Amazon transcription, you can enable it using a manufactured iterator NumberOfChannels: 2 and EnableChannelIdentification: true To enable dual channel transcription. For more information, see the AWS SDK StartStreamTranscriptionCommand documentation.

import {
  LanguageCode,
  MediaEncoding,
  StartStreamTranscriptionCommand,
} from '@aws-sdk/client-transcribe-streaming'

const command = new StartStreamTranscriptionCommand({
    LanguageCode: LanguageCode.EN_US,
    MediaEncoding: MediaEncoding.PCM,
    MediaSampleRateHertz: SAMPLE_RATE,
    NumberOfChannels: 2,
    EnableChannelIdentification: true,
    ShowSpeakerLabel: true,
    AudioStream: getAudioStream(audioIterator),
  })

After submitting the request, a WebSocket connection is created to exchange audio stream data, and Amazon will transcribe the results.

const data = await client.send(command)
for await (const event of data.TranscriptResultStream) {
    for (const result of event.TranscriptEvent.Transcript.Results || []) {
        callback({ ...result })
    }
}

result The object contains a ChannelId Properties you can use to identify your microphone source ch_0 and ch_1respectively.

Deep Dive: Audio Worklet

Audio worklets can run on separate threads to provide very low latency audio processing. The source code for the implementation and demo is public/worklets/recording-processor.js file.

In our case, we use worklets to perform two main tasks:

I'll handle it mergerNode Audio in a repetitive way. This node contains both audio channels and is an input to the worklet.
Encodes data bytes of mergerNode The node to PCM has signed a 16-bit Little-Endian audio format. Do this for each iteration or if you need to issue a message payload to the application.

The general code structure that implements this is:

class RecordingProcessor extends AudioWorkletProcessor {
  constructor(options) {
    super()
  }
  process(inputs, outputs) {...}
}

registerProcessor('recording-processor', RecordingProcessor)

Custom options can be passed to this worklet instance processorOptions attribute. In the demo, set a maxFrameCount: (SAMPLE_RATE * 4) / 10 As a bitrate guide to determine when to fire a new message payload. For example, the message is:

this.port.postMessage({
  message: 'SHARE_RECORDING_BUFFER',
  buffer: this._recordingBuffer,
  recordingLength: this.recordedFrames,
  audioData: new Uint8Array(pcmEncodeArray(this._recordingBuffer)), // PCM encoded audio format
})

PCM encoding for two channels

One of the most important sections is how to encode into PCM on two channels. Following the AWS documentation for the Amazon Transcription API Reference, Audiochunk is defined as follows: Duration (s) * Sample Rate (Hz) * Number of Channels * 2. For two channels, 1 second at 16000Hz is: 1 * 16000 * 2 * 2 = 64000 bytes. Our encoding function looks like this:

// Notice that input is an array, where each element is a channel with Float32 values between -1.0 and 1.0 from the AudioWorkletProcessor.
const pcmEncodeArray = (input: Float32Array[]) => {
  const numChannels = input.length
  const numSamples = input[0].length
  const bufferLength = numChannels * numSamples * 2 // 2 bytes per sample per channel
  const buffer = new ArrayBuffer(bufferLength)
  const view = new DataView(buffer)

  let index = 0

  for (let i = 0; i < numSamples; i++) {
    // Encode for each channel
    for (let channel = 0; channel < numChannels; channel++) {
      const s = Math.max(-1, Math.min(1, input[channel][i]))
      // Convert the 32 bit float to 16 bit PCM audio waveform samples.
      // Max value: 32767 (0x7FFF), Min value: -32768 (-0x8000) 
      view.setInt16(index, s < 0 ? s * 0x8000 : s * 0x7fff, true)
      index += 2
    }
  }
  return buffer
}

See the audioworkletprocessor:process() method for information on how to process audio data blocks. For more information about PCM format encoding, see Multimedia Programming Interface and Data Specification 1.0.

Conclusion

In this post, we investigated the implementation details of web applications that allow real-time dual-channel transcription using the browser's web audio API and Amazon transcription streaming. Using a combination of AudioContext, ChannelMergerNodeand AudioWorkletbefore seamlessly processing and encoding audio data from two microphones, it was possible to send it to Amazon transcription for transcription. Using AudioWorklet In particular, it achieved low-latency audio processing, providing a smooth and responsive user experience.

Based on this demo, you can create more sophisticated real-time transcription applications that cater to a wide range of use cases, from recordings to voice-controlled interfaces.

Try the solution yourself and leave feedback in the comments.

About the author

Jorge Lanzarotti This is Sr. Prototyping SA from Amazon Web Services (AWS) based on Tokyo, Japan. He supports public sector clients by creating innovative solutions to challenging issues.