Fundamentals

This chapter will describe how to get started with the Web Audio API, which browsers are supported, how to detect if the API is available, what an audio graph is, what audio nodes are, how to connect nodes together, some basic node types, and finally, how to load sound files and playback sounds.

A Brief History of Audio on the Web

The first way of playing back sounds on the web was via the <bgsound> tag, which let website authors automatically play background music when a visitor opened their pages. This feature was only available in Internet Explorer, and was never standardized or picked up by other browsers. Netscape implemented a similar feature with the <embed> tag, providing basically equivalent functionality.

Flash was the first cross-browser way of playing back audio on the Web, but it had the significant drawback of requiring a plug-in to run. More recently, browser vendors have rallied around the HTML5 <audio> element, which provides native support for audio playback in all modern browsers.

Although audio on the Web no longer requires a plug-in, the <audio> tag has significant limitations for implementing sophisticated games and interactive applications. The following are just some of the limitations of the <audio> element:

There have been several attempts to create a powerful audio API on the Web to address some of the limitations I previously described. One notable example is the Audio Data API that was designed and prototyped in Mozilla Firefox. Mozilla’s approach started with an <audio> element and extended its JavaScript API with additional features. This API has a limited audio graph (more on this later in The Audio Context), and hasn’t been adopted beyond its first implementation. It is currently deprecated in Firefox in favor of the Web Audio API.

In contrast with the Audio Data API, the Web Audio API is a brand new model, completely separate from the <audio> tag, although there are integration points with other web APIs (see Integrating with Other Technologies). It is a high-level JavaScript API for processing and synthesizing audio in web applications. The goal of this API is to include capabilities found in modern game engines and some of the mixing, processing, and filtering tasks that are found in modern desktop audio production applications. The result is a versatile API that can be used in a variety of audio-related tasks, from games, to interactive applications, to very advanced music synthesis applications and visualizations.

Games and Interactivity

Audio is a huge part of what makes interactive experiences so compelling. If you don’t believe me, try watching a movie with the volume muted.

Games are no exception! My fondest video game memories are of the music and sound effects. Now, nearly two decades after the release of some of my favorites, I still can’t get Koji Kondo’s Zelda and Matt Uelmen’s Diablo soundtracks out of my head. Even the sound effects from these masterfully-designed games are instantly recognizable, from the unit click responses in Blizzard’s Warcraft and Starcraft series to samples from Nintendo’s classics.

Sound effects matter a great deal outside of games, too. They have been around in user interfaces (UIs) since the days of the command line, where certain kinds of errors would result in an audible beep. The same idea continues through modern UIs, where well-placed sounds are critical for notifications, chimes, and of course audio and video communication applications like Skype. Assistant software such as Google Now and Siri provide rich, audio-based feedback. As we delve further into a world of ubiquitous computing, speech- and gesture-based interfaces that lend themselves to screen-free interactions are increasingly reliant on audio feedback. Finally, for visually impaired computer users, audio cues, speech synthesis, and speech recognition are critically important to create a usable experience.

Interactive audio presents some interesting challenges. To create convincing in-game music, designers need to adjust to all the potentially unpredictable game states a player can find herself in. In practice, sections of the game can go on for an unknown duration, and sounds can interact with the environment and mix in complex ways, requiring environment-specific effects and relative sound positioning. Finally, there can be a large number of sounds playing at once, all of which need to sound good together and render without introducing quality and performance penalties.

The Audio Context

The Web Audio API is built around the concept of an audio context. The audio context is a directed graph of audio nodes that defines how the audio stream flows from its source (often an audio file) to its destination (often your speakers). As audio passes through each node, its properties can be modified or inspected. The simplest audio context is a connection directly form a source node to a destination node (Figure 1-1).

Figure 1-1. The simplest audio context

An audio context can be complex, containing many nodes between the source and destination (Figure 1-2) to perform arbitrarily advanced synthesis or analysis.

Figures Figure 1-1 and Figure 1-2 show audio nodes as blocks. The arrows represent connections between nodes. Nodes can often have multiple incoming and outgoing connections. By default, if there are multiple incoming connections into a node, the Web Audio API simply blends the incoming audio signals together.

The concept of an audio node graph is not new, and derives from popular audio frameworks such as Apple’s CoreAudio, which has an analogous Audio Processing Graph API. The idea itself is even older, originating in the 1960s with early audio environments like Moog modular synthesizer systems.

Figure 1-2. A more complex audio context

Initializing an Audio Context

The Web Audio API is currently implemented by the Chrome and Safari browsers (including MobileSafari as of iOS 6) and is available for web developers via JavaScript. In these browsers, the audio context constructor is webkit-prefixed, meaning that instead of creating a new AudioContext, you create a new webkitAudioContext. However, this will surely change in the future as the API stabilizes enough to ship un-prefixed and as other browser vendors implement it. Mozilla has publicly stated that they are implementing the Web Audio API in Firefox, and Opera has started participating in the working group.

With this in mind, here is a liberal way of initializing your audio context that would include other implementations (once they exist):

var contextClass = (window.AudioContext || 
  window.webkitAudioContext || 
  window.mozAudioContext || 
  window.oAudioContext || 
  window.msAudioContext);
if (contextClass) {
  // Web Audio API is available.
  var context = new contextClass();
} else {
  // Web Audio API is not available. Ask the user to use a supported browser.
}

A single audio context can support multiple sound inputs and complex audio graphs, so generally speaking, we will only need one for each audio application we create. The audio context instance includes many methods for creating audio nodes and manipulating global audio preferences. Luckily, these methods are not webkit-prefixed and are relatively stable. The API is still changing, though, so be aware of breaking changes (see Deprecation Notes).

Types of Web Audio Nodes

One of the main uses of audio contexts is to create new audio nodes. Broadly speaking, there are several kinds of audio nodes:

Source nodes

Sound sources such as audio buffers, live audio inputs, <audio> tags, oscillators, and JS processors

Modification nodes

Filters, convolvers, panners, JS processors, etc.

Analysis nodes

Analyzers and JS processors

Destination nodes

Audio outputs and offline processing buffers

Sources need not be based on sound files, but can instead be real-time input from a live instrument or microphone, redirection of the audio output from an audio element [see Setting Up Background Music with the <audio> Tag], or entirely synthesized sound [see Audio Processing with JavaScript]. Though the final destination-node is often the speakers, you can also process without sound playback (for example, if you want to do pure visualization) or do offline processing, which results in the audio stream being written to a destination buffer for later use.

Connecting the Audio Graph

Any audio node’s output can be connected to any other audio node’s input by using the connect() function. In the following example, we connect a source node’s output into a gain node, and connect the gain node’s output into the context’s destination:

// Create the source.
var source = context.createBufferSource();
// Create the gain node.
var gain = context.createGain();
// Connect source to filter, filter to destination.
source.connect(gain);
gain.connect(context.destination);

Note that context.destination is a special node that is associated with the default audio output of your system. The resulting audio graph of the previous code looks like Figure 1-3.

Figure 1-3. Our first audio graph

Once we have connected up a graph like this we can dynamically change it. We can disconnect audio nodes from the graph by calling node.disconnect(outputNumber). For example, to reroute a direct connection between source and destination, circumventing the intermediate node, we can do the following:

source.disconnect(0);
gain.disconnect(0);
source.connect(context.destination);

Power of Modular Routing

In many games, multiple sources of sound are combined to create the final mix. Sources include background music, game sound effects, UI feedback sounds, and in a multiplayer setting, voice chat from other players. An important feature of the Web Audio API is that it lets you separate all of these different channels and gives you full control over each one, or all of them together. The audio graph for such a setup might look like Figure 1-4.

Figure 1-4. Multiple sources with individual gain control as well as a master gain

We have associated a separate gain node with each of the channels and also created a master gain node to control them all. With this setup, it is easy for your players to control the level of each channel separately, precisely the way they want to. For example, many people prefer to play games with the background music turned off.

Loading and Playing Sounds

Web Audio API makes a clear distinction between buffers and source nodes. The idea of this architecture is to decouple the audio asset from the playback state. Taking a record player analogy, buffers are like records and sources are like playheads, except in the Web Audio API world, you can play the same record on any number of playheads simultaneously! Because many applications involve multiple versions of the same buffer playing simultaneously, this pattern is essential. For example, if you want multiple bouncing ball sounds to fire in quick succession, you need to load the bounce buffer only once and schedule multiple sources of playback [see Multiple Sounds with Variations].

To load an audio sample into the Web Audio API, we can use an XMLHttpRequest and process the results with context.decodeAudioData. This all happens asynchronously and doesn’t block the main UI thread:

var request = new XMLHttpRequest();
request.open('GET', url, true);
request.responseType = 'arraybuffer';

// Decode asynchronously
request.onload = function() {
  context.decodeAudioData(request.response, function(theBuffer) {
    buffer = theBuffer;
  }, onError);
}
request.send();

Audio buffers are only one possible source of playback. Other sources include direct input from a microphone or line-in device or an <audio> tag among others (see Integrating with Other Technologies).

Once you’ve loaded your buffer, you can create a source node (AudioBufferSourceNode) for it, connect the source node into your audio graph, and call start(0) on the source node. To stop a sound, call stop(0) on the source node. Note that both of these function calls require a time in the coordinate system of the current audio context (see Perfect Timing and Latency):

function playSound(buffer) {
  var source = context.createBufferSource();
  source.buffer = buffer;
  source.connect(context.destination);
  source.start(0);
}

Games often have background music playing in a loop. However, be careful about being overly repetitive with your selection: if a player is stuck in an area or level, and the same sample continuously plays in the background, it may be worthwhile to gradually fade out the track to prevent frustration. Another strategy is to have mixes of various intensity that gradually crossfade into one another depending on the game situation [see Gradually Varying Audio Parameters].

Putting It All Together

As you can see from the previous code listings, there’s a bit of setup to get sounds playing in the Web Audio API. For a real game, consider implementing a JavaScript abstraction around the Web Audio API. An example of this idea is the following BufferLoader class. It puts everything together into a simple loader, which, given a set of paths, returns a set of audio buffers. Here’s how such a class can be used:

window.onload = init;
var context;
var bufferLoader;

function init() {
  context = new webkitAudioContext();

  bufferLoader = new BufferLoader(
    context,
    [
      '../sounds/hyper-reality/br-jam-loop.wav',
      '../sounds/hyper-reality/laughter.wav',
    ],
    finishedLoading
    );

  bufferLoader.load();
}

function finishedLoading(bufferList) {
  // Create two sources and play them both together.
  var source1 = context.createBufferSource();
  var source2 = context.createBufferSource();
  source1.buffer = bufferList[0];
  source2.buffer = bufferList[1];

  source1.connect(context.destination);
  source2.connect(context.destination);
  source1.start(0);
  source2.start(0);
}

For a simple reference implementation of BufferLoader, take a look at http://webaudioapi.com/samples/shared.js.