Fundamentals

This chapter will describe how to get started with the Web Audio API, which browsers are supported, how to detect if the API is available, what an audio graph is, what audio nodes are, how to connect nodes together, some basic node types, and finally, how to load sound files and playback sounds.

A Brief History of Audio on the Web

The first way of playing back sounds on the web was via the <bgsound> tag, which let website authors automatically play background music when a visitor opened their pages. This feature was only available in Internet Explorer, and was never standardized or picked up by other browsers. Netscape implemented a similar feature with the <embed> tag, providing basically equivalent functionality.

Flash was the first cross-browser way of playing back audio on the Web, but it had the significant drawback of requiring a plug-in to run. More recently, browser vendors have rallied around the HTML5 <audio> element, which provides native support for audio playback in all modern browsers.

Although audio on the Web no longer requires a plug-in, the <audio> tag has significant limitations for implementing sophisticated games and interactive applications. The following are just some of the limitations of the <audio> element:

No precise timing controls
Very low limit for the number of sounds played at once
No way to reliably pre-buffer a sound
No ability to apply real-time effects
No way to analyze sounds

There have been several attempts to create a powerful audio API on the Web to address some of the limitations I previously described. One notable example is the Audio Data API that was designed and prototyped in Mozilla Firefox. Mozilla’s approach started with an <audio> element and extended its JavaScript API with additional features. This API has a limited audio graph (more on this later in The Audio Context), and hasn’t been adopted beyond its first implementation. It is currently deprecated in Firefox in favor of the Web Audio API.

In contrast with the Audio Data API, the Web Audio API is a brand new model, completely separate from the <audio> tag, although there are integration points with other web APIs (see Integrating with Other Technologies). It is a high-level JavaScript API for processing and synthesizing audio in web applications. The goal of this API is to include capabilities found in modern game engines and some of the mixing, processing, and filtering tasks that are found in modern desktop audio production applications. The result is a versatile API that can be used in a variety of audio-related tasks, from games, to interactive applications, to very advanced music synthesis applications and visualizations.

Games and Interactivity

Audio is a huge part of what makes interactive experiences so compelling. If you don’t believe me, try watching a movie with the volume muted.

Games are no exception! My fondest video game memories are of the music and sound effects. Now, nearly two decades after the release of some of my favorites, I still can’t get Koji Kondo’s Zelda and Matt Uelmen’s Diablo soundtracks out of my head. Even the sound effects from these masterfully-designed games are instantly recognizable, from the unit click responses in Blizzard’s Warcraft and Starcraft series to samples from Nintendo’s classics.

Sound effects matter a great deal outside of games, too. They have been around in user interfaces (UIs) since the days of the command line, where certain kinds of errors would result in an audible beep. The same idea continues through modern UIs, where well-placed sounds are critical for notifications, chimes, and of course audio and video communication applications like Skype. Assistant software such as Google Now and Siri provide rich, audio-based feedback. As we delve further into a world of ubiquitous computing, speech- and gesture-based interfaces that lend themselves to screen-free interactions are increasingly reliant on audio feedback. Finally, for visually impaired computer users, audio cues, speech synthesis, and speech recognition are critically important to create a usable experience.

Interactive audio presents some interesting challenges. To create convincing in-game music, designers need to adjust to all the potentially unpredictable game states a player can find herself in. In practice, sections of the game can go on for an unknown duration, and sounds can interact with the environment and mix in complex ways, requiring environment-specific effects and relative sound positioning. Finally, there can be a large number of sounds playing at once, all of which need to sound good together and render without introducing quality and performance penalties.

The Audio Context

The Web Audio API is built around the concept of an audio context. The audio context is a directed graph of audio nodes that defines how the audio stream flows from its source (often an audio file) to its destination (often your speakers). As audio passes through each node, its properties can be modified or inspected. The simplest audio context is a connection directly form a source node to a destination node (Figure 1-1).

An audio context can be complex, containing many nodes between the source and destination (Figure 1-2) to perform arbitrarily advanced synthesis or analysis.

Figures Figure 1-1 and Figure 1-2 show audio nodes as blocks. The arrows represent connections between nodes. Nodes can often have multiple incoming and outgoing connections. By default, if there are multiple incoming connections into a node, the Web Audio API simply blends the incoming audio signals together.

The concept of an audio node graph is not new, and derives from popular audio frameworks such as Apple’s CoreAudio, which has an analogous Audio Processing Graph API. The idea itself is even older, originating in the 1960s with early audio environments like Moog modular synthesizer systems.

Figure 1-2. A more complex audio context

Initializing an Audio Context

The Web Audio API is currently implemented by the Chrome and Safari browsers (including MobileSafari as of iOS 6) and is available for web developers via JavaScript. In these browsers, the audio context constructor is webkit-prefixed, meaning that instead of creating a new AudioContext, you create a new webkitAudioContext. However, this will surely change in the future as the API stabilizes enough to ship un-prefixed and as other browser vendors implement it. Mozilla has publicly stated that they are implementing the Web Audio API in Firefox, and Opera has started participating in the working group.

With this in mind, here is a liberal way of initializing your audio context that would include other implementations (once they exist):

var contextClass = (window.AudioContext || 
  window.webkitAudioContext || 
  window.mozAudioContext || 
  window.oAudioContext || 
  window.msAudioContext);
if (contextClass) {
  // Web Audio API is available.
  var context = new contextClass();
} else {
  // Web Audio API is not available. Ask the user to use a supported browser.
}

A single audio context can support multiple sound inputs and complex audio graphs, so generally speaking, we will only need one for each audio application we create. The audio context instance includes many methods for creating audio nodes and manipulating global audio preferences. Luckily, these methods are not webkit-prefixed and are relatively stable. The API is still changing, though, so be aware of breaking changes (see Deprecation Notes).

Types of Web Audio Nodes

One of the main uses of audio contexts is to create new audio nodes. Broadly speaking, there are several kinds of audio nodes:

Source nodes: Sound sources such as audio buffers, live audio inputs, <audio> tags, oscillators, and JS processors
Modification nodes: Filters, convolvers, panners, JS processors, etc.
Analysis nodes: Analyzers and JS processors
Destination nodes: Audio outputs and offline processing buffers

Sources need not be based on sound files, but can instead be real-time input from a live instrument or microphone, redirection of the audio output from an audio element [see Setting Up Background Music with the <audio> Tag], or entirely synthesized sound [see Audio Processing with JavaScript]. Though the final destination-node is often the speakers, you can also process without sound playback (for example, if you want to do pure visualization) or do offline processing, which results in the audio stream being written to a destination buffer for later use.

Connecting the Audio Graph

Any audio node’s output can be connected to any other audio node’s input by using the connect() function. In the following example, we connect a source node’s output into a gain node, and connect the gain node’s output into the context’s destination:

// Create the source.
var source = context.createBufferSource();
// Create the gain node.
var gain = context.createGain();
// Connect source to filter, filter to destination.
source.connect(gain);
gain.connect(context.destination);

Note that context.destination is a special node that is associated with the default audio output of your system. The resulting audio graph of the previous code looks like Figure 1-3.

Once we have connected up a graph like this we can dynamically change it. We can disconnect audio nodes from the graph by calling node.disconnect(outputNumber). For example, to reroute a direct connection between source and destination, circumventing the intermediate node, we can do the following:

source.disconnect(0);
gain.disconnect(0);
source.connect(context.destination);

Power of Modular Routing

In many games, multiple sources of sound are combined to create the final mix. Sources include background music, game sound effects, UI feedback sounds, and in a multiplayer setting, voice chat from other players. An important feature of the Web Audio API is that it lets you separate all of these different channels and gives you full control over each one, or all of them together. The audio graph for such a setup might look like Figure 1-4.

Figure 1-4. Multiple sources with individual gain control as well as a master gain

We have associated a separate gain node with each of the channels and also created a master gain node to control them all. With this setup, it is easy for your players to control the level of each channel separately, precisely the way they want to. For example, many people prefer to play games with the background music turned off.

What Is Sound?

In terms of physics, sound is a longitudinal wave (sometimes called a pressure wave) that travels through air or another medium. The source of the sound causes molecules in the air to vibrate and collide with one another. This causes regions of high and low pressure, which come together and fall apart in bands. If you could freeze time and look at the pattern of a sound wave, you might see something like Figure 1-5.

Figure 1-5. A sound pressure wave traveling through air particles

Mathematically, sound can be represented as a function, which ranges over pressure values across the domain of time. Figure 1-6 shows a graph of such a function. You can see that it is analogous to Figure 1-5, with high values corresponding to areas with dense particles (high pressure), and low values corresponding to areas with sparse particles (low pressure).

Figure 1-6. A mathematical representation of the sound wave in Figure 1-5

Electronics dating back to the early twentieth century made it possible for us to capture and recreate sounds for the first time. Microphones take the pressure wave and convert it into an electric signal, where (for example) +5 volts corresponds to the highest pressure and −5 volts to the lowest. Conversely, audio speakers take this voltage and convert it back into the pressure waves that we can hear.

Whether we are analyzing sound or synthesizing it, the interesting bits for audio programmers are in the black box in Figure 1-7, tasked with manipulating the audio signal. In the early days of audio, this space was occupied by analog filters and other hardware that would be considered archaic by today’s standards. Today, there are modern digital equivalents for many of those old analog pieces of equipment. But before we can use software to tackle the fun stuff, we need to represent sound in a way that computers can work with.

Figure 1-7. Recording and playback

What Is Digital Sound?

We can do this by time-sampling the analog signal at some frequency, and encoding the signal at each sample as a number. The rate at which we sample the analog signal is called the sample rate. A common sample rate in many sound applications is 44.1 kHz. This means that there are 44,100 numbers recorded for each second of sound. The numbers themselves must fall within some range. There is generally a certain number of bits allocated to each value, which is called the bit depth. For most recorded digital audio, including CDs, the bit depth is 16, which is generally enough for most listeners. Audiophiles prefer 24-bit depth, which gives enough precision that people’s ears can’t hear the difference compared to a higher depth.

The process of converting analog signals into digital ones is called quantization (or sampling) and is illustrated in Figure 1-8.

Figure 1-8. Analog sound being quantized, or transformed into digital sound

In Figure 1-8, the quantized digital version isn’t quite the same as the analog one because of differences between the bars and the smooth line. The difference (shown in blue) decreases with higher sample rates and bit depths. However, increasing these values also increases the amount of storage required to keep these sounds in memory, on disk, or on the Web. To save space, telephone systems often used sample rates as low as 8 kHz, since the range of frequencies needed to make the human voice intelligible is far smaller than our full audible-frequency range.

By digitizing sound, computers can treat sounds like long arrays of numbers. This sort of encoding is called pulse-code modulation (PCM). Because computers are so good at processing arrays, PCM turns out to be a very powerful primitive for most digital-audio applications. In the Web Audio API world, this long array of numbers representing a sound is abstracted as an AudioBuffer. AudioBuffers can store multiple audio channels (often in stereo, meaning a left and right channel) represented as arrays of floating-point numbers normalized between −1 and 1. The same signal can also be represented as an array of integers, which, in 16-bit, range from (−2¹⁵) to (2¹⁵ − 1).

Audio Encoding Formats

Raw audio in PCM format is quite large, which uses extra memory, wastes space on a hard drive, and takes up extra bandwidth when downloaded. Because of this, audio is generally stored in compressed formats. There are two kinds of compression: lossy and lossless. Lossless compression (e.g., FLAC) guarantees that when you compress and then uncompress a sound, the bits are identical. Lossy audio compression (e.g., MP3) exploits features of human hearing to save additional space by throwing out bits that we probably won’t be able to hear anyway. Lossy formats are generally good enough for most people, with the exception of some audiophiles.

A commonly used metric for the amount of compression in audio is called bit rate, which refers to the number of bits of audio required per second of playback. The higher the bit rate, the more data that can be allocated per unit of time, and thus the less compression required. Often, lossy formats, such as MP3, are described by their bit rate (common rates are 128 and 192 Kb/s). It’s possible to encode lossy codecs at arbitrary bit rates. For example, telephone-quality human speech is often compared to 8 Kb/s MP3s. Some formats such as OGG support variable bit rates, where the bit rate changes over time. Be careful not to confuse this concept with sample rate or bit depth [see What Is Sound?]!

Browser support for different audio formats varies quite a bit. Generally, if the Web Audio API is implemented in a browser, it uses the same loading code that the <audio> tag would, so the browser support matrix for <audio> and the Web Audio API is the same. Generally, WAV (which is a simple, lossless, and typically uncompressed format) is supported in all browsers. MP3 is still patent-encumbered, and is therefore not available in some purely open source browsers (e.g., Firefox and Chromium). Unfortunately, the less popular but patent-unencumbered OGG format is still not available in Safari at the time of this writing.

For a more up-to-date roster of audio format support, see http://mzl.la/13kGelS.

Loading and Playing Sounds

Web Audio API makes a clear distinction between buffers and source nodes. The idea of this architecture is to decouple the audio asset from the playback state. Taking a record player analogy, buffers are like records and sources are like playheads, except in the Web Audio API world, you can play the same record on any number of playheads simultaneously! Because many applications involve multiple versions of the same buffer playing simultaneously, this pattern is essential. For example, if you want multiple bouncing ball sounds to fire in quick succession, you need to load the bounce buffer only once and schedule multiple sources of playback [see Multiple Sounds with Variations].

To load an audio sample into the Web Audio API, we can use an XMLHttpRequest and process the results with context.decodeAudioData. This all happens asynchronously and doesn’t block the main UI thread:

var request = new XMLHttpRequest();
request.open('GET', url, true);
request.responseType = 'arraybuffer';

// Decode asynchronously
request.onload = function() {
  context.decodeAudioData(request.response, function(theBuffer) {
    buffer = theBuffer;
  }, onError);
}
request.send();

Audio buffers are only one possible source of playback. Other sources include direct input from a microphone or line-in device or an <audio> tag among others (see Integrating with Other Technologies).

Once you’ve loaded your buffer, you can create a source node (AudioBufferSourceNode) for it, connect the source node into your audio graph, and call start(0) on the source node. To stop a sound, call stop(0) on the source node. Note that both of these function calls require a time in the coordinate system of the current audio context (see Perfect Timing and Latency):

function playSound(buffer) {
  var source = context.createBufferSource();
  source.buffer = buffer;
  source.connect(context.destination);
  source.start(0);
}

Games often have background music playing in a loop. However, be careful about being overly repetitive with your selection: if a player is stuck in an area or level, and the same sample continuously plays in the background, it may be worthwhile to gradually fade out the track to prevent frustration. Another strategy is to have mixes of various intensity that gradually crossfade into one another depending on the game situation [see Gradually Varying Audio Parameters].

Putting It All Together

As you can see from the previous code listings, there’s a bit of setup to get sounds playing in the Web Audio API. For a real game, consider implementing a JavaScript abstraction around the Web Audio API. An example of this idea is the following BufferLoader class. It puts everything together into a simple loader, which, given a set of paths, returns a set of audio buffers. Here’s how such a class can be used:

window.onload = init;
var context;
var bufferLoader;

function init() {
  context = new webkitAudioContext();

  bufferLoader = new BufferLoader(
    context,
    [
      '../sounds/hyper-reality/br-jam-loop.wav',
      '../sounds/hyper-reality/laughter.wav',
    ],
    finishedLoading
    );

  bufferLoader.load();
}

function finishedLoading(bufferList) {
  // Create two sources and play them both together.
  var source1 = context.createBufferSource();
  var source2 = context.createBufferSource();
  source1.buffer = bufferList[0];
  source2.buffer = bufferList[1];

  source1.connect(context.destination);
  source2.connect(context.destination);
  source1.start(0);
  source2.start(0);
}

For a simple reference implementation of BufferLoader, take a look at http://webaudioapi.com/samples/shared.js.