Digital Audio with the DFPlayer

My oldest daughter and I recently built one of Mr. Baddeley’s Baby R2s. These units are small, almost cartoonish, radio-controlled R2D2s that are extremely easy to build. And fun! But something missing from the design (at least at the time of this writing) is the ability for these little guys to produce sounds. And what’s an R2D2 with his signature beeps and boops?

For my life-size R2, I incorporated an MP3 Trigger from Sparkfun and paired that with an Arduino. But I couldn’t use that here because the MP3 Trigger is too large. The Baby R2s just can’t accommodate it. So I went in search of something else. And that’s when I came across the DFPlayer from DFRobot.

In this article (the first of three), we’ll be exploring the DFPlayer and beginning our journey into ultimately using an RC radio to trigger audio. If that’s not something that interests you, no worries. This article is focused entirely on the DFPlayer.

DFPlayer

DFRobot’s DFPlayer is a tiny (~ 21mm x 21mm) module that’s capable of playing MP3, WMV, and WAV audio data. It features a micro-SD slot for your audio files. It can be connected directly to a small speaker (< 3 W) or an amplifier. And it can be controlled a few different ways, including via a serial connection which works well for my particular needs.

Perhaps the biggest selling point for the DFPlayer is its price – $6 as of this writing. Compare that to the SparkFun MP3 Trigger, which comes in at around $50. The DFPlayer is practically a guilt-free impulse buy. In fact, I picked up a 3 pack from Amazon for around $10.

One of the downsides to this module is that there are various models that exhibit different tolerances to electrical noise, which means you might struggle with an audible hiss. Killzone_kid posted a great writeup on the Arduino forums that examines some of the models and makes some recommendations on possible ways to mitigate the hiss. Some of the flavors also apparently have compatibility issues with the officially supported DFPlayer Arduino library, which we’ll look at in a bit.

Here’s the pin diagram for the DFPlayer.

Left-Side Pins

There’s a Vcc pin and a ground pin as you might expect. These pins are used to power up the device. The source voltage must be between 3.2V and 5V.

The Serial RX and TX pins are used to provide a serial interface for controlling the device. It defaults to 9600 baud, 1 data bit, no check bits, no flow control. The serial protocol is detailed in the datasheet. But there’s also an official support library for Arduino called DFRobotDFPlayerMini that implements the serial protocol and provides a high-level interface for controlling the DFPlayer.

The Amp Out pins are for connecting to an audio amplifier or headphones.

The Spkr pins are for very small speakers – less than 3 watts. This is actually what I’ll be using for my project. I’m connecting the DFPlayer to a small speaker that I harvested from one of my kids’ annoying…er, I mean, broken toys.

 

Right-Side Pins

On the right-hand side of the pin diagram, you’ll see pairs of pins for I/O and ADKey. These are two other mechanisms for controlling the DFPlayer. We’ll use the I/O pins to test the DFPlayer shortly. But we won’t be using the ADKey pins at all. I won’t be discussing them further. If you want to learn more about them, I advise you to check out the DFPlayer Wiki.

The USB pins allow the DFPlayer to work as a USB device. I haven’t been able to find very much information about this. It’s not discussed much in the DFPlayer manual. Apparently, it provides a mechanism for updating the contents of the SD card from your PC. This could be handy for projects where the DFPlayer ends up hidden away inside of an enclosure that can accommodate a USB connector on the outside. For my project, I don’t need it so I won’t be exploring it. However, if anyone knows where I can find more information about this, please let me know. It could come in handy on another project.

The pin labeled “Playing Status”, referred to as the “Busy” pin, is normally high when the device is idle and goes low when the device is playing audio. The device already has a small LED that lights up when it’s playing files. But if that’s not good enough, you can connect an LED to this pin, or connect it to a microcontroller for more sophisticated behaviors.

Adding Media Files

The DFPlayer manual describes the naming of folders and files on the SD card. With regards to files, it specifies names using 3 digit numbers, such as 001.mp3, 002.wav, etc. Folders can be named similarly. I didn’t actually create any folders on my SD card, and it worked just fine. My file layout looks like so.

Testing the DFPlayer

Before doing anything else, I like to do a quick smoke test. This simply involves powering up devices straight out of the box (if feasible) and seeing if any magic blue smoke appears. I also like to note if any LEDs light up, as there’s often a power LED that will indicate the device is at least turning on. In this case, nothing happened. After a bit of reading, I learned that the device’s single LED only lights up when it’s actually playing media. So at this point, I wasn’t sure if it was even powering up. My benchtop power supply said the DFPlayer was pulling a small amount of current, so something was happening.

Next, I wanted to see if I could get some sound out of the device. I plugged in my micro SD card and connected my speaker. The I/O pins (9 and 11) were the key to this. Grounding I/O pin 1 for a quick duration will cause the DFPlayer to play the “next” track. Grounding it for a long duration lowers the audio level. Grounding I/O pin 2 for a quick duration will cause the DFPlayer to play the “previous” track. Grounding it for a long duration raises the audio level.

I grounded I/O pin 2 for a couple of seconds to get the audio level all the way up and then quickly grounded I/O pin 1 with a quick tap. The device’s LED lit up and I immediately heard some beeps and boops. The device was working. Success!

Now I knew any issues I might have trying to drive it over a serial connection would be limited to my code and/or the serial interface itself.

Connecting the Arduino

For this project, I used a Nano clone from Lavfin. These come with the pins pre-soldered. When I originally bought this, you could get a pack of 3 from Amazon for around $14. The price has since gone up to $28 as of this writing (presumably, because of supply chain issues).

For testing, I’m used a Nano expansion board. This provides convenient screw terminals.

 

I connected the DFPlayer to the Arduino using the serial pins of the DFPlayer and digital pins 10 and 11 of the Arduino. The DFPlayer’s RX pin connects to the Arduino’s digital pin 11. The DFPlayer TX pin connects to the Arduino’s digital pin 10.

Why didn’t I use the Arduino’s serial pins? I could have. But the process of writing new code to the Arduino makes use of the UART. So I’d have to disconnect and reconnect the DFPlayer every time I wanted to update the software.

I don’t recommend connecting the DFPlayer Vcc and ground pins to the Arduino’s 5v and ground pins unless you REALLY need to leverage the Arduino’s voltage regulator. This is how I’ve seen it wired in the online examples. It’s convenient, sure. But the Arduino has current supply limitations. Having the DFPlayer and the Arduino connected independently to the power source is the better option.

This is how my DFPlayer and Nano are wired together.

The source code for my Arduino/DFPlayer test is as follows.

#include "Arduino.h"
#include "DFRobotDFPlayerMini.h"
#include "SoftwareSerial.h"
 
static SoftwareSerial g_serial(10, 11);
static DFRobotDFPlayerMini g_dfPlayer;
 
/**
 * Called when the Arduino starts up.
 */
void setup()
{
    // We'll use the built-in LED to indicate a communications
    // problem with the DFPlayer.
    pinMode(LED_BUILTIN, OUTPUT);
    digitalWrite(LED_BUILTIN, LOW);
 
    // Let's give the DFPlayer some time to startup.
    delay(2000);
 
    g_serial.begin(9600);
 
    if (!g_dfPlayer.begin(g_serial))
    {
        // There's a problem talking to the DFPlayer. Let's turn on the LED
        // and halt.
        digitalWrite(LED_BUILTIN, HIGH);
        while(true)
        {
            delay(0);
        }
    }
 
    // Valid values for volume go from 0-30.
    g_dfPlayer.volume(20);
    // Plays the first file found on the filesystem.
    g_dfPlayer.play(1);
}
 
// Called over and over as long as the Arduino is powered up.
void loop()
{
    static unsigned long timeLastSoundPlayed = millis();
 
    // We're going to iterate through the sounds using DFRobotDFPlayerMini's next() function,
    // playing a new sound every five seconds.
    if ((millis() - timeLastSoundPlayed) &gt; 5000)
    {
        g_dfPlayer.next();
        timeLastSoundPlayed = millis();
    }
 
    // Consumes any data that might be waiting for us from the DFPlayer.
    // We don't do anything with it. We could check it and report an error via the
    // LED. But we can't really dig ourselves out of a bad spot, so I opted to
    // just ignore it.
    if (g_dfPlayer.available())
        g_dfPlayer.read();
}

To build this sketch, the DFRobotDFPlayerMini library must be installed. This can be downloaded from GitHub –
https://github.com/DFRobot/DFRobotDFPlayerMini. Installing it is as simple as extracting it to the Arduino libraries directory (e.g., C:\Program Files (x86)\Arduino\libraries).

Line #5 creates an instance of the SoftwareSerial class. I call it g_serial. This is used to emulate a UART over digital pins. I reserved the hardware UART so I could update the software on the Arduino while being connected to DFPlayer. If you’d rather use the hardware UART and the Serial global variable, that’ll work fine too. You just have to disconnect the DFPlayer every time you need to update the code. The two arguments to the SoftwareSerial constructor are the pin numbers for RX and TX, respectively.

Line #2 of the source above includes the DFRobotDFPlayerMini header file, which brings the various DFRobotDFPlayerMini types into scope. I then declare the static global instance of DFRobotDFPlayerMini called  g_dfPlayer. This will be the object we use to interact with the DFPlayer.

The first thing the setup() function does is configure the Arduino’s on-board LED. Since I’m not actually using the Serial Monitor in the IDE, I wanted to light this up if there were problems initiating communications with the DFPlayer.

I then delay execution for 2 seconds to give the DFPlayer time to startup. This is important to note because I didn’t see this happen in any of the sample sketches that cames with the DFRobotDFPlayerMini library. Without the delay, things just wouldn’t work for me. I don’t know if it’s because of my particular flavor of DFPlayer and/or Arduino. But 2 seconds seems to be the delay I need to get the devices to talk to one another.

I call begin() on g_serial with an argument of 9600 baud since this is the baud rate supported by DFPlayer. I then attempt to start communication with the DFPlayer by calling g_dfPlayer’s begin() function, passing it the serial object I want it to use. If an error occurs, I light up the LED and effectively halt. If no error occurs, I crank the volume up to 20 (out of 30 max) and play the first file on the DFPlayer’s file system. If all things are well, I should hear a sound.

In the loop() function, we do two things – 1) check to see if it’s been 5 seconds since we last played a sound and, if so, play one and 2) eat up any data sent to us by the DFPlayer. I should point out that I don’t actually know if we need to consume data if we’re not doing anything with it. But without diving too deep into the DFRobotDFPlayerMini code, I’m erring on the side of caution and hoping to keep some buffer somewhere from filling up.

Once this code was compiled and flashed to the Arduino, I had to restart both the Arduino and the DFPlayer. After a couple of seconds, I started hearing more beeps and boops.

Wrapping Up

This code is a good starting point. I encourage you to explore the sample code that accompanies the DFRobotDFPlayerMini library. There are some good nuggets there.

In the next article, I’ll be focusing on interfacing a radio controller with the Arduino. And then a third article will follow that which will tie everything together so that we can trigger our audio with a radio.

Until next time…

Appendix B – Introduction to Windows Core Audio

For a little while now, I’ve been hard at work on a little side project. I almost hesitate to announce it at this point, because it’s still very early. But, what the heck. Why not. I’ve started writing a book. And no, it’s not one full of suspense and intrigue. Nor is it the next young adult break-out series. Turns out, I’m writing a programming book. The working title is “Practical Digital Audio for C++ Programmers”, which I admit is a mouthful. Henceforth (at least as far as this blog entry is concerned), I shall refer to it as PDA4CPP.

When I first started my audio programming journey, I quickly discovered there was a huge hole in the information available to newcomers to the field. There was plenty of material to be found regarding specific audio libraries. And there was even more material that discussed, in very mind-bendy ways, things like audio effects and synthesis that assumed you already had some level of comfort with digital audio programming. But there was very little in-between. And as a complete newb, I found it super discouraging. So I decided to do something about it. PDA4CPP is the fruits of my labor.

As I mentioned, the book is in its infancy. Only one chapter has been completed to date – Appendix B: Introduction to Windows Core Audio. But it’s a beast, coming in at 170 pages. In it, I talk about where Core Audio fits into the Windows story, the Windows audio architecture, device discovery, audio formats, WASAPI, audio rendering, and audio capturing.

Why did I start with Appendix B? Some of it was because of the questions and feedback I received from my blog entry, “A Brief History of Windows Audio APIs”. But mostly, I started with Appendix B because that’s where I needed to. Most of the book’s code will be implemented around a custom audio library that’s effectively a thin wrapper around platform-specific audio code. The Windows side of things provided as great a starting point as any.

Something I’m going to experiment with is making drafts of the book’s chapters available for purchase as I complete them. Not only will this help motivate me to keep writing, but it will also help me gauge interest. Appendix B is the first chapter available for purchase. Pricing for each chapter will vary based on each chapter’s size and density. More information can be found on the book’s page, which can be found under the “Pages” menu. An excerpt is available, as well as the chapter’s source code.

If you purchase the chapter and love it, hate it, or have ideas on how to improve it, please email me or leave a comment below.

Thanks!

-Shane

A Brief History of Windows Audio APIs

A few months ago, the audio programming bug bit me pretty hard. I’m not entirely sure why it took so long really. I’ve been recording and mixing music since college. And much of my software development career has been built on a giant mass of C/C++ code. But somehow these worlds never converged. Somehow, with all of the time I’ve spent in front of tools like Cakewalk Sonar and FruityLoops, it never occurred to me that I might be able to learn what these applications are doing under the hood.

Then my nightstand began accumulating books with titles like “The Theory of Sound” and “Who Is Fourier? A Mathematical Adventure”. I absorbed quite a bit of theoretical material in a relatively short amount of time. But as with anything, you must use it to learn it. So I looked to the platform I already use for all of my audio recording – Windows.

What I found was a dark, mysterious corner in the Windows platform. There’s not a lot in the way of introductory material here. As of this writing, I could find no books dedicated to Windows audio programming. Sure, there’s MSDN, but, um, it’s MSDN. I also spent some time digging through back issues of Windows Developer’s Journal, MSDN Magazine, Dr. Dobbs, etc. and the pickings were slim. It seemed the best sources of information were blogs, forums, and StackOverflow. The trick was wading through the information and sorting it all out.

Developers new to Windows audio application development, like me, are often overwhelmed by the assortment of APIs available. I’m not just talking about third party libraries. I’m talking about the APIs baked into Windows itself. This includes weird sounding things like MME, WASAPI, DirectSound, WDM/KS, and XAudio2. There are a lot different paths a developer could take. But which one makes the most sense? What are the differences between them all? And why are there so many options?

I needed a bit more information and context in deciding how I was going to spend my time. And for this, I had to go back to 1991.

1991 – Windows Multimedia Extensions (aka MME, aka WinMM): Ahhh…1991. That was the year both Nirvana’s “Nevermind” and “Silence of the Lambs” entered pop culture. It was also the year of the first Linux kernel and the very first web browser. Most of us didn’t realize it at the time, but a lot of cool stuff was happening.

Most PCs of this vintage had tiny little speakers that were really only good at producing beeps and bloops. Their forte was square waves. They could be coerced into producing more sophisticated sounds using a technique called Pulse Width Modulation, but the quality wasn’t much to get excited about. That “Groove is the Heart” sound file being played through your PC speaker might be recognizable, but it certainly wasn’t going to get anybody on the dance floor.

Sound cards didn’t usually come bundled with name brand PCs, but they were becoming more and more popular all the time. Independently owned computer shops were building and selling homebrew PCs with sound cards from companies like Creative Labs and Adlib. Folks not lucky enough to buy a computer bundled with a sound card could buy an add-on card out of the back of a magazine like PC Computing or Computer Shopper and be up and running in no time.

The 90’s was also the golden age for the demo scene. Programmers pushed the limits of graphics and audio hardware in less bytes than most web pages are today. Amiga MOD files were a big deal too. They even inspired many audio enthusiasts to build their own parallel port DACs for the best audio experience. And then there were the video games. Game publishers like Apogee and Sierra Entertainment were cranking out awesome game titles, most of which could take advantage of Sound Blaster or Adlib cards if they were available.

Professional audio on the PC existed, but it was usually implemented using external hardware solutions, proprietary software, and proprietary communications protocols. Consumer grade sound card manufacturers were adding MIDI support in the form of a dual purpose joystick port that seemed oddly out of place. It was more of a marketing tactic than a useful feature. Most consumers had no idea what MIDI was.

It was at this point when Microsoft decided to add an audio API for Windows. Windows 3.0 had been out for a year and was in widespread use. So Microsoft released a version of Windows 3.0 called Windows 3.0 with Multimedia Extensions (abbreviated MME, sometimes referred to in software development circles as the waveOut API). MME has both a high-level and low-level API. The low-level API supports waveform audio and MIDI input/output. It has function names that start with waveIn, waveOut, midiIn, midiStream, etc. The high-level API, the Media Control Interface (MCI), is REALLY high level. MCI is akin to a scripting language for devices.

MME was the very first standard audio API for Windows. It’s evolved a bit over the years, to be sure. But it’s still around. And it works well, but with some caveats.

Latency is a problem with MME. Dynamic, near-real time audio (e.g., game event sounds, software synthesizers, etc.) is a bit harder to do in a timely fashion. Anything that occurs 10ms later than the brains thinks it should is perceived to be out of sync. So that kind of programming is pretty much out of the question. However, pre-generated content (e.g., music files, ambient sounds, Windows system sounds, etc.) works well with MME. At the time, that was good enough.

MME is still around. Some might even use the word thriving. Historically, support for high quality audio has been a pain point for MME. Parts of the MME API (e.g., anything that deals with the device capability structures WININCAPS and WINOUTCAPS) can only handle a maximum of 96kHz and 16-bit audio. However, in modern versions of Windows, MME is built on top of Core Audio (more on this later). You may find that even though a device can’t report itself as capable of higher quality audio, higher sample rates and bit depths work anyway.

1995 – DirectSound (aka DirectX Audio): When Windows 3.1 came out in 1992, MME was officially baked in. But Windows still left game developers uninspired. All versions of Windows up to this point were effectively shells on top of DOS. It was in the way. It consumed memory and other resources that the games desperately needed. DOS was well known and already a successful platform for games. With DOS, games didn’t have to compete for resources and they could access hardware directly. As a result, most PC games continued to be released as they had been – DOS only.

Along came Windows 95. Besides giving us the infamous “Start” button and the music video for Weezer’s “Buddy Holly”, Windows 95 brought with it DirectX. DirectX was core to Microsoft’s strategy for winning over game developers, whom they saw as important for the success of Windows 95.

DirectX was the umbrella name given to a collection of COM-based multimedia APIs, which included DirectSound. DirectSound distinguished itself from MME by providing things like on the fly sample rate conversion, effects, multi-stream mixing, alternate buffering strategies, and hardware acceleration where available (in modern versions of Windows, this is no longer the case. See the discussion on Core Audio below). Because DirectSound was implemented using VxDs, which were kernel mode drivers, it could work extremely close to the hardware. It provided lower latency and support for higher quality audio than MME.

DirectSound, like the rest of DirectX, wasn’t an instant hit. It took game developers time, and a bit of encouragement on the part of Microsoft, to warm up to it. Game development under DOS, after all, was a well worn path. People knew it. People understood it. There was also a fear that maybe DirectX would be replaced, just as its predecessor WinG (a “high-performance” graphics API) had been. But eventually the gaming industry was won over and DirectX fever took hold.

As it relates to professional audio, DirectSound was a bit of a game changer. There were PC-based DAW solutions before DirectX, to be sure. From a software perspective, most of them were lightweight applications that relied on dedicated hardware to do all of the heavy lifting. And with their hardware, applications did their best at sidestepping Windows’ driver system. DirectSound made it practical to interact with hardware through a simple API. This allowed pro-audio applications to decouple themselves from the hardware they supported. The umbilical cord between professional grade audio software and hardware could be severed.

DirectX also brought pluggable, software based audio effects (DX effects) and instruments (DXi Instruments) to the platform. This is similar in concept to VST technology from Steinberg. Because DX effects and instruments are COM based components, they’re easily discoverable and consumable by any running application. This meant effects and software synthesizers could be developed and marketed independently of recording applications. Thanks to VST and DX effects, a whole new market was born that continues to thrive today.

Low latency, multi-stream mixing, high resolution audio, pluggable effects and instruments – all of these were huge wins for DirectSound.

1998 – Windows Driver Model / Kernel Streaming (aka WDM/KS): After the dust settled with Windows 95, Microsoft began looking at their driver model. Windows NT had been around for a few years. And despite providing support for the same Win32 API as it’s 16-bit/32-bit hybrid siblings, Windows NT had a very different driver model. This meant if a hardware vendor wanted to support both Windows NT and Windows 95, they needed to write two completely independent drivers – drivers for NT built using the the Windows NT Driver Model and VxDs for everything else.

Microsoft decided to fix this problem and the Windows Driver Model (WDM) was born. WDM is effectively an enhanced version of the Windows NT Driver Model, which was a bit more sophisticated than the VxDs used by Windows 95 and 3.x. One of the big goals for WDM, however, was binary and source code compatibility across all future versions of Windows. A single driver to rule them all. And this happened. Sort of.

Windows 98 was the first official release of Windows to support WDM, in addition to VxDs. Windows 2000, a derivative of Windows NT followed two years later and only supported WDM drivers. Windows ME, the butt of jokes for years to come, arrived not long after. But ME was the nail in the coffin for the Windows 9.x product line. The technology had grown stale. So the dream of supporting a driver model across both the NT and the 9.x line was short lived. All versions of Windows since have effectively been iterations of Windows NT technology. And WDM has since been the lone driver model for Windows.

So what’s this WDM business got to do with audio APIs? Before WDM came about, Windows developers were using either DirectSound or MME. MME developers were used to dealing with latency issues. But DirectSound developers were used to working a bit closer to the metal. With WDM, both MME and DirectSound audio now passed through something call the Kernel Audio Mixer (usually referred to as the KMixer). KMixer was a kernel mode component responsible for mixing all of the system audio together. KMixer introduced latency. A lot of it. 30 milliseconds, in fact. And sometimes more. That may not seem like a lot, but for a certain class of applications this was a non-starter.

Pro-Audio applications, such as those used for live performances and multitrack recording, were loathe to embrace KMixer. Many developers of these types of applications saw KMixer as justification for using non-Microsoft APIs such as ASIO and GSIF, which avoided the Windows driver system entirely (assuming the hardware vendors provided the necessary drivers).

Cakewalk, a Boston-based company famous for their DAW software, started a trend that others quickly adopted. In their Sonar product line starting with version 2.2, they began supporting a technique called WDM/KS. The WDM part you know. The KS stands for Kernel Streaming.

Kernel streaming isn’t an official audio API, per se. It’s something a WDM audio driver supports as part of its infrastructure. The WDM/KS technique involves talking directly to the hardware’s streaming driver, bypassing KMixer entirely. By doing so, an application could avoid paying the KMixer performance tax, reduce the load on the CPU, and have direct control over the data delivered to the audio hardware. Latency wasn’t eliminated. Audio hardware introduces its own latency, after all. But the performance gains could be considerable. And with no platform components manipulating the audio data before it reached the hardware, applications could exert finer control over the integrity of the audio as well.

The audio software community pounced on this little trick and soon it seemed like everybody was supporting WDM/KS.

It’s worth noting at this point in the story that, in special circumstances, DirectSound could actually bypass KMixer. If hardware mixing was supported by both the audio hardware and the application, DirectSound buffers could be dealt with directly by the audio hardware. It wasn’t a guaranteed thing, though. And I only mention it here in fairness to DirectSound.

2007 – Windows Core Audio: It was almost 10 years before anything significant happened with the Windows audio infrastructure. Windows itself entered an unusually long lull period. XP came out in 2001. Windows Vista development, which had begun development 5 months before XP had even been released, was fraught with missteps and even a development “reboot”. When Vista finally hit the store shelves in 2007, both users and developers were inundated with a number of fundamentals changes in the way things worked. We were introduced to things like UAC, Aero, BitLocker, ReadyBoost, etc. The end user experience of Vista wasn’t spectacular. Today, most people consider it a flop. Some even compare it to Windows ME. But for all of its warts, Vista introduced us to a bevvy of new technologies that we still use today. Of interest for this discussion is Windows Core Audio.

Windows Core Audio, not to be confused with OSX’s similarly named Core Audio, was a complete redesign in the way audio is handled on Windows. KMixer was killed and buried. Most of the audio components were moved from kernel land to user land, which had an impact on application stability. (Since WDM was accessed via kernel mode operations, WDM/KS applications could easily BSOD the system if not written well). All of the legacy audio APIs we knew and loved were shuffled around and suddenly found themselves built on top of this new user mode API. This included DirectSound, which at this point lost support for hardware accelerated audio entirely. Sad news for DirectSound applications, but sadder news was to come (more on this in a bit).

Core Audio is actually 4 APIs in one – MMDevice, WASAPI, DeviceTopology, and EndpointVolume. MMDevice is the device discovery API. The API for interacting with all of the software components that exist in the audio path is the DeviceTopology API. For interacting with volume control on the device itself, there’s the EndpointVolume API. And then there’s the audio session API – WASAPI. WASAPI is the workhorse API. It’s where all of the action happens. It’s where the sausage, er, sound gets made.

Along with new APIs came a number of new concepts, such as audio sessions and device roles. Core Audio is much better suited to the modern era of computing. Today we live in an ecosystem of devices. Users no longer have a single audio adapter and a set of speakers. We have headphones, speakers, bluetooth headsets, USB audio adapters, webcams, HDMI connected devices, WiFi connected devices, etc. Core Audio makes it easy for applications to work with all of these things based on use-case.

Another significant improvement Core Audio brings us is the ability to operate in either shared mode or exclusive mode.

Shared mode has some parallels with the old KMixer model. With shared mode, applications write to a buffer that’s handed off to the system’s audio engine. The audio engine is responsible for mixing all applications’ audio together and sending the mix to the audio driver. As with KMixer, this introduces latency.

Exclusive mode is Microsoft’s answer to the pro-audio world. Exclusive mode has much of the same advantages of WDM/KS. Applications have exclusive access to hardware and audio data travels directly from the application to the driver to the hardware. You also have more flexibility in audio formats with exclusive mode as compared to shared mode. The audio data format can be whatever the hardware supports – even non-PCM data.

At this point, you might assume WDM/KS can go away. Well, it can’t. As I said before, it’s not really an API. It’s part of the WDM driver infrastructure, so it will continue to exist so long as WDM exists. However, there’s no compelling reason to use WDM/KS for modern audio applications. An exclusive mode audio session in Core Audio is safer and just as performant. Plus it has the advantage of being a real audio API.

As of this writing, Windows 10 is the latest version of Windows and Core Audio still serves as the foundation for platform audio.

2008 – XAudio2: Over the years, DirectX continued to evolve. The Xbox, which was built on DirectX technologies, was a significant source of influence in the direction DirectX took. The “X” in Xbox comes from DirectX, after all. When DirectX 10 came out in 2007, it was evident that Microsoft had gone into their latest phase of DirectX development with guns blazing. Many APIs were deprecated. New APIs appeared that started with the letter “X”, such as XInput and XACT3.

XAudio2 appeared in the DirectX March 2008 SDK and was declared the official successor to DirectSound. It was built from the ground-up, completely independent of DirectSound. Its origins are in the original XAudio API which was part of XNA, Microsoft’s managed gaming framework. And while XAudio was considered an Xbox API, XAudio2 was targeted at multiple platforms, including the desktop. DirectSound was given “deprecated” status (this is the sadder news I mentioned earlier).

XAudio2 offers a number of features missing from DirectSound, including support for compressed formats like xWMA and ADPCM, as well as built-in, sophisticated DSP effects. It’s also considered a “cross-platform” API, which really just means it’s supported on the Xbox 360, Windows, and Windows Phone.

It’s worth mentioning that while XAudio2 is considered a low-level API, it’s still built on other technology. For the desktop, XAudio2 sits on top of Core Audio like everything else.

You might read all of this business about XAudio2 and assume that DirectSound is dead. We’re quite a way off from that, I think. There’s still a lot of DirectSound based software out there. Given Microsoft’s commitment to backwards compatibility, some level of DirectSound support/emulation is liable to exist in perpetuity. However, unless you’re determined to support versions of Windows that even Microsoft has stopped supporting, there’s no compelling reason to support DirectSound in modern audio applications.

Honorable Mention – ASIO: There are plenty of third party audio APIs available for Windows that weren’t invented by Microsoft. Some of them, like GSIF used by TASCAM’s (formerly Nemesys) GigaStudio, are tied to specific hardware. Some of them, like PortAudio and JUCE (more than just an audio API), are open-source wrappers around platform specific APIs. Some of them like OpenAL are just specifications that have yet to gain widespread adoption. But none has had quite the impact on the audio industry as ASIO.

Steinberg, the same forward-thinking company that gave us VSTs and Cubase, introduced us to ASIO all the way back in 1997. ASIO was originally a pro-audio grade driver specification for Windows. Its popularity, however, has allowed it to gain some level of support on Linux and OSX platforms. Its primary goal was, and still is, to give applications a high quality, low latency data path direct from application to the sound hardware.

Of course, the power of ASIO relies on hardware manufacturers providing ASIO drivers with their hardware. For applications that can support ASIO, all of the business of dealing with the Windows audio stack can be completely avoided. Conceptually, ASIO provides applications with direct, unfettered access to the audio hardware. Before Windows Vista, this could allow for some potentially significant performance gains. In the Core Audio world, this is less of a selling point.

The real-world performance of ASIO really depends on the quality of driver provided by the manufacturer. Sometimes an ASIO driver might outperform its WDM counterpart. Sometimes it’s the other way around. For that reason, many pro-audio applications have traditionally allowed the user to select their audio driver of choice. This, of course, makes life complicated for end-users because they have to experiment a bit to learn what works best for them. But such is life.

The waters get muddied even further with the so-called “universal” ASIO drivers, like ASIO4ALL and ASIO2KS. These types of drivers are targeted at low cost, consumer-oriented hardware that lack ASIO support out-of-the-box. By installing a universal ASIO driver, ASIO-aware applications can leverage this hardware. In practice, this type of driver merely wraps WDM/KS or WASAPI and only works as well as the underlying driver it’s built on. It’s a nice idea, but it’s really contrary to the spirit of the ASIO driver. Universal drivers are handy, though, if the audio application you’re trying to use only supports ASIO and you’ve got a cheap sound card lacking ASIO support.

ASIO, like MME is an old protocol. But it’s very much still alive and evolving. Most pro-audio application professionals hold it in high regard and still consider it the driver of choice when interfacing with audio hardware.

Conclusion: “Shane, where’s the code?” I know, I know. How do you talk about APIs without looking at code? I intentionally avoided it here in the interest of saving space. And, yet, this article still somehow ended up being long winded. In any case, I encourage you to go out on the Interwebs and look at as much Windows audio source code as you can find. Browse the JUCE and Audacity source repos, look at PortAudio, and peruse the sample code that Microsoft makes available on MSDN. It pays to see what everybody else is doing.

For new developers, the choice of audio API may or may not be clear. It’s tempting to make the following generalization: games should go with XAudio2, pro-audio should go with ASIO and/or Core Audio, and everybody else should probably go with MME. Truth is, there are no rules. The needs of every application are different. Each developer should weigh their options against effort, time, and money. And as we see more often than not, sometimes the solution isn’t a single solution at all.

(Shameless Plug: If you’re interested in learning how to use Core Audio, consider purchasing an early draft of “Appendix B: Introduction to Windows Core Audio” from the book I’m currently working on entitled, “Practical Digital Audio for C++ Programmers.”)