This is a “late” follow up of the technet wiki summit presentation on implementing media stream sources for background audio in windows phone runtime. If you don’t like video presentations or something, you can just read this article, as it contains pretty much everything that was covered in that presentation, and more.
Now, what’s the deal with windows phone runtime? Why isn’t my previous article of any use now? Because the runtime changed. Background audio tasks are now windows runtime components and that brings a whole new set of features and problems that you have to take care of.
My last article (which also happens to be the first) about media stream sources was presenting the solution in C#. This time around we will have C++. Why? Read on.

Memory Management

      Windows Phone Runtime is basically a native API based on COM. Its underlying implementation is C++, so it is best consumed from other unmanaged languages, such as, well, C++. It is not the same C++ used in the standards, so the C++/CX extension had to be created in order to consume winRT from C++. Now, the winRT component uses this pesky metadata file (.winmd) which has the same format as .net assemblies. This allows the Common Language Runtime (CLR) to consume the unmanaged winRT as if it were managed. But this is the catch: while developing, you can easily get carried away into thinking winRT types (everything in the Windows namespace) is managed in C#, but it is not. Everytime you call a winRT API, the CLR does all the magic behind to properly marshal things across the fence (also known as Application Binary Interface - ABI) from the managed world (also known as Managed Heap) to the unmanaged world (also known as Unmanaged Heap). The fence (ABI) dictates how things go from one place to the other. Most types are marshaled “as-is” or through some conversion or abstraction done automatically. Other types are marshaled through a copy (strings and all things related to strings, such as URI). How the CLR and winRT agree with one another in all those cases is not the subject of this article.

      When playing sound (or video for that matter) you will have to pass around some buffers containing data. winRT provides this useful interface called “IBuffer” which allows both winRT types and CLR types to expose themselves as a buffer. The easiest way to get a buffer in C# is to use the “AsBuffer” extension method for the byte[] array struct. And this is obviously the only way to do it in C#. You can’t access the winRT’s own Buffer class directly. And this leads to our bigger problem.
winRT types use a ref counting GC (garbage collection) scheme. Suppose object A references B. B reference count is now 1. Suppose object C references B as well, object B ref count is now 2. Now if objects A and C somehow get destroyed, B’s reference count will be 0, at which point it gets destroyed. This is all done automatically by the COM framework and the developers don’t have to do anything. winRT basically has free, automatic memory management. But there is a catch: this reference counting induces latency and may cause memory leaks on circular dependencies (think object A references B and B references A in return). The CLR also comes with a GC, but this one works slightly different (generational GC). It only triggers when needed and uses a different algorithm to determine which memory can be freed. The generational GC is slightly slower to release memory, but allocating it is very cheap so in situations when memory is not an issue, C# is slightly faster than C++ when dealing with memory allocations.

      Now, when you call a winRT object from C#, the CLR will hold a reference to that winRT object, which means its ref count will be at least 1. Until the CLR’s GC collects that reference, the winRT object will not self-destruct. Basically, winRT has a deterministic memory management (you know that if nobody references an object, it will self destruct). CLR does not: the memory management will only happen “when needed”. Mixing the C# code with winRT nullifies the winRT’s ability to automatically manage its own memory in a deterministic fashion.
Since the MediaStreamSource API is now a winRT type, it will need winRT buffers to do its magic. If you want to get your buffers from C# adapters, it means you will be having a byte[] array copied into a winRT type implementing the “IBuffer interface”, creating a mutant object, half in the managed heap, half in the unmanaged heap. What is the problem with that? The winRT type will not release the memory until the GC collects its reference, which happens non deterministically.
If you are dealing with low to medium quality audio (up to 16bit, 2 channels, up to 96 KHz) this won’t cause any issues. If you want to go further than that, the memory will start leaking and the pipeline will crash eventually.

If you want no problems with background audio rendering, you need to follow the C++ of this tutorial.

Implementing media stream sources in windows runtime

      To implement media stream sources, one has to go through several stages. First, you need to get a decoder. This can be a tricky job, especially if you want to support formats that don’t have decoders built-in the SDK (such as ogg vorbis files, which we will be dealing with today). Luckily, visual studio 2013 supports quite a large part of the C++ standard so recompiling and creating winRT wrappers for C++ decoders is quite easy.
This guy, Alveochin91, has done a nice job at creating wrappers for several decoders that can be used with windows runtime. You can check out his git repository here, for examples on how to properly create wrappers. We will use the ogg vorbis decoder.

Once you have you decoder, it is time to initialize the media stream source.

      To properly initialize the media stream source, you will need the sampling rate of your stream, the bit rate and the sound channel count. The latter is usually 2. Windows Phone supports up to 24 bit sound and high sample rates. Remember, the bigger the numbers, the bigger the memory consumption. You will also need to attach event handlers to the media stream source starting, closed and sample request events. You have to set the buffering time to zero to avoid some unpleasant behavior when resuming from pause and to minimize memory consumption.

                      souceFile = ref new OggVorbisFile();
                auto x = targetFile->OpenAsync(FileAccessMode::Read);
                stream = create_task(x).get();
                auto info = souceFile->Info(0);
                decodedStreamDuration = souceFile->TimeTotal(-1);
                blockAlign = (info->Channels * (16 / 8));
                avgBytesPerSec = blockAlign * info->Rate;
                AudioEncodingProperties^ pcmprops = AudioEncodingProperties::CreatePcm((unsigned int)info->Rate, (unsigned int)info->Channels, 16);
                mss = ref new MediaStreamSource(ref new AudioStreamDescriptor(pcmprops));
                mss->CanSeek = true;
                TimeSpan bufferTime;
                bufferTime.Duration = 0;
                TimeSpan duration;
                duration.Duration = decodedStreamDuration * 10000000L;
                mss->BufferTime = bufferTime;
                mss->Duration = duration;
                mss->Closed += ref new Windows::Foundation::TypedEventHandler<Windows::Media::Core::MediaStreamSource ^, Windows::Media::Core::MediaStreamSourceClosedEventArgs ^>(this, &OggMediaStreamSourcery::OnClosed);
                mss->Starting += ref new Windows::Foundation::TypedEventHandler<Windows::Media::Core::MediaStreamSource ^, Windows::Media::Core::MediaStreamSourceStartingEventArgs ^>(this, &OggMediaStreamSourcery::OnStarting);
                mss->SampleRequested += ref new Windows::Foundation::TypedEventHandler<Windows::Media::Core::MediaStreamSource ^, Windows::Media::Core::MediaStreamSourceSampleRequestedEventArgs ^>(this, &OggMediaStreamSourcery::mss_SampleRequested);
                mss->Paused += ref new TypedEventHandler<Windows::Media::Core::MediaStreamSource ^, Object^>(this, &OggMediaStreamSourcery::MssPause);

      In the media stream source starting event, all you have to do is set the actual start position in the request argument. You can get here either by user seeking through your media, when resuming from pause or when the media starts playing. You will need to keep an internal timeline so you can seek and send samples at the right moment in the timeline.

                 auto deferal = args->Request->GetDeferral();
                if (args->Request->StartPosition == nullptr)
                    TimeSpan zero;
                    zero.Duration = secondsPosition;
                else if (args->Request->StartPosition != nullptr)
                    if (args->Request->StartPosition->Value.Duration > 0){
                        auto seconds = (args->Request->StartPosition->Value.Duration * 1) / 10000000L;
                        auto byteOffset = avgBytesPerSec * seconds;
                        secondsPosition = args->Request->StartPosition->Value.Duration;
                        TimeSpan reg;
                        reg.Duration = 0;
            catch (Exception^ e)

      In the closing event, all you have to do is cleanup: nothing else.

    delete souceFile;
    souceFile = nullptr;

      In the sample request event, you have to feed the devious buffers mentioned earlier. The decoder should handle the buffer acquisition process. There are several things you have to nail properly here. First, you have to set the duration of the sample. You will also have to set the rendering position properly. Don’t forget to handle exceptions gracefully, otherwise you could get sued for damaging people’s ears.

                if (souceFile->IsValid)
                    x = souceFile->Read(4096);
                    if (x->Length > 0)
                        TimeSpan SamplePosition;
                        SamplePosition.Duration = (long long)(secondsPosition);
                        MediaStreamSample^ sample = MediaStreamSample::CreateFromBuffer(x, (SamplePosition));
                        TimeSpan SampleDuration;
                        SampleDuration.Duration = GetDurationFromBufferSize(x->Length) * 10000000L;
                        sample->Duration = SampleDuration;
                        secondsPosition += sample->Duration.Duration;
                        args->Request->Sample = sample;
                        secondsPosition = 0;
            catch (Exception^ e)

You may have noticed that the media closing event and the sample request events are protected by a mutex, and in addition, a synchronization flag is used in the sample request. This makes sure you will not fall for a thread races between the 2 handlers. Each of the event handlers are marshaled on different threads, so you can get a closed event while you are fetching your sample. Since you can’t read from a closed stream, you can see where this is going. A mutex alone is not enough to protect from the thread race. You can still get a closed stream before you process your sample, so you have to check against that.