missingbytes: The Cocktail Party Effect (Part 1 of 2)

Late at night, when the boys are (finally) asleep, we sometimes like to watch stuff on the television. If the TV is too loud, the boys wake up, and that really breaks immersion. If the TV is too quiet, we can't hear the gripping dialogue, and that breaks immersion too.

Here's how I fixed it in code.

Mono

Okay. Don't judge me, but the first thing I did is to convert the incoming stereo audio down to mono. Most music tends to be well spatialised, while the voice track comes through the center channel. By converting to mono I estimate a ~3dB drop in perceived volume during musical interludes, or, conversely, I can raise the total volume by 3dB without waking up the boys.

Bass

The next thing to do is to cut the low frequencies. The rumbles and the explosions. The ones which reverberate throughout the house and wake up the kids and the neighbors too.

It's relatively easy to find info on constructing a digital high-pass filter on the internet, provided you can cut through the jargon. Generally there tends to be an incoming signal, denoted as x_n. It's the series of samples coming from your video decoder. And then there's an outgoing signal, y_n. It's the series of samples that you send to the sound card.

Then there's some function that links them together. Here's an example of a really simple high-pass filter you can find on the internet :

y_n= 0.2 . x_n + 0.8 . y_n-1

It's a simple matter to turn this into code:

A simple high-pass filter, in C++.

Z-transform

To work out the frequency response, we can use the Z-transform to find the Transfer Function, which maps our filter from the discrete, time domain, into the continuous, frequency domain. In our case:

H(z) = Y(z)/X(z) = 0.2 / (1 + 0.8 . z^-1)

The cool thing is we can use theory to treat z like a complex number, even though in practice, all the x_n and y_n will be floating point numbers. For example, to compute the frequency response at 1000Hz, with a sample rate of 48kHz, we take:

i = √-1

FrequencyResponse(1000Hz) = | H(2π1000/48000 i) |

= | 0.2/(1 + 0.8 / (2π1000i/48000) ) |

= 0.008

If we draw this on a graph, with frequency (Hz) on the X-axis, and magnitude (dB) on the Y-axis, we can get a "Bode Plot" of our filter:

A Bode Plot for a simple filter.

So that's a great start, but now we want more control over which frequencies pass through the filter, and which are blocked, and by how much.

Optimize

Back in the old days, we would have had to endlessly try different filter combinations and painstakingly compute Bode plots to find combinations of filters that might fit our requirements.

But now that we all have super-computers under our desks, we've got much more powerful tools to solve an old problem in a new way.

In pseudo-code, it looks like this :

def GetDesiredFrequencyResponse(frequency):

    // Your desired EQ function goes here

def EvaluateFilter(filter):

    errorSum = 0

    for frequency in 20 .. 22000:

        freq1 = GetDesiredFrequencyResponse(frequency)

        freq2 = filter.CalcFrequencyResponse(frequency)

        errorTerm = freq1 - freq2

        errorSum += errorTerm * errorTerm
    return errorSum

bestFilter = Minimize(FilterFactory, EvaluateFilter)

That's right, lets just use an offline optimizer to compute the optimal co-efficients for us! You could use numpy, or octave for this.

... And Generate the Code

'f32' is a 32-bit float

Now we have our filter coefficients, we need to inject it back into C++.

Here's a function which writes out the C++ directly. We can just copy and paste directly into ProcessAudio(), hit recompile, and hear the results immediately.

Notice the coding style - it's rife with buffer overflows, and makes huge unjustified assumptions about the inputs. You certainly couldn't use this code in a production environment, but as the scaffolding to bootstrap a single-use filter, the iteration speed trumps all other considerations.

Sample

Here's a sample output :

    static float y1 = 0.0f, y2 = 0.0f;
    float sum = 0.211989*x[0]
        -6.794531*x[-1]
        -29.803024*x[-2]
        +0.535980*x[-3]
        +0.215555*x[-4];
    float y0 = (sum -9.433180*y1 +31.500277*y2) / 0.000017;
    y2 = y1;
    y1 = y0;

Coming Soon

So this is part 1 of 2, in the next post I'll go into the details of the actual filters I'm using on my television, as well as some sample video so you can hear before and after. I'll try and include tips for how to ensure convergence, and some gotchas that the Bode plot won't tell you about. And of course, if there's anything you'd like to know more, why not suggest it in the comments below?

4 comments:

Renee27 May 2012 at 05:58
Or maybe we could just turn the volume down a bit and switch the subtitles on :P
Oma Ali27 May 2012 at 21:14
This looks amazing. I am going to try it right away.
Anonymous25 October 2012 at 16:36
Hey Chris, your "really simple high-pass filter" is actually a "really simple low-pass filter"

Steve West
missingbytes25 October 2012 at 16:53
Whoops! Yes, you're right! I should fix that!

Saturday, 26 May 2012

The Cocktail Party Effect (Part 1 of 2)