Edd Mann Developer

Creating a 'Winning' Audio Lambda Service using Serverless, Polly and compiled SOX

Following on from my previous post which discussed manipulating images, I would now like to expand upon this and look into how you can interact with audio using Lambda. To highlight this use-case we will be creating a simple service which given a name and optional voice (provided by Polly), will synthesise the name and include it in a returned ‘And the winner is…’ applause MP3 file. This will demonstrate how to integrate Polly within Lambda, compile and execute native-code within Lambda and return a binary MP3 file to the client.

Compiling SOX for Lambda

As we wish to join our static intro and outro audio files with the dynamically produced Polly response, we will need an application that can go about achieving this. I have decided to use SOX for this task, as it provides us with a very simple API for joining multiple files together into a single track. Lambda allows us to execute natively compiled code, providing that it has been correctly compiled for the underlying host operating system. To go about correctly compiling SOX for Lambda, we will be using a Docker image which locally replicates the environment as best it can, providing all the necessary build tooling. First, we need to start up a container (of this image) with bash running, so we can go about compiling SOX and its’ required dependencies.

$ docker run -it lambci/lambda:build bash

This will pull down the required image from Docker Hub and begin a bash interpreter session. Within this session, we will start by compiling MPEG Audio Decoder, which is a dependency of SOX.

$ curl -L -o libmad-0.15.1b.tar.gz "http://downloads.sourceforge.net/project/mad/libmad/0.15.1b/libmad-0.15.1b.tar.gz"
$ tar zxf libmad-0.15.1b.tar.gz && cd libmad-0.15.1b
$ sed -i '/-fforce-mem/d' configure # https://stackoverflow.com/questions/14015747/gccs-fforce-mem-option
$ ./configure --prefix=/usr/libmad-0.15.1b --disable-shared --enable-static
$ make && make install

Next we will compile LAME, which will allows us to encode the desired MP3 audio file within SOX.

$ curl -L -o lame-3.100.tar.gz "https://downloads.sourceforge.net/project/lame/lame/3.100/lame-3.100.tar.gz"
$ tar zxf lame-3.100.tar.gz && cd lame-3.100
$ ./configure --prefix=/usr/lame-3.100 --disable-shared --enable-static
$ make && make install

Finally, we are able to compile SOX, providing locations to the previously compiled libraries.

$ curl -L -o sox-14.4.2.tar.bz2 "http://downloads.sourceforge.net/project/sox/sox/14.4.2/sox-14.4.2.tar.bz2"
$ tar jxf sox-14.4.2.tar.bz2 && cd sox-14.4.2
$ CPPFLAGS="-I/usr/libmad-0.15.1b/include -I/usr/lame-3.100/include" \
  LDFLAGS="-L/usr/libmad-0.15.1b/lib -L/usr/lame-3.100/lib" \
  ./configure --prefix=/usr/sox-14.4.2 --disable-shared --enable-static
$ make && make install

You will notice that we have statically compiled all these applications as we desire to only depend on a single executable within the Lambda service. With SOX now compiled we can open up a host terminal session and copy the newly compiled sox executable from the container.

$ docker ps # displays the running containers id
$ docker cp {CONTAINER-ID}:/usr/sox-14.4.2/bin/sox ~/sox

Creating the Serverless Project

Now with the native executable compiled, we can go about creating the accompanying Lambda service. In a similar manner to the previous blog post, we will first create a skeleton Serverless project template.

$ serverless create --template aws-nodejs --path and-the-winner-is

Running this will create the basic handler and Serverless definition file. Replace the given Serverless definition file with the following.

service: and-the-winner-is

provider:
  name: aws
  runtime: nodejs6.10
  stage: prod
  region: eu-west-1
  environment:
    SOX_EXEC: ./sox
    INTRO_FILE: ./intro.mp3
    OUTRO_FILE: ./outro.mp3
  iamRoleStatements:
    - Effect: Allow
      Action:
        - polly:DescribeVoices
        - polly:SynthesizeSpeech
      Resource: '*'

plugins:
  - serverless-apigw-binary

custom:
  apigwBinary:
    types:
      - '*/*'

functions:
  winner:
    handler: handler.winner
    events:
      - http:
          path: /
          method: get

This configuration defines a single Lambda function which is exposed via a root API Gateway path. This also provides a couple of environment variables which specifiy the SOX executable location, along with the desired intro and outro audio files. We then use a Serverless plugin to correctly add the desired binary support to the API Gateway. As we desire to use Polly within Lambda we permit access to both the DescribeVoices and SynthesizeSpeech actions. Before continuing we should include the Serverless plugin we have defined as a development dependency.

$ npm install serverless-apigw-binary --save-dev

Synthesising the Name

With this definition in-place we will move on to generating (synthesising) the provided name given to us by the client using Polly. If the client happens to not supply us with a desired voice we will randomly choose one from the list of available options. After creating a new file called synthesise-name.js, copy the following functions into the file.

'use strict';

const AWS = require('aws-sdk');

const random = arr => arr[Math.floor(Math.random() * arr.length)];

const polly = new AWS.Polly();

const getRandomVoice = () =>
  new Promise((res, rej) => {
    polly.describeVoices({}, function (err, { Voices }) {
      if (err) rej(err);
      else res(random(Voices).Id);
    });
  });

const synthesiseSpeech = (text, voice) =>
  new Promise((res, rej) => {
    const params = {
      OutputFormat: 'mp3',
      SampleRate: '22050',
      Text: text,
      TextType: 'text',
      VoiceId: voice,
    };

    polly.synthesizeSpeech(params, function (err, speech) {
      if (err) rej(err);
      else res(speech.AudioStream);
    });
  });

module.exports = (name, voice = undefined) =>
  Promise.resolve(voice || getRandomVoice()).then(voice => synthesiseSpeech(name, voice));

We have a couple of helper functions, one of which returns a randomly selected Polly voice (if no voice is supplied) and another to go about generating the audio representation of the supplied name. Combining these two helpers together returns to us an audio buffer stream which we can later use within our response.

Joining audio files uing SOX

Having synthesised the clients desired name, we now wish to join the multiple audio files together and generate the output track. SOX requires that all audio files be of the same sample-rate and channel count to successfully produce a joined file. As Polly returns a mono-channel audio file with a sample-rate of 22050, the intro and outro I have provided is re-sampled to these requirements. After creating a new file called generate-track.js, copy the following logic into the file.

'use strict';

const fs = require('fs');
const tempfile = require('tempfile');
const childProcess = require('child_process');

const { SOX_EXEC, INTRO_FILE, OUTRO_FILE } = process.env;

module.exports = nameAudio => {
  const nameTempFile = tempfile('.mp3');
  fs.writeFileSync(nameTempFile, nameAudio);

  const trackTempFile = tempfile('.mp3');
  childProcess.execFileSync(SOX_EXEC, [INTRO_FILE, nameTempFile, OUTRO_FILE, trackTempFile]);

  return fs.readFileSync(trackTempFile);
};

This function simply takes in the audio buffer stream returned from the Polly service and writes it into a temporary file. We use an external temporary file library to achieve this so we need to include it as a project dependency.

$ npm install tempfile --save

We then supply this file, along with the intro and outro audio files to the SOX executable to generate the final joined output track. As this output is written into a tempoary file, we then read its’ contents into a buffer which we can later on use within our service.

Wiring it all together

With the two key problems now solved, we can now go about wiring the handler together. Replace the sample handler.js file contents with the following.

'use strict';

const synthesiseName = require('./synthesise-name');
const generateTrack = require('./generate-track');

module.exports.winner = (event, context, callback) => {
  const input = event.queryStringParameters || {};

  synthesiseName(input.name || 'All of us', input.voice)
    .then(generateTrack)
    .then(track => {
      callback(null, {
        statusCode: 200,
        headers: { 'Content-Type': 'audio/mpeg' },
        body: track.toString('base64'),
        isBase64Encoded: true,
      });
    });
};

This composes the two functions together, returning the resulting audio track back to the client. API Gateway requires that we Base-64 encode the binary response, so we do so within the callback.

We are all winners

With the implementation now fully complete, we can deploy the Lambda service by executing.

$ serverless deploy -v

Finally, we can visit the returned endpoint URL and enjoy creating our own winning audio tracks! You can find the code in its entirety, along with supporting assets in this GitHub repository.