Mixing For Variation Notes

Navy Boards

Developing Narrative
Experiences for
Amazon Alexa

Mixing For Variation Notes Mixing For Variation Notes - Adventures In Developing For Alexa

Mixing For Variation Notes

updates

Mar 30, 2020

So, the title for this week was a bit of a furphy. This was a combination of recording and editing for variation. There were three different experiments I tried to give a sense of variation and life to the point and click adventure, and one that was recorded, but not implemented. I’m going to talk about it though because it’s a neat idea I’ll be using at some point.

The full script for this point and click adventure is available here, and the code (including audio assets) is available on Bitbucket here.

Experiment 1 - Variation in the Opening Line

As a basic mockup, the skill starts with the line “Alright, I’m in the room, what do you want me to do?” each time its opened. I improvised six different variations for this line, and recorded them in full. Each time the skill is opened, it picks one of the six available options.

Doing some basic testing on this it felt like Python’s pseudorandom random.choice wasn’t great at creating truly random outcomes, so I’m going to look at better random seeding down the track.

This was the simplest experiment to implement, but would be the most costly in terms of a voice acting budget for a full experience - each variation is a full line in the script essentially.

Experiment 2 - Combining With SSML

When the player discovers a key hidden within a jacket when playing the skill, the following line is delivered as a response:

“Hey - there’s a key in here. A small one.”

The line is broken into two parts: signalling the importance of the discovery (“Hey!”) and the information for the player (“There’s a key in here. A small one.”) Taking those two parts individually, nine variations were recorded for the moment of discovery ranging from single words to non-verbals like ‘hmm’. Five variations were recorded for the information part of the line, giving a total of 45 possible combinations.

Taking advantage of Alexa’s ability to play up to three audio files as part of a response, SSML was used here to splice the two files together, which worked perfectly. The line was a good fit due to the natural pause between signalling the moment of discovery and delivering the rest of the information. While a line like this would likely only be delivered once to a player as a moment of genuine discovery in game, it made an interesting test case.

Experiment 3 - Looking Around The Room

This was the trickiest and most informative experiment. The player can ask the skill what can be seen in the room being investigated in the point and click experience. There are four objects that are always visible:

Window
Computer
Closet
Filing Cabinet

And three objects that may be visible:

USB key
Jacket
Key

The skill lists the objects that can be seen in a straightforward way, using their standard short descriptions. This was interesting for a couple of reasons:

Shopping Lists

This is what’s sometimes called a ‘shopping list’ in voice over copy - a list of things given one after the other with some differentiation in delivery. We need to tell the listener that more objects are coming if we’re not at the end of the list. Each item in the list before the last needs to ‘float’, finishing in a slight rising inflection to cue a listener that more items are coming, until the last entry ‘lands’ the list with a falling tone.

It would have made implementing this a little easier to ensure that the first and last items in the standard order were always visible - this would have reduced code complexity a little, although through sheer luck I did dodge the worst case scenario of having to record rising/falling variations for the same item.

Standard List Order

Because this dialogue is conveying complex information (a list of more than three items) The order of the list was kept constant, with the exception of whether or not the ‘hidden’ objects could be seen. Through sheer coincidence, the full list of objects hit the magic number that (according to some) the average person can remember, which is seven.

Because the total number of possible audio segments was at least eight (seven objects and lead-in dialogue to start the line) this was implemented using the Python library pydub, using the following steps:

four variations were recorded for the start of the line
the description for each object was recorded only once for consistency

The list of objects was then built together using pydub's incredibly simple method of concatenating MP3 files (code for the Python script is here to cater for all possible combinations of objects being hidden or visible.

At runtime, the Lambda function chooses one of the four starting variations, then the appropriate object list depending on what objects are visible. The result glued together well, and this is exciting for the future - I’d been hoping pydub would do heavy lifting here, and it does.

Other Discoveries

A few quick bullet points:

Look for the ‘thinking points, or shifts in gears in dialogue as good points to glue together variations in delivery
Non-verbals in dialogue stick out the most when a line is repeated. We might repeat the same words, but it’s rare or never that we repeat the same involuntary thinking noise when talking. Make sure non-verbals are randomised or varied.

Moving Beyond

For a fully fledged point and click experience, or for Alexa skills in general, one idea I didn’t get.a chance to explore in more detail is signifying repeated information: if a player repeats the same request, having the fictional entity they’re communicating with flag the repetition. For example:

“It’s an… 8 terabyte Seegson thumb drive. Looks well used.”

The first line simulates the fictional character scrutinising the object, as well as flagging that the object is important because of its constant use. The fictional brand (Seegson) is included for flavour. Then, if asked to describe it again:

“It’s a well-used thumb drive”

Shortened to the specifics. Then:

“Like I said, it’s a well-used thumb drive”

A few words flagging the repetition, then the same words verbatim for clarity. This sort of dialogue would need careful writing and delivery, otherwise it could come across as negative. Particularly if the player was using the same command to hear the response repeated for clarity, rather than using Alexa’s standard ‘repeat’ intent.

“A worn thumb drive”

For the final variation, a short description that would be the consistently repeated line.

I feel there’s real potential in this device, and I want to experiment more with a queue of tracked responses given to the player (keeping a record of the last 5, or 10, or n responses given) in future prototypes.

More to come. Was this helpful at all? Let me know in the comments.

Tags: