Game Design for Voice Devices Like Alexa – Part Two

by Jenn | on February 21, 2018

the voice originals team working on their first game, when in rome

Welcome back to our blog about designing for Alexa and other voice devices. In part one we introduced our This We Believe principles, the ways you can interact with voice devices and discussed physical components. In this part we will continue with the lessons learned, concentrating on voice AI and voice commands.

Lessons Learnt from Designing for Voice Devices

Using Voice AI

One of the main benefits of using a voice device is the computing logic that comes to bear behind the device. Although, you can gain computing power by placing a digital screen in the middle of a tabletop gaming experience, this can make players focus on the screen rather than each other. Further, it can mean that people can’t arrange themselves as they prefer around a table because they have to make sure they are seated so they can see the device.

family playing jenga together
Fig 5.Screens can disintermediate face to face play

Voice devices can become part of a tabletop gaming experience in a way that digital screens cannot. Many people have a device already in their living room, so people can sit naturally around a table. Then the voice device simply becomes another person speaking to players from wherever it is located in the room.

If one player or team is doing much better than another, a voice device can subtly use video game rubber-banding techniques to make it a much more even playing field so that no player wins by an insurmountable margin. This helps players stay engaged with the game right up until it ends.

Voice devices can optionally keep track of score and state for players. This means that players aren’t required to do complex mathematics, keep scraps of paper, remember complex outcomes, or spend a lot of time looking up outcomes in a lengthy book. Keeping track of score also means that people can’t cheat. However, if getting players to track score and state enhances the gameplay, then obviously designers can choose to include it.

Not only can voice devices keep track of score and state within a single session, they can keep track over multiple sessions and extended periods of time. This means that players can play the game for some amount of time, stop, clear the table for dinner and then set up again after dinner. The voice device will remember exactly where players left off and help them set back up again.

A voice device can keep track of complex and deep state trees in a way that would be impossible in a purely tabletop game. It means that the rules that players need to track is much smaller, since the voice device will manage the overall adherence and explanation of rules.

Further, by knowing what state the game is in currently, we can increase the likelihood that the voice device will understand a player’s input. For example, if we ask people a multiple choice question and the device hears “8”, it is much more likely that the player said “A” instead.

Although we didn’t use the feature much, voice devices also have access to real-time data from the internet. This means you could create a game relating to real-world weather patterns or update trivia questions based on updated facts and figures.

Voice AI Rules of Thumb:

  • Gain computing power without having to put a device physically between players
  • Use rubber-banding to keep players evenly matched
  • Keep track of score and state, so players don’t have to
  • Use state knowledge to improve device comprehension
  • Use real-time data to keep game up to date

Fig 6. Heuristics for Voice AI

Voice Commands – Basics

The main way that players can change what is going on in the game should be via speaking and saying commands to their voice devices. If this is not the case, then the game is becoming too reliant on the physical components and is not utilising everything that a voice device can bring.

On voice devices, when you are not within your specific skill, you can start the skill by requesting to “open” the skill, e.g. “Alexa, open When in Rome”. However, you can also perform a “one shot intent”, where you say the skill’s name and some more specific instructions, e.g. “Alexa, tell When in Rome to Fly me to Tokyo”. This allows the game to load up in the middle of an action, rather than right at the beginning, and means that players get back to playing the game more quickly.

Once you’re within the skill itself: there are two basic ways that you can speak to a voice device. The first is answering. That is, after the device has asked you specifically for input. In this case, the device is actively listening to absolutely everything you say. This places a lot of pressure on players. At this point, they can’t discuss options amongst themselves because the voice device will interpret any audio as a final answer.

The second way is via interruption. That is, players talk over the top of audio to stop the device while it is in the middle of speaking or playing other audio. In this case, the device is only listening for its wake word, so you must say the wake word followed by the specific phrase you want to say.

Interruption is useful for speeding up gameplay, but it is not intuitive to use for some players. People who have been frequently using voice devices will understand about interruption, so the device owner is likely to be familiar with this. However, guests may not understand this, so teaching people about saying wake words and interrupting speech is important at this early stage of platform deployment. Once players have become accustomed to interruption, they can get carried away: one possible issue to watch out for is people interrupting speech accidentally and then not being able to replay what they missed.

One use of interruption is to allow plays in the middle of a standard turn flow. In digital card game variants of tabletop card games, the digital device has to interrupt flow a lot to ask if you have any special cards to play at a specific moments in the game. With voice-enabled games, the player can interrupt and play in a way that matches to the seamless tabletop experience.

Once players have worked out that they can use a wake word, they might use it all the time, even when not in the skill. For players there is no visual representation on a device that they are inside a game skill or in the top level device structure. Within the game, saying “Alexa, Fly me to Tokyo” means one thing, but if you attempt to ask that when you’re not in the game, Alexa might attempt to book you flights to Tokyo or ask you when you’d like to leave. To fix this process, it is mostly about making it clear by audio clues when you’re in the game and when you’re not and also educating players that these two states are different.

Until recently, the Amazon Alexa did not recognise multiple voices on the same device. This meant that when designing the game we weren’t able to allow Alexa to recognise which team or individual responded to a question. We hope to explore this ability in the future.

Basic Voice Commands Rules of Thumb

  • Use one shot intents if players need to exit skill & get back to playing quickly
  • Ask players direct questions when you want to put pressure on them
  • Teach everyone how to interrupt audio
  • Interruptions can speed up gameplay
  • Interruptions allow for special cards to be played “out of turn”
  • Watch for: accidental interruptions where content can’t be replayed
  • Give audio clues to help players understand when they are in the skill/game or not

Fig 7, Heuristics for basic voice commands

Voice Commands – Wildcards

Technology is improving all the time and the ability to recognise what players are saying is also improving. However, there are steps that we can take to help improve recognition. The first step is to choose words that are easy for players to say, that aren’t tongue twisters or aren’t pronounced very differently with different accents.

Another option is to only allow a very small number of things that the player can say at any point in the game. Perhaps the player can only say “Yes” and “No” throughout the entire game. You can also encourage players to say catch phrases that voice devices can easily understand. For example, “That is my final answer”, “I’d like to buy a vowel”, “Phone a friend” “Lock it in” and so on.

When a voice is interpreting a command from a player, it is attempting to link it to one of several expected sentences. This structure works much more easily if only one word is a “wildcard”. For example, if the command is: “Fly me to Tokyo”, the sentence will be picked up much more easily if the device is looking for “Fly me to _____”. In that case it’s just attempting to work out what city you’re referring to.

Imagine though that the sentence needs to match to “____ ____ to _____”. In this case, there are a lot of combinations that could go there, e.g. “Walk Jenn to Tokyo”, “Drive the other team to Timbuktu”. Now, imagine that the order of the words matters as well (ie permutations count). In these cases, misunderstanding even one word can lead to drastic changes in what happens in the game.

For example, one game we tried to make was about casting spells that took you to different places as different characters. Players needed to say four words in a row, where each individual word was significant. If the voice device misinterpreted you the game would do something totally unexpected and undesirable. When the game understood exactly what you were trying to say, it did feel like you were casting a magic spell. However, in our very preliminary exploratory prototype, there were too many fail states that by the time it worked it no longer felt magical: you only felt a sense of relief that it had worked at all. We’re confident that with some more effort, this could be made to work effectively, but it something that needs further design and technical exploration.

If a game stores state, then it is handy to use that to improve interpretation of player commands. That is, if you know you’re expecting a player to fly somewhere, then you can be listening for a city name, rather than a number or some other input. But if any command can happen at any time, it will be harder to work out what players are intending to do. One way to deal with these device miscomprehension issues is to lean into the theming and create a main character that is a broken AI. Unsurprisingly, not everyone in the world likes sci-fi AI stories. So instead of limiting games to a specific type of narrative stories, the key is to work out ways to limit errors that the voice device can make.

Wildcard Voice Commands Rules of Thumb

  • Use easy to pronounce key words
  • Set up catch phrases that people have fun saying repeatedly and that a voice device can recognise easily
  • Use a minimal number of wildcard spots in sentences
  • Use game state to help device interpret what to expect

Fig 8. Heuristics for wildcard voice commands

And that ends our blog for this time. In our next and final entry we’ll discuss creating atmosphere, writing and will do a deep-dive into our chosen prototype, When in Rome.