Conversational Voice AI for L&D: Coaching, Role Playing, and More

Okay, listen. I wrote this entire blog post about an AI executive coach we’ve been experimenting with and thought, “Anyone reading this will probably just want to try it out first.”

Check it out! It’s not perfect, but it is kind of fun. (We’ll pay for the token usage, so don’t chat all night.) We’ll be leaving the sign-up open for a few days.

When you’re done chatting, come back and read the rest of this post. And with that, back to our regular programming:

Yes, the robots can talk.

OpenAI announced on May 13 that a new conversational mode will be released in the next few weeks.

These improved capabilities becoming ubiquitous will make chatting with AI verbally a regular way we interact with technology. So, what will it mean for L&D? And how did we get here? After all, voice assistants aren’t new. So what’s the big deal?

We conducted a few experiments to find out, gathering input from L&D pros along the way. Join us for a look at the results—and the implications. It’s looking more and more like conversational AI will have a significant impact on some key aspects of L&D.

First, a Quick Technology Summary

Voice assistants like Apple Siri and Amazon Alexa have been around for a while. They use natural language processing (NLP) to take a request and match it to canned responses. This means they’re helpful for checking the weather, but also, as Microsoft CEO Satya Nadella said in 2023, they’re “dumb as a rock.” They don’t have the dynamic or generative abilities of a large language model (LLM) like that used by ChatGPT.

ChatGPT changed the game. In 2022, Whisper was introduced as a complement to ChatGPT-3.5, providing users the ability to convert audio into text. This allowed users to speak to ChatGPT, which could then read responses back at the click of a button. Audio and voice became usable, but the technology still lacked the ability to interrupt, be interrupted, or have a real conversation without strict instructions to take turns.

Newer startups have enabled more conversational interactions on top of LLMs. They introduced the ability to automatically detect turn taking, allowing for interruptions and speaking freely back and forth. They also added nice interjections like “mmm hmms,” which occurred while the AI was listening. Further, they analyzed vocal expressions. Some latency still happened in these experiences as they were essentially a multi-step process; an LLM like ChatGPT-4 creates a response and then a voice agent, a separate technology, speaks the response.

Then this month, OpenAI announced that ChatGPT-4o, the most recent version of the company’s generative AI chatbot, will be able to natively understand and reply conversationally. This means that there will no longer be a voice agent reading responses from the LLM. The LLM will speak. If you tell the AI to “slow down” or “pretend to be a character,” it will. It will also be naturally vocally expressive and understand users’ vocal expressions. Because it will all be built into the same system, it will be faster than anything that’s come before.

To date, a text model of GPT-4o has been released. The most advanced voice capabilities (including a controversial Scarlett Johansson-sounding voice) have not been released as of this publication. That means that right now, today, you can talk to GPT-4o, which is fast, but you have to tell the system if you want to interrupt (because it’s still using the old voice input-readout technology). You can read details of what is currently available.

Whew! That was a lot. Let’s talk about what this could all mean for L&D after the floodgates open.

Our Hypothesis: Faster, More Authentic Interactions for Practice and Reflection

We had previously explored chat-based scenarios (via typing) as a form of practice and role play. These interactions were fun at first, but the effort required to make them feel real (while you knew it was an AI) was hard to sustain. It was also weird to play out scenarios via chat that you’d be more likely to have in a real spoken conversation (like a call).

We wanted to add voice capabilities to see if it would make the scenario feel more real and make it easier to engage.

Experiment No. 1: AI Coach

We used GPT-4 Turbo as our LLM and added a conversational layer on top. We then instructed the assistant to act as an executive coach. Prior research has shown GPT-4 to be the most effective at role playing (from a limited evaluation of other models).

This video captures my first experience with this combination:

As you’ll notice, there’s a bit of latency but the conversational ability is impressive. I’m the one having a harder time putting words together.

I shared the link to try out the AI coach with L&D folks in my network for their feedback.

Overall sentiment was positive:

“Natural”
“Wow!”
“Fantastic!”
“Realistic”
“Smooth”
“Something I could see myself using on a daily basis.”
And then there was my wife’s reaction: “That is really freaky.”

Comments on interacting via voice:

Conversational flow is very good and most human-like yet.
Voice is good for reflection; people are less self-critical over voice because it is linear (they can’t go back and make edits) and they don’t see what they’re out-putting. It felt faster and required less effort.
When asking the coach to slow down (when trying to document recommendations), the coach couldn’t.
Knew it was AI but became less aware over time.
Tone and inflections were good and conversational.
Latency was noticed but also called out as not too bad.

Comments on the usefulness of the AI coach:

The coach offered ideas and recommendations that were helpful.
It prompted real reflection with good questions.
Users found the approach and methodology effective.
The coach has a habit of reflecting back what the user said (mentioned as a positive and negative).
It suggested a role play to practice the recommendation, which was appropriate, but the role play itself felt a bit awkward.

Comments on the user interface:

Needs a way to document recommendations (mentioned several times).
Wasn’t immediately clear how to start the conversation.
May be helpful to have an avatar—to feel like you’re speaking to someone.
Needs ways to pause (for reflection and to just step away).
Needs to communicate expectations about the length of the experience.
Would be helpful to see a transcript, summary, next steps, or resources to revisit at a later time.

Experiment No. 2: AI Role Playing with Expressive Understanding

In this experiment, we wanted to see if expressive understanding and interactions with AI would feel natural. We experimented with a role playing interaction—helping resolve a customer service interaction with an angry customer.

Here’s a short clip:

We haven’t gotten as much feedback on this experience yet, but here are my initial reactions. The role play was effective in that it made me uncomfortable! It was hard. It was stressful. I felt some level of “realness” hearing another person’s upset voice.

But because I knew it was only role play, I knew I could also bail out when I felt stuck or uncomfortable. I would need some level of accountability or assessment to help me persevere. I also learned that I am not cut out for customer service roles!

I had one of our sales leaders try this interaction, and they said that they spent 15 minutes speaking with the AI customer before they got to a good resolution. (It required a change of tactics halfway through.) The sales leader said they felt they had to solve the issue so they could “win.” Sales people are just built differently I guess.

We also tried a coaching interaction with an AI that has expressive understanding, to see if it could detect my emotion without relying on the content of my words. While it was impressive that it could pick up on my sentiment (even if it wasn’t reflected in my words), I wasn’t a fan. Perhaps because I was in testing mode, it felt inauthentic when the AI acted like it understood my feelings. The AI also wasn’t as good at detecting when to jump into the conversation, repeatedly interrupting my discontented ramblings.

Conclusion: Expression analysis is probably more helpful for real, human-to-human interactions.

Experiment No. 3: Faster with GPT-4o

When GPT-4o text mode became available, we decided to revisit the AI coach we created in Experiment No. 1. Text mode is advertised as 50% faster than GPT-4 Turbo, so using it seemed like a great way to find out if we could reduce the latency.

The inclusion of GPT-4o in our AI coach did reduce the latency a bit, as you can see here:

Conclusion: Using GPT-4o, the latency in our AI coach application dropped from an average of 3.6 seconds to 2.2 seconds making the conversation much more natural.

Looking Ahead

We aren’t done experimenting with voice. We’re implementing some of the suggestions we’ve already received from L&D folks about the AI coach (including the inclusion of transcripts, summarization of action items, a better user interface, analysis, and feedback options).

We’ll keep testing the latest LLMs. And, we’ll explore voice for new use cases (maybe something used on the go, something interacted with during meetings, or something to help complete administrative tasks).

Here’s a quick peak at the live transcript work:

Takeaways for L&D

As consumer technology gets more advanced, it puts even more pressure on the experiences L&D creates. With this in mind, what does the advent of conversational voice AI mean for L&D professionals?

Voice AI isn’t great for everything, but it seems well-suited for certain use cases (like skill development). Figure out what those are for your audiences and find appropriate solutions.
Voice AI will let L&D reach more people with better experiences at less cost, but it will likely also create a premium for real, human interactions.
Certainly, effective coaching requires more than what our experiment offered. However, we see AI interactions as a great complement to your learning programs.
GPT-4o will be able to do almost all of the heavy lifting here, but L&D will still likely need a vendor to provide reporting and analysis as well as connections to supplementary workflows.

If you’d like to talk about conversational AI, please send me an email at tblake@degreed.com.

Thanks for experimenting with us!

See all the Degreed Experiments.

Introduction: Degreed Experiments with Emerging Technologies

AI Taxonomies for Skills: Actionable Steps for Career Goals

To find out more about chatbots and L&D, check out our companion blog post Chatbots for Learning: Gateway, Guide, or Destination?

Conversational Voice AI for L&D: Coaching, Role Playing, and More

Yes, the robots can talk.

First, a Quick Technology Summary

Our Hypothesis: Faster, More Authentic Interactions for Practice and Reflection

Experiment No. 1: AI Coach

Experiment No. 2: AI Role Playing with Expressive Understanding

Experiment No. 3: Faster with GPT-4o

Looking Ahead

Takeaways for L&D

See all the Degreed Experiments.

More Articles in Learning

Skills, Automation, & AI: The New Building Blocks of L&D

How Learning Teams Can Jump-Start AI Experiments

As HR Pivots, Make L&D the Hero of the Employee Experience

AI-Generated Content: How Companies Can Avoid the Slop Ahead

Yes, the robots can talk.

First, a Quick Technology Summary

Our Hypothesis: Faster, More Authentic Interactions for Practice and Reflection

Experiment No. 1: AI Coach

Experiment No. 2: AI Role Playing with Expressive Understanding

Experiment No. 3: Faster with GPT-4o

Looking Ahead

Takeaways for L&D

See all the Degreed Experiments.

More Articles in Learning

Skills, Automation, & AI: The New Building Blocks of L&D

How Learning Teams Can Jump-Start AI Experiments

As HR Pivots, Make L&D the Hero of the Employee Experience

AI-Generated Content: How Companies Can Avoid the Slop Ahead

Let’s keep in touch.