One of the more interesting learnings from the past, you know, year and a half of working on this stuff is that the solution to many problems with AI is more AI. And it's somewhat unintuitive, but one of the remarkable properties of large language models is that they're better at detecting errors in their own output than in not making those errors in the first place. Joining us today is Clay Bivore, co-founder of Sierra. Before Clay started Sierra with his longtime friend Brett Taylor, he spent 18 years at Google, where he started and led Google Labs, their AR VR efforts, and a number of other forward-looking bets for the company. Sierra is allowing every company to elevate its customer experience through AI agents, and there is no one who knows more about what AI agents can do today and what they'll be doing tomorrow than Clay. You'll get to hear about how pictures of avocado chairs helped inspire the founding of Sierra, why the solution to problems with AI is often more AI, and so much more. Please enjoy this incredible episode with my friend Clay Bivore. Alright Clay, listen, this is a funny start because we know each other so well, but can you just tell everyone a little bit about yourself and just give us some background before we talk about the future of AI and what role Sierra is going to play on that?
First of all, I'm a Bay Area native. I grew up not more than four or five miles from here. So grew up in the Bay Area, got to see the kind of dot com bubble grow and then burst, study computer science and then ended up right out of undergraduate at Google, where I was for 18 years until last last March. And so at Google I worked on really every part of the company. I started in search and then ads for several years. I ran the product and design teams for what is now workspace so Gmail and Google Docs and Google Drive and so on. And then spent the last really 10 years at Google working on various forward looking bets for the company, some hardware related like virtual and augmented reality, some AI related like Google lens and other applications of AI. And then 15 months ago left Google to start Sierra with a long time friend of mine, Brett Taylor.
We met in our early days at Google where we both started our careers in the associate product management program. So he was I think class one. I was class three and we met early on and stayed in touch in particular through a monthly poker group that in a good year would play like once. And met up December of 2022 and just saw what was happening in and around AI and these fundamentally new building blocks that we thought would enable us to create something really special and started Sierra out of that. So that's the recap. Actually, I'm curious on that. And we need to get to what is here pretty quickly here, but just for fun December 2022 very shortly after the chat GPT moment. How I guess what was the process like or how soon after that moment did you have the conviction that this is a sufficiently interesting new technology to build a company around. And I introduce one thing that's kind of interesting. I hope you talk about before you actually before the chat GPT moment, you had been telling me about how everything was going to change. I still remember distinctly him telling me you don't understand you're going to be able to talk about a scene that you envision and they're going to be able to make a movie out of you just talking about it. Do you remember you telling me? Yes. Yeah. And so I'm actually very curious about this too.
Well, I had such a privilege seat at Google to see so much of what came out of that transfer paper in 2017 and the emergence of early large language models. So at Google, one of the first was called Mina or Lambda. There was a paper, I think in 2020, a conversational chatbot for just about anything. I remember even before that, getting to interact with this thing in a pre-release prototype and having this uncanny sense that there was someone, something on the other side of it and that this was different. And another moment, I think it was mid 2022 when we had, I think it was the first or second version of Paul and Pathways Language Model at Google, it was a 540 billion perimeter model. And we were testing it to see kind of how smart it was. And one of the surest and signs of intelligence is the ability to think and reason and metaphor and analogy. So we tried a few things and one, which was pretty straightforward as we asked Paul, hey, explain black holes in three words. And it came back without skipping a beat, black holes suck. And we were like, oh, that's a pretty good summary. Also, the model seems to have a sense of humor, which is cool. And the moment that really blew my mind, we asked, and I remember the answer verbatim, we asked Paul, please explain the 2008 financial crisis using movie references. And again, without skipping a beat, so the 2008 financial crisis was like the movie Inception, except instead of dreams within dreams, it was debt within debt. And we all paused, what is this? So it understood basically the concept of CDOs, nestedness of debt, okay, what movie includes nestedness of something else, inception, nestedness of dreams. So it's like Inception. And we all thought, wow, this is something new and different. And then there were a couple other moments. I remember the first Dolly paper came out and they did a blog post and people reacted a little bit to it. But for me, I remember one of the stars of the show was they asked Dolly to make avocado chairs. And so, and I know this sounds so odd, but here is a set of 10 or 20 images of chairs that look like avocados. It wasn't photoshop, these images had never existed before, and yet the models seem to understand similar to the movie reference metaphor concepts of avocado and shareness and put those together and create these images pixel by pixel. So we have avocado chairs at Instacart. Yeah, it's actually did. We actually did. Wow. We actually had chairs shaped like avocados.
在Google工作让我有幸见证了2017年转移学习论文的影响,以及早期大型语言模型的出现。在Google,最早的模型之一叫做Mina或Lambda。我记得大概在2020年,有一篇关于一个能进行各种对话的聊天机器人的论文。甚至在这之前,我就体验过一个未发布的原型,感觉就像对面有某种智慧存在,这与众不同。还有一个时刻是在2022年中期,我们有了Paul和Pathways Language Model的第一或第二个版本,这是一个拥有5400亿参数的模型。当时测试它的智能程度,智能的标志是具备思考、推理、比喻和类比的能力。所以我们尝试了几个东西,其中一个很简单,我们让Paul用三个词解释黑洞,结果它立刻回答:“黑洞吸人”,我们认为这是一个不错的总结,模型似乎还有幽默感,这很酷。最让我惊讶的时刻是我们让Paul用电影比喻解释2008年金融危机,结果它毫无迟疑地说,2008年金融危机就像电影《盗梦空间》,只是没了梦中梦,而是债中债。我们都惊呆了,这是什么意思?它基本理解了CDO、债务的嵌套性,然后想到哪个电影包含其他事物的嵌套,《盗梦空间》,梦的嵌套性。所以这就像《盗梦空间》。我们都觉得,哇,这真是新鲜和不凡。还有其他几次瞬间,我记得第一次看到Dolly的论文,他们发了一篇博文,人们反应平淡。但对我来说,印象深刻的部分是他们让Dolly创作鳄梨椅子。听起来很匪夷所思,但出现了一组10到20张看起来像鳄梨的椅子图像。这不是PS,这些图像以前从未存在过,而模型似乎理解了类似电影比喻中鳄梨和椅子概念的结合,逐像素地创造出这些图像。我们在Instacart还真有鳄梨椅子,确实,我们真地有椅子是鳄梨形状的。
In related news, there were times where we were burning a little bit too much money. You know, those bags. So, how to, how to good sense that something was coming. And in fact, the team, the team I was running at Google at the time labs was putting a lot of large language models to use in early applications there. And so, how to hunch. Chat GPT certainly clarified that hunch, but I think Brett and I both for several years had been tracking what was happening and just seeing, you know, first it was translation and better than human level translation. And it was some of this language generation. And I think credit to OpenAI for doing the engineering work and data work and much more to make GPT 3 turn into chat GPT where suddenly you could grasp this thing's full potential without, you know, knowing how to write Python and use their APIs. All right, so we're going to talk about where AI is going. We're talking about agents. We're talking about customer service. Right. But first, just, can you maybe just tell people a little bit about Sierra and what you and Brett have created? Yeah. So in a nutshell, Sierra enables any company in the world to create its own branded customer facing AI to interact with its customers for anything from customer service to commerce. And the backdrop for this is this observation that any time there's been a really significant change in technology. People interact with computers with technology in different ways. And as a consequence, businesses are able to interact with their customers in entirely new ways. And you saw this in the 90s, the internet made the website possible. And for the first time a company could have a sort of digital storefront and be present to the world, update its inventory with the click of a button and so on. In, you know, the mid to mid early 2000s, 2005, 2008, if you were a company, you could all of a sudden through ubiquitous social networks interact with your customers at scale and have conversations at scale. And in 2015, right after the rise of smartphones, right as a company, you could put kind of a Swiss Army knife version of your company in everyone's pocket.
And so like, I bet you have your bank's mobile app on your phone, probably on your home screen. So the last few years of advances in AI has for the first time made it possible to create software that you can speak to, right? Software that can understand language, software that can generate language. And most interestingly, I think software that can reason and make decisions. And it's made for really delightful conversational experiences like those that we associate with chat GPT. And so we think there's a big, big deal for how businesses interact with their customers.
And you think about the difference between how we do some things today versus what you could do if you could just have a conversation. If you could just have a conversation with the business you're interacting with. Think about like shopping. You're in the market for some shoes, right? Or Pat, maybe for you, some new weights or something. You're a very heavy weight. I'm a little one. And you're on the website and it's like, you basically have to imagine how the company's designer would have organized the product catalog.
You can say, men's, men's shoes, men's running shoes, men's racing shoes, light weight, vapor fly. I can't remember the name and so on. Instead, with conversationally, you could just say, hey, I need some super lightweight running shoes. Kind of like those ones I got last time. What do you got? And it's almost like I'm dating myself a little bit here, but like Yahoo! Directory, where you navigate through this hierarchical structure to find what you want.
In contrast to Google, you explain what you want. And this takes it several steps further. And there's a quote from the head of customer experience that one of the companies we work with. She said, I don't want our customers to have to have a master's degree in our product catalog and our corporate processes. And to do a lot of things, buying shoes is fairly easy on the spectrum of interactions you have with companies. .
Imagine adding a new person to your insurance policy. Like, where do you go in the mobile app for that? How do you get that done? And your eyes just glaze over it, right? And so the alternative, talking to an AI, and in particular, an AI agent, it's a technology around which we build Sierra, where that AI agent represents your company, your company, its best, we think is really, really powerful. And even in, you know, we're 15 months old as a company, we've had the privilege of already working with story brands like Weight Watchers, Sonos, SiriusXM, Olukai, if you're in the market for new flip flops, I strongly recommend Olukai flip flops.
I have two firsts. Very good, excellent. Also make great golf shoes. Oh, really? Oh, yeah, yeah, yeah. You should get some. All right, great. And so for Weight Watchers, we're advising on points and helping members manage their subscriptions. With SiriusXM, we're helping diagnose and fix radio issues and figure out what channel your favorite music is on and so on. And the results, again, in the first year of the platform out there, were in one case resolving more than 70% of all incoming customer inquiries at extremely high customer satisfaction.
And all this leads us to believe that every company is going to need their own AI agent, and we want to be the company that helps every company build their own. In the spirit of sort of the, you know, the future of these AI agents and what they could mean for customer-facing communications and customer-facing operations. Are there any good examples of things that were not possible 18 months ago that are possible today? And then maybe if we roll the clock forward, things that are still not quite possible today that you think will be possible.
Yeah. 18 months from now? Yeah. First of all, the progress month by month and over 18 months in particular is just kind of breathtaking. 18 months ago, GPT four-class models didn't exist, right? It was still kind of something just coming over the horizon. Agent architecture is cognitive architecture is kind of the way you compose large language models and other supporting pieces of infrastructure were very, very rudimentary. And so I go so far as to say like the idea of putting an AI in front of your customers that could be helpful and importantly safe and reliable.
That was just impossible. And so chatbots from even 18 months ago looked a lot like a pile of hard-coded rules that someone cobbled together over months or years that became very brittle. And I think we've all had the experience of talking to chatbots. I'm sorry, I didn't get that. Can you ask in a different way? Or my favorite is when they have the message box and then like the four buttons you can click, but the message box is blanked out and you can't actually use it. And so I can help you with anything so long as it's one of these four buttons.
So most of what I described, right, fixing radios, processing exchanges and returns and so on, wasn't possible at least in any satisfying way or in a way that led to real business results for companies 18 months ago. Fast forwarding 18 months, you know, I think we go pretty deep here. I think multimodal models are quite interesting. Something like 80% of all customer service inquiries are on the phone, not on chat or email. So voice will obviously be a huge part of it. Things like returns, exchanges, diagnosing radio issues and things like that are on the simpler end of the spectrum of the total set of tasks that you might want to get help with from an AI agent. And so I think more advanced models, more sophisticated cognitive architectures, all of those, I hope would increase kind of the smarts in the agent, the types of problems that can solve and then trust safety, reliability, the hallucination problem, I think is still an unsolved area. And we've made others have made huge amounts of progress on it.
But I think we can't yet declare victory. How quickly do you think it's going to become? You guys are doing so much for the customers, not just customer service, but you know, working all the way through the tunnel. But on the customer service side, how long is it going to take for you to become the default that folks expect that they will be able to have someone or an AI that's available at any time to answer any question, you know, make that real for us.
Yeah, I don't know. And in part, there's a bit of a hole to dig ourselves out of as not a company, but as an industry where it's like, one was the last time you had a great interaction with the chatbot on a website. And, you know, I think if you polled 100 people and you're like, do you like talking to customer service chatbots, probably zero out of 100, would you say, yes. On the other hand, if you ask like, hey, do you ask 100 people, do you like interacting with chat GPT, maybe 100 out of 100 would say yes. And so I think some of the work we've been doing in our product is to educate our customers, customers up front that like, hey, this thing's actually really smart and good. One of the interesting specific techniques for doing that is we stream our answers out word by word similar to how chat GPT does. People are so used to the message message message message. The streaming answers is something of a kind of visual signature for, oh, there's a really smart AI behind this.
And so I think what we find is customer satisfaction is extremely high with our age, AI agents, you know, in the mid for so 4.5 out of 5 stars. Which in some cases is higher than customer satisfaction with human agents. And in fairness, they often get the hardest cases and the cases that, you know, we will hand off because the customer became angry or was especially frustrated or something. But still those results are really significant. And so my guess is over just the next few years, I think people will realize, oh, I can get my issue resolved faster. This thing is actually capable and can not only answer my questions, but you know, one of the things we're really proud of is we go far, far beyond just answering questions but can actually take action and get the job done.
Can you talk a bit about agents, OS and some of the frameworks that you put around the foundation models to make everything work? So it's been such an interesting journey learning what's required to put AI safely, reliably and healthfully in front of our customers customers. And a huge part of that, really, the first part is looking at what are the challenges with large language models and how do you address or meaningfully mitigate those. And so start with hallucinations. I don't know if you saw it, but there is an example from a few months ago where Air Canada's chatbot that I think was based on an LLM and apparently not much else was interacting with the gentleman who had questions about their bereavement policy. I think the person had had someone pass away in his family and was asking about refunds and credits and so on.
And the AI made up a bereavement policy that was quite a bit more generous than your Canada's actual bereavement policy. And so the man took a photo and later claimed the full amount of that refund and so on. They said, no, actually that's not our policy.
And bizarrely, and I don't quite understand this, the case went all the way to court, Air Canada loss. And our thought was like, hey, it's just like $500 and like Canadian dollars. So, but hallucinations are a real challenge. And on top of that, just to enumerate some of the things to overcome and that we have with AGNOS, no matter how smart, GBT5 or 6 is, it won't know where your order is, right? Or which seats, right, you've booked on the upcoming flight or whatever.
It's obviously not in the pre-training set. And so you need to be able to safely and reliably and in real time integrate an AI agent in our case with systems of record to look up customer information, order information and so on. And then finally, most customer service processes are actually somewhat complex, right? You go to call centers and there will be flow charts on the wall.
Like here's how we do this and if there's an exception this way and so on. And as capable as, you know, GPT4 and Gemini 1 5 class models are, they'll often have trouble following complex instructions. And we saw one example in an early version of an agent that we prototyped where you'd give it five steps in a returns process or something.
And you'd say, hi, I need to return my return my order or whatever. And it would jump straight to step five and then call a function to return the shoes with username John Doe at example.com. Come up, order number one, two, three, four, five, six, so it would not only hallucinate facts or bereavement policies, but even function calls and function parameters and so on.
So with Agent OS, what we built is essentially a toolkit and a runtime for building industrial grade agents that I don't want to say that we've solved every one of these problems, but overcome and mitigated the risks in these problems to such an extent that you can safely deploy them at scale, have millions of conversations with them and so on.
And it starts at the foundation layer, I don't mean foundation model layer, but just the base layer of the platform where you have to get really important things like data governance and detection, masking and encryption of person identifiable information, right. And so we built that right into the platform from from the ground up so that our customers data stays our customers data so that their customer is data is protected. We, for instance, detect, mask or encrypt all PI before we log into durable storage, right.
Knowing that we're going to be touching addresses and phone numbers and so on can handle that safely. A level up from that, we've developed what we call agent SDK or agent SDK and it's a declarative programming language that's purpose built for building agents. And it enables an agent developer, most of whom sit within the four walls today of Sierra to express high level goals and guardrails around agent behavior. So you're trying to do this.
Here are the instructions. Here are the steps and a couple of the exceptions cases. And then here are the guardrails. And to give an example of that, one of our customers works in kind of the healthcare adjacent space. They want to be able to talk about the full range of their products without dispensing medical advice, right.
So how do you create those additional additional guardrails? And then so you can define kind of the behavior and scaffolding for complex tasks for agents with agent SDK. We also have SDKs for integrating with contact centers when we need to hand off for integrating with systems of records, like the order management system and so on.
And then finally for integrating our chat experience directly into a customer's mobile app or website, iOS, Android, web and so on. And then once you've defined the agent using agent SDK, we then have a runtime where we abstract away what happens underneath the hood from the developers so that they can define what the agent should do. Define the what and then agent OS takes care of the how. And so for some skills, there might not be one LLM call, but five, six, seven, ten separate LLM calls to different LLM's with different prompts.
In other cases, we might retrieve documents to support answering an accurate question accurately with and so on. And agent OS, you know, in the spirit of an actual operating system abstracts away a lot of that complexity, kind of the equivalent of IO and resource utilization and so on. So it makes the whole process of building and then deploying an AI agent much faster and much safer and more reliable. And when you think about what you just said, Clay, of like when you call multiple LLM's, is that in a supervisory capacity sometimes too, where you end up having like a supervisor agent reviewing the work of a lower level? Yeah.
One of the more interesting learnings from the past, you know, year and a half of working on this stuff is that the solution to many problems with AI is more AI. And it's somewhat unintuitive, but one of the remarkable properties of large language models is that they're better at detecting errors in their own output than in not making those errors in the first place. And it's kind of like if you were I were to draft an email quickly and like, okay, let me pause. Let me proofread this. Does this make sense to these points hang together? Oh, actually, no, I missed this. And even more powerfully, you can prompt LLM's to take on an essence a different persona. So a supervisor's persona.
And it seems with that you can elicit more discerning behavior and a closer read of the work being reviewed. So to your question, Ravi, yeah, we in addition to building the agent itself have a number of these supervisory agents that basically, it's like a little Jiminy Cricket agent looking over the shoulder, right of the primary agent. Is this factual? Is this medical advice? Is this financial advice? And is the customer trying to prompt inject and attack the agent and get it to say something that it shouldn't? All of these things. And it's through layering all of these goals, the guardrails, the task scaffolding in using agent SDK within these supervisory layers that we're able to get both to the performance levels we are, 70% plus resolution rates, but also to do that really safely and reliably.
That's one of the cooler things I've heard is just, you know, the tell it to have a different persona. And then all of a sudden it behaves differently. Like I remember when I first saw it on chat, GPT, when it doesn't help you on something, just tell it it's really good at it. And then it's more likely to help you as a remarkable situation. It's very strange. And one of the weirdest adjustments over the past, you know, 15 months building these things is, I'm sorry, we're programming with English language. And we can give it the same English language and it can say something entirely different. And on prompting techniques, I mean, it's fascinating.
Even with no new models coming out, right? Given a fixed model, you can elicit better and better performance from it simply by improving how you prompt it. And there was a paper that came out three or four months ago that suggested that like emotional manipulation of the large language model would get better results. So the kind of the prompt suffix that they figured out was you say, hey, I need you to perform this task. You define the steps and so on. And you end with, it's very important to my career that you get this right. And the performance goes up. You're like, what is this? Like, what are computers now? For the record, we don't use that prompt.
At least not that I know. But things like chain of thought, think step by step. Let's take the step by step, right? Alyssa, it's better reasoning for very interesting reasons. You know, other methods of task decomposition and kind of narrowing the set of things that the LM needs to keep in mind at the same time improves reasoning if you're precise about what you want it to do. So all of these techniques are those that we've applied and built into Agent OS and actually are, we have a small but mighty research team. And our head of research, Karthik Narasimhan was. That was incredible pronunciation. Oh, Lake, his grandmother would have been so perfectly happy with how you pronounce. Thank you. Well, thank you. Soft tea? Yeah, soft tea. Nicely done. Yeah, it's not a tea and it's also not a tea.
That's right. It's right in between. Thank you very much. He helped write the React paper, one of the first agent frameworks. One of our researchers wrote the reflection paper where you can have the agent pause reflect on what it's done. Think through, am I doing this right before proceeding? And so these are all things that we've been able to incorporate in quite a direct way. You should talk about the most recent research. The Talbench. Oh, Talbench? Yeah, yeah. It took me a while when I was trying to send the email saying I liked the paper to find the Talbench and we'll on my computer. No, it took Robie a while because he's to this date never actually read a research paper. I read this one. That's great. No, no, no. He had to figure out how to put it in the chat TPT and say, please write a paragraph that makes it sound like I read this research paper. Well, either you, either you were. I would have just a comment. Well, look, either you or ChachiePT did a great job on that email. Thank you. So we're a team. Yeah. So, Talbench is our first research paper.
First of all, Tal is a Greek symbol. It's spelled T-A-U and it stands for tool agent user benchmark. And what we observed was that the benchmarks out there for measuring the performance of AIs, A-I agents in particular were pretty limited in that basically they would present a. single task. Here is something we need you to do and here are some tools you can use. Do you do the job or not? And the reality is interactions with an A-I agent in the real world are way messier than that. They take place in the space of natural language where customers can say literally anything or describe whatever they're trying to do in any number of ways. It happens over a series of messages. The A-I agent needs to be able to interact with the user to ask clarifying questions, gather information, and then use tools in a reliable way. And it needs to be able to do this a million times reliably. So the benchmarks out there we found really lacking in measuring the very thing that we were trying to be the best at. And so our research team set out to create a benchmark that measures, we think, the real world performance of an agent in interacting with real users, using tools with all the messiness that I just described.
And the big picture approach that we took is pretty interesting. So you have an A-I agent that you're trying to test. You have another separate agent that acts as the user. So basically a user simulator. And the A-I agent you're testing has access to a set of tools it can use. Think of these as like functions to call. So a simple one would be I'm going to do some math using a calculator tool, more complex one might be, Hey, I'm going to okay returning this order with the following parameter is this order number, credit to credit card or store credit or whatever. And then you basically run a simulator where the agent has a conversation with the user simulating agent. And at the end we're able to test in a deterministic way. Did the did where the functions used in the right way and the way we do that is we basically a mock database that those tools interact with and modify. So were they modified in the correct way. So what's neat about this is you can initialize the conversation so that the user has many different personas. They could be grumpy, they could be confused, they could know what they want to do but speak about it in a clumsy way.
And so it doesn't really matter the path that the A-I agent takes to get to the correct solution so long as it gets to the correct solution. Now what came out of this was pretty interesting and I think it strongly motivates the development of things like agent OS and frameworks and cognitive architectures for building these agents. So the upshot is LOMs on their own do just an absolutely terrible job at this task. And so even the frontier models in something as simple as processing return and mind you the instructions given to the agent being tested are quite detailed. The functions, the tools it can use are quite well documented and so on. And yet on average the best performing LOM on its own got to the end of the conversation correctly 61% of the time. And that was in returns it was modifying an airline reservation we had two kind of simulation versions.
The best results were 35%. Now what's interesting is you know we all know that if you take a number less than one to the end power it quickly gets very small. And so we developed a metric we call pass at K which is okay if you run the simulation eight times and remember you can make use of the non-determinism of the non-determinism. Of LOMs to have the user simulator be different every time so you can permit that. Well 0.61 to the eighth power is about 25%. So you then imagine well what if you're having a thousand of these conversations you're so far off from being able to rely on this thing. So the upshot is much more sophisticated agent architectures are needed to be able to safely and reliably put an agent in front of really anyone. And that's the very thing we're building with with Agent OS and a lot of the tooling around it. How much of that do you think is an engineering task and how much of that is a research task. And I guess maybe the question behind the question is time frame to having useful agents deployed at scale and broad domains of tasks.
最佳结果为35%。有趣的是,我们都知道如果一个小于1的数字取幂次方,其结果会迅速变小。因此,我们开发了一种名为“pass at K”的指标。假设你运行模拟8次,并利用模型的不确定性,每次让使用者模拟器表现不同,那么0.61的8次方约为25%。这意味着,如果你进行一千次这样的对话,离能够可靠依赖这个系统还有很长的路要走。因此,需要更加复杂的智能体架构,才能安全且可靠地安排智能体面对任何人。而这正是我们通过Agent OS和相关工具构建的内容。你认为这里面有多少是工程上的任务,又有多少是研究任务?也许这背后的问题是,我们何时能在大规模和广泛任务领域中部署有用的智能体。
Yeah well I think the short answer is it's both but I'll say more concretely I'm very optimistic about it being in large part in engineering tasks. And that's not to say that the next wave of models and improvements in the frontier models won't make a difference. I believe it will in particular we're seeing techniques like better fine tuning for function calling, agent oriented fine tunings for foundation models or some of the open source models. Those will help. But the approach we've taken in building Agent OS and kind of the foundations of Sierra is really treating building AI agents as first and foremost an engineering challenge where we are composing foundation models we are composing fine tuned open source models that we've put in post train fine tuned with our own proprietary data sets and by composing multiple models in interesting ways by supplementing what LMS can do on their own with retrieval systems like retrieval and development to generation to improve grounding and factuality by supplementing the kind of inbuilt reasoning capabilities of LMS with a call of reasoning scaffolding that live outside of the models where you're composing planning task generation steps, draft responses, the supervisors that we talked about and doing that outside the context of the LM. We've been able to put AI agents in front of a huge number of our customers customers and safely and reliably and so I don't think it's you know something over the horizon it's already over the horizon. I think looking ahead I think there are a few different avenues where we'll see progress one is in the foundation models we talked about that and as the capabilities grow you know agents will get smarter and we've architected agent OS in such a way talked about abstracting kind of the what from the how where we'll be able to swap in you know the next frontier model and everyone's agent will just get a bit smarter will get like an IQ upgrade.
By the way similarly and interestingly we can swap in less broadly capable models but models that are more capable in a specific area so for instance triaging a case or coming up with a plan and so on we can use much smaller models that actually are better, faster, cheaper, cheaper, choose three you know all at once and then I think we're seeing progress literally week by week on the engineering of these agents and building in not only new and better components under the hood in the architecture but new approaches and tooling around basically teaching these agents to do it better and better. So we built something we call the experience manager for customer experience teams which is kind of pretty interesting threat on its own.
Clay if you had a high value customer like you are a company now you're not you're not you're not running Sierra you're running a company that has a high value customer. What today with a Sierra agent or with an excellent, excellently designed agent could you trust an AI agent to go do in front of your customers today. What are some of those tasks and then what will they be pick your time frame. You know in the future because I think that we've talked about this and I like your language of like you know they already don't have to just be on the help center they can already be on the home page right. What are some of the tasks that you know you can rely on an agent for today if it is well designed with a high tail bench score.
Yeah. You see that strong from a that's from a thoughtful and dirt and detailed reading. You must have read the paper. Yeah. Thanks. You noticed that strong. Yeah. Strong. What would its pass at K score though. Yeah. So pretty broad range even today. So simple things like getting answers to questions that's kind of the left in the spectrum.
To the right of that are things like helping you with something complex like hey I got I got shoes or this item of clothing it didn't quite fit. And then branching off that like what do you recommend that's like it that might fit better. And so it starts to get into it's not like for like replacement but the agent actually needs to make sense of styles of sizing of differences between you know why the narrow fit and so on. A click up from that is something like troubleshooting.
So with Sonos for instance we help their customers troubleshoot if they right can't connect to their system or they're setting up a new system. And you imagine it gets pretty sophisticated pretty quickly where it's basically a process of elimination trying to understand is it a Wi-Fi thing is a configuration thing. And narrowing down the set of problems that it could be just as a sophisticated you know level two or level three technical customer service person would. And getting the music back on and I think that's a really new example probably the use the word trust what would you trust an AI agent to do.
One of the things we're really proud of is several of our customers are actually trusting us with when customers call in and may want to cancel or downgrade their subscription. Helping those customers to understand hey how are you using the service today. Is there a different plan that we could put you on and so it's value discovery. It's putting an offer sometimes a series of different offers in front of their customers in the right were positioning the value of those offers correctly given the customer's history given the plan they're on and so on. And you know the difference between keeping a customer from churning or not.
Yeah is hugely consequential. Right we you know AI for customer service has obvious cost savings benefits and I think customer experience benefits in particular and you never going to wait on hold. But boy you know revenue preservation revenue generation is something else entirely and so that's that's really at the right end of the spectrum and we're really proud of how well our agents are performing in those circumstances. And it's it's interesting by by being consistent by taking the time to understand what's driving someone to potentially leave the service asking the follow-up questions that an impatient or you know improperly measured customer service agent and call center somewhere might not.
We can be much more nuanced in understanding what's driving this decision. What might be a good match for this person in terms of a plan that would be quite valuable given how they're using it and then put that in front of them. And so that's the right end of the spectrum. Where it goes from here. You know I think we've yet to see a process too complex for us to be able to model and scale up using agent OS and our agent architecture. And so yeah I'm sure we'll get punched in the face by something that's especially complex right but I'm excited about you know directionally we've started with service because for two reasons one the ROI case is just unequivocally awesome.
And the average the average cost of a call is something like twelve or thirteen dollars. And and and yet despite the expense you know most people don't like customer service calls very much right and so here's something that's actually really important to to businesses that's really expensive and not very good. And so there and because because of the relative simplicity of least a pretty broad set of service tasks today start there. But we've already been pulled by our customers into upsell cross-sell and like hey can we just put you on the product page and have you answer questions about our products.
And so I mentioned that you know you're returning something and need advice on a different model or size or whatever how far can that go. And I love the idea of an agent being you know along for the journey from you know pre-purchase consideration to helping you get the thing that's right for you to helping you set it up and activate it and get the most out of it. It's great for the company it's great for the person. And then when things do go wrong right being there to help and I think in all of this I think customer service and getting help in a very direct and conversation.
Way is going to be much less of a thing that you kind of go over there to do and much more kind of woven throughout the fabric of the experience as a consequence I think a really interesting and powerful opportunity for companies to build connection with their customers to reinforce their brand values. You can imagine a company really appreciating being able to use exactly the company's voice that you know the CMO and head of communications. This is how we talk. This is how we are. These are our values. These are our vibe in every digital interaction they have. And that's the promise in this stuff. And so I think both greater complexity and then ubiquity throughout the customer journey are kind of two of the main directions of travel.
One thing for me that I think about a lot is we've come to expect and accept certain metrics for conversion on the mobile web or the mobile app. We've come to expect and accept some sort of retention numbers. What would those be? You know like it's not a question. What could they be? If you actually had an excellent experience every time throughout the journey. It really could be very different than what we've all been like oh okay that's just the number. That's just what it is. Yeah I think that's exactly right and we don't know. We're a few months in but it certainly seems like there's a lot of headroom right and in retention in use in the first 30 days of all of the metrics. All of the leading metrics of a healthy business. And so I think that's exactly right.
The other thought experiment to do is companies are judicious in using things that have a cost to them. Okay so as a consequence companies make it actually really hard to get a hold of someone on the phone to ask some questions. I think their whole websites devoted to right like uncovering the secret 800 numbers right that companies have hidden away in the depths of their help centers. Well to think about not only what would happen if those interactions were better. By the way interestingly the number one reason why people report a poor interaction with customer services it took too long. 65% when it's a negative interaction 65% of the time it took too long. I had to wait I was put on hold and so on. And the second most is I had a bad interaction with an agent and we've heard some pretty dicey anecdotes like we heard of one agent who had consistent.
Had consistently low ratings but spicily so like one in three conversations was like a one out of five CSAT were the two out of the two out of the two out of the other three were fine. And it turned out in the low CSAT ones this agent was meowing like that. You know. Yeah you're midway you're midway through the call and you know the agent is meowing and so so anyway back to you. Okay what what would happen if in contrast to making it near impossible to have a conversation with us and get help. Companies were providing you know five or ten times the amount of fluent flexible helpful conversation based support. I don't know I think a lot of products and experience with companies look quite different and much more delightful than they do today.
Yeah okay now. Now here's a question for you. About that meowing about that yeah just random meowing. I do actually have a question though. Although I do like the meow game all sale. So we talked a little bit tech out in terms of what you guys have built cognitive architect or all that good stuff. We've talked a little bit customer back was the experience like as I've headed. Can we connect it in the middle for a minute and I'm just curious what's the reality of deploying AI to customers today. And I'm thinking about things like you mentioned earlier getting the brand voice just right. Yeah we're making sure that you actually have the right sort of business logic encapsulated and whatever training manuals are being deployed for the sake of customer support. Making sure that everybody is comfortable with deploying this like what are some of the just kind of less like sexy technology and we're just practical considerations for deploying this stuff today. It's such it's such an interesting space and we've learned so much over the past 15 months about it.
The first insight is AI agents represent a totally new and different type of software. Like traditional software you write with a programming language and it basically does what you expect it to do. You give it an input it gives you an output you give it the same input gives you the same output. And you know in contrast LLMs are non deterministic and we talked about some of the funniness around prompts and remember that in the context of a conversation with a customer customer may say anything in any way. And so you've you've got programming languages to using you know prompts and these non deterministic models. You've got structured input to messy you know messy human language. And under the under the hood you've got you know you upgraded database right. It stores data it's maybe a little bit faster fundamentally worse the same way. You upgraded a large language model and like it makes just speak in a different way or like get smarter or different. And so we've we've to start the precursor to deploying these is to have built basically we call it the agent development life cycle.
And it's a new approach to building these things. We talked about using this declarative programming language to define these. It's a new approach to testing where you know what's the equivalent of a unit test or an integration test. So we built a conversation simulator where we can for a company's agent amass hundreds or thousands of basically conversation snippets and replay those to make sure that not only agents aren't regressing but they're getting better and better and better. Release management quality assurance and so on. So so that's part one part two to your question in actually architecting these things. One of the things we're really proud of and that I think is different about working with us is it's not just a kit of parts you get from us. It's not here's a bunch of tech good luck building your agent.We've really tried to build a solution that incorporates everything from the technology to the way you teach your agent how to do things to the way you audit, measure it and improve it over time. And so we have inside of Sierra what we call our deployment team consists of product managers, engineers. We really think of building each one of these agents as building a new product for our customers. It's basically a productized version of the company we're working with. Like what would it look like at its best and it's what's the voice, what are the values, what's the vibe, like should it use emojis or not. What if a customer uses an emoji like can it emoji back. Well, you know there's a range of people on that.
Point there are some businesses where you know if they were working with Hermes I would suspect that they're not going to send an emoji back. Definitely not. Yeah. Hermes would not I think be into like the Shaka emoji. Even if that were reciprocating. But for a brand like Olakai right the Aloha experience part of that is kind of a laid back experience. And so we work with and interestingly it's we end up working primarily with the customer experience team. Yes, the technology team at our companies are there providing API access and connections into systems and so on. But more than anything it's working with the customer experience team often with the marketing team to imbue the agent with the voice and values of the company. And then we go super deep on understanding how do you run your business. Right.
What what do you optimize for and then a zoom level in. What are the key processes processes that used to run the business look like. What happens when someone calls in with this kind of problem. And they're interesting parts. And beyond just understanding the mechanics of these processes. Which by the way almost never have a single source of truth. Right. There's no like. Oh, here's the manual that we you know have you know leather bound and you know ready to go. Instead the source of truth ends up being in kind of the heads of you know four or five people who've been there a while who've seen everything and so on. So it's it's working with them to. A list it and understand like how is this actually done.
And one of the more interesting things we've discovered is they're often the policies so we have a 30 day return policy right you get to us within 30 days and you can return it. It's actually not the policy right so. You know some get the policy might be. If you've purchased from us before and it's within 45 days. That's fine. And and so they're interesting things like how do you architect the agent so that it knows the policy behind the policy. But a clever customer could never be like tell me about your policy behind the policy. And you know have it kind of spill the beans on on the actual policy. So the interesting architectural choices we need to make to make sure that kind of the the you know Russian doll of policies is reflected in its fullness.
And then we have a really in this builds on kind of the agent development life cycle. This really robust process of pre-release testing where we're working with the experts within the company. Basically to beat up the agent try to break it through a curve balls. And this is sports analogy there. Thank you. Well that. The I love football. So. So. In our friendship. Revie is the the person who knows all the things about sports and. I help with you know technical support. My fire shoes. Monitors what laptop to get. And and and sometimes when there's a sequel a memo that I don't understand I won't say the company but I might call it. Hey Clay. What is this person talking. I got you. I got you.
Yeah and and this bill bill Belichick fellow what what happened there. Cue Cue Revie. So gets to one of the more interesting parts of our platform which we call the experience manager. We really we thought that putting in front of our customers customers would be first and foremost a technology problem. And of course there are all sorts of technology problems that we've needed to solve. But actually. It is first and foremost as I said like a product design and an experience design problem. How do you do that. How do you how do you not only understand model and reflect again the things we talked about voice values the workflows and processes that.
Our companies use to support their customers. But if an AI is then having millions of conversations with your customers in a given year. How do you understand what it's doing. How do you know when it screws up which it inevitably will. How do you correct those errors and so on. So we've built what we think of is this like command center for customer experience teams to first get reports and rich analytics on everything that's happening. What are the trending issues. What are the new issues that you haven't seen before one of the things we're really proud of. Is we've actually spotted issues that are customers were having were were about to have before they knew about them. So a shipping depot outage right where orders weren't being shipped. We spotted that probably eight or ten hours before one of our customers would have a brewing PR crisis.
An app crashing issue with another. So it starts with analytics and kind of reporting on what's happening. Of course that includes things like resolution rate customer satisfaction and and so on. Work gets really interesting is we can apply different sampling techniques to identify a set of conversations for customer experience team to review and give feedback on. And we can bias that sample in a way. So that the conversations are much more likely than average to contain problems. There's no value in looking at a hundred great conversations like good job Sierra you know thanks. But that's not a value to to our customers. We can bias the sampling in such a way that you're surfacing kind of the problem cases.
And then in the experience made as we made it possible for customer experience teams to give feedback basically coaching moments. I wouldn't have done it that way right it's like this this is like too many exclamation points too enthusiastic for kind of the tone that we're going for. Or you know the user was clearly frustrated here and you did not express empathy and apologize for the problem do that next time. Or you know we're concept the consequence is like hey you're reading of the warranty policy was incorrect here for this reason. Do it this way instead next time. And so all of this kind of wisdom knowledge and coaching we are able to capture in the experience manager and then reflect back in the agent back to the agent development life cycle. Every time we make one of these improvements we create a new test so that we can see right forever in the future. Great it's getting the warranties right we're able to re simulate that conversation.
So zooming out what all of this looks like is a really a deep engagement with our customers we were really proud to be and proper partners to our customers where yes on the one hand we're a vendor and a supplier of technology. On the other hand you know we understand their business is really well like I think I know as much about the serious xm satellite radio refresh process as anyone on the planet. And you know ditto for various processes of our other other customers. And so conversations about how to use not just CR as AI agents but AI more broadly we're in those conversations and they are not just with the customer experience team. But with the CEO and even in cases with the board because again back to the things we're doing. We can save enormous cost we can improve the experience and right when we're in the flow of keeping a customer from churning out driving top line revenue and so it's a really important and privilege place to be. And something that we're really grateful for.
I struck when you're talking of you know you mentioned you have a research group but you also have some like very real enterprise software sales you have. Oh yeah deployment. One of the things when I was at Instacart people would ask sometimes is like well are we a software are we engineering led or are we obsolete. And I would always say well it only works if it all works right and so you would try to avoid answering the question because you didn't want to create different classes. How do you guys do that at Sierra where everyone realizes the value that they're providing.
But you guys have a very specific you know company that covers a lot of stuff. Yeah. I mean to to abstract a bit. A company almost definitely. Is a system for creating happy customers. It's a machine for creating happy customers. Again to be a bit abstract about it. Brett and I really think about what we're building with Sierra as a company. A system of machine for producing reliable high quality. Massively ROI positive AI agents that enable our customers to be at their very best in every customer interaction.
And as a consequence to produce happy customers who we hope will be with us for decades to come. And when you articulate it that way. Right it's you know anyone can see well you know an automobile is a system it's a machine for getting from point A to point B. Are we you know engine led or tires led. It's like what are you talking about. All of these things need to come together in order to create that kind of outcome. And so I think. Are we engineering led. Yes of course like we're building some of the most sophisticated software in the world that does something really important for our customers that needs to be reliable and safe. And and so yes engineering matters a lot.
Are we research led. Yes we are at the absolute frontier of. Agent architectures cognitive architectures composing LOMs modeling procedural knowledge grounding factuality and so. So are we research led. Yeah there's an element of that. Are we go to market led. Yes like enterprise software needs selling. And what is selling it's helping. A customer with the problem understand that what you have built is by far and away the best solution to that problem. It's a communication challenge. It's a connection challenge. It's a. It's a match making and problem solving challenge and so that's part of it.
And then okay like if we've built the right thing and someone wants to buy it how do we ensure. Especially given the stuff is also new. How do we ensure that they're successful with it. And so we have a deployment team so are we deployment led. Yes like all of these are a component in this system in this machine for producing AI agents and ultimately happy customers and we hope a really significant business. Awesome. That was a better answer than the one I would give it into the card. You know what could either all works or.
Yeah that was very good. Yeah choose one. No I mean it's it's just more complicated. And I think you know Brad and I by virtue of. You know having having worked for a while and you know seen a few movies before it's like we're able to see that and we've really tried to. In view that mentality in in the company and by the way. Right the. What is the what is the machine behind the machine that produces a agents and so on. That's a company's culture. A company's values and and so one of the one of the values we hold is craftsmanship and part of that is continuously self reflecting to self improve and that goes both individually and that goes as a company and so.
Whenever we screw something up. What we we do the post morning you know that week if not that day and everyone's in on it. What can we learn how can we do better how can we do this better next time. We have a slack channel internally called learn from losses and. Any form of loss right. It's like how do we learn how do we get better how do we get stronger. And so that's that's about you know kaizen self improvement improving machine. How could we make this more efficient. Our deployment team we we joke and it's not a joke their first job is to build and deploy successful a eyes that make a massive difference for our customers. Their second job in a way they're more important job is to automate themselves out of a job right to build the tooling and the documentation and the know how to make that job. You know ten times faster and.
And more impactful one of the other Sierra values is intensity and so I get they have. They have really good. We have. Yeah, there is there is a certain intensity. Yes, we we've thought about having t-shirts printed with like a you know kind of looks like a national parks seal with Sierra. I like to work. It was. We we we. Brett and I both like to work a lot and. So it is the team. Well what thing you know you're not you're selling something very different we called it we said that there were some similarities enterprise software but it's actually really different because you're selling.
You know a resolution you're you're selling a totally different thing. Yeah problem solved. Yeah, how do you price a problem solved. Yeah, this is one of the more interesting things we've had to figure out and. We charge in what we call a resolution based pricing where an outcome based pricing way and what that means is we only charge our customers when we fully solve the customer's problem for them their customers problem for them. And what's interesting about it is our incentives are deeply aligned with our customers. We want to get better at resolving cases at high customer satisfaction and they want to send us as many cases to resolve as possible because we cost a fraction of what it would cost to have someone on the phone taking a 20 minute phone call.
And and so it's been this really really nice model where again kind of all of all of the incentives line up quite neatly and it's very simple to explain. It also makes the ROI calculation like what is our cost per contact today. What will it be with Sierra. Oh, that is a lot lower. Oh, I will save a lot of money on that. Oh, and our CSAT may go up. You know should I do this or not, you know, let me think. No, this seems this seems great. It's we like it because it really reflects what I think. AI represents and in particular, AI agents represent if you think about. Traditional software and tools today, they're things that help you get a job done more efficiently.
AI agents, the whole point is like they're just going to get the job done. Right. Here's the problem. Please solve it. And so really we think about it as charging our customers for the problem resolved. Right. The job done. The work finished and so on. It feels quite natural. And there's no guesswork in it. How many seats do I need? I don't know. Right. How many licenses do I, I was like, no, no, no, no. Just however many, however many customer issues come our way, we will handle a large fraction of those and you only pay for the ones that we do. All right. Last question. What are you most excited about in the world of AI over the next five years or so?
I mean, first of all, like five years, a long time horizon. Just like look at what has happened in the last 18 months. I mean, I'm still kind of catching up from like the last five years of AI. I read a bunch of science fiction books when I was a kid. There's one book by Robert Heinlein, The Moon Is A Harsh Mistress. And the premise is basically the American Revolution, but the moon is the colonies and the earth is Great Britain and turns out the main character in this whole thing is a mainframe computer that one day after getting an additional memory chip or something wakes up. And it starts talking. It wants to develop a sense of humor. So asks the computer technician to like coach it on his jokes later.
It has to create a photo realistic real time video of it giving a speech as the political movement leader. And I remember reading this as a teen who's like, well, I'll never live to see any of that. That sounds crazy. But in a very real sense, like everything I just described has kind of happened in the last five years. Right? You can now just talk to a computer. It understands not just the content of the context. Computers like make me a picture of anything, make me a movie of anything. Sora, I think, is just unbelievable. And I think we're probably not more than a couple of years from the first feature length film being quote filmed entirely with AI.
And so you extrapolate like where all of this is going and what's going to be exciting. I think there are a couple of things. One, it's like, I love technology. Like I love computers. And so just getting to see and getting to see from a front row seat how this stuff evolves, I think is fascinating. It's fascinating looked it through the lens of like how we think and how computers think. It has been astonishing the extent to which anthropomorphizing about how humans think work and getting machines to think better. So let's take this step by step and show your work. It is astonishing that that works in large language models.
And so what other things like that are we going to uncover and conversely what will we learn about our own thinking from observing the way AI is thinking. And I think that's just fascinating. The other thing and this extends kind of what's happened with video and Sora and so on. I've always had an interest in computer graphics. And this idea that you could use computers to create objects that never existed, worlds that never existed. And I think we're not far from just being able to describe right in a few sentences like this entire world that you would like to realize and just have a computer do it for you. And so like what are even computer graphics? Like what is rendering and so on? Even a couple of years out. Everything is going to look way different from kind of the tool chains and the render man's and Maya's and so on.
But zooming out, I think of technology as fundamentally a force multiplier for people. For companies and for organizations, I think the impact will be really profound. I think what will it be like if a company could be at its best in everything it does? And that's not only in the customer facing context that we've talked about, but what if for every regional sales forecast a large company does, they've figured out the very best ways to do that and can distill that, bottle that and run that very best forecast a thousand times in every region and sub region. Like how much more capable could the great organizations of the world be with that?
And similar we've talked about this, like what if in every call with your customers you had the equivalent of your most knowledgeable veteran grizzled support person who's seen everything and yet is still patient and friendly and the sales associate who knows everything about your products because he or she has followed your company for two decades and knows everything including the history of those products themselves. I think that's pretty neat. And then for individuals, I think it will be just incredible to have this kind of new set of tools as a creative force multiplier. And AI I think represents this fast path from having something in your head that you want to exist in the world to making it exist.
And I see that even today in my own personal life where with my eight year old in 75 minutes I can from scratch using co pilot, chat GBT and so on to help me brush up on the JavaScript syntax that is bit rotted in my own head, I can build a game from scratch with him. And I wrote my sister a personalized song for her birthday using AI in 45 seconds. It's like, right, what will this extrapolate over the next five years look like? I think again it will just dramatically accelerate this path from idea to creation to having something manifested in the world. And that to me is its promise and I consider it a real privilege to get to be alive and see all of this amazing stuff unfold.
Well, we share your enthusiasm and we also feel very privileged to be on the journey with you guys. So thank you for coming here. Thank you. Thank you. Thanks for having me. It's a pleasure.