Siri :: Apple's Digital Assistant :: The Technology
Wireless and Mobile Technology Branding
Siri Primer :: Part 2 of 3 :: The Technology
(See Part 1 here, if you missed it)
Siri is a notable implementation of several technologies: Nuance Communications' voice recognition and text-to-speech (TTS) technology, Siri's artificial intelligence-like (AI) natural language processing engine and Apple-hosted backend services (i.e., processing capabilities and access to data and other resources). Perhaps a useful simplification is to suggest that Siri has three layers: voice processing, grammar analysis-context engine and services. To these I would add learning capabilities to establish Siri’s technology milieu.
At the iPhone level, it appears that Apple has integrated Nuance's voice technology with iOS to allow spoken words to be translated to data (text) and to accept text and give it a voice. Nuance has been in the voice technology business since 1994 and, interestingly, was another spin out from the same lab as Siri (SRI International's STAR). It is fair to say that Siri's actual interface is really Nuance technology…but it's the backend magic of Siri that really makes things pop.
What Siri appears to do better than any other mobile voice solution is process natural language. This is not voice activation, where users learn to speak commands. Nor is it traditional speech recognition where enunciation is essential and only when words are said clearly enough, tasks are accomplished. Siri goes far beyond this by understanding language, modeling knowledge and applying logic.
With Siri, there are no pre-defined ways of requesting Siri do something or answer a question—it simply understands what a user wants to do. Importantly, Siri not only understands spoken words, it understands context. Understanding context requires deciphering natural language and then adroitly accessing the resources at Siri's disposal to perform tasks or correctly answer basic or even certain complex questions.
To add clarity to this, consider what happens when you ask Siri to, “Book a table at Beachfire in San Clemente for 5PM,” Siri determines that “Beachfire” is likely the name of a place, “San Clemente” is likely a location and that Siri still needs to confirm the reservation is for today. This is referred to as natural language processing, and as you can imagine, it is incredibly difficult to get right.
Siri architecture, like Android's Ice Cream Sandwich (ICS), relies upon backend processing and data access. This is largely due to mobile processor, memory and storage constraints. Viewed by most as a limitation (Siri doesn't understand "Call home" without network access), Apple appears to bank on the belief that network access (WiFi, WiMax, cellular, etc.) will become faster, broader and more reliable over time and that doing the heavy computational lifting natural language processing requires is best handled by powerful data center servers and their fat pipe access to resources and services. An advantage to this architecture is that extending platforms services can be done centrally and made available simultaneously to all users.
Siri also learns. At the user/handset level, Siri has routines that allow Siri to better understand the subtleties of each individual user’s accent and voice characteristics. At a macro level, Siri’s backend culls through the millions of requests (think: Google Search or Apple’s Genius) and finds things to improve upon. For example, when Siri first launched it voiced “Tee X” but within a week it began saying “Texas”.
What really sets Siri apart, though more of design specification than a technology, is Siri's "friendly edginess" and humor - its persona. Siri tries very hard to be witty and very useful. This is very difficult but critical as this "personality" is what has captured the imagination of the market. When merited, Siri delights users with clever, cheeky and laughter-provoking responses. I very much doubt Siri would be getting all of the attention it has if Siri gave accurate but boring responses every time.
How Siri Works
Though the following needs to be confirmed, it appears that once the the Siri microphone button is touched, whatever is said is turned into text and sent to Apple's data centers where Apple hosts Siri's AI-like natural language processing engine. Siri then figures out what has been said. Depending upon the inquiry, Siri either sends a text response back to the iPhone or performs queries and sends a text response and data back to the handset. The iPhone's Siri Digital Assistant is given “life” by vocalizing answers to the question asked using the Nuance text-to-speech service and, if merited, displaying information obtained via Siri's backend services (e.g., Wolfram|Alpha data) or via the user's iPhone resources (e.g., Map app).
Importantly, Siri also somehow manages conversations - multi-part exchanges between a user's iPhone and Siri’s data center resources. So when Siri needs more information to fulfill a request it asks the user for more information without forgetting what was originally asked. This is critical to making a digital assistant conversational. This ability to maintain a "conversion" by supporting multiple exchanges is essential to making Siri work as smoothly as it does today.
Apple has integrated Siri with iOS5 but what does this actually mean? Turning speech into text can be done without extraordinary processing power. However, using AI-like technology to understand what has been said is generally an intensive computational task. By integrating the means to both turn speech into text and text into speech at the handset level and running the context engine on big iron at a data center, Siri succeeds. I also think that by integrating parts of Siri with iOS, Apple is protecting how third parties can use Siri, helping to ensure that that which sets Siri apart remains a significantly Apple-controlled technology.
Siri was architected as an extensible platform so new services can be added without extraordinary development effort. Should Apple open Siri to developers - and I am not 100% certain that they will do this soon or at all - there is virtually no limit to the breadth of domains Siri could eventually support.
Apple has designated the initial version of Siri Beta, a rarity for Apple, which is why, in part, they have not opened Siri up to the Apple development community. Currently, Apple has not announced if or when a Siri API/SKD will be released. Some predict Apple will release the SDK with the launch of iPad 3 in early 2012 but I believe it could be well over a year, if at all. The challenge is that Siri has a distinct persona that is controlled by one group within Apple. How can Apple open this up and maintain that efficient, compliant, engaging and uncomplaining “voice with an attitude” if thousands are developing for the platform?
I believe that Apple sees Siri's vast potential and decided to take more time to expand Siri to fully support non-US markets, improve it and extend Siri to other Apple products. For example, rumors abound that Apple will begin selling HDTVs with a Siri interface in late 2012 or 2013 and I’d wager that Apple-TV will also include Siri so that all TVs can be controlled via the Siri voice interface.
Fanboys and Apple bashers alike agree that Siri is the best implementation of voice technology to date; even “Adrodians” cede that Siri leaves anything on the Android platform in the dust. Given the potential of the technology to sell more Apple products that support Siri, it is not unreasonable to suggest that quite some time could pass before Apple feels the solution is ready to open Siri up to the general Apple dev community. But again, this may never happen.
Siri currently isn’t a "ready for prime-time" Apple product which is why it is technically a Beta release. The biggest drawback is that it is not a global product. While the initial Apple version supports American and UK English, French and German, Siri's full functionality only works with American English in the US. Apple still needs to expand data centers in Europe and Asia to give folk there the full flower of the Siri experience.
Another limitation, presently at least, is that Siri operates in a closed ecosystem. It doesn’t work with other apps or services other than those Apple has connected on the backend. This is just one of the challenges Apple faces in allowing others into Siri’s world. Beyond technology, Apple will likely need to develop new economic models or hybrid usage/licensing schemes and until they figure this out, developers will not be able to tap into Siri.
Apple reliance on backend servers to do a lot of the heavy computational lifting exposes other limitations: network availability (it simply doesn't work when access to the Internet is not available) and Apple data centers resources. If Siri proves wildly successful, Apple will need to scale server resources to keep pace and this is expensive and tricky. Siri’s volume in the first month since release is reportedly 10 times what Apple predicted and there have already been complaints that Siri sometime just stops working. Pundits attribute at least some of these outages to Siri traffic exceeding the capacity of Apple's hosted services.
In the third and final installment of this Siri Primer, I offer insights into Siri potential, competition and present certain conclusions. The third part also includes select research links that I tapped in writing this Primer.
Have you had a chance to use Siri? What are your thoughts?