Skip to contents
Hot Pod Insider

The Robots are Coming

Notes on Pocket's New Text-to-Speech Feature.

In the past year or so, I’ve become more and more interested in the boundaries of the term “podcast.” It’s already a flexible word, which can signify anything at different times for different people from an internet-first audio creation delivered through an RSS feed into a smartphone app to a listen-again version of a radio broadcast accessed through a smart speaker. To whit: it feels significant to me that the BBC’s latest version of their smartphone app for audio is called “BBC Sounds,” not “iPlayer Radio” as before. The corporation is, as far as I understand at the moment, seeking to smooth over the distinctions between live radio broadcasts, catch up services and podcast-first content to create one place where all their audio can be accessed.

These questions of definition came up again for me this week when I read about the new version of Pocket, the read-it-later app that allows users to save links to articles and videos into an offline queue for later consumption. The redesign includes a revamped listening feature, which can stream an audio version of a saved article to create “a personal podcast that you curate on your own,” according to founder and CEO Nate Weiner. Pocket has had a text-to-speech feature since 2012, but it wasn’t emphasised in the design and had an extremely robotic-sounding voice. This new version is still automated, but the sound is more humanish (that’s a depressing phrase to type) and the listening feature is now really at the fore of the app’s interface.

Weiner sold Pocket (which was called “Read It Later” when it was founded in 2007) to Mozilla in 2017 for an undisclosed sum, but according to Bloomberg the app has an audience of over 30 million. That’s a lot of people who are now exposed to this automated text-to-speech vision of “podcasting.” In the last couple of years, Weiner has been talking a lot in interviews about Pocket’s potential to be a better version of the Facebook news feed and to offer readers a haven from “the fast-churning hellbroth of the daily news cycle”. Adding listening to the mix is an obvious extension of this idea, since podcasts are often touted as a way to “escape the news”.

Of course, such text to speech services have existed for a long time — for almost as long as I can remember it’s been possible to get a computer to bark out bits of text, mostly for accessibility reasons. But this capability is now being specifically rebranded as for entertainment, rather than for functionality. As such I think, for the casual listener, it appears to exist on a spectrum that also contains the highly produced, narrative storytelling shows more traditionally described as podcasts. Crucially, by associating the robotic reading service with podcasting, Pocket can tap into the existing expectations around advertising in audio. Listeners won’t balk at the idea of an audio ad that runs in between articles in their queue; they’re so used to midroll adverts in podcasts anyway. By adding audio, Pocket just massively increased their inventory.

The rapid adoption of smart speakers intersects neatly with this mainstreaming of automated, processed audio. Pocket’s new version also includes their first Alexa skill, which enables you to have the articles in your queue read out by your speaker. People are rapidly getting used to interacting verbally with their devices — a podcast that is built out of articles chosen by the user isn’t that far a cry from some of the “choose your own adventure” style experiments in immersive fiction that have been made for smart speakers.

According to The Verge, Weiner has said that Pocket “might” start using real people to record articles in the future, but that for now everything you hear in the app will be automated. There are already several companies that offer human readings of articles: SpokenLayer is one, as is the Y Combinator alum Audm. High quality narration from professional voice actors is a point of differentiation for these startups, which work on a revenue-share basis with the publishers whose stories they convert into audio. The Economist was an early player in the field of full “audio editions” — they record and release a complete professional reading of all the articles in their magazine for paying subscribers. The Guardian produces an “Audio Long Reads” feed that comprises straight readings of their longer articles, as does the digital science magazine Mosaic. There’s plenty of “printed text to spoken word” translation going on already — I mean, audiobooks, duh — but so far it’s pretty much all been read out by real people.

Another production that I think is relevant to this grey area of narrated articles and audio editions is The Paris Review Podcast. This show tries to do something different to the usual magazine podcast — rather than interviews or analysis by its writers, it presents readings of fiction and poetry from the archive. “We’re just going to let the writing speak for itself, the way it always has in the magazine,” former editor Lorin Stein says at the start of the first episode. I attended a session at this year’s Third Coast lead by Stitcher’s John Delore all about how he produces this podcast and “brings the written word to life.” He talked about directing voice actors to infuse a reading with greater intention and meaning; about using acted recreations of scenes enliven a long passage of exposition; and about inserting music and sound effects to create an audible version of the atmosphere the original writer was trying to evoke.

I came away with a strong impression of the meticulous, painstaking, highly skilled work that Delore and his team put into the audio versions of these old Paris Review pieces. I don’t think anyone would think twice before calling what they make “a podcast”, but it isn’t so very far away from what the Pocket app is now doing too, in a fundamental sense. I don’t think there’s any cause for a “the robots are taking the audio jobs” panic (yet!), but I do wonder how the widespread acceptance of automated readings would change listener and advertiser expectations. If the engagement for the ads running on Pocket is as good as those running on a show like the one Delore works on — and I will want to see a lot of data before I accept this to be the case, just to be clear — the already-blurred lines between what is a podcast and what is not will get even harder to make out.

Nick Quah — There is a relationship that we can build, I think, between this prospect of text-to-speech futures and the notion of speed-listening: both are oriented towards a view of the presented material as an informational commodity, and both are endeavor to streamline (and capture and monetize) the flow of that commodity. Generally speaking, I don’t valuate one higher than the other; in my eyes, podcasts-as-an-experience and podcasts-as-information-vessels are two different types of offering that fall from the larger audio ecosystem. If we imagine for the possibilities of an audio-first internet, the two things will find room to occupy different spaces within that internet, even if they are unable to be sorted out within the discourse.

My thing with text-to-speech stuff — whether it’s the rudimentary robot voice or the SpokenLayer/Audm-style human-powered frameworks — has always been an unwariness with the ensuing uncanny valley. It continues to feel weird when a robot voice trundles over sarcasm. It’s a different kind of weird when a voice actor assumes another person’s sarcasm. (This, presumably, is why the act of directing is a really important thing, albeit one that’s hard to properly display.) That said, I’m one of those people who believes in the human experience being a largely (but not completely) plastic thing: when exposed and inured enough, you can get a person used to anything. (See: the New York subway system. Also: Stockholm Syndrome.)

If I was a betting man, I’d put my money on automated text-to-speech getting better over the human-powered frameworks being able to be properly scaled. Two reasons: (1) It’s already shocking to see how much better the automated voices provided by Amazon Polly — which is the technology at the heart of Pocket’s text-to-speech functionality — sounds compared to automated voices, say, five years ago. I really won’t be surprised if it continues to develop in such a way that it would likely pass a version of the Turing Test a few years from now. (2) I’m the last person that would bet against Amazon, perhaps on anything.