Skip to contents

The Modern Challenge of Voice Identity and Anonymity

When someone wants to talk on the record about something that could get them in trouble, they might opt to conceal their identity. For a written piece, they could use a pseudonym (or simple anonymity, like being referred to as “someone familiar with the matter”); for something filmed, they could obscure their face.

In theory, audio offers the best of both these approaches. It separates a person’s face from their words, as written quotes do, while retaining their tone and some expression of their personality. In articles and forums for those considering this move, the major point up for debate is identification: Will they use a middle name? Opt out of a name altogether? In either case, audio anonymity is generally presented as available to anyone who wants it. But focusing on naming is insufficient, as it would be unwise to underestimate how much is still revealed by what remains: the voice, which humans — and, increasingly, computers — are great at recognizing.

Imagine Barack Obama and Donald Trump, says Taylor Abel, a neurosurgeon who leads a research group studying voice and speech perception. The two former presidents, having filled the same role, could have hypothetically been captured speaking in the exact same context, perhaps saying the exact same words. Yet, if you were to listen to parallel recordings, the contrast of their voices would be too extreme for you to mix them up. And even if you took them out of a presidential context and shuffled them in with other people’s voices, odds are you’d still be able to identify them, since, as Abel said over email, “these individuals are unmistakable from the characteristics of their voice alone.”

To recognize a voice, you have to have heard it before, and this ability gets stronger the more you’ve heard of a voice, says Neeraj Sharma, a postdoctoral researcher at the Indian Institute of Science. This makes Obama or Trump an extreme example, since being famous and constantly recorded means it’s pretty likely that listeners could pick someone out. It’s less likely that they’d recognize, say, the voice of an Alabama-based banker that the reporter Camille Petersen interviewed for a radio piece on remote workplace surveillance. But the prospect that someone could recognize him should still give you pause.

When I first listened to Petersen’s report, which she published in September 2020 for The Pulse on WHYY, I found myself questioning the source’s decision to be recorded, since he also requested to remain anonymous. (Disclaimer: I’ve also contributed reporting to The Pulse.) Here was this man, requesting that his name not be included “out of fear of punishment from his employer,” not only giving his general location and field of work but — perhaps more importantly — his voice.

In Petersen’s impression, what the source chose to disclose “gives a decent amount of privacy.” His job security would be at most risk if he were to be recognized by upper-level managers at his company, who ostensibly don’t interact with him enough to recognize him by voice alone. On the other hand, “if someone really knows you” and is “embedded” in your life, says Petersen, “I don’t think it’s really any privacy.” By my estimation, this is in line with Sharma’s understanding.

What stripping away a name does do is remove a direct line between an online presence (e.g., social media profiles) and a recording, though that isn’t completely airtight, either. Even without a source’s name, there is still the matter of searchable keywords, and finding them is made possible by the digital assets that are often associated with modern audio recordings, such as transcripts or articles adapted from the episodes, for those who might prefer to read instead of listen. (This is what you see on the lower part of the page housing Petersen’s piece).

These text-based components are a great way for producers to get in front of more audiences, who can stumble upon shows just by Googling, and written assets in particular accommodate people with impaired hearing. When these components become a liability, however, is when the goal is to be the opposite of discoverable. Petersen’s story in particular was about surveilling remote workers; say the source’s employer suspected workplace grumblings about the practice, set out searching “Alabama” and “bossware,” found Petersen’s piece, and recognized the source’s voice. What then?

Extra care is taken to scrub identifiers from written assets for the podcast Polycurious. This is less for legal reasons than it is for social ones: On the show, host and producer Fernanda and co-host Mariah discuss their experiences with non-monogamous relationships, and “even though part of the mission of the podcast is to destigmatize those things,” says Fernanda, “I cannot ignore the reality that some people are turned off by that.”

“Fernanda” and “Mariah” are not their legal names; they chose to alter how they refer to themselves given existing stigmas about polyamory, and they extend this to written materials for the show. Fernanda’s primary concern is that the family of her boyfriend, who she describes as very religious, wouldn’t approve of her relationship’s openness, so she wants to decrease the likelihood that his relatives could look her up and find this project. Mariah’s primary concern is stigmatization from colleagues, so, in addition to hiding her name, she doesn’t disclose the general field she works in, out of fear that someone within her industry would be able to narrow it down.

The two have collectively taken more steps than Petersen’s source did, though what they still can’t control for is the continued recognizability of their own voices, which they don’t alter in any way. “I have an accent and also an uncommon name,” says Fernanda, for whom “Fernanda” is her real first name. “I wouldn’t be surprised if people who knew me [already] knew who I was if they listened to it.”

Ironically, it’s Mariah who’s been sniffed out, even though she takes more precautions by not using her real first name or disclosing work details (unlike Fernanda, who says in the first episode that she “came to New York about four years ago to do my master’s in journalism”): In the first episode of Polycurious, Mariah mentions her partner, and after that, she says, someone she knew reached out.

“I didn’t even know it was you until you mentioned [his] name,” Mariah recalls them saying. To figure it out, all they needed was a tangential detail — paired, of course, with the sound of her voice.

Many voices aren’t all that different, yet our brains perceive the minute distinctions between them, says Sharma, and exactly how they do that isn’t completely understood. Scientists know that sonic differences signal one speaker to be taller, heavier, or a different sex than the other, but to us, once we take in these differences and process them, they just mean that one speaker is a stranger and the other is our mom. And since scientists don’t exactly know how we get from A to B, it’s hard to make computers replicate the process. But, Sharma says, “that’s where machine learning comes in.”

Programs for machine learning, which is a category of artificial intelligence, do indeed learn. In a process referred to as “training,” says Sharma, you continually feed a program recordings of one person’s voice and reinforce that the voice belongs to that particular person; eventually, upon “hearing” a novel recording of that same voice, the program would tell you, with pretty high certainty, who it belongs to.

For now, Sharma says, “there is no open-access website where you can take a clip and drop, and it will tell you this voice is of this person, but I won’t be surprised if somebody makes it available.”

What’s more, databases already exist to help voice-recognition technology get better at this skill, he says — and guess where those databases pull recordings from? TikTok. Eventually, someone won’t need to have ever heard your voice to be able to verify that it’s yours; a computer will have already heard it, and a person could just ask.

The melding of human and computer processes is chilling, and at its core is an overwhelmingly complex ability that we already possess. The acquaintance of Mariah who found her out, after all, wasn’t using machine learning; they were using their ear. And they’re not the only one.

“I got a Facebook message the other day from a teacher I had in high school who said, ‘I heard you on NPR. I heard your voice!’” says Petersen. “That’s interesting,” she recalls thinking. “You haven’t talked to me in ten years.”