Voice Design

By Sarah Wilson (and Claude)

Describe a character in plain English. Get a voice back.

The code takes a description e.g. "stone giant" and generates a voice. It maps your words to acoustic properties like pitch, breathiness, and texture, drawn from my MSc dissertation which statistically analysed the voices of robots and aliens in TV and film.

10 years ago I didn't have the technical skills to build this but I've been using Claude Code to further bring it to life.

Try the demo (note: it can take a while to load first time) or play the example voices below.

Hear some examples

"happy female ghost"

Click to play / pause

"elderly evil creature"

Click to play / pause

"small curious tree"

Click to play / pause

Some things to try

The code works by matching words to acoustic traits — it's not a language model (yet).

Descriptions where the traits agree with each other

  • "ancient stone giant" — Deep, slow, rough-textured. Think something that has been around since before plumbing.
  • "warm charismatic digital assistant" — Smooth, measured, and oddly likeable.
  • "young nervous robot" — Higher-pitched and slightly too fast. Definitely overthinking it.
  • "angry ghost" — Breathy and volatile. Not ideal for customer service, but atmospheric.

Descriptions that create more ambiguous results

  • "large young" — size pulls pitch down, youth pulls it back up. You'll land somewhere in the middle.
  • "warm evil" — warmth and menace fight over the same acoustic quality (voice clarity). The result sits awkwardly between the two.
  • "charismatic ghost" — charisma needs a clean, resonant voice; a ghost is defined by breathiness and absence. One will win.

Stick to concrete nouns and clear adjectives. Words like stone, metal, ancient, nervous, giant trigger specific acoustic rules. Abstract or literary descriptions ("melancholy Victorian librarian", "the feeling of autumn") won't match anything and will fall back to the default voice. It also doesn't know how specific people sound, so asking for "Morgan Freeman" won't do much.

A few other caveats

  • Cold start: The page itself may take a minute or more to build before you can interact with it. Once live, the first generation takes a further 20–40 seconds while the speech model loads ~1GB of model weights. Subsequent generations are faster. I haven't paid for the faster compute power.
  • Three base voices: All output starts from one of three presets: male, female, and a more neutral but still male-leaning option. Unless you specify gender, descriptions default to the neutral preset. True gender neutrality is a known challenge of open-source TTS models.
  • Small research sample: The acoustic mappings come from a data set of 93 characters, which were mostly male so come with weaknesses.
  • Text to audio artefacts: The model (Bark) produces slightly different output each run and occasionally adds unexpected sounds or trails off mid-sentence.

About the research

The acoustic mappings come from my MSc dissertation at the University of Sheffield, which analysed 93 fictional character voices from TV and film. A few things the data showed that weren't obvious going in:

  • Evil characters don't just have lower-pitched voices. Their voices are measurably noisier — higher shimmer, jitter, and harmonic distortion. The chaos is in the signal, not just the frequency.
  • Personality (good, evil, or neutral) could be predicted from acoustic measurements alone with up to 67% accuracy.
  • The acoustic difference between a robot voice and an alien voice is statistically insignificant. They sound more similar than you'd expect.

The research was published in the proceedings of the 1st International Workshop on Vocal Interactivity in-and-between Humans, Animals and Robots (2017): Robot, Alien and Cartoon Voices: Implications for Speech-Enabled Systems (p.42).

Python · Bark TTS · WORLD vocoder · Gradio · Hugging Face Spaces