Beyond ALL CAPS: The Multilingual Challenge of Google Expressive Captions
Jul 12, 2025
As a product designer passionate about accessible experiences, I was thrilled to discover Expressive Captions in Google Pixel's June update. After testing it extensively on my Pixel phone and diving into the inspiring behind-the-scenes interview with Angana Ghosh, I realized this feature represents something profound: the evolution from functional accessibility to emotional inclusion.
Having used live captioning applications before Pixel's built-in Live Caption feature, I've witnessed firsthand how system-level accessibility features can transform user experiences. But Expressive Captions goes beyond utility—it's about allowing users with hearing loss to feel the collective emotions that make content truly engaging.
As I tested this feature, one question dominated my thoughts: How would this work across different languages and cultures? Interestingly, this was exactly the challenge Angana discussed in her interview.
This post explores two key areas: first, the current state of Expressive Captions and its technical foundation; then, the complex challenge of scaling emotional expression across multilingual contexts—drawing from my own experiences with non-alphabetic languages.
Part I: The Technology Behind Emotional Expression
Expressive Captions builds upon Google's Live Caption foundation with a clear mission: enhance understanding by capturing emotions, feelings, and context that traditional captions miss. Google's development team collaborated with theater artists, speech-language pathologists, and deaf and hard-of-hearing community members to identify crucial audio elements that current captioning overlooks.
Key Features
The initial launch includes three core capabilities:
Intensity of Speech: High-energy speech like yelling appears in ALL CAPS to convey intensity
Example: "I can't believe it!" becomes "I CAN'T BELIEVE IT!" when shouted
Vocal Bursts: Non-verbal sounds like sighs and gasps are captured and described
Example: [sighs] or [gasps] appear in brackets to indicate emotional responses
Ambient Sounds: Background elements like music are included in captions
Example: [upbeat music playing] or [phone ringing] provide environmental context

Expressive Captions uses three features to express the emotional moment (Source)
This represents Google's commitment to inclusive design for all—while rooted in accessibility, Expressive Captions proves useful for everyone, whether in noisy environments, multilingual contexts, or multitasking scenarios.
Part II: The Global Challenge: Designing for Every Language
Following the English launch, the expansion challenge begins, as Angana mentioned, "getting multilingual right is really hard". The multilingual implications fascinate me most as an international designer.
The challenge extends beyond translation: while some expressive elements like sighs transcend language, the model is tailored to English contexts, and emotional expressions are deeply cultural.
Most critically, how do you convey emphasis in non-alphabetic languages where "ALL CAPS" doesn't apply?
This question kept me up at night. As someone who regularly communicates in both alphabetic and non-alphabetic languages, I started paying closer attention to how I naturally express emphasis in daily digital conversations. I found myself thinking beyond text messages, like all kinds of multimedia in the context of a multilingual scenario.
Here's what I discovered through this reflection and research:
Chinese (中文) - Visual Emphasis Strategies
From my observations of daily Chinese digital communication, several patterns emerge naturally:
Repeated Punctuation: People instinctively use "太好了!!!" (Great!!!) or "真的嗎?!" (Really?!) for emotional intensity—it feels natural and immediately conveys excitement or surprise
Sound-Imitating Characters: Repeating characters like "啊啊啊" (ahhhhh) or "嗚嗚嗚" (wooooo) for vocal bursts—these literally represent duration and intensity
Character Spacing: "這真的太 厲 害 了" (This is really amazing) to indicate emphasis—mimicking how we might slow down speech or emphasize each syllable
Wavy Line Symbol: Using "~" to represent stretched sounds and tone like "真的好好吃哦~~" (It's so tasty!)
Japanese (日本語) - Cultural Adaptation
What strikes me about Japanese digital expression is its elegance and subtlety:
Emphasis Dots (傍点/圏点): Traditional Japanese text emphasis using small dots beside characters—elegant and culturally specific. While less common in digital interfaces, they represent a beautiful cultural approach to textual emphasis
Punctuation Repetition: 「すごい!!!」(Amazing!!!) for strong emotion—similar to Chinese, but with distinctly Japanese punctuation conventions
Korean (한국어) - Intuitive Expression
Korean digital communication has its own fascinating patterns:
Repeated Punctuation: "정말요???" (Really???) for emotional intensity
Interspersed Dots: "진.짜.개.힘.들.다." (This is really freaking hard) with dots between characters for emphasis—mimicking how someone might speak each syllable with emphasis
Opportunities for Global Expression Design
What became clear through this exploration is that each language has developed sophisticated ways of conveying emotion in digital text. The challenge isn't just technical—it's deeply cultural and intuitive.
Cultural Adaptation of "High Energy":
Each language needs re-trained models that understand what constitutes "high energy" speech in that specific cultural context. What sounds intense in English may be normal conversation volume in another culture.
Organic Visual Cues:
Rather than forced translations, we need culturally intuitive visual cues that feel natural within each language context, based on how people already express emphasis digitally.
Balanced Information Hierarchy
Captions must maintain readability in live streams where users process information quickly. The emphasis should enhance understanding without stealing the spotlight from the content.
Cross-Media Inspiration
If we think beyond traditional captioning conventions, subcultures have already solved similar emotional expression challenges in fascinating ways. Manga, for instance, has developed sophisticated visual languages—enlarged text for shouting, speed lines for urgency, varied fonts for different emotional states. Live streaming platforms use danmaku (bullet comments) where real-time emotions are expressed through creative spacing, symbol repetition, and playful typography. While these approaches may not directly translate to captions, they offer valuable inspiration for how different cultures have organically developed emotional visual languages that feel native to their media consumption habits.
Looking Ahead: The Future of Immersive Experience
By building Expressive Captions at the system level, Google is leading users toward more immersive experiences that truly embody "hear what you hear, see what you see." This represents a fundamental shift from treating accessibility as an afterthought to making it a core part of the user experience.
Beyond eagerly awaiting multilingual versions of Expressive Captions, I'm excited about the potential deeper integration with the app ecosystem. Currently, captions display overlappingly on top of media apps, sometimes blocking content. Imagine captions that could sense content and avoid blocking, or seamlessly interact with different applications to enhance rather than interrupt the viewing experience. The opportunities are endless and exciting.