The Future of Closed Captioning
The Future of Closed Captioning
Abstract and Keywords
This chapter looks to the future of closed captioning by discussing three areas in terms of universal design: video search, search engine optimization, and findability; pedagogy, interactive transcripts, and fully searchable lecture archives; and mainstream uses of captioning and subtitling that suggest a much wider role for captioning. In addition to interactive transcripts and search engine optimization, closed captioning enters the mainstream through a variety of channels: enhanced episodes, easter eggs, caption fails, animated GIFs, parody videos, creative or humorous captioning, fictional captions and Internet memes, occasional English subtitles, direct “fourth wall” references to the captions, and animated captions.
Reading Sounds grew out of my fascination with closed captioning, both as a daily user and a scholar interested in reading and writing practices. Closed captioning remains an endless source of curiosity, delight, and, sometimes, frustration to me. I set out to break new ground in caption studies. Much work still remains. I hope this book can serve as a roadmap for future scholars and others interested in contributing to caption studies and, specifically, to our understanding of how closed captions create meaning, interact with the soundscape, and interface with readers. Our current definitions of caption quality, which have focused on accuracy, completeness, and formal issues such as typeface preferences and screen placement, only scratch the surface. While accuracy remains a key concern for captioning advocates (e.g., see YouTube’s heavily criticized automatic speech recognition and transcription technology), we shouldn’t stop short of trying to make sense of how captions transform the soundscape into a new text, just as subtitling theorists have explored the ways in which foreign language subtitling creates new texts through translation (cf. Nornes 2007, 15).
Closed captioning has too often served as an afterthought. The task of captioning a movie or television show is typically handed off to an independent company or team. This disconnect between producer and captioner does a disservice to everyone involved but especially to those readers who depend on quality closed captioning. If captions (p.291)
Reading Sounds complements existing books on closed captioning, including Gregory John Downey’s (2008) historical study, Karen Peltz Strauss’s (2006) legislative and economic history of telecommunications access, and Gary D. Robson’s (2004) technical account of captioning equipment, standards, and specifications. It offers a humanistic rationale that moves us beyond captioning as a set of prescriptions or standards and toward a more flexible view that recognizes the creative work of captioners, the mediating influence of the captions, and the interpretative power of readers. Closed captioning is a rhetorical practice because it involves human choices about the best course of action to take under specific contexts and constraints of space and time. The most important decisions about meaning, context, and significance can’t be reduced to a list of decontextualized prescriptions. Likewise, an ideology that presents captioning as simple transcription—so easy, even speech recognition software can do it!—dissuades us from exploring the complex nature of translating sounds into written texts for the purposes of timed reading. This complexity is reflected, as I have argued in these pages, in the degree to which captions contextualize, clarify, formalize, equalize, linearize, time-shift, and distill the soundscape.
My own perspective has been admittedly partial and biased. As a hearing advocate and parent of a deaf child, I have come to embody the perspective of universal design that, regardless of hearing status, closed captioning can potentially benefit all viewers. I wanted to focus this book on some of these benefits, because the dominant view (captioning as mere transcription) has left little room to discuss the differences between listening and reading. I’ve avoided discussing glaring errors or so-called “caption fails,” even though the topic of accuracy is more important than ever in the era of autocaptioning. I haven’t devoted much space to late, ill-timed captions either, even though delays of five to seven seconds are standard for live programming such as television news. Late captions are the flip side of what I have called “captioned irony” (chapter 6), but I wanted to focus on captioned irony over caption fails and slow captions because reading ahead and having advance knowledge are new contributions to our understanding of how captions make meaning.
Even if we don’t watch closed captions, closed captions are watching us. The future of Internet video is being built on a foundation of robust closed captioning (or, more specifically, subtitling). Because closed captions on the Web are stored as separate plain text files, they can be fed to search engines and retrieved by keyword searches. Search engines are not very good at indexing the content of audio or video files. Indeed, Google’s search engine has been metaphorically compared to both a blind and a deaf user (see Chisholm and May, 2009, 14). But search engines thrive on plain text: tags, keywords, text descriptions, text transcripts, and, of course, closed captions (Ballek 2010; Stelter 2010; Sizemore 2010). Google can process and index closed captions, allowing users to search for content inside YouTube videos that is difficult for search engines to process without the textual information that captions provide. No wonder search engine optimization (SEO) consultants promote closed captioning as a way for their clients to increase their search engine rankings and bring more visitors to their sites (e.g., Ballek 2010; Sizemore 2010). In education, closed captioning allows students to search for content inside recorded lectures. Without the added benefit of searchable text captions, students would have to manually scan lecture videos looking for that one example, anecdote, or solution that they vaguely remembered from class but cannot locate quickly or easily in the recorded video lectures.
Interactive transcripts raise the value of captions further by allowing users to click on a single word in a video transcript and be transported to that moment in the accompanying video where that word is spoken. I first became aware of, and then immediately recognized the immense game-changing power of, interactive transcripts on TED.com. Each word in an interactive transcript is time-stamped and clickable. The transcript is fully searchable and automatically scrolls in time with the video. Individual words are highlighted as they are spoken. Because captions on TED.com are crowdsourced out to regular users (TED 2014), many of the videos on the site are available in an impressive number of languages. One could, for example, listen to Aimee Mullins (2009) speaking in English, read the captions in a second language such as French, and browse the interactive transcript in a third language such as Japanese. (Or one could simply load captions and interactive transcript in English, which is what I do.) In the case of Mullins’s (2009) TED talk, (p.294) users can choose from thirty-two languages. YouTube also offers interactive transcripts for the captioned videos in its collection (Chitu 2010). Companies such as 3Play Media and ProTranscript also provide, as part of their regular video transcription service, a video player plugin that serves up interactive, clickable transcripts alongside closed captions. In addition, 3Play Media supports “archive searching” across a website’s video collection, and has developed a “clipping plugin” that allows users to “[c]lip video segments simply by highlighting the text. Rearrange clips from multiple sources and create your own video montages” (3Play Media). The video clipping plugin will output a web link for sharing montages with other users. When these users are college students attending the same university, or enrolled in the same course, the video montage—fully accessible because it is built on closed captions—could be a powerful, accessible learning tool indeed. In these ways, then, captioning serves as a potential common ground upon which video indexing and retrieval are possible. Without captions, users are not able to search inside videos or make connections between videos based on keyword searches. Captioning solves the problem of video indexing by transforming video into something that search engines thrive on: plain text. Even if we don’t watch web video with closed captioning, and even if some web videos are not yet captioned, captioning is already playing a vital role in creating the future of web video search. Captioning advocates must ensure that autocaptions generated by imperfect speech recognition technologies are not used to index the content of web videos. A fully searchable web must be built on accurate captions, not automated and less-than-perfect ones.
Interactive transcripts already provide users with an excellent way to search for and find information within a single video. As video becomes more popular and captioning technology provides a way to index large databases of video context, students will be able to search the video collection of an entire course, or even across all of the videos produced in all of the courses of a department, college, or university. In this near-future learning environment, captions will enable students to use keywords not only to find and review course content across multiple videos but also insert their own “margin” notes, which could take the form of time-stamped text comments or pop-up idea bubbles (as at BubblyPly.com), their own video responses or notes produced on the fly with their web cams, links to other related video moments in the course’s video collection, links to external web resources, and comments from other students that have been made public. This added content may or may not be searchable/captioned, but it would at least be tagged and easier (p.295) to find as visible nodes in the student’s personalized video stream. The instructor’s lecture videos would thus be transformed into the student’s personalized study guide and an opportunity for collaborative learning. In addition, keyword searches would not return simply a list of matching video clips but also, perhaps, a single mash-up composed of all the clips that satisfied the search query, plus any accompanying student commentary. The inherent limitations of uncaptioned video would thus be addressed by a robust video captioning and search system that allows students to personalize and reconfigure the content of a course according to their needs. The promise of universal design could be achieved, in other words, by an accessible system that levels the playing field for all students—deaf, hard of hearing, and hearing. We need to continue to push for and applaud advances in captioning technology that will leverage the power of searchable text to provide a more inclusive, more accessible learning environment for our students. While it is naïve to think that a fully accessible library of university lecture videos is cheap or easy to achieve, it is nevertheless important for web accessibility advocates to continue to promote all of the reasons (ethical, legal, business, user-centered, etc.) that accessibility makes sense for our students and our pedagogies. As the number of distance learning, video-enriched courses grows on our college campuses, educators and students will require solutions that combine the richness of video with the data mining benefits of text-based captions and transcripts.
When closed captioning is seen as a widely desirable and vital component of the digital infrastructure—even if it sometimes operates below the surface, as in video search optimization—it becomes naturalized, marked not merely as a special accommodation but as a universal benefit and right. In this way, closed captioning can potentially “trouble the binary between normal and assistive technologies” (Palmeri 2006, 58) by showing us how captioning can serve all of us. To put it another way, when captioning supports universal design principles, it reminds us that “all technologies [are] assistive” (58). In addition to interactive transcripts and search engine optimization, closed captioning enters the mainstream through a variety of channels: enhanced episodes, easter eggs, caption fails, animated GIFs, parody videos, creative or humorous captioning, fictional captions and Internet memes, occasional English subtitles, direct “fourth wall” references to the captions, and animated captions.
Enhanced episodes and “pop-up” bubbles on television shows make use of onscreen text (similar in form to subtitles) to provide additional information to viewers about an episode or music video. VH1’s “Pop Up (p.296) Video” program is perhaps the most well known use of onscreen textual enhancement, followed by enhanced episodes of the TV show Lost. An example of an enhanced caption on Lost: “And based on the iteration count in the tower the survivors have been on the island for 91 days” (Lostpedia n.d.). These enhancements are not closed captions because they are not intended to stand in for audio content but rather to supplement it with additional information. Nevertheless, enhanced episodes place similar cognitive demands on readers and may raise awareness of the needs of readers who depend on closed captioning. Media: http://ReadingSounds.net/chapter9/#enhanced.
Easter eggs, while rare in captioning, may nevertheless increase the level of interest in captioning among the general public. Dawn Jones, who runs the “I Heart Subtitles” blog, writes about a 2013 episode of BBC’s Sherlock that included, in the upper left corner of the subtitle track, “letters that acted as clues to viewers and was part of the promotion to encourage repeated viewing and speculation about the new series” (Jones 2013). As the letters H-I-S appeared one at a time over the course of the episode, they spelled out a hint that only viewers who were watching with captions could see. Media: http://ReadingSounds.net/chapter9/#eastereggs.
Caption fails enter the mainstream through discussions of the limits of autocaptioning. For example, Rhett and Link’s “caption fail” videos on YouTube average one million views each. Rhett and Link, two self-described “Internetainers,” describe their comedic experiment with autocaptioning this way: “… we use YouTube’s audio transcription tool, which doesn’t always do the best job translating. We write a script. Act it out, then upload it. Let the tool translate it. Then make that into a script. Act it out. Upload it. Then let that be made into a script” (Rhett and Link 2011). The results are always hilarious and remind us that autocaptioning is still in its infancy and should never be a substitute for human-generated captions. People who wouldn’t otherwise be aware of closed captioning, let alone autocaptioning, are introduced to it through an entertaining experiment that also contains an informative lesson about the imperfect state of speech recognition technology. Media: http://ReadingSounds.net/chapter9/#captionfails.
Animated GIFs, when they include text captions from TV shows and movies, require readers to read lips and read captions at the same time, thus mimicking in a small way how deaf and hard-of-hearing people process information on the screen. An animated GIF is an image format composed of a series of frames that simulate the movement of video when the frames play in sequence and loop automatically. The GIF file (p.297) format does not support sound. When short clips from TV shows or movies are made into animated GIFs, the official closed captions may be included. Alternatively, verbatim text can be added by the GIF author in the style of meme text (Impact typeface, black stroke with white fill, sometimes set in all caps). For example, see the popular animated meme of Ron Burgundy (Will Ferrell) saying/mouthing “I don’t believe you” in Anchorman (2004). As discussed in chapter 6, animated GIFs encourage a kind of lip-reading culture that positions closed captions as vital to the process of making sense of animated GIFs for all viewers. Media: http://ReadingSounds.net/chapter9/#animatedgifs.
In parody videos, subtitling may play an instrumental role. In “Bad Lip Reading” videos, for example, new audio is dubbed over the original audio and synched up roughly with characters’ lip movements. One such video, “The Walking (and Talking) Dead—A Bad Lip Reading of The Walking Dead” (Bad Lip Reading 2013), dubs a new audio track in which characters appear to be saying such ridiculous things as “Hey, do you remember that costume party? / You went as a penguin / And I went as a pink shark.” Subtitles assist in delivering the parody, which is why they are enabled or burned in by default. Other examples of parody that rely on subtitling include “literal music videos,” in which new lyrics are written and performed to accompany the visuals in the official music video. The original literal music video was Dustin McLean’s (2008) rewrite of A-ha’s “Take on Me.” Countless imitators and new literal music videos have followed, always with open subtitles. The wildly popular Downfall parody videos should be included here too. Each parody video presents new English subtitles for a scene from Downfall (2004), a movie about Hitler’s final days in which Hitler (Bruno Ganz), speaking in German, goes on an animated rant. The Downfall parodies have become “so ubiquitous on YouTube that they have even spawned self-referential meta-parodies—jokes about Hitler learning about his internet fame” (“Hitler Downfall Parodies” 2009). Video parody and subtitles seem to fit together seamlessly. I’ve tried my hand at subtitle-delivered parody in a playful analysis of the limited vocabulary of one talk show host. Media: http://ReadingSounds.net/chapter9/#parody.
Creative or humorous captioning grabs the attention of viewers who may then be compelled to post screenshots on social media sites such as Reddit. I shared one such example in chapter 5, originally posted to Reddit, of a single closed caption from The Tudors that lingered on the screen during a commercial for IKEA (jeredhead 2013). The poster’s screenshot captured the ironic tension between a lingering adult-oriented caption, [Sexual moaning], and the image of a young boy (p.298) riding his trike in an IKEA kitchen. Adolescent humor may be perceived as more novel when it is encountered in closed captioning, at least for those viewers who are not accustomed to watching TV with captions. A Reddit user (secularflesh 2013) describes waking his girlfriend with his “adolescent giggling” after reading a captioned description of a bodily function in a documentary: (COLOBUS MONKEYS FART). Other popular captions posted to social media sites include [silence] and [silence continues] from The Artist (2011). Silence captions call into question our assumptions about the very nature of captioning itself and thus raise awareness of captioning’s potential complexity and value. To this category I would also add experimental forms of textual representation. For example, Accurate Secretarial, a transcription and web captioning company, has been experimenting with new forms of textual representation on their YouTube channel, “The Closed Captioning Project,” including new notational systems for music captioning. Media: http://ReadingSounds.net/chapter9/#creativecaptioning.
Fictional captions and internet memes call attention to the cognitive work that captions perform. The “[Intensifies]” meme can help us think through the problem of how to adequately describe modulating noise (see Know Your Meme, “[Intensifies]”). But the “[Intensifies]” meme, which often makes use a vibrating GIF image and static subtitle, describes actions that aren’t clearly linked with sounds at all. For example, the caption [Shrecking Intensifies] (sic) accompanies a vibrating GIF image of Shrek and Donkey from the Shrek movies. The animation culminates in an increasingly distorted image of the two characters. Another example of this meme is [TARDIS INTENSIFIES], a subtitle that describes a vibrating image of the time machine police call box from Doctor Who. Fictional captions continue this theme of describing actions and playing on characters’ personality quirks or traits: an image of a distraught Spock (Leonard Nimoy) from Star Trek [SOBBING MATHEMATICALLY]; an image of an emotional Eleventh Doctor (Matt Smith) from Doctor Who as he [ANGRILY FIXES BOW TIE]; and an animated image of a seething John Dorian (Zach Braff) from Scrubs as he [screams internally]. These examples are intended to be playful and even absurd. In some cases, they don’t provide access to sounds to all. In others, they exploit characters’ personality or sartorial traits. Media: http://ReadingSounds.net/chapter9/#memes.
Occasional English subtitles in English programming remind hearing viewers that onscreen transcriptions of what people are saying can benefit all viewers regardless of hearing status. Consider three examples: On the DVD for Snatch (2001), viewers can select a “pikey” track, which only displays subtitled translations of the hilariously thick En (p.299) glish accent of Mickey O’Neil (Brad Pitt). On Swamp People (2013), an American reality TV series that follows a group of alligator hunters in Louisiana, the English but accented speech of these Cajun hunters is sometimes accompanied with hard-coded subtitles. On season 12 of Project Runway (2013), one of the contestants, Justin LeBlanc, wears a co chlear implant, identifies himself as deaf, and is accompanied on the show by a sign-language interpreter. His speech is also subtitled early in the season, much to the confusion of some viewers who didn’t feel his clear English speech warranted English subtitles. One blog writer, in a recap of an early episode in this season, directed a question to the show’s producers: “Why are you subtitling Justin when Sandro [another contestant] is much harder to understand?” (Toyouke 2013). One could ask the same question of the producers who unnecessarily subtitled the clear En glish speech in Swamp People. Later in the Project Runway season, after the producers presumably stopped subtitling LeBlanc’s speech, LeBlanc himself tweets: “Yay for no subtitles under me! haha @ProjectRunway” (LeBlanc 2013). The examples from Swamp People and Project Runway sug gest that decisions about subtitles may be driven, in part, by the mere perception of difference rather than any real need to provide access to accented English speech. These subtitles reinforce differences by marking certain speakers as not normal. Media: http://ReadingSounds.net/chapter9/#englishsubs.
Medium awareness applies to captioning when speakers make explicit references to the subtitle track, something that fictional characters aren’t supposed to be aware of. For example, in a 2012 Dairy Queen commercial, the speaker literally hops on the subtitle track and rides it as it chugs off the screen. “These aren’t just subtitles,” he says. “These are subtitles I like to ride on.” This example and others elevate subtitles to a topic of discussion. Subtitles become integral, meaningful elements of the program in their own right. They don’t support or translate the primary meaning of the program, or try to sit unobtrusively at the bottom of the screen. Instead, they make their own meaning. We are asked to look at them, not merely look through them. These examples break through the so-called fourth wall. The imaginary fourth wall separates the audience from the action on the screen or stage. When the audience suspends its disbelief, the events are taken as real and believable. When fictional characters show an awareness of the medium (e.g., by talking directly into the camera, commenting on the soundtrack, bumping into or referring to the subtitles, etc.), they break through the fourth wall that enables the audience’s suspension of disbelief. Put simply, fictional characters are not supposed to see subtitles. When they do, (p.300) it’s usually in the service of a joke. Media: http://ReadingSounds.net/chapter9/#mediumawareness.
Animated and enhanced captioning, similar to creative captioning, draws our attention to innovative and experimental forms, including kinetic typography and visual captions. For example, Raisa Rashid et al. (2008) have tested animated text captions with hearing and hard-of-hearing participants, comparing traditional captioning with “enhanced” and “extreme” forms. In the enhanced condition, kinetic type animates select words in the caption to signal emotional content. In the extreme condition, “some of the text was animated dynamically around the screen while static captions were displayed at the bottom of screen” (509). Par ticipants found the enhanced captions to be preferable to both the traditional and extreme forms (516). Quoc Vy and Deborah Fels (2009) have explored the use of visual captions as a way to aid readers in identifying who is speaking (see also Vy 2012). Specifically, Vy and Fels (2009) tested the use of headshots (or avatars) decked out with color-coded borders. The avatars were placed alongside their respective captions, and traditional screen placement techniques were used as well. Results were mixed. Some participants felt “overwhelmed with the amount of information available on screen” (919). Nevertheless, these experiments need to continue if only because “closed captioning guidelines have remained virtually unchanged since the early days, and these guidelines actually discourage the use of colors, animation, and mixed case lettering in captions” (Rashid et al. 2008, 506). The traditional all-caps and center-aligned environment needs to be infused with alternatives grounded in healthy critique and a willingness to test out new forms. I’ve explored animated captioning in a revision of one scene from The Three Musketeers (2011) in which a character is drugged and experiences the speech of others as distorted and echoic. How might we visualize sonic distortion in the captions themselves? Using Adobe After Effects, I applied a small number of text treatments to the captions in this scene, while being mindful of the need to make the captions legible above all else. Finally, caption studies might address other representational challenges, such as how to embody sonic directionality and perspective when expressed, for example, in stereo or surround-sound. In one scene from BloodRayne 2: Deliverance (2007), someone or something is moving quickly outside a family’s cabin. The mother follows the sound with her head, first turning her head from the viewer’s right to left as the sound is captioned as (wind rushing), and then turning her head back again as the sound is captioned as (rushing continuing). The children also follow the sound with their heads, thus providing a number of visual clues to the sound’s (p.301) movement. The rushing wind turns out to be vampiric Billy the Kid. But these captions don’t capture the directionality of the whooshing force as it moves from the listener’s right ear to left ear and back again. The wind isn’t rushing; it’s moving with intentionality. Captioners must account for the ways in which film sound is dimensional and stereophonic and seek to counter the monophonic world suggested by the typical caption file, a world in which every sound is centered, static, fully present, and equally loud. Media: http://ReadingSounds.net/chapter9/#animated.
When captioning enters the mainstream, even if an author’s intentions are satiric or absurd, captioning becomes more natural and less strange, more universal and less marginal, more central to our theories, pedagogies, and viewing habits and less likely to be overlooked or forgotten. In short, the more often we see or hear about captioning in the mainstream, the less often it becomes something we can write off as the purview of a seemingly narrow group. My hope is that closed captioning will be increasingly folded into and inform our scholarship on reading and writing practices, multimodality, and the future of Internet video. Accessibility is more than a transcript, afterthought, legal requirement, or set of prescriptions. (p.302)