Captioning

Last updated: 10/29/2024

Caption quality can have a significant impact on a user's ability to receive audio information equivalently through captions. With that in mind, the Web Content Accessibility Guidelines (WCAG) link to two caption guides, intended to help developers provide users with the best possible caption experience.

While you're encouraged to explore those resources in depth, we've also captured here, for your convenience, their principle guidance, first as a checklist summary and then, in more detailed explanations that follow.

Summary

Note: some checklist items contain links to corresponding, detailed information provided later in this page; you'll also find select BBC, DCMP and WCAG citations provided as links within the checklist and detailed explanations.

  • 100% accurate
  • Includes non-speech audio that is intended to convey information
  • Includes speaker identification as appropriate
  • No caption segment should be shorter than ~1.33 seconds or capture more than 6 seconds of audio information
    • I.e., Segments should display long enough to reasonably be read but not so long that it inhibits readability or a viewer's ability to connect specific audio to specific moments in the media
  • No more than 2 lines of text in any caption segment (#3, DCMP: Caption Placement)
    • However, 3 lines of text is acceptable if the first line is speaker identification and no important visual information or user interface components are blocked/obstructed
  • Try to divide caption segments (i.e., determine where one segment ends and the next begins) by punctuation and/or natural pauses in speech
  • Do not include part of one sentence and part of another sentence in a single caption segment
    • Unless it's necessary to avoid an especially short sentence displaying for less than ~1.33 seconds
  • When determining where to manually insert line breaks within caption segments, try to keep parts of speech together
    • E.g., try to keep subjects and their verbs, modifiers and what they modify, etc. in the same line; try not to separate parts of prepositional phrases (emphasis on try; this may not always be possible)
  • Do not include gaps of less than 1-1.5 seconds between caption segments (as it creates a flicker/strobe effect)
    • If a gap of less than 1-1.5 seconds would exist between caption segments, extend the first segment so it ends right when the latter segment begins
  • Captions should not block/obstruct important visual information or user interface components within a video or eCourse slide (4th note, WCAG captions definition)

In More Detail

Accuracy

WCAG does not provide much detail on required caption accuracy, but the FCC has interpreted this regulatory language on closed captioning of television video programming to apply to some online video clips as well, and it provides more detailed coverage of caption accuracy expectations/requirements.

Back to "100% accurate"

Non-speech Audio

If you've ever watched a horror movie, you know non-speech audio can be intended to convey information. For example, an especially discordant background score, intended to build tension, or footsteps from the floor above, intended to convey that the protagonists aren't as alone in the supposedly abandoned mansion as they thought they were.

These audio cues need to be captured in the captions as well, so users who are relying on captions can have the full audience experience.

For sound effects, include a descriptive word or succinct phrase that conveys the sound heard, and sound source if known, in brackets, all lowercase:

[dog barking]

You can even included commonly understood sounds and onomatopoeias:

[alarm clock]
beep, beep, beep

For music, if it's simply the mood or effect that's important, convey that, in italics if the source is offscreen or not seen:

[discordant strings]

or...

[elegant waltz]

If the artist or song were chosen for a particular reason, you should reference them specifically:

[Louis Armstrong sings 
"Stardust"]

If the song was chosen for its lyrics, include those in the captions, with a musical note icon (♪) at the beginning and end of each segment for the lyrics, except the last segment, which should end with two ♪♪.

♪Though I dream in vain
In my heart it will remain♪

♪My stardust melody
The memory of love's refrain♪♪

Back to "Includes non-speech audio..."

Speaker Identification

Identifying speakers in captions can be helpful in all circumstances, but it’s especially important in videos that have multiple speakers and is critically important in situations where viewers who do not rely on captions can tell who is speaking by their voice but cannot actually see the person speaking (because, without the speaker being identified, users who rely on captions will have no way of knowing who is speaking and will, thus, not have an equivalent experience, as required).

Identify speakers by including their name or title in parentheses at the start of the first caption segment that conveys their speech, in its own line of text if possible (i.e., if it doesn't cause important visual information or user interface components to be blocked/obstructed; remember, this is one of the few situations where it's appropriate to have three lines of text in a caption segment).

If the speaker has not yet been identified by name, use a descriptive title or label and be consistent in using that same title or label for that speaker until such time that they are named: e.g., "(pedestrian #1)" or "(mysterious voice)."

Do not include the speaker's name, title or label at the start of any other caption segment while they are speaking continuously. For example:

00:00:00.304 --> 00:00:03.052

(narrator)
The Common Rule
is another name

00:00: 03.052 --> 00:00:06.649

for the Federal Policy for the
Protection of Human Subjects,

When the speaker changes, include the new speaker’s name/title/label in the first caption segment that conveys their speech, in its own line of text if possible. Do this whenever the speaker changes. For example:

00:00:00.250 --> 00:00:04.000

(Jane)
Would you like
to go to the movies?

00:00:04.000 --> 00:00:06.740

(Jordan)
Yes Jane, I would.

00:00:06.740 --> 00:00:11.570

I've heard such good things
about the movie that is playing.

00:00:11.570 --> 00:00:15:830

(Jane)
Great!
I'll buy us tickets.

00:00:15:830 --> 00:00:15:830

(Jordan)
And I'll get the popcorn!

Back to "Includes speaker identification"

Segmenting

"Segment" refers to an individual block of captions displayed at any given moment in a video, like the "Our patients have that same fundamental right" segment below that displays from 0m:35.25s to 0m:38.20s in our Introduction to Health Care Privacy video (YouTube).

Still from 36-second mark in Introduction to Health Care Privacy video

Segmenting best practices are primarily intended to achieve three goals:

  1. Avoiding excessive eye and neck movement, as they may cause repetitive stress injuries or exacerbate existing conditions: segmenting achieves this by limiting the amount of text per segment and limiting the screen area in which captions display
  2. Facilitating faster reading: this is also achieved by limiting the screen area in which captions display, allowing viewers to consume more text in any given moment, leaving more time for them to consume other aspects of the video/media
  3. Keeping ideas and parts of speech together as much as possible, so the information being conveyed is as easy as possible to understand, especially with complex ideas that span multiple caption segments

If ever in doubt when captioning, go the route that best achieves the above goals.

Avoiding Excessive Eye and Neck Movement

Width and duration restrictions serve to avoid excessive eye and neck movement.

The BBC: Line Length (i.e., caption width) rules dictate that no caption segment should have a width exceeding:

  • For 16:9 media, 68% of the media width
  • For 4:3 media, 90% of the media width

DCMP’s guideline that no caption segment exceed 6 seconds of audio (DCMP: Caption Duration), when combined with their line division guidance (DCMP: Line Division), essentially achieves the same effect and purpose.

Some caption standards use 32- or 42-characters-per-line limits as their means to avoid excessive eye and neck movement. These limits tend to be informed by broadcast television compatibility and may take less priority than other goals, like keeping ideas/parts of speech together, for media that will not be broadcast on television.

Back to "No caption segment..."

Minimum Display Time

Segments need to display for a minimum amount of time so their text can reliably be read. BBC suggests a minimum display time of .3 seconds per word in a segment (BBC: Target minimum timing), but DCMP’s equivalent guideline of a minimum display time of 1.33 seconds (40 frames at 30 fps; DCMP: Caption Duration) arguably creates a better user experience and thus should be prioritized if possible.

Back to "No caption segment..."

Keep Ideas and Parts of Speech Together

Do not include parts of multiple sentences in an individual caption segment (e.g., the latter half of one sentence and the first half of the next sentence). This practice makes it more difficult to track and understand the ideas conveyed by each sentence.

Instead, when a sentence ends, its caption segment should end, and the next sentence should begin in the next caption segment (BBC 3.2 and BBC 3.4).

For sentences that are too long to be conveyed in a single caption segment, try to divide the sentence between segments at a spot that keeps ideas and parts of speech together as much as possible.

Breaking caption segments into two lines of text isn’t always necessary to avoid the width restrictions covered earlier, but it helps keep captions isolated to a certain segment of the screen, which in turn helps facilitate faster reading and avoidance of excess eye and neck movement. Furthermore, manually controlling where line breaks occur may prevent your media platform from implementing line divisions on its own, which may occur in a way that inhibits understanding.

When caption segments are broken into two lines of text, again, try to keep parts of speech together, such as not separating a modifier from what it modifies or not breaking apart a prepositional phrase (DCMP: Line Division).

In most text editing programs and platforms, the Shift+Enter keys pressed together in tandem create a line break (a.k.a., a soft break or soft return)

Back to "Try to divide caption segments..."

Gaps Between Caption Segments

Avoid gaps of less than 1-1.5 seconds between caption segments (BBC: Gaps), as one segment disappearing then being replaced by another segment within 1.5 seconds creates a jarring flicker effect that can disrupt the user experience. If a gap of more than 1.5 seconds between segments is not possible, have a segment end at the same time the next segment starts.

Back to "Do not include gaps..."