How to Add Captions to Screen Recordings (Step-by-Step)

Captions are no longer optional. They are expected. Studies consistently show that the majority of social media videos are watched without sound, and captions increase viewer engagement by 40% or more. Beyond engagement, captions make your content accessible to deaf and hard-of-hearing viewers, non-native speakers, and anyone watching in a noisy environment.

Despite their importance, adding captions to screen recordings has traditionally been tedious. You either paid for a transcription service, manually typed out every word, or struggled with auto-generated captions that butchered technical terminology. That has changed dramatically with AI transcription tools like Whisper. This guide walks you through every method for adding captions to your screen recordings, from fully automated to manual.

Understanding Caption Formats

Before diving into tools and workflows, it helps to understand the two main approaches to captions:

Open Captions (Burned In)

Open captions are rendered directly into the video pixels. They are always visible and cannot be turned off by the viewer. This is the approach used on most social media content because platforms like Instagram, TikTok, and LinkedIn autoplay videos without sound.

Pros:

Always visible regardless of platform or player
Consistent styling across all devices
Work in any context (embedded, downloaded, screen-shared)

Cons:

Cannot be toggled off
Cannot be resized by the viewer for accessibility
Require re-rendering the entire video to fix a typo

Closed Captions (Sidecar Files)

Closed captions are stored in separate text files (SRT, VTT, or similar formats) that video players display as an overlay. YouTube, Vimeo, and most video hosting platforms support uploading caption files alongside your video.

Pros:

Viewers can toggle captions on or off
Easy to edit without re-rendering the video
Viewers can adjust size and styling
Searchable by platforms (YouTube indexes caption text for search)

Cons:

Depend on the player or platform supporting them
Styling varies across platforms
May not display correctly when the video is downloaded or embedded outside the original platform

Which Should You Use?

For social media content (LinkedIn, Twitter/X, Instagram, TikTok): use open captions. Most viewers will see your video without sound and without caption toggle options.

For YouTube, course platforms, and documentation: use closed captions (SRT/VTT files). This gives viewers control and improves discoverability through search.

For internal team recordings: either approach works. Open captions are simpler; closed captions are better if you are hosting on a platform like Loom or Notion that supports them.

Method 1: AI-Powered Automatic Captions

AI transcription is the fastest path from recording to captioned video. The technology has improved dramatically, and modern tools handle technical vocabulary, multiple speakers, and varied accents far better than they did even two years ago.

Using Whisper (Open Source)

OpenAI's Whisper is an open-source speech recognition model that runs locally on your machine. It supports 99 languages and produces remarkably accurate transcriptions, even for technical content with programming terminology.

Step-by-step with Whisper CLI:

Install Whisper. If you have Python installed:
```
pip install openai-whisper
```
Extract audio from your screen recording (if you do not already have a separate audio file):
```
ffmpeg -i recording.mp4 -vn -acodec pcm_s16le audio.wav
```
Run Whisper on the audio file:
```
whisper audio.wav --model medium --output_format srt
```
This generates an audio.srt file with timestamped captions.
Review and correct the output. Open the SRT file in a text editor and fix any transcription errors. Pay special attention to proper nouns, technical terms, and abbreviations.
Use the SRT file. Upload it to YouTube alongside your video, or burn it into the video using FFmpeg:
```
ffmpeg -i recording.mp4 -vf subtitles=audio.srt output_with_captions.mp4
```

Whisper model sizes and accuracy:

Model	Size	Relative Speed	Accuracy
tiny	39 MB	Fastest	Good for simple content
base	74 MB	Fast	Decent for most content
small	244 MB	Moderate	Good accuracy
medium	769 MB	Slow	Very good accuracy
large	1.55 GB	Slowest	Best accuracy

For screen recordings with narration, the medium model hits the sweet spot between accuracy and speed. Use large for content with heavy technical jargon or multiple languages.

Using One Rec's Built-in Captions

One Rec integrates Whisper directly into the recording studio, eliminating the multi-step workflow above. After recording, you enable AI captions in the editor, and One Rec transcribes your audio locally using Whisper, generates timed caption segments, and renders them directly into your video with customizable styling.

The key advantage is that everything happens locally. Your audio is never sent to a cloud service, which matters for recordings that contain sensitive information like internal tools, customer data, or proprietary workflows.

Step-by-step with One Rec:

Record your screen (or import an existing recording).
Open the recording in One Rec's editor.
Enable AI captions from the caption panel.
Wait for the local transcription to complete (typically 30-90 seconds for a 5-minute recording).
Review the generated captions and make any corrections.
Customize the caption style (font, size, color, position, background).
Export your video with captions burned in.

Using Cloud-Based Transcription Services

If you do not want to run Whisper locally, several cloud services offer transcription:

Rev — Professional-quality transcription with human review options. Supports SRT and VTT export.
Otter.ai — Real-time transcription with a free tier. Good for meetings and presentations.
Descript — Transcription integrated with a full video editor. Edit your video by editing the transcript text.

Cloud services are convenient but come with privacy tradeoffs. Your audio is uploaded to and processed on third-party servers. For sensitive content, local tools like Whisper or One Rec are safer.

Method 2: Manual Caption Creation

For high-stakes content where accuracy is critical (training videos, legal content, medical tutorials), manual captioning ensures every word is correct.

Creating SRT Files Manually

SRT (SubRip Subtitle) is the most widely supported caption format. The file structure is straightforward:

1
00:00:01,000 --> 00:00:04,500
Welcome to this tutorial on setting up
your development environment.

2
00:00:05,200 --> 00:00:08,800
First, we will install the required
dependencies from the terminal.

3
00:00:09,500 --> 00:00:13,000
Open your terminal and run the following
command to get started.

Each caption block has:

A sequential number
A timecode range (start --> end) in HH:MM:SS,mmm format
One or two lines of text

Tips for manual SRT creation:

Keep each segment to 1-2 lines of text and 3-5 seconds of duration.
Align segment boundaries with natural speech pauses.
Use a text editor with monospace font for easier alignment.
Tools like Subtitle Edit (Windows/Linux) or Jubler (cross-platform) provide a visual timeline that makes timing much easier.

Creating VTT Files

WebVTT (VTT) is the web-native caption format used by HTML5 video players. It is very similar to SRT with minor syntax differences:

WEBVTT

00:00:01.000 --> 00:00:04.500
Welcome to this tutorial on setting up
your development environment.

00:00:05.200 --> 00:00:08.800
First, we will install the required
dependencies from the terminal.

VTT uses periods instead of commas in timecodes and starts with a WEBVTT header. It also supports styling directives, though platform support for VTT styling varies.

Method 3: Hybrid Approach (AI + Manual Review)

The most practical approach for most people is to use AI transcription as a starting point and then manually review the output. This combines the speed of automation with the accuracy of human judgment.

Workflow:

Generate captions with Whisper or One Rec's built-in AI.
Export the captions as an SRT file (or review them in the editor).
Read through the entire transcript while watching the video.
Fix misheard words, especially proper nouns and technical terms.
Adjust timing for any segments that feel off (too early, too late, or too short).
Re-import the corrected captions or update them in your editor.

This workflow typically takes 5-10 minutes for a 5-minute video, compared to 30-45 minutes for fully manual captioning.

Styling Captions for Professional Results

How your captions look matters almost as much as their accuracy. Poorly styled captions can obscure important content, strain readability, or clash with your video's aesthetic.

Font and Size

Use a clean sans-serif font like Inter, Helvetica, or Arial. Avoid decorative fonts.
Size should be large enough to read on mobile without squinting. For 1080p video, 28-36px is a good range.
Bold weight improves readability, especially over busy backgrounds.

Background and Contrast

Add a semi-transparent dark background behind light text (or vice versa). This ensures readability regardless of what is happening in the video behind the captions.
Alternatively, use a text outline or shadow for a cleaner look.
Avoid fully opaque backgrounds that block too much of the video.

Position

Lower third (bottom center) is the standard position for most content.
For screen recordings where the important content is at the bottom, move captions to the top to avoid covering UI elements.
Ensure captions do not overlap with your cursor, click indicators, or other visual elements.

Animation

Subtle word-by-word highlighting (where each word lights up as it is spoken) increases engagement and comprehension.
Avoid overly flashy animations that distract from the content.
One Rec supports animated caption styles that highlight words in sync with your speech.

Adding Captions in Multiple Languages

If your audience spans multiple languages, multi-language captions dramatically expand your reach.

Approaches:

Whisper multilingual mode: Whisper can detect the spoken language automatically and transcribe in that language. You can also force a specific language with the --language flag.
Translation after transcription: Generate captions in the original language, then translate the SRT/VTT file. Tools like DeepL or Google Translate work for a first pass, but professional translation is worth the investment for important content.
Multiple SRT files: For platforms like YouTube, upload separate SRT files for each language. YouTube lets viewers switch between them.

Common languages for screen recording content:

English (primary for most tech content)
Spanish, Portuguese, and French (large developer communities)
Japanese, Korean, and Chinese (significant tech audiences)
German (strong engineering community)

Common Captioning Mistakes to Avoid

Segments that are too long. Keep each caption segment under 7 seconds and 2 lines. Longer segments are hard to read.
Misaligned timing. Captions that appear too early or too late break the connection between audio and text. Always preview your captions with the video playing.
Ignoring technical terms. AI transcription often struggles with programming languages, framework names, and acronyms. Always review these manually.
Inconsistent formatting. Decide on capitalization (sentence case vs. title case), punctuation, and number formatting, then stick with it throughout.
Covering important content. Position captions so they do not obscure the UI elements you are demonstrating.

Final Thoughts

Adding captions to screen recordings has gone from a tedious chore to a quick, semi-automated process. AI tools like Whisper handle the heavy lifting, and integrated solutions like One Rec's built-in captions reduce the workflow to a few clicks.

The most important thing is to actually do it. Even imperfect captions are dramatically better than no captions. Start with AI-generated captions, review them quickly for obvious errors, and publish. You can always refine your captioning workflow over time as you learn what your audience needs.

How to Add Captions to Screen Recordings (Step-by-Step)

Understanding Caption Formats

Before diving into tools and workflows, it helps to understand the two main approaches to captions:

Open Captions (Burned In)

Pros:

Always visible regardless of platform or player
Consistent styling across all devices
Work in any context (embedded, downloaded, screen-shared)

Cons:

Cannot be toggled off
Cannot be resized by the viewer for accessibility
Require re-rendering the entire video to fix a typo

Closed Captions (Sidecar Files)

Pros:

Viewers can toggle captions on or off
Easy to edit without re-rendering the video
Viewers can adjust size and styling
Searchable by platforms (YouTube indexes caption text for search)

Cons:

Depend on the player or platform supporting them
Styling varies across platforms
May not display correctly when the video is downloaded or embedded outside the original platform

Which Should You Use?

For social media content (LinkedIn, Twitter/X, Instagram, TikTok): use open captions. Most viewers will see your video without sound and without caption toggle options.

For YouTube, course platforms, and documentation: use closed captions (SRT/VTT files). This gives viewers control and improves discoverability through search.

For internal team recordings: either approach works. Open captions are simpler; closed captions are better if you are hosting on a platform like Loom or Notion that supports them.

Method 1: AI-Powered Automatic Captions

Using Whisper (Open Source)

Step-by-step with Whisper CLI:

Install Whisper. If you have Python installed:
```
pip install openai-whisper
```
Extract audio from your screen recording (if you do not already have a separate audio file):
```
ffmpeg -i recording.mp4 -vn -acodec pcm_s16le audio.wav
```
Run Whisper on the audio file:
```
whisper audio.wav --model medium --output_format srt
```
This generates an audio.srt file with timestamped captions.
Review and correct the output. Open the SRT file in a text editor and fix any transcription errors. Pay special attention to proper nouns, technical terms, and abbreviations.
Use the SRT file. Upload it to YouTube alongside your video, or burn it into the video using FFmpeg:
```
ffmpeg -i recording.mp4 -vf subtitles=audio.srt output_with_captions.mp4
```

Whisper model sizes and accuracy:

Model	Size	Relative Speed	Accuracy
tiny	39 MB	Fastest	Good for simple content
base	74 MB	Fast	Decent for most content
small	244 MB	Moderate	Good accuracy
medium	769 MB	Slow	Very good accuracy
large	1.55 GB	Slowest	Best accuracy

For screen recordings with narration, the medium model hits the sweet spot between accuracy and speed. Use large for content with heavy technical jargon or multiple languages.

Using One Rec's Built-in Captions

Step-by-step with One Rec:

Record your screen (or import an existing recording).
Open the recording in One Rec's editor.
Enable AI captions from the caption panel.
Wait for the local transcription to complete (typically 30-90 seconds for a 5-minute recording).
Review the generated captions and make any corrections.
Customize the caption style (font, size, color, position, background).
Export your video with captions burned in.

Using Cloud-Based Transcription Services

If you do not want to run Whisper locally, several cloud services offer transcription:

Rev — Professional-quality transcription with human review options. Supports SRT and VTT export.
Otter.ai — Real-time transcription with a free tier. Good for meetings and presentations.
Descript — Transcription integrated with a full video editor. Edit your video by editing the transcript text.

Cloud services are convenient but come with privacy tradeoffs. Your audio is uploaded to and processed on third-party servers. For sensitive content, local tools like Whisper or One Rec are safer.

Method 2: Manual Caption Creation

For high-stakes content where accuracy is critical (training videos, legal content, medical tutorials), manual captioning ensures every word is correct.

Creating SRT Files Manually

SRT (SubRip Subtitle) is the most widely supported caption format. The file structure is straightforward:

1
00:00:01,000 --> 00:00:04,500
Welcome to this tutorial on setting up
your development environment.

2
00:00:05,200 --> 00:00:08,800
First, we will install the required
dependencies from the terminal.

3
00:00:09,500 --> 00:00:13,000
Open your terminal and run the following
command to get started.

Each caption block has:

A sequential number
A timecode range (start --> end) in HH:MM:SS,mmm format
One or two lines of text

Tips for manual SRT creation:

Keep each segment to 1-2 lines of text and 3-5 seconds of duration.
Align segment boundaries with natural speech pauses.
Use a text editor with monospace font for easier alignment.
Tools like Subtitle Edit (Windows/Linux) or Jubler (cross-platform) provide a visual timeline that makes timing much easier.

Creating VTT Files

WebVTT (VTT) is the web-native caption format used by HTML5 video players. It is very similar to SRT with minor syntax differences:

WEBVTT

00:00:01.000 --> 00:00:04.500
Welcome to this tutorial on setting up
your development environment.

00:00:05.200 --> 00:00:08.800
First, we will install the required
dependencies from the terminal.

VTT uses periods instead of commas in timecodes and starts with a WEBVTT header. It also supports styling directives, though platform support for VTT styling varies.

Method 3: Hybrid Approach (AI + Manual Review)

Workflow:

Generate captions with Whisper or One Rec's built-in AI.
Export the captions as an SRT file (or review them in the editor).
Read through the entire transcript while watching the video.
Fix misheard words, especially proper nouns and technical terms.
Adjust timing for any segments that feel off (too early, too late, or too short).
Re-import the corrected captions or update them in your editor.

This workflow typically takes 5-10 minutes for a 5-minute video, compared to 30-45 minutes for fully manual captioning.

Styling Captions for Professional Results

How your captions look matters almost as much as their accuracy. Poorly styled captions can obscure important content, strain readability, or clash with your video's aesthetic.

Font and Size

Use a clean sans-serif font like Inter, Helvetica, or Arial. Avoid decorative fonts.
Size should be large enough to read on mobile without squinting. For 1080p video, 28-36px is a good range.
Bold weight improves readability, especially over busy backgrounds.

Background and Contrast

Add a semi-transparent dark background behind light text (or vice versa). This ensures readability regardless of what is happening in the video behind the captions.
Alternatively, use a text outline or shadow for a cleaner look.
Avoid fully opaque backgrounds that block too much of the video.

Position

Lower third (bottom center) is the standard position for most content.
For screen recordings where the important content is at the bottom, move captions to the top to avoid covering UI elements.
Ensure captions do not overlap with your cursor, click indicators, or other visual elements.

Animation

Subtle word-by-word highlighting (where each word lights up as it is spoken) increases engagement and comprehension.
Avoid overly flashy animations that distract from the content.
One Rec supports animated caption styles that highlight words in sync with your speech.

Adding Captions in Multiple Languages

If your audience spans multiple languages, multi-language captions dramatically expand your reach.

Approaches:

Whisper multilingual mode: Whisper can detect the spoken language automatically and transcribe in that language. You can also force a specific language with the --language flag.
Translation after transcription: Generate captions in the original language, then translate the SRT/VTT file. Tools like DeepL or Google Translate work for a first pass, but professional translation is worth the investment for important content.
Multiple SRT files: For platforms like YouTube, upload separate SRT files for each language. YouTube lets viewers switch between them.

Common languages for screen recording content:

English (primary for most tech content)
Spanish, Portuguese, and French (large developer communities)
Japanese, Korean, and Chinese (significant tech audiences)
German (strong engineering community)

Common Captioning Mistakes to Avoid

Segments that are too long. Keep each caption segment under 7 seconds and 2 lines. Longer segments are hard to read.
Misaligned timing. Captions that appear too early or too late break the connection between audio and text. Always preview your captions with the video playing.
Ignoring technical terms. AI transcription often struggles with programming languages, framework names, and acronyms. Always review these manually.
Inconsistent formatting. Decide on capitalization (sentence case vs. title case), punctuation, and number formatting, then stick with it throughout.
Covering important content. Position captions so they do not obscure the UI elements you are demonstrating.