Automatic Subtitles with Whisper and Remotion: A Complete Guide

Written by Francesco Di Donato • July 16, 2025 • 9 minutes reading
Handy and Versatile
Whisper and Remotion are really handy tools. What you’ll see in this post, automatic subtitle creation, is just one of the many possible use cases for these open-source technologies. Think of it as a gym for learning the fundamentals, which you can then apply to your own projects.
Are you a podcaster? You could use this stack to generate promotional videos with your synchronized audio waveform, taking advantage of Remotion’s ready-made templates.
Are you a creative coder or a digital artist? Imagine being able to finally render your WebGL creations, your shaders, and your polygons into a clean MP4 video, without going crazy with screen capture software.
Maybe you give talks and need a custom-made video-slide template.
This workflow will give you the tools to do all of that. Now, let’s start with our concrete example.
In my case, I want to add subtitles to my videos. Mainly because statistics show me that some viewers have their volume set to zero. However, although various apps and TikTok’s own editor are excellent for starting out, they still require considerable operating time and manual labor on my part. Time is a precious resource, so it’s time to automate.
We want:
- Input: some
.mp4
where someone is talking (☝️🤓). The editing of the scenes that make up the video has already been done with other specific tools. Here, we’re just adding the captions. - To extract text and translations with the highest possible accuracy. To assemble everything into a final video with unique animations and style that no traditional editor can give you. And we’ll do it by writing code you probably already know.
- To apply your format, your colors. It can both automate recurring visualizations and enhance your personal branding. It’s up to you how you
implement ithave an LLM generate it for you.
Basically:
- Stop editing
- Start building
Whisper
Everyone thinks Whisper is a transcription tool. Wrong. Or rather, incomplete. To understand its power, we first need to understand how a traditional ASR (Automatic Speech Recognition) works and why Whisper is different.
A classic ASR is an assembly line that processes sound in stages. The first step is the most abstract and crucial: the Acoustic Model.
What does an Acoustic Model do? It breaks down sound into “Phonemes.”
Imagine phonemes as the LEGO® bricks of spoken language. They are the smallest units of sound that distinguish one word from another. They aren’t letters, but the actual sounds.
- Cat vs. Bat: The only thing that changes is the initial sound. The hard /k/ sound and the /b/ sound are two different phonemes.
- Ride vs. Road: Here, the vowel sound in the middle changes the entire meaning. The /aɪ/ sound and the /oʊ/ sound are two other distinct phonemes.
A traditional ASR must first “listen” to the continuous sound wave of your speech and break it down into a sequence of these sound-bricks. Only then, with other models, does it try to reconstruct the words and sentences. It’s a rigid, specialized, step-by-step process.
How it differs
Whisper throws away this assembly line. It’s a single end-to-end Transformer model, much more similar to an LLM like GPT than to an ASR. And its secret lies in tokens.
How does Whisper really work? The token game.
Whisper doesn’t “listen and transcribe” in the classic sense. Instead, it transforms the audio into a numerical representation and then asks its decoder, “Okay, starting from this audio and these commands, what words should I generate?”.
The “commands” are special tokens we give it as a prompt to guide the result. The flexibility is total.
Example 1: Simple Transcription in Italian
<|startoftranscript|> <|it|> <|transcribe|>
startoftranscript
: Starts the process.it
: Specifies the language (Italian).transcribe
: Sets the transcription task.
Example 2: Translation to English
<|startoftranscript|> <|it|> <|translate|>
translate
: This single token change tells the model not to transcribe, but to translate the Italian audio directly into English text.
Example 3: Controlling Timestamps
<|startoftranscript|> <|it|> <|transcribe|> <|notimestamps|>
notimestamps
: If you don’t want the timing data for each word, add this token. If you omit it, you’ll get them by default.
Example 4: Language Identification
<|startoftranscript|>
- By providing only the start token, the first token Whisper will generate is the one for the language it has detected (e.g.,
<|en|>
or<|es|>
).
Get the trick? Whisper doesn’t have different functions; it has a single generative engine that you can steer with tokens. This makes it incredibly versatile, but it’s also why it can “hallucinate”: if it’s not sure, it tries to generate the most plausible text sequence, just like an LLM would.
Licenses
Always check the licenses yourself. This post is here to stay, and things can always change over time.
Whisper’s dual nature remains a key point.
- Open-Source (MIT License): You download it, run it wherever you want, even in commercial projects. The only obligation is to maintain the original attribution. Maximum freedom, maximum responsibility.
- OpenAI’s API (Terms of Service): You pay for a service. Simpler, more scalable, but you’re bound by OpenAI’s rules. You are a customer, not a user of free software.
Remotion
Basically, Remotion is a thingy that (at the time of writing) you install with npx create-video@latest
and it comes with well-made templates, which are excellent starting points. Like the template for TikTok.
The Golden Cages
The video editors integrated into social platforms (CapCut, the TikTok editor, Instagram Reels) are great for starting out. They are, by definition, a subset of the possibilities. What you can do with them depends on what their developers were paid to implement. You work within their boundaries, which isn’t necessarily a bad thing. It depends on whether you currently have the capacity to invest time in automating some aspect of video editing that might be useful to you now or later.
The Power of the Web Stack
This is not an exaggeration. It means you can use:
- CSS: Want to use a specific font from Google Fonts? Done. Want a complex text gradient, an artistic
text-shadow
, or aclip-path
to reveal text in an original way?You can do itYou can prompt it properly to have it do it for you. Democratization of Brand identity(?) - SVG: Import complex vector logos like your company’s as a watermark.
- WebGL: For the more
daringnerdy, this opens the door to 3D and GPU-accelerated effects with libraries likethree.js
. Jokes aside, making 3D stuff is art in every sense, literally painting with math. It’s cool, come on.
But many other useful ones.
Maybe Audiogram to put a visualization of your podcast’s audio track or music. The animated code template is super interesting (dependency: Code Hike).
Licenses
Always check the licenses yourself. This post is here to stay, and things can always change over time.
Free for small teams (up to 3 people), non-profits, and freelancers. Paid for larger companies (4 employees and up). A fair model that encourages adoption.
Perfect, let’s update the workflow section with these practical steps. Remotion’s TikTok template is an excellent starting point because it already integrates whisper.cpp
.
Here is the new “Workflow” section, rewritten with your instructions.
Workflow
I recommend starting from the official Remotion template for TikTok-style videos, which already has everything you need to get started.
Create the Project Open your terminal and run this command. It will create a new folder with a pre-configured Remotion project for generating subtitles.
npx create-video@latest --template tiktok
Configure Whisper Before running any scripts, open the
src/whisper-config.mjs
file. Here you need to specify the Whisper model to use and the language of your audio. For a good balance, you could use:WHISPER_MODEL
:large-v3
(very accurate) ormedium
(faster).WHISPER_LANG
:en
for English (orit
for Italian, etc.).
Note: The first time you run the transcription script,
whisper.cpp
will be downloaded and compiled automatically. Be patient for a few minutes; subsequent runs will be instantaneous.await installWhisperCpp({ ... }); await downloadWhisperModel({ ... });
Transcribe the Audio Move your video file (e.g.,
my-video.mp4
) into the project’spublic/
folder. Then, run the transcription script:node sub.mjs public/my-video.mp4
The template uses the following code to extract the audio from the video using the correct bitrate:
const extractToTempAudioFile = (fileToTranscribe, tempOutFile) => { // Extracting audio from mp4 and save it as 16khz wav file execSync( `npx remotion ffmpeg -i "${fileToTranscribe}" -ar 16000 "${tempOutFile}" -y`, { stdio: ["ignore", "inherit"] }, ); };
It saves it in a
temp
folder. Immediately after, it uses it to get thecaptions
.const whisperCppOutput = await transcribe({ ... }) const { captions } = toCaptions({ whisperCppOutput })
It will save the result (words and timestamps) in a JSON file with the same name as the video, but a different extension.
[ { "text": "Example", "startMs": 0, "endMs": 330, "timestampMs": null, "confidence": 0.90491 } ]
Translate the Transcript (Optional) You are free to evolve this script as you see fit. In my case, I introduced an additional translation step, again thanks to Whisper.
const englishWhisperOutput = await transcribe({ ... translateToEnglish: true, });
It is very important to provide the entire text to the
transcribe
function. Whisper’s output improves if it has the full context available. If you split the input into chunks and process them separately, they will be treated independently of each other, and it will not be able to use the context for a more accurate transcription and translation.Customize and Preview Launch the Remotion Studio to see a live preview in your browser:
pnpm dev
Open the
src/CaptionVideo.tsx
file. This is the React component where you can modify the appearance of the subtitles: change the font, colors, position, animations. Every change will be reflected in real-time in the Studio.Render the Final Video When you’re satisfied with the result you see in the preview, stop the studio (
Ctrl+C
) and run the final rendering command:npx remotion render
Remotion will process the video frame by frame, applying your custom subtitles, and save the final MP4 file in the
out/
folder.Alternatively, you can trigger the render directly from the
localhost
interface. You can also select different settings.
Conclusion
The Whisper + Remotion combo is a game-changer. It transforms video creation from a manual and restrictive process to a programmatic, automated, and creatively limitless one.
You are literally building your videos as if they were interactive web applications. And this not only saves you an enormous amount of time but also drastically raises the quality and uniqueness of your final product.
Giving you more free time, too.