current

CutCaption

A subtitle workflow product with a full media pipeline, a real editor, and consistent preview-to-export rendering.

DjangoVue 3DockerFFmpegWhisperStripe

CutCaption is a subtitle production suite built around a real media pipeline, a workflow-sensitive editor, and consistent preview-to-export rendering. It is the first product from OtsoLabs, my Wyoming-registered LLC, and the work has required decisions across media ingest, editor state, rendering, export, billing, and workspace behavior.

What exists today

The product has four distinct surfaces: public marketing and pricing routes; auth flows covering login, signup, email confirmation, password reset, and TOTP-based two-factor authentication; a workspace dashboard for project management and billing; and a dedicated editor environment for upload, transcription, editing, styling, and export.

The editor is a three-panel layout — captions list, video preview, waveform timeline — with five named layout modes including one optimized for vertical video. The surrounding product layer connects auth recovery, billing state, background job status, and export-intent resumption into one system.

Selected engineering decisions

Media ingest and transcription pipeline

Uploads are validated with ffprobe before entering a multi-phase Celery pipeline:

  1. Source validation — confirms the file is valid media before committing resources
  2. Proxy generation — produces a browser-playable file at lower bitrate from the original
  3. Waveform generation — peak data for timeline visualization
  4. Sprite sheet generation — thumbnail grid for scrub preview
  5. Transcription — GPU Whisper runtime in production; a deterministic mock runtime for development, running on a separate worker queue
  6. Alignment — word-level timestamps for Tier 1 languages (English, French, German, Spanish, and six others); Tier 2 languages (Japanese, Chinese, Korean, Arabic, Hindi) get segment-level timing only
  7. Caption normalization — strips malformed word-alignment entries without invalidating the segment; segment text and timing remain authoritative
  8. Caption persistence — saves the versioned caption document to the project

The two transcription runtimes (GPU Whisper and CPU mock) run on separate Celery queues, so the GPU worker stays isolated from development traffic.

Editor state and document model

Each subtitle is a versioned document entry containing: plain text, a Quill delta for rich-text formatting, float start/end timestamps, per-word alignment data when available, a spacial dict for per-subtitle position and style overrides, and a stable UUID.

The editor manages autosave via Pinia action subscriptions, an undo/redo history stack with merge-key deduplication for continuous edits (slider drags produce one history entry rather than one per tick), in-editor full-text search, split and merge with timing proportionally resolved from word boundaries when alignment data is present, and a re-split operation that replaces the subtitle list from the backend.

For performance, the editor auto-selects a rendering mode at load time: Normal (below 250 subtitles, full DOM), Hybrid (250–499, lightweight shells outside the active area that upgrade on scroll with a debounce), and Performance (500 and above, virtual scrolling with fixed row height). Users can override manually.

Preview-to-export rendering

The browser preview uses JASSUB — a WASM-compiled libass build — rendering ASS subtitle tracks on a canvas element overlaid on the video. Export flows through FFmpeg with the ASS filter on the backend.

To keep these consistent, fonts are registered into both libass and the browser’s FontFace API so layout metrics stay aligned between environments. Each font is also present in the export worker’s shared font directory.

The renderer distinguishes patch and rebuild operations: single subtitle text or timing edits update JASSUB events in place using an event-index map; structural mutations (add, delete, split, merge, replace-range) trigger a full track rebuild. This avoids unnecessary rebuilds on every keystroke while keeping the track state correct after structural changes.

Karaoke and word-level effects

Four karaoke modes are available for AI-transcribed content:

  • Sweep — uses ASS \kf fill-sweep tags; words transition with a smooth color sweep at their word boundary
  • Instant — uses ASS \k tags; words snap to the highlight color at the word boundary
  • Word flash — each word is rendered as a separate timed subtitle event that appears and disappears; subtitle file downloads export matching word-level cue windows
  • Word style — each active word gets its own per-word ASS style override (color, outline, background, font size, bold/italic/underline) that activates at the word boundary; the active word has its own secondary style panel independent of the global subtitle style

Eligibility is gated by language tier (word-level alignment required for word-flash and word-style), project type (AI-generated only), and video duration (word-flash and word-style capped at 4 minutes due to per-event count). When some subtitles have word timings and others do not, a warning appears in the mode menu and untimed subtitles render normally.

Product layer around the editor

Export-intent resumption stores an ExportIntent record when the export flow is interrupted by auth or billing issues; the user resumes from a banner in the editor captions panel. Export workers apply billing debits at submission time and issue a refund on worker failure.

The workspace model supports three member roles (Owner, Admin, Editor), per-workspace billing authority resolved through the workspace owner’s account, and invite/remove/role-change flows. Billing events flow through Stripe webhooks with reconciliation logic for missing cycle grants.