Audio Flamingo Next Captioner: Long-Form Audio Captioning for Speech, Sound, and Music

Upload audio or paste a YouTube URL to generate dense captions, timestamp-aware summaries, and detailed scene descriptions with AF-Next-Captioner.

Authors: Sreyan Ghosh^1,2, Arushi Goel¹, Kaousheik Jayakumar², Lasha Koroshinadze², Nishit Anand², Zhifeng Kong¹, Siddharth Gururani¹, Sang-gil Lee¹, Jaehyeon Kim¹, Aya Aljafari¹, Chao-Han Huck Yang¹, Sungwon Kim¹, Ramani Duraiswami², Dinesh Manocha², Mohammad Shoeybi¹, Bryan Catanzaro¹, Ming-Yu Liu¹, Wei Ping¹

¹NVIDIA, CA, USA | ²University of Maryland, College Park, USA

Correspondence: sreyang@umd.edu, arushig@nvidia.com

This Model Is Best For

Detailed audio captions that integrate speech, sound effects, ambience, and music in one response
Captioning prompts that also ask for precise transcription of all spoken content by all speakers
Long-form summaries, timestamp-aware event descriptions, and dense scene breakdowns
Showcase-style outputs where you want to see how the audio evolves over time

If you need standard QA, assistant-style chat, ASR, or AST / speech translation, use Audio Flamingo Next.

If you need explicit reasoning traces or multi-step timestamp-grounded analysis, use Audio Flamingo Next Think.

Prompting note: AF-Next-Captioner is strongest when you ask explicitly for a dense caption, timestamped scene summary, lyrics, or a speaker-aware breakdown.

Prompt Guide

Task	Prompt	Recommended Checkpoint(s)
ASR	`Transcribe the input speech.`	`Instruct`, `Think`
AST	`Translate any speech you hear from <src_lang> into <tgt_lang>.`	`Instruct`, `Think`
Short Audio Captioning	`Generate a caption for the input audio.`	`Captioner`, `Think`
Long Audio Captioning	`Generate a detailed caption for the input audio. In the caption, transcribe all spoken content by all speakers in the audio precisely.`	`Captioner`, `Think`
Music Captioning	`Summarize the track with precision: mention its musical style, BPM, key, arrangement, production choices, and the emotions or story it conveys.`	`Captioner`, `Instruct`, `Think`
Lyrics	`Generate a lyrics transcription from the input song.`	`Instruct`, `Captioner`, `Think`
QA	`What precise description did the commentator use for the punch that ended the fight?`	`Instruct`, `Think`
Timestamped Multi-Talker ASR	`Transcribe the input audio. If multiple speakers are present, provide diarized transcripts with speaker labels.` `[Speaker 1] ...` `[Speaker 2] ...`	`Instruct`, `Think`

Audio Input

Upload Audio File

YouTube URL

Paste any YouTube URL - we'll extract high-quality audio automatically

Prompt

Example Prompts

YouTube URL	Prompt

Model Caption