Audio Flamingo Next Captioner: Long-Form Audio Captioning for Speech, Sound, and Music

Upload audio or paste a YouTube URL to generate dense captions, timestamp-aware summaries, and detailed scene descriptions with AF-Next-Captioner.

Authors: Sreyan Ghosh^1,2, Arushi Goel¹, Kaousheik Jayakumar², Lasha Koroshinadze², Nishit Anand², Zhifeng Kong¹, Siddharth Gururani¹, Sang-gil Lee¹, Jaehyeon Kim¹, Aya Aljafari¹, Chao-Han Huck Yang¹, Sungwon Kim¹, Ramani Duraiswami², Dinesh Manocha², Mohammad Shoeybi¹, Bryan Catanzaro¹, Ming-Yu Liu¹, Wei Ping¹

¹NVIDIA, CA, USA | ²University of Maryland, College Park, USA

Correspondence: sreyang@umd.edu, arushig@nvidia.com

This Model Is Best For

Detailed audio captions that integrate speech, sound effects, ambience, and music in one response
Captioning prompts that also ask for precise transcription of all spoken content by all speakers
Long-form summaries, timestamp-aware event descriptions, and dense scene breakdowns
Showcase-style outputs where you want to see how the audio evolves over time

If you need standard QA, assistant-style chat, ASR, or AST / speech translation, use Audio Flamingo Next.

If you need explicit reasoning traces or multi-step timestamp-grounded analysis, use Audio Flamingo Next Think.

Audio Input

Upload Audio File

YouTube URL

Paste any YouTube URL - we'll extract high-quality audio automatically

Prompt

Example Prompts

Upload Audio File	YouTube URL	Prompt

Model Caption