Audio Flamingo Next Captioner: Long-Form Audio Captioning for Speech, Sound, and Music

Upload audio or paste a YouTube URL to generate dense captions, timestamp-aware summaries, and detailed scene descriptions with AF-Next-Captioner.

Authors: Sreyan Ghosh1,2, Arushi Goel1, Kaousheik Jayakumar2, Lasha Koroshinadze2, Nishit Anand2, Zhifeng Kong1, Siddharth Gururani1, Sang-gil Lee1, Jaehyeon Kim1, Aya Aljafari1, Chao-Han Huck Yang1, Sungwon Kim1, Ramani Duraiswami2, Dinesh Manocha2, Mohammad Shoeybi1, Bryan Catanzaro1, Ming-Yu Liu1, Wei Ping1

1NVIDIA, CA, USA | 2University of Maryland, College Park, USA

Correspondence: sreyang@umd.edu, arushig@nvidia.com

This Model Is Best For

  • Detailed audio captions that integrate speech, sound effects, ambience, and music in one response
  • Captioning prompts that also ask for precise transcription of all spoken content by all speakers
  • Long-form summaries, timestamp-aware event descriptions, and dense scene breakdowns
  • Showcase-style outputs where you want to see how the audio evolves over time

If you need standard QA, assistant-style chat, ASR, or AST / speech translation, use Audio Flamingo Next.

If you need explicit reasoning traces or multi-step timestamp-grounded analysis, use Audio Flamingo Next Think.

Audio Input

OR

Example Prompts
Upload Audio File YouTube URL Prompt