Text to speech

taaft.com/text-to-speech 265,347 subscribers

There are 7 Free AI tools for Text to speech.

Copy 🔗

Number of tools

Number of models

100

Number of devices

Also used for Text to speech 7

Spritefy

Generate game-ready assets with AI in seconds.

2,405 spritefy.com

Share

🇩🇪 Germany
Released 16d ago
Free + from $12

2,674
5
PodcastorAI

Turn any content into engaging podcasts instantly.

Yawin Lin

🙏 3 karma

Jun 10, 2026

@PodcastorAI

As a student, I regularly work with research papers and long PDFs. PodcastorAI has been a helpful way to turn that content into something I can listen to while commuting or walking. I like that it creates a structured podcast rather than simply reading the text aloud. The dialogue format is surprisingly engaging and makes dense material easier to get through. I still rely on the original documents for deeper study, but for review and knowledge retention, it's been a genuinely useful tool.

3 Reply Share Edit Delete Report

Share

Released 24d ago
Free + from $9.9/mo

464
10
4.0
CiteVerse v2.0

Real-time AI-powered Scripture display and note taker that responds to voice.

15,247 citeverse.live

Share

Released 1mo ago
Free + from $22.50/mo

16,022
7
5.0
Trainn AI v1.4

Turn one screen recording into videos, interactive tours, and product guides.

6,299 trainn.co

Share

Released 3mo ago
Free + from $19/mo

7,665
10
5.0
Whimsy v2.0

A personalized audio story gift starring any child, delivered instantly by email.

Share

🇺🇸 United States
Released 3mo ago
Free + from $19.99

5,613
32
5.0
KikiVoice

Clone any voice in seconds with 99% similarity.

CatCat01

🙏 5 karma

Jan 26, 2026

@KikiVoice

KikiVoice does a great job with voice cloning — the results sound very natural and close to the original, and there’s no sign-up required.

61 Reply Share Edit Delete Report

Share

Released 5mo ago
100% Free

6,893
98
4.5
FastlyConvert

Transcribe audio & video with Whisper. Export TXT/SRT/VTT. Auto-delete 24h.

Share

Released 5mo ago
Free + from $14.99/mo

1,095
3
3.6

Related Tasks✕

Speech to image1 0

Models 100

Gen 4 Seed

Seed Audio 1.0

By ByteDance

Seed Audio 1.0 is ByteDance's universal audio generation model that creates voice, music, sound effects, and ambient soundscapes from text prompts. It supports zero-shot voice cloning from short audio references, multi-character dialogue generation in a single pass, and cross-lingual synthesis without fine-tuning. Accessible via Volcano Engine API.

🔊Advanced audio generation 🔊Text to speech 🗣️Voice cloning 🎶Music generation

NewAudio

Released 8d ago
Gen 3

ZONOS2

By Zyphra AI

ZONOS2 is Zyphra’s open-source real-time text-to-speech model with MoE architecture, high-fidelity zero-shot voice cloning, and multilingual expressive speech generation.

🔊Text to speech 🗣️Voice cloning 🎙️Voiceovers

NewMultimodal

Released 19d ago
Gen 4

Gemini 3.5 Live Translate

By Google DeepMind

Gemini Audio is Google DeepMind’s closed-source native audio model family for low-latency live dialogue, controllable speech generation, audio understanding, and voice-first applications.

🎧Audio translation 🔊Text to speech 🗣️Voice cloning 🗒Transcription

NewAudio

Released 22d ago
Gen 4

Higgs Audio v3 TTS

By Boson AI

Higgs Audio v3 TTS is Boson AI’s text-to-speech model for expressive conversational voice agents across 100+ languages with zero-shot voice cloning and inline speech controls.

🔊Text to speech 🗣️Voice cloning 🎙️Voiceovers 🗣Dialogue generation

NewAudio

Released 27d ago
Gen 4

Miso TTS 8B

By Miso Labs

MisoTTS is Miso Labs’ open-weight 8B text-and-audio-conditioned speech generation model for expressive, context-aware, emotive TTS and dialogue voice output.

🔊Text to speech 🗣️Voice cloning 🎙️Voiceovers 🗣Dialogue generation

NewAudio

Released 28d ago
Gen 3

MAI Voice 2 Flash

By Microsoft

MAI-Voice-2-Flash is Microsoft AI’s upcoming lower-cost, ultra-efficient variant of MAI-Voice-2 for speech generation.

🔊Text to speech 🗣️Voice cloning 🎙️Voiceovers

NewMultimodal

Released 29d ago
Gen 3

MAI Voice 2

By Microsoft

MAI-Voice-2 is Microsoft AI’s speech generation model for natural-sounding voice output across 15 languages with short-sample voice adaptation.

🔊Text to speech 🗣️Voice cloning 🎙️Voiceovers 🎧Audiobooks

NewMultimodal

Released 29d ago
Gen 4

Phonon

By Gradium

Phonon is Gradium’s private-beta 100M-parameter on-device text-to-speech model for low-latency, offline, privacy-sensitive voice generation.

🔊Text to speech 🗣️Voice cloning

NewVideo

Released 1mo ago
Gen 3

Realtime TTS 2

By Inworld AI

Realtime TTS-2 is Inworld AI’s realtime conversational text-to-speech model. It is built for live voice interaction rather than narration, with conversational awareness from prior audio turns, natural-language voice direction, crosslingual voice identity across 100+ languages, and prompt-based voice design.

🔊Text to speech 🗣️Voice cloning 🎙️Voiceovers

NewMultimodal

Released 1mo ago
Gen 3

Sonic 3.5

By Cartesia

Sonic 3.5 is Cartesia’s fastest and most natural text-to-speech model, built for low-latency conversational voice generation across 42 languages.

🔊Text to speech 🗣️Voice cloning 🎙️Voiceovers 🗣Dialogue generation

NewMultimodal

Released 1mo ago
Gen 3

p video avatar

By Pruna AI

p-video-avatar is Pruna’s talking-head video generation model for creating speaking avatar videos from a single portrait image. It takes either a text script or an audio file, then generates a realistic head-and-shoulders speaking video, with support for multiple voices, languages, and 720p or 1080p output.

🎥Video avatars 🔊Text to speech 🎤Lip sync videos 🎨Portrait animation

NewMultimodal

Released 2mo ago
Gen 3

sarashina 2.2 tts

By SB Intuitions

sarashina2.2-tts is SB Intuitions’ Japanese-centric large-language-model-based text-to-speech system. It supports Japanese and English, is designed for high pronunciation accuracy, naturalness, and stability across diverse speaking styles, and includes zero-shot voice generation.

🔊Text to speech 🗣️Voice cloning 🎙️Voiceovers

NewMultimodal

Released 2mo ago
Gen 3

Soniox Text to Speech

By Soniox

Soniox Text-to-Speech is Soniox’s multilingual TTS model and API for precise, low-latency speech generation. It is built for production voice systems, supports 60+ languages, and emphasizes accurate pronunciation, faithful reading of structured text like emails and phone numbers, natural code-switching, and streaming output for real-time voice apps.

🔊Text to speech 🎙️Voiceovers 🎤Voice agents

NewMultimodal

Released 2mo ago
Gen 3

OpenRouter Text to Speech

By OpenRouter

OpenRouter TTS is OpenRouter’s unified text-to-speech interface for accessing multiple speech-generation models through one API layer. It standardizes voice generation across providers, supporting streaming audio output, customizable voices, and multimodal workflows without provider-specific integration complexity.

🔊Text to speech 🎙Voice chatting 🎙️Voiceovers

NewMultimodal

Released 2mo ago
Gen 3

StepAudio 2.5 TTS

By StepFun

StepAudio 2.5 TTS is StepFun’s contextual text-to-speech model with performance-oriented vocal control. It combines global and inline context guidance with zero-shot voice cloning so generated speech can follow broader style instructions as well as local delivery details, rather than just reading text flatly.

🔊Text to speech 🗣️Voice cloning 🗣️Dialect simulation

NewMultimodal

Released 2mo ago
Gen 4

MAI Voice 1

By Microsoft

MAI-Voice-1 is Microsoft’s top-tier text-to-speech model for natural, expressive voice generation. It is built to preserve clarity, intent, speaker identity, emotional nuance, and pacing across long-form speech, and supports custom voice creation from only a few seconds of audio. Microsoft positions it for voice experiences, voice agents, and expressive spoken content at high speed and low cost.

🔊Text to speech 🎤Voice changing 🗣️Voice cloning 🎧Audiobooks 🔊Advanced audio generation

NewAudio

Released 2mo ago
Gen 3

MOSS TTS Nano

By OpenMOSS

MOSS-TTS-Nano is an open-source multilingual tiny speech generation model from MOSI.AI and OpenMOSS. With only 0.1B parameters, it is built for real-time TTS, can run directly on CPU without a GPU, and keeps deployment simple enough for local demos, web serving, and lightweight product integration.

🔊Text to speech 🗣️Voice cloning 🌐Multilingual communication

NewMultimodal

Released 2mo ago
Gen 4

Pocket TTS Spokenword

By danneauxs

Pocket-TTS-Spokenword is an enhanced version of Kyutai’s Pocket TTS built for emotionally expressive audiobook generation from plain text. It adds AI emotion analysis, smart text chunking, voice adaptation, and voice cloning, while staying lightweight enough to run on CPU-only systems without requiring a GPU.

🔊Text to speech 🗣️Voice cloning 🎧Audiobooks

NewAudio

Released 2mo ago
Gen 3

VoxCPM2

By OpenBMB

VoxCPM2 is OpenBMB’s open-source tokenizer-free multilingual text-to-speech model for natural speech generation, voice design, and controllable voice cloning. It is a 2B-parameter model trained on over 2 million hours of speech, supports 30 languages, and produces 48 kHz studio-quality audio with real-time streaming capability.

🔊Text to speech 🎤Voice changing 🗣️Voice cloning 🔊Audio

NewMultimodal

Released 2mo ago
Gen 3

OmniVoice

By Xiaomi

OmniVoice is a multilingual zero-shot text-to-speech model built for voice cloning, voice design, and general speech synthesis at massive language scale. It supports more than 600 languages, uses a diffusion language model-style architecture, and is positioned for high-quality speech generation with fast inference.

🔊Text to speech 🎤Voice changing 🗣️Voice cloning 🌐Multilingual communication

NewMultimodal

Released 2mo ago
Gen 4

LongCat AudioDiT 3.5B

By Meituan

LongCat-AudioDiT-3.5B is Meituan LongCat’s diffusion-based text-to-speech model built directly in waveform latent space rather than mel-spectrogram space. It is designed for high-fidelity speech generation and zero-shot voice cloning, supports Chinese and English, and is positioned as a top-performing open model on the Seed benchmark for speaker similarity and intelligibility.

🔊Text to speech 🗣️Voice cloning 🔊Voice enhancement

Audio

Released 3mo ago
Gen 3

Voxtral TTS

By Mistral AI

Voxtral TTS is Mistral’s new open-source text-to-speech model for building voice agents and enterprise speech applications. According to TechCrunch, it supports 9 languages, can clone a voice from under 5 seconds of audio, preserves accents and speaking style, and is optimized for real-time use on edge devices like phones, laptops, and wearables.

🔊Text to speech 🗣️Voice cloning 🎙️Voiceovers 🎤Voice agents

Audio

Released 3mo ago
Gen 4

Lightning v3

By Smallest AI

Lightning is Smallest.ai’s low-latency text-to-speech system for real-time voice agents, voiceovers, and voice cloning.

🔊Text to speech 🗣️Voice cloning 🎙️Voiceovers 🎤Voice agents

Audio

Released 3mo ago
Gen 4

MOSS TTS Local Transformer v1.5

By OpenMOSS

MOSS-TTS-Local-Transformer-v1.5 is a 5B-parameter text-to-speech model supporting 31 languages with zero-shot voice cloning, long-form speech generation, token-level duration control, Pinyin/IPA pronunciation control, code-switching, and 48 kHz stereo audio output via MOSS-Audio-Tokenizer-v2.

🔊Text to speech 🗣️Voice cloning 🎙️Voiceovers 🎧Audiobooks

Audio

Released 3mo ago
Gen 3

Xiaomi MiMo V2 TTS

By Xiaomi

MiMo-V2-TTS is Xiaomi’s large-scale speech synthesis model built for expressive agent voice, aiming for natural, emotionally aware speech.

🔊Text to speech 🎤Voice changing 🎙️Voiceovers 🎤Singing

Audio

Released 3mo ago
Gen 4

TADA 1B

By Hume AI

TADA-1B is a unified speech-language model checkpoint that aligns text tokens and speech representations 1-to-1 for fast, reliable text-to-speech generation.

🔊Text to speech 🗣️Voice cloning 🗣️Speech to speech 🎙️Voiceovers

Audio

Released 3mo ago
Gen 3

TADA 3B ML

By Hume AI

TADA-3B-ml is a multilingual TADA checkpoint built for fast, reliable speech generation using the same 1-to-1 text-acoustic alignment framework.

🔊Text to speech 🗣️Voice cloning 🗣️Speech to speech 🎙️Voiceovers 🌐Multilingual communication

Multimodal

Released 3mo ago
Gen 3

Qwen3 TTS

By Alibaba

Qwen3-TTS is a speech generation model family designed for high-quality, human-like TTS with voice cloning and natural-language control over voice style.

🔊Text to speech 🎤Voice changing 🗣️Voice cloning 🎙️Voiceovers

Audio

Released 4mo ago
Gen 3

GPT Realtime 1.5

By OpenAI

gpt-realtime-1.5 is OpenAI’s flagship real-time voice model for audio-in, audio-out use cases like voice agents and customer support, with support for text, audio, and image inputs and text and audio outputs.

🎙Voice chatting 🔊Text to speech 🎤Voice assistants

Multimodal

Released 4mo ago
Gen 2

CSM

By Sesame AI Labs

Conversational speech generation model that generates audio codes from text and audio inputs for dialogue style speech output.

🔊Text to speech

Coding

Released 4mo ago
Gen 3 Seed

Seed 2.0

By ByteDance

Seed 2.0 is described publicly only as a new ByteDance Seed language model for Doubao, but there is not yet any reliable, detailed public technical description of its architecture, context length, or training data that I can quote.

🎬Video editing 🔊Text to speech 📰News analysis

Multimodal

Released 4mo ago
Gen 3 Seedream

Seedream 5.0 Lite

By ByteDance

Seedream 5.0 Lite is ByteDance Seed's multimodal image generator with deep reasoning and built in web search, built for precise text to image and image editing that follow complex, real time instructions with tight layout and style control.

🖼️Image generation 🔊Text to speech 🖌️Image editing 🖼️Logos

Image

Released 4mo ago
Gen 4

Zyphra

By Zyphra AI

Zonos-v0.1 is Zyphra’s open-weight text-to-speech family, two 1.6B models trained on 200k+ hours of multilingual speech, offering expressive, real-time TTS and high-quality voice cloning.

🔊Text to speech 🗣️Voice cloning

Audio

Released 4mo ago
Gen 4

LuxTTS

By ysharma3501

ZipVoice-based voice cloning TTS that generates 48 kHz speech at up to 150x real time, fitting in about 1 GB VRAM for local, high quality synthesis

🔊Text to speech 🗣️Voice cloning

Audio

Released 4mo ago
Gen 3

MOVA

By OpenMOSS

Open source foundation model that jointly generates video and audio in one pass, achieving tightly synchronized lip movements and environment-aware sound effects.

🎥Videos 🔊Text to speech 🎵Music 🎬Animations

Video

Released 5mo ago
Gen 4 Qwen

Qwen3 ForcedAligner 0.6B

By Alibaba

Multilingual forced alignment model that aligns speech and transcripts in 11 languages, predicting timestamps for arbitrary units in up to 5 minutes of audio with accuracy surpassing previous end-to-end aligners.

🗒Transcription 🔊Text to speech 🌐Text translation 🔍SEO content

Audio

Released 5mo ago
Gen 4 MiniMax

Minimax Speech 2.8 Turbo

By MiniMax

Latency-optimized sibling of Speech-2.8-HD, trading a bit of ultimate fidelity for faster, cheaper generation while keeping multilingual, emotional, voice-cloning strengths.

🔊Text to speech 🗣️Voice cloning

Audio

Released 5mo ago
Gen 4 MiniMax

Minimax Speech 2.8 HD

By MiniMax

High-definition MiniMax TTS model focused on studio-grade, multilingual speech, rich emotion control, interjections and voice cloning for premium voiceovers and production audio.

🔊Text to speech 🗣️Voice cloning

Audio

Released 5mo ago
Gen 2

D4RT

By Google

D4RT is DeepMind’s unified 4D scene reconstruction and tracking model that turns ordinary videos into a fast, queryable representation of 3D geometry and motion, solving tracking, depth and pose up to hundreds of times faster than prior work.

🌍3D images 🔊Text to speech 🎮Game creation 🎬Video editing

Coding

Released 5mo ago
Gen 4 Qwen

Qwen3 TTS 12Hz 1.7B CustomVoice

By Alibaba

Qwen’s open text-to-speech model supporting multilingual speech generation with custom voice capability.

🔊Text to speech 🗣️Voice cloning

Audio

Released 5mo ago
Gen 3

Manzano

By Apple

Manzano is Apple’s unified multimodal model that shares a hybrid vision tokenizer for both image understanding and text-to-image generation, using one autoregressive LLM plus a diffusion decoder to reach state-of-the-art unified performance.

🖼️Image generation 🔊Text to speech 🔍SEO content 🎮Game creation

Multimodal

Released 5mo ago
Gen 7

Chroma 1.0

By Flash Labs

FlashLabs Chroma 1.0 is a real-time spoken dialogue model that interleaves text and audio tokens to enable sub-second, end-to-end conversations with personalized voice cloning and high speaker similarity.

🔊Text to speech 🗣️Voice cloning

Text

Released 5mo ago
Gen 4 FLUX

FLUX.2 [klein] 4B

By Black Forest Labs

FLUX.2-klein-4B is Black Forest Labs’ 4B-parameter rectified-flow image model, unifying fast text-to-image and image-editing with multi-reference support, distilled for sub-second generation on consumer GPUs under Apache 2.0.

🖼️Image generation 🔊Text to speech 🔍SEO content 🖌️Image editing

Image

Released 5mo ago
Gen 7

Seed Prover 1.5

By ByteDance

Seed-Prover 1.5 is ByteDance Seed’s formal theorem-proving model for Lean, trained with agentic RL and test-time scaling to solve most undergraduate and many graduate-level competition problems.

📚Academic research 🔊Text to speech 🎨NFT art

Text

Released 5mo ago
Gen 4

Kyutai Pocket TTS 1.6B

By Kyutai

Kyutai TTS 1.6B is Kyutai's open-source streaming text-to-speech model for English and French, using delayed streams modeling to start speaking before the full text is read, enabling ultra-low-latency, high-quality voices for assistants and real-time apps.

🔊Text to speech 🔍SEO content 🎥Video generation

Audio

Released 5mo ago
Gen 4 Qwen

Qwen3 TTS VC Flash

By Alibaba

Qwen3-TTS-VC-Flash is Qwen’s VoiceClone voice-conversion model that clones any speaker from about 3 seconds of audio, then revoices speech in that identity across 10 languages with low word-error rates.

🗣️Voice cloning 🔊Text to speech

Audio

Released 6mo ago
Gen 4 Qwen

Qwen3 TTS VD Flash

By Alibaba

Qwen3-TTS-VD-Flash is Alibaba Qwen's voice-design TTS model that creates fully custom voices from natural-language instructions, letting users control timbre, rhythm, emotion and persona for expressive, multilingual speech via the Qwen API.

🔊Text to speech

Audio

Released 6mo ago
Gen 4

VoxCPM 1.5

By OpenBMB

VoxCPM1.5 is OpenBMB’s tokenizer-free TTS model that generates expressive, context-aware speech and realistic zero-shot voice clones in Chinese and English, with real-time streaming and open-source weights that support full and LoRA fine-tuning.

🔊Text to speech 📚Database Q&A 🗣️Voice cloning

Audio

Released 6mo ago
Gen 4

VibeVoice

By Microsoft

VibeVoice is Microsoft’s open-source frontier TTS framework that turns long text into expressive multi-speaker conversational audio, generating podcast-style speech with natural turn-taking in English and Mandarin

🔊Text to speech 🌐Websites 📚Book writing 📚Academic writing

Audio

Released 6mo ago
Gen 4 Kling

Kling 2.6

By Kuaishou Technology

Kling Video 2.6 is Kling AI's latest video model that natively generates video plus dialogue, music and sound effects in one step, turning text or images into 5-10 second 1080p clips with tightly synced audio-visual storytelling for creators and advertisers.

🎥Videos 🔊Text to speech 🎵Music

Video

Released 6mo ago