162,464
90,579
68,249
60,151
37,898
30,761
22,925
21,224
19,472
19,249
17,676
17,186
17,171
13,695
13,073
12,674
11,624
10,981
10,966
9,249
Text to speech
taaft.com/text-to-speech
264,873 subscribers
There are 7 Free AI tools for Text to speech.
Subscribe
Free mode
100% free
Freemium
Free Trial
Also used for Text to speech 7
-
Generate game-ready assets with AI in seconds.Share2,274 spritefy.com -
Turn any content into engaging podcasts instantly.ShareYawin Lin🙏 3 karmaJun 10, 2026@PodcastorAIAs a student, I regularly work with research papers and long PDFs. PodcastorAI has been a helpful way to turn that content into something I can listen to while commuting or walking. I like that it creates a structured podcast rather than simply reading the text aloud. The dialogue format is surprisingly engaging and makes dense material easier to get through. I still rely on the original documents for deeper study, but for review and knowledge retention, it's been a genuinely useful tool.3 Reply Share Edit Delete ReportReleased 22d agoFree + from $9.9/mo457104.0
-
Real-time AI-powered Scripture display and note taker that responds to voice.Share15,127 citeverse.liveReleased 1mo agoFree + from $22.50/mo15,89775.0 -
Turn one screen recording into videos, interactive tours, and product guides.Share6,292 trainn.coReleased 3mo agoFree + from $19/mo7,658105.0
-
Share
A personalized audio story gift starring any child, delivered instantly by email. -
Clone any voice in seconds with 99% similarity.ShareKikiVoice does a great job with voice cloning — the results sound very natural and close to the original, and there’s no sign-up required.61 Reply Share Edit Delete ReportReleased 5mo ago100% Free6,836984.5
-
Share
Transcribe audio & video with Whisper. Export TXT/SRT/VTT. Auto-delete 24h.Released 5mo agoFree + from $14.99/mo1,09233.6
Related Tasks✕
Models 100
-
By ByteDanceSeed Audio 1.0 is ByteDance's universal audio generation model that creates voice, music, sound effects, and ambient soundscapes from text prompts. It supports zero-shot voice cloning from short audio references, multi-character dialogue generation in a single pass, and cross-lingual synthesis without fine-tuning. Accessible via Volcano Engine API.NewAudioReleased 6d ago
-
By Zyphra AIZONOS2 is Zyphra’s open-source real-time text-to-speech model with MoE architecture, high-fidelity zero-shot voice cloning, and multilingual expressive speech generation.NewMultimodalReleased 17d ago
-
Gemini Audio is Google DeepMind’s closed-source native audio model family for low-latency live dialogue, controllable speech generation, audio understanding, and voice-first applications.NewAudioReleased 20d ago
-
By Boson AIHiggs Audio v3 TTS is Boson AI’s text-to-speech model for expressive conversational voice agents across 100+ languages with zero-shot voice cloning and inline speech controls.NewAudioReleased 25d ago
-
By Miso LabsMisoTTS is Miso Labs’ open-weight 8B text-and-audio-conditioned speech generation model for expressive, context-aware, emotive TTS and dialogue voice output.NewAudioReleased 26d ago
-
By MicrosoftMAI-Voice-2-Flash is Microsoft AI’s upcoming lower-cost, ultra-efficient variant of MAI-Voice-2 for speech generation.NewMultimodalReleased 27d ago
-
By MicrosoftMAI-Voice-2 is Microsoft AI’s speech generation model for natural-sounding voice output across 15 languages with short-sample voice adaptation.NewMultimodalReleased 27d ago
-
By GradiumPhonon is Gradium’s private-beta 100M-parameter on-device text-to-speech model for low-latency, offline, privacy-sensitive voice generation.NewVideoReleased 1mo ago
-
By Inworld AIRealtime TTS-2 is Inworld AI’s realtime conversational text-to-speech model. It is built for live voice interaction rather than narration, with conversational awareness from prior audio turns, natural-language voice direction, crosslingual voice identity across 100+ languages, and prompt-based voice design.NewMultimodalReleased 1mo ago
-
By CartesiaSonic 3.5 is Cartesia’s fastest and most natural text-to-speech model, built for low-latency conversational voice generation across 42 languages.NewMultimodalReleased 1mo ago
-
By Pruna AIp-video-avatar is Pruna’s talking-head video generation model for creating speaking avatar videos from a single portrait image. It takes either a text script or an audio file, then generates a realistic head-and-shoulders speaking video, with support for multiple voices, languages, and 720p or 1080p output.NewMultimodalReleased 1mo ago
-
sarashina2.2-tts is SB Intuitions’ Japanese-centric large-language-model-based text-to-speech system. It supports Japanese and English, is designed for high pronunciation accuracy, naturalness, and stability across diverse speaking styles, and includes zero-shot voice generation.NewMultimodalReleased 2mo ago
-
By SonioxSoniox Text-to-Speech is Soniox’s multilingual TTS model and API for precise, low-latency speech generation. It is built for production voice systems, supports 60+ languages, and emphasizes accurate pronunciation, faithful reading of structured text like emails and phone numbers, natural code-switching, and streaming output for real-time voice apps.NewMultimodalReleased 2mo ago
-
By OpenRouterOpenRouter TTS is OpenRouter’s unified text-to-speech interface for accessing multiple speech-generation models through one API layer. It standardizes voice generation across providers, supporting streaming audio output, customizable voices, and multimodal workflows without provider-specific integration complexity.NewMultimodalReleased 2mo ago
-
By StepFunStepAudio 2.5 TTS is StepFun’s contextual text-to-speech model with performance-oriented vocal control. It combines global and inline context guidance with zero-shot voice cloning so generated speech can follow broader style instructions as well as local delivery details, rather than just reading text flatly.NewMultimodalReleased 2mo ago
-
By MicrosoftMAI-Voice-1 is Microsoft’s top-tier text-to-speech model for natural, expressive voice generation. It is built to preserve clarity, intent, speaker identity, emotional nuance, and pacing across long-form speech, and supports custom voice creation from only a few seconds of audio. Microsoft positions it for voice experiences, voice agents, and expressive spoken content at high speed and low cost.NewAudioReleased 2mo ago
-
By OpenMOSSMOSS-TTS-Nano is an open-source multilingual tiny speech generation model from MOSI.AI and OpenMOSS. With only 0.1B parameters, it is built for real-time TTS, can run directly on CPU without a GPU, and keeps deployment simple enough for local demos, web serving, and lightweight product integration.NewMultimodalReleased 2mo ago
-
By danneauxsPocket-TTS-Spokenword is an enhanced version of Kyutai’s Pocket TTS built for emotionally expressive audiobook generation from plain text. It adds AI emotion analysis, smart text chunking, voice adaptation, and voice cloning, while staying lightweight enough to run on CPU-only systems without requiring a GPU.NewAudioReleased 2mo ago
-
By OpenBMBVoxCPM2 is OpenBMB’s open-source tokenizer-free multilingual text-to-speech model for natural speech generation, voice design, and controllable voice cloning. It is a 2B-parameter model trained on over 2 million hours of speech, supports 30 languages, and produces 48 kHz studio-quality audio with real-time streaming capability.NewMultimodalReleased 2mo ago
-
By XiaomiOmniVoice is a multilingual zero-shot text-to-speech model built for voice cloning, voice design, and general speech synthesis at massive language scale. It supports more than 600 languages, uses a diffusion language model-style architecture, and is positioned for high-quality speech generation with fast inference.NewMultimodalReleased 2mo ago
-
By MeituanLongCat-AudioDiT-3.5B is Meituan LongCat’s diffusion-based text-to-speech model built directly in waveform latent space rather than mel-spectrogram space. It is designed for high-fidelity speech generation and zero-shot voice cloning, supports Chinese and English, and is positioned as a top-performing open model on the Seed benchmark for speaker similarity and intelligibility.NewAudioReleased 2mo ago
-
By Mistral AIVoxtral TTS is Mistral’s new open-source text-to-speech model for building voice agents and enterprise speech applications. According to TechCrunch, it supports 9 languages, can clone a voice from under 5 seconds of audio, preserves accents and speaking style, and is optimized for real-time use on edge devices like phones, laptops, and wearables.AudioReleased 3mo ago
-
By Smallest AILightning is Smallest.ai’s low-latency text-to-speech system for real-time voice agents, voiceovers, and voice cloning.AudioReleased 3mo ago
-
By OpenMOSSMOSS-TTS-Local-Transformer-v1.5 is a 5B-parameter text-to-speech model supporting 31 languages with zero-shot voice cloning, long-form speech generation, token-level duration control, Pinyin/IPA pronunciation control, code-switching, and 48 kHz stereo audio output via MOSS-Audio-Tokenizer-v2.AudioReleased 3mo ago
-
By XiaomiMiMo-V2-TTS is Xiaomi’s large-scale speech synthesis model built for expressive agent voice, aiming for natural, emotionally aware speech.AudioReleased 3mo ago
-
By Hume AITADA-1B is a unified speech-language model checkpoint that aligns text tokens and speech representations 1-to-1 for fast, reliable text-to-speech generation.AudioReleased 3mo ago
-
By Hume AITADA-3B-ml is a multilingual TADA checkpoint built for fast, reliable speech generation using the same 1-to-1 text-acoustic alignment framework.MultimodalReleased 3mo ago
-
By AlibabaQwen3-TTS is a speech generation model family designed for high-quality, human-like TTS with voice cloning and natural-language control over voice style.AudioReleased 4mo ago
-
By OpenAIgpt-realtime-1.5 is OpenAI’s flagship real-time voice model for audio-in, audio-out use cases like voice agents and customer support, with support for text, audio, and image inputs and text and audio outputs.MultimodalReleased 4mo ago
-
Conversational speech generation model that generates audio codes from text and audio inputs for dialogue style speech output.CodingReleased 4mo ago
-
By ByteDanceSeed 2.0 is described publicly only as a new ByteDance Seed language model for Doubao, but there is not yet any reliable, detailed public technical description of its architecture, context length, or training data that I can quote.MultimodalReleased 4mo ago
-
By ByteDanceSeedream 5.0 Lite is ByteDance Seed's multimodal image generator with deep reasoning and built in web search, built for precise text to image and image editing that follow complex, real time instructions with tight layout and style control.ImageReleased 4mo ago
-
By Zyphra AIZonos-v0.1 is Zyphra’s open-weight text-to-speech family, two 1.6B models trained on 200k+ hours of multilingual speech, offering expressive, real-time TTS and high-quality voice cloning.AudioReleased 4mo ago
-
By ysharma3501ZipVoice-based voice cloning TTS that generates 48 kHz speech at up to 150x real time, fitting in about 1 GB VRAM for local, high quality synthesisAudioReleased 4mo ago
-
By OpenMOSSOpen source foundation model that jointly generates video and audio in one pass, achieving tightly synchronized lip movements and environment-aware sound effects.VideoReleased 4mo ago
-
By AlibabaMultilingual forced alignment model that aligns speech and transcripts in 11 languages, predicting timestamps for arbitrary units in up to 5 minutes of audio with accuracy surpassing previous end-to-end aligners.AudioReleased 5mo ago
-
By MiniMaxLatency-optimized sibling of Speech-2.8-HD, trading a bit of ultimate fidelity for faster, cheaper generation while keeping multilingual, emotional, voice-cloning strengths.AudioReleased 5mo ago
-
By MiniMaxHigh-definition MiniMax TTS model focused on studio-grade, multilingual speech, rich emotion control, interjections and voice cloning for premium voiceovers and production audio.AudioReleased 5mo ago
-
By GoogleD4RT is DeepMind’s unified 4D scene reconstruction and tracking model that turns ordinary videos into a fast, queryable representation of 3D geometry and motion, solving tracking, depth and pose up to hundreds of times faster than prior work.CodingReleased 5mo ago
-
By AlibabaQwen’s open text-to-speech model supporting multilingual speech generation with custom voice capability.AudioReleased 5mo ago
-
By AppleManzano is Apple’s unified multimodal model that shares a hybrid vision tokenizer for both image understanding and text-to-image generation, using one autoregressive LLM plus a diffusion decoder to reach state-of-the-art unified performance.MultimodalReleased 5mo ago
-
By Flash LabsFlashLabs Chroma 1.0 is a real-time spoken dialogue model that interleaves text and audio tokens to enable sub-second, end-to-end conversations with personalized voice cloning and high speaker similarity.TextReleased 5mo ago
-
FLUX.2-klein-4B is Black Forest Labs’ 4B-parameter rectified-flow image model, unifying fast text-to-image and image-editing with multi-reference support, distilled for sub-second generation on consumer GPUs under Apache 2.0.ImageReleased 5mo ago
-
By ByteDanceSeed-Prover 1.5 is ByteDance Seed’s formal theorem-proving model for Lean, trained with agentic RL and test-time scaling to solve most undergraduate and many graduate-level competition problems.TextReleased 5mo ago
-
By KyutaiKyutai TTS 1.6B is Kyutai's open-source streaming text-to-speech model for English and French, using delayed streams modeling to start speaking before the full text is read, enabling ultra-low-latency, high-quality voices for assistants and real-time apps.AudioReleased 5mo ago
-
By AlibabaQwen3-TTS-VC-Flash is Qwen’s VoiceClone voice-conversion model that clones any speaker from about 3 seconds of audio, then revoices speech in that identity across 10 languages with low word-error rates.AudioReleased 6mo ago
-
By AlibabaQwen3-TTS-VD-Flash is Alibaba Qwen's voice-design TTS model that creates fully custom voices from natural-language instructions, letting users control timbre, rhythm, emotion and persona for expressive, multilingual speech via the Qwen API.AudioReleased 6mo ago
-
By OpenBMBVoxCPM1.5 is OpenBMB’s tokenizer-free TTS model that generates expressive, context-aware speech and realistic zero-shot voice clones in Chinese and English, with real-time streaming and open-source weights that support full and LoRA fine-tuning.AudioReleased 6mo ago
-
By MicrosoftVibeVoice is Microsoft’s open-source frontier TTS framework that turns long text into expressive multi-speaker conversational audio, generating podcast-style speech with natural turn-taking in English and MandarinAudioReleased 6mo ago
-
Kling Video 2.6 is Kling AI's latest video model that natively generates video plus dialogue, music and sound effects in one step, turning text or images into 5-10 second 1080p clips with tightly synced audio-visual storytelling for creators and advertisers.VideoReleased 6mo ago
Loading more models...
Devices 4
-
The BeanieSmart Phone · SabiApr, 2026AnnouncedN/AA non-invasive knit beanie that decodes internal speech into text using a dense array of ~70,000–100,000 fabric-embedded dry biosensors p... -
MemoMind OneSmart Glasses · MemoMindMay 28, 2026Announced$599.00Camera-free AI smart glasses with dual-eye waveguide microLED display (green monochrome), integrated Harman Kardon-tuned stereo speakers,... -
AIY Voice KitSmart Speaker · GoogleApr 16, 2018Discontinued$49.99The AIY Voice Kit from Google is a do-it-yourself intelligent speaker that lets you build your own natural language processor and connect... -
OpenHome DevKitSmart Speaker · OpenHome TechnologiesMar 11, 2026Available$200.00The OpenHome DevKit is an open-source voice AI development platform that lets developers build custom AI-powered smart speakers and voice...
Repositories 592
#
Repository
Company
Stars
Forks
License
Size
Updated
