Edge TTS API

# Overview

Text-to-speech API I built for automated video narration. You send text, it gives you audio + word-by-word subtitle files. I made this for those Minecraft storytelling videos that went viral in late 2023 to 2024 era, those were popular on TikTok and Facebook at the time. It was cool concept and easy to automate so I started building one part at a time - this is the audio + subtitle engine part.

It used Microsoft Edge's neural TTS engine. 400+ voices, 100+ languages, completely free. Hosted on Vercel as serverless functions. Worked great for about 3-4 months. Then Microsoft changed their API. I tried fixing it. Didn't work. Ultimately gave up with the work I had at the time.

# Timeline

July 27, 2024 - First commit. Built the whole thing in a day.
April 16, 2025 - Last attempt at fixing it. Gave up after that.

# Why I Built This

I wanted to make those simple videos myself. You probably seen them - those creepypasta stories that TikTok spits out in old days. A Reddit horror story with the Minecraft parkour background and middle word-by-word big text popping up with the audio. Nowadays they're not as viral as old days but still around.

Microsoft Edge has this "Read Aloud" feature with voices that sound actually good. Someone reverse-engineered it into a Python library and wrapped it into a library called edge-tts. I wrapped it in an API, added subtitle generation, deployed to Vercel. It can deploy onto a VPS as well but Vercel offered free so I hosted there. Worked perfectly for months.

# How It Worked

The flow was pretty simple:

Loading diagram...

You send text + voice choice. The API calls Microsoft Edge's TTS service. Gets the audio back. Also grabs timing data for each word (like "hello" starts at 0.00s, ends at 0.30s). Converts that into subtitle format. Zips everything together. Done.

# Technical Stack

Core API

FastAPI - Web framework
edge-tts - Python wrapper for Microsoft Edge TTS
Uvicorn - Server
Pydantic - Data validation

Deployment

Vercel - Serverless hosting (Edge-tts-api-on-vercel repo)
Python 3.9 - Runtime

# Voices & Languages

400+ neural voices. Some I used:

English: en-US-BrianMultilingualNeural, en-GB-SoniaNeural
Sinhala: si-LK-SameeraNeural, si-LK-ThiliniNeural (yes, Sinhala included)

The quality was impressive. Not robotic like other TTS's at the time. Actually sounded like a real person talking. The kind of voiceover you'd normally pay for. And it's completely free.

# Vercel Deployment

That's my 1st exploration into serverless environments. I setup Vercel config files and all - all of them were new to me, but figured it out with AI and so on, and deployed there. Serverless are limited by time. I didn't remember what the limit was at the time but it's pretty generous. I can run this on like few minutes of text without problem, it spits out all the SRT file and the audio file perfectly for download. It's pretty cool overall.

# Why It Died

Microsoft updated Edge's TTS service. I think they also changed the API endpoints, updated authentication, and so on. But the community adopted and changed the library, so that meant I needed to change my scripts too to match the new style. But as I said it's not viral anymore so why bother fixing unwanted library anyway. But I tried once to fix it.

I tried fixing:

Updated library to latest version
The subtitle part didn't work correctly so adjusted the script a bit
Poked at the internal API calls

But ended up calling it a day. Because it really felt dead weight at this point to support something no one using - not even myself.

# What I Learned

Even though it doesn't work anymore, I learned a lot:

FastAPI basics - Request models, routers, auto-generated docs
Async/await - TTS is slow, learned to use async properly
World of Serverless - Understanding serverless envarements
External API risks - Building on undocumented APIs = breakage eventually
ZIP files - Creating file packages in memory
SRT format - Sub file formats, SRT, VTT
Vercel deployment - Serverless functions, cold starts, timeouts

# Would I Do It Again?

Yeah, but differently next time.

I need tools that once built, will sustain themselves. Occasional troubleshooting is fine - but depending on undocumented APIs that break randomly isn't sustainable. If I build this again, I'd use official services or run it on a proper VPS where I have full control.

This was my first ever my own hosted project, So yeah, I learned a lot. I'd definitely learn even more building it again with what I know now.

It was fun while it lasted. Serverless is cool - I'm a fan now. No server management, less flexible. Need to learn where to use it. That's a lesson. And I like lessons. The best lessons are the ones you break stuff, then fix it yourself multiple times.

Sometimes a project doesn't need to last forever. It just needs to teach you something while it works.

Lakshan De Silva

Text-to-speech API