AI Text-To-Speech – Using Play.th API
Reading Time: 5 minutesChatGPT and AI are terms that nowadays are all over the place. Wherever you read or watch the news, AI is around. Even though AI is a concept that is much older than ChatGPT, I clearly see an improvement, not only in the text generation with the latest ChatGPT 4 but also in the image generation with Midjourney. Voice generation is not an exception. However, how does it work nowadays at the time to integrate this new technology into our apps and services?
Today, we can find several products that offer an AI-based Text-To-Speech API, and in the current article, I will create a proof of concept Java application to use the API offered by the startup Play.th. Engage!
During the last year of my master’s degree at the University of Oviedo, I had the opportunity to include Google’s Speech-To-Text (not Text-To-Speech) in one of the Android apps I had to develop. At that time, the usage was quite simple. I must confess that, at that time, I was pretty impressed by how easy and effective that technology was to use from an application.
Today, using a third-party API from one of our services is not (usually) difficult. Let’s see how to create a small Java application that uses Play.ht API to obtain audio based on text sent.
All the analysis about Path.ht can be found in The Optimist Engineer Newsletter
Create the app for the POC
First, we can start to create a Java project skeleton. As you know, I’m a big fan of Micronaut Framework, so let’s go with it:
mn create-cli-app --build=gradle --jdk=11 --lang=java --test=junit --features=http-client com.marcosflobo.demoplayht.demo-play_ht
Magic! We have a CLI application ready to implement. For this POC I plan to go fast so I just add a service that will use a declarative HTTP Client.
package com.marcosflobo.demoplayht;
import io.micronaut.http.annotation.Body;
import io.micronaut.http.annotation.Header;
import io.micronaut.http.annotation.Post;
import io.micronaut.http.client.annotation.Client;
@Client("https://play.ht")
@Header(name = "X-User-Id", value = "foo")
@Header(name = "AUTHORIZATION", value = "Bearer bar")
@Header(name = "accept", value = "text/event-stream")
@Header(name = "content-type", value = "application/json")
public interface PathHtApiClient {
@Post("/api/v2/tts")
String get(@Body String request);
}
Please note that, in the headers X-User-Id
and in AUTHORIZATION
, you will have to set the values provided by Play.ht developer portal. Let’s jump there now.
Get API Access
Once you create an account and access Play.ht developer portal to can reach the API Access menu. There, you will see that the User ID has been generated for your already. You just have to click on the button to generate the Secret Key. The map, between our HTTP Client and the credentials from the developer portal, is like this:
HTTP Header | Credential parameters |
---|---|
X-User-Id | User ID |
AUTHORIZATION | Secret Key |
The API documentation is quite simple, which is a good thing in my opinion. Set up the API key and the secret was straightforward. Moreover, the documentation is very clear about how to include those parameters in our requests.
Configure and run the app
Now, let’s create a small service, that will be called from the main
method, to provide the payload (the text) to the API call.
package com.marcosflobo.demoplayht;
import jakarta.inject.Singleton;
@Singleton
public class PathHtService {
private final PathHtApiClient pathHtApiClient;
public PathHtService(PathHtApiClient pathHtApiClient) {
this.pathHtApiClient = pathHtApiClient;
}
public String get() {
String request = "{\n"
+ " \"text\": \"10 years ago, I started to be more active in the Tech Events thing. Some of the biggest events I had the opportunity to attend were OpenStack Summit 2015, DockerCon Europe 2018 (2.200 attendees), and KubeCon Europe 2019 (7.700 attendees).\",\n"
+ " \"voice\": \"larry\"\n"
+ "}";
return pathHtApiClient.get(request);
}
}
Ready. Let’s run the application
./gradlew run
event: generating
data: {"id":"Viu8v3Wvzgm40e08IV","progress":0,"stage":"queued"}
event: generating
data: {"id":"Viu8v3Wvzgm40e08IV","progress":0.01,"stage":"active"}
event: generating
data: {"id":"Viu8v3Wvzgm40e08IV","progress":0.01,"stage":"preload","stage_progress":0}
event: generating
data: {"id":"Viu8v3Wvzgm40e08IV","progress":0.11,"stage":"preload","stage_progress":0.5}
event: generating
data: {"id":"Viu8v3Wvzgm40e08IV","progress":0.16,"stage":"preload","stage_progress":0.75}
event: generating
data: {"id":"Viu8v3Wvzgm40e08IV","progress":0.21,"stage":"preload","stage_progress":1}
event: generating
data: {"id":"Viu8v3Wvzgm40e08IV","progress":0.21,"stage":"generate","stage_progress":0}
event: generating
data: {"id":"Viu8v3Wvzgm40e08IV","progress":0.32,"stage":"generate","stage_progress":0.2}
event: generating
data: {"id":"Viu8v3Wvzgm40e08IV","progress":0.53,"stage":"generate","stage_progress":0.6}
event: generating
data: {"id":"Viu8v3Wvzgm40e08IV","progress":0.55,"stage":"generate","stage_progress":0.64}
event: generating
data: {"id":"Viu8v3Wvzgm40e08IV","progress":0.57,"stage":"generate","stage_progress":0.68}
event: generating
data: {"id":"Viu8v3Wvzgm40e08IV","progress":0.74,"stage":"generate","stage_progress":1}
event: generating
data: {"id":"Viu8v3Wvzgm40e08IV","progress":0.74,"stage":"postprocessing","stage_progress":0}
event: generating
data: {"id":"Viu8v3Wvzgm40e08IV","progress":0.82,"stage":"postprocessing","stage_progress":0.33}
event: generating
data: {"id":"Viu8v3Wvzgm40e08IV","progress":0.91,"stage":"postprocessing","stage_progress":0.67}
event: generating
data: {"id":"Viu8v3Wvzgm40e08IV","progress":0.99,"stage":"postprocessing","stage_progress":1}
event: completed
data: {"id":"Viu8v3Wvzgm40e08IV","progress":1,"stage":"complete","url":"https://peregrine-results.s3.amazonaws.com/pigeon/adasdsadasdas_0.mp3","duration":18.7093,"size":375885}
In the last event, we can find the link to the MP3 file to be downloaded. The endpoint POST /api/v2/tts support more parameters, for example, choosing the format of the file (mp3, wave, ogg, flac), the quality of the audio, the voice, and even its speed, among others. Play.ht uploads the resulting MP3 audio into an AWS S3 bucket, for us to use it.
The input I’ve provided is 237 characters and it takes ~7 seconds on average. Just considering the time to analyze the test, generate the audio, and upload the 375 KB MP3 to AWS S3 bucket, I would say it’s a fair time to respond. Also, even though you see the whole response came in one shot (because I’ve used String as a response object), the response is an event stream, so you can imagine is something we can handle.
In my example app, I just could use the voice from “Larry”. When I tried to use another voice, I could not figure out which would be the ID, from the voice list on the developer portal, to set in the request, so I was getting BAD REQUEST
using a different voice.
Also, I want to mention that, in some runs, more often than I expected, I got a “Read Timeout” from the HTTP client. I choose to believe that this can happen when using free accounts like me and, with a paid account, this does not happen.
19:36:17.984 [main] INFO i.m.c.DefaultApplicationContext$RuntimeConfiguredEnvironment - Established active environments: [cli]
Running!
19:36:29.940 [main] ERROR i.m.r.intercept.RecoveryInterceptor - Type [com.marcosflobo.demoplayht.PathHtApiClient$Intercepted] executed with error: Read Timeout
io.micronaut.http.client.exceptions.ReadTimeoutException: Read Timeout
at io.micronaut.http.client.exceptions.ReadTimeoutException.<clinit>(ReadTimeoutException.java:26)
at io.micronaut.http.client.netty.DefaultHttpClient.lambda$exchangeImpl$33(DefaultHttpClient.java:1097)
at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:94)
Summary
To summarize, a very good developer experience; I could set up very fast a free account and start using it. Also, the documentation is short and concise, which helps developers (once again) to speed up the implementation.
My perception is that the free accounts clearly must have less priority than the paid accounts. I can imagine also that, due to the heavy load they might have because of all the hype around AI tools and products, their backend could be a bit overloaded, so the free accounts could get some Read Timeouts as I got.
It’s the first time that I use an AI-based Text-To-Speech technology but I’m impressed with the speed of generating the audio file based on the text provided. I also liked that the audio files are updated to AWS S3. I’m wondering if, in their current roadmap, they have they have a feature request to let the customer set up an owned AWS S3 bucket to upload the audio files.
By the way, the whole Java Micronaut project can be found on GitHub.
Hope you enjoyed this POC. Are you using any Text-To-Speech AI in your services? What was your experience? Reach me out on Twitter or Mastodon!