this post was submitted on 01 Dec 2023
12 points (100.0% liked)
Free Open-Source Artificial Intelligence
3631 readers
1 users here now
Welcome to Free Open-Source Artificial Intelligence!
We are a community dedicated to forwarding the availability and access to:
Free Open Source Artificial Intelligence (F.O.S.A.I.)
More AI Communities
LLM Leaderboards
Developer Resources
GitHub Projects
FOSAI Time Capsule
- The Internet is Healing
- General Resources
- FOSAI Welcome Message
- FOSAI Crash Course
- FOSAI Nexus Resource Hub
- FOSAI LLM Guide
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
This is so exciting!
I can't wait to see how well the Expressive model does on anime and foreign films. I wouldn't be surprised if this was the end of terrible dubs.
This is gonna be great for language learning as well. Finally being able to pick any media and watch it in any language. It might even be possible to rig it up to an LLM to tune the vocab to your exact level...
Yeah, I was over-enthusiastic based on their cherry-picked examples. SeamlessExpressive still leaves a lot to be desired.
It has a limited range of emotions and can't change emotion in the middle of the clip. It can't produce the pitch shifts of someone talking excitedly, making the output sound monotonous. Background noise in the input causes a raspy, distorted output voice. Sighs, inter-sentence breaths, etc. aren't reproduced. Sometimes the sentence pacing is just completely unnatural, with missing pauses or pauses in bad places (e.g. before the sentence-final verb in German).
IMO their manual dataset creation is holding them back. If I was in this field, I would try to follow the LLM route: Start with a next-token predictor trained indiscriminately on large-scale speech+text data (e.g. TV shows, movies, news radio, all with subtitles even if the subs need to be AI generated), fine-tune it for specific tasks (mainly learning to predict and generate based on "style tokens" (speaker, emotion, accent, pacing)), then generate a massive "textbook" synthetic dataset. The translation aspect could be almost completely outsourced to LLMs or multilingual subtitles.