Qt 6.6的TextToSpeech改进
TextToSpeech improvements for Qt 6.6
Qt 6.6的TextToSpeech改进
July 17, 2023 by Volker Hilsheimer | Comments
2023年7月17日:Volker Hilsheimer |评论
When we announced the Qt 6 port of Qt Speech for Qt 6.4, one of the comments pointed out that the module would be more valuable if applications could access the generated speech audio data. Qt 6.6 introduces exactly that, plus a few more features and API improvements.
当我们宣布Qt 6.4的Qt Speech的Qt 6移植时,其中一条评论指出,如果应用程序能够访问生成的语音音频数据,该模块将更有价值。Qt6.6正好引入了这一点,另外还有一些特性和API改进。
Synthesizing audio data from text
从文本合成音频数据
The QTextToSpeech C++ class has learned how to generate the speech audio as PCM data. In addition to using QTextToSpeech::say(QString), which simply plays the generated audio, applications can call one of the new QTextToSpeech::synthesize() overloads. These overloads take the input text as well as a slot, i.e. a functor, lambda, free function, or member function pointer (with context object if needed). That slot will then get called whenever a chunk of PCM data is available from the backend, with a QAudioBuffer from Qt Multimedia (or, slightly more efficiently, a QAudioFormat and QByteArray) describing the format and containing the actual data. Applications can then post-process the PCM data, write it to a file, or cache it for repeated play-backs using Qt Multimedia.
QTextToSpeech C++类已经学习了如何将语音音频生成为PCM数据。除了使用QTextToSpeech::say(QString)(它只播放生成的音频)之外,应用程序还可以调用一个新的QTextToSpeech::synthesis()重载。这些重载接受输入文本以及一个槽,即一个函子、lambda、自由函数或成员函数指针(如果需要,带有上下文对象)。然后,每当后端有PCM数据块可用时,就会调用该插槽,Qt Multimedia的QAudioBuffer(或者更有效的是,QAudioFormat和QByteArray)描述格式并包含实际数据。然后,应用程序可以对PCM数据进行后处理,将其写入文件,或使用Qt Multimedia将其缓存以重复播放。
Better process control
更好的过程控制
With Qt 6.6, applications will have better control over the flow of the speech generation. The new QTextToSpeech::enqueue function adds an utterance to an ongoing text-to-speech process, and the new aboutToSynthesize signal is emitted before each of the enqueued utterances gets passed to the backend. This allows applications to make modifications to speech attributes, such as voice or pitch, for each utterance in the queue. And while speech audio is being played, QTextToSpeech can now emit the sayingWord signal for each word as it gets spoken, allowing applications to follow the progress and perhaps give visual cues to the user.
有了Qt 6.6,应用程序将能够更好地控制语音生成的流程。新的QTextToSpeech::enqueue函数将话语添加到正在进行的文本到语音过程中,并且在每个入队的话语被传递到后端之前,会发出新的aboutToSynthesis信号。这允许应用程序为队列中的每个话语修改语音属性,例如语音或音高。在播放语音音频时,QTextToSpeech现在可以在每个单词说出时发出sayingWord信号,使应用程序能够跟踪进度,并可能为用户提供视觉提示。
Selecting voices made easy
轻松选择声音
We made it easier for applications to select a voice for the text-to-speech synthesis. This has been difficult until now, as applications had to first set the correct locale on the QTextToSpeech object, and then pick one of the voices from the list of availableVoices. With Qt 6.6, it becomes easy to find a suitable voice matching a combination of criteria:
我们使应用程序更容易选择用于文本到语音合成的语音。到目前为止,这一直很困难,因为应用程序必须首先在QTextToSpeech对象上设置正确的区域设置,然后从可用语音列表中选择一个语音。使用Qt 6.6,很容易找到符合以下标准组合的合适语音:
const auto frenchWomen = textToSpeech->findVoices(QLocale::French,
QVoice::Female,
QVoice::Adult);
const auto norwegians = textToSpeech->findVoices(QLocale::Norway);
Note how the criteria can include an attribute of a locale (e.g. just "French" as a language, or "Norway" as the country; a QLocale object always has both defined). This way, your application doesn't have to worry about the optimal territory or dialect. To be fair, one shouldn't ask a Nynorsk voice to pronounce a Bokmål text; but if your system only happens to support one of the Norwegian official languages, then using that will still be an improvement over the English voice of your e.g. navigation system trying to pronounce my old street address in "Banksjef Frølichs Gate".
请注意,标准可以如何包括区域设置的属性(例如,仅将“法语”作为语言,或将“挪威”作为国家;QLocale对象始终同时定义了这两种语言)。这样,您的应用程序就不必担心最佳区域或方言。公平地说,人们不应该要求尼诺斯克人的声音来发音博克马尔语;但是,如果系统恰好只支持挪威官方语言之一,那么使用它仍然会比英语语音有所改进,例如,导航系统试图在“Banksjef Frølichs Gate”中发音我的旧街道地址。
With the exception of QTextToSpeech::synthesize (where the code that processes the raw PCM bytes should be written in C++ anyway), all new capabilities are available from QML as well. E.g. the selection of a voice is achieved through an attached VoiceSelector property:
除了QTextToSpeech::synthesis(其中处理原始PCM字节的代码无论如何都应该用C++编写)之外,QML还提供了所有新功能。例如,语音的选择是通过附加的VoiceSelector属性实现的:
TextToSpeech {
id: femaleEnglishVoice
VoiceSelector.gender: Voice.Female
VoiceSelector.language: Qt.locale("en")
}
This will implicitly select the first matching voice, or otherwise leave the voice unchanged.
这将隐含地选择第一个匹配的语音,或者以其他方式保持语音不变。
What's left?
剩下什么?
The last significant feature on my Qt TextToSpeech backlog is support for Speech Synthesis Markup Language, or short SSML. A work-in-progress implementation is available on gerrit code review, and what I learned from that experiment is that each backend supports a different subset of SSML. Also, the data we get from backends for the new sayingWord signal are indices into the actual text being spoken, not into the XML string. This might be ok, but the feature needs some more thinking; We don't want an XML string that works well on one platform to break the output completely on a different platform (but should we then remove XML elements that we know to be currently unsupported?).
QtTextToSpeech积压工作中的最后一个重要功能是支持语音合成标记语言(简称SSML)。gerrit代码审查中提供了一个正在进行中的实现,从实验中了解到,每个后端都支持不同的SSML子集。此外,我们从新sayingWord信号的后端获得的数据是对所说的实际文本的索引,而不是XML字符串的索引。这可能还可以,但这个功能需要更多的思考;我们不希望一个在一个平台上运行良好的XML字符串在另一个平台中完全破坏输出(但我们是否应该删除我们知道当前不支持的XML元素?)。
Not all new features are available with all backends. In particular, synthesising to PCM data as well as word progress emission require support for the backend. The new QTextToSpeech:engineCapabilities API reports which features are implemented by the backend, and we have updated the backend documentation with the relevant details. Applications can now check at runtime which features they can use, but it would of course be best if everything just worked everywhere. Most importantly, it would be great if we could synthesise speech PCM data also with the speech-dispatcher engine. Contributions welcome, although last time I checked, this required some work on speech-dispatcher itself (at least on the documentation).
并非所有新功能都可用于所有后端。特别是,合成PCM数据以及字进度发射需要后端支持。新的QTextToSpeech:engineCapabilities API报告了哪些功能是由后端实现的,我们已经用相关详细信息更新了后端文档。应用程序现在可以在运行时检查他们可以使用哪些功能,但如果一切都在任何地方工作,那当然是最好的。最重要的是,如果我们也能用语音分配器引擎合成语音PCM数据,那就太好了。欢迎投稿,尽管上次我检查时,这需要对语音分配器本身进行一些工作(至少在文档上)。
As for speech support as a whole: the qtspeech repository now covers the direction from text to speech; some research has been done and proof-of-concept implementations for speech recognition are available on gerrit code review. We'd be very interested to learn more about your use-cases for such a module.
至于整个语音支持:qtspeech存储库现在涵盖了从文本到语音的方向;已经进行了一些研究,并且在gerrit代码审查中可以获得语音识别的概念验证实现。我们很有兴趣了解更多关于这样一个模块的用例。
And apropos contributions - around the Qt 6.6 feature freeze we had a public API review of Qt TextToSpeech, and I'd like to thank Marc, Fabian, and Philippe for taking the time to go through the changes, provide their feedback, and generally help with improving this module!
关于Qt 6.6功能冻结,我们对Qt TextToSpeech进行了一次公开的API审查,我要感谢Marc、Fabian和Philippe花时间完成了更改,提供了反馈,并帮助改进了此模块!