CN104867491B

CN104867491B - Rhythm model training method and device for phonetic synthesis

Info

Publication number: CN104867491B
Application number: CN201510337430.7A
Authority: CN
Inventors: 徐扬凯; 李秀林
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2015-06-17
Filing date: 2015-06-17
Publication date: 2017-08-18
Anticipated expiration: 2035-06-17
Also published as: CN104867491A

Abstract

The invention discloses a kind of rhythm model training method and device for phonetic synthesis, wherein, for the rhythm model training method of phonetic synthesis, including：S1, the corresponding text feature of extraction participle and marker characteristic from training corpus text；S2, based on Chinese thesaurus in training corpus text participle carry out it is extensive；And S3, according to text feature, marker characteristic and it is extensive after participle, rhythm model is trained.The rhythm model training method and device for phonetic synthesis of the embodiment of the present invention, by extracting the corresponding text feature of participle and marker characteristic from training corpus text, multiple participles in training corpus text are carried out based on Chinese thesaurus extensive, then according to text feature, marker characteristic and it is extensive after participle, rhythm model is trained, so that rhythm model is more perfect, and then lift the accuracy of prosody prediction.

Description

Rhythm model training method and device for phonetic synthesis

Technical field

The present invention relates to literary periodicals technical field, more particularly to a kind of rhythm model training method for phonetic synthesis And device.

Background technology

Phonetic synthesis, also known as literary periodicals technology, it is a kind of that text information can be converted into voice and read aloud Technology.With the continuous progress of science and technology, the application of phonetic synthesis is more and more extensive, such as report, the sound novel of news and information Deng.In daily life, also the information such as short message, mail can be synthesized by voice by phonetic synthesis, is to provide one kind user more Obtain the mode of information.

In speech synthesis system, prosody prediction is the basis of whole system, if rhythm pause prediction error can be direct Influence the effect of phonetic synthesis.For example：Synthesis text is " if passerby passs its empty bottle ", and the correct rhythm should be " if # 1 passerby #1 passs its #2 of #1 mono- #1 empty bottle ", and real prosody prediction result is " if #1 passerby #1 passs its #1 mono- of #2 Individual #1 empty bottles ", #1 therein represents dwell, and #2 represents big pause.It is final that rhythm pause prediction error causes the sentence The inadequate remarkable fluency of synthetic effect, so as to cause user experience poor.

The content of the invention

It is contemplated that at least solving one of technical problem in correlation technique to a certain extent.Therefore, the present invention One purpose is a kind of rhythm model training method for phonetic synthesis of proposition, and this method can improve rhythm model, carry Rise the accuracy of prosody prediction.

Second object of the present invention is to propose a kind of phoneme synthesizing method.

Third object of the present invention is to propose a kind of rhythm model trainer for phonetic synthesis.

Fourth object of the present invention is to propose a kind of speech synthetic device.

To achieve these goals, first aspect present invention embodiment proposes a kind of rhythm model for phonetic synthesis Training method, including：S1, the corresponding text feature of extraction participle and marker characteristic from training corpus text；S2, based on synonymous Word word woods carries out extensive to the participle in the training corpus text；And S3, according to the text feature, the marker characteristic And it is extensive after participle, the rhythm model is trained.

The rhythm model training method for phonetic synthesis of the embodiment of the present invention, by being extracted from training corpus text Multiple participles in training corpus text are carried out general by the corresponding text feature of participle and marker characteristic based on Chinese thesaurus Change, then according to text feature, marker characteristic and it is extensive after participle, rhythm model is trained so that rhythm model It is more perfect, and then lift the accuracy of prosody prediction.

Second aspect of the present invention embodiment proposes a kind of phoneme synthesizing method, including：S4, the extraction from text to be predicted Text feature, and the text feature is inputted into the rhythm model；S5, according to the rhythm model to the text to be predicted Carry out prosody prediction；S6, acoustical predictions further are carried out to the text to be predicted, to generate parameters,acoustic sequence；And S7, The parameters,acoustic sequence is spliced, to generate phonetic synthesis result.

The phoneme synthesizing method of the embodiment of the present invention, by extracting text feature from text to be predicted, and text is special Input rhythm model is levied, prosody prediction is carried out to text to be predicted according to rhythm model, further to text carry out sound to be predicted Prediction is learned, to generate parameters,acoustic sequence, and parameters,acoustic sequence is spliced, to generate phonetic synthesis result, is based on The rhythm model of Chinese thesaurus, improves the accuracy of prosody prediction, and the more remarkable fluency so that the rhythm pauses, lifting is used Family usage experience.

Third aspect present invention embodiment proposes a kind of rhythm model trainer for phonetic synthesis, including：Carry Modulus block, for extracting the corresponding text feature of participle and marker characteristic from training corpus text；Extensive module, for based on Chinese thesaurus carries out extensive to the participle in the training corpus text；And training module, for special according to the text Levy, the marker characteristic and it is extensive after participle, the rhythm model is trained.

The rhythm model trainer for phonetic synthesis of the embodiment of the present invention, by being extracted from training corpus text Multiple participles in training corpus text are carried out general by the corresponding text feature of participle and marker characteristic based on Chinese thesaurus Change, then according to text feature, marker characteristic and it is extensive after participle, rhythm model is trained so that rhythm model It is more perfect, and then lift the accuracy of prosody prediction.

Fourth aspect present invention embodiment proposes a kind of speech synthetic device, including：Extraction module, for be predicted Text feature is extracted in text, and the text feature is inputted into the rhythm model；Prosody prediction module, for according to described Rhythm model carries out prosody prediction to the text to be predicted；Acoustical predictions module, for further to the text to be predicted Acoustical predictions are carried out, to generate parameters,acoustic sequence；And generation module, for splicing to the parameters,acoustic sequence, To generate phonetic synthesis result.

The speech synthetic device of the embodiment of the present invention, by extracting text feature from text to be predicted, and text is special Input rhythm model is levied, prosody prediction is carried out to text to be predicted according to rhythm model, further to text carry out sound to be predicted Prediction is learned, to generate parameters,acoustic sequence, and parameters,acoustic sequence is spliced, to generate phonetic synthesis result, is based on The rhythm model of Chinese thesaurus, improves the accuracy of prosody prediction, and the more remarkable fluency so that the rhythm pauses, lifting is used Family usage experience.

Brief description of the drawings

Fig. 1 is the flow chart of the rhythm model training method according to an embodiment of the invention for phonetic synthesis.

Fig. 2 is the flow chart of phoneme synthesizing method according to an embodiment of the invention.

Fig. 3 is the structural representation of the rhythm model trainer according to an embodiment of the invention for phonetic synthesis Figure.

Fig. 4 is the structural representation of speech synthetic device according to an embodiment of the invention.

Embodiment

Embodiments of the invention are described below in detail, the example of the embodiment is shown in the drawings, wherein from beginning to end Same or similar label represents same or similar element or the element with same or like function.Below with reference to attached The embodiment of figure description is exemplary, it is intended to for explaining the present invention, and be not considered as limiting the invention.

Below with reference to the accompanying drawings describe the rhythm model training method and device for phonetic synthesis of the embodiment of the present invention with And phoneme synthesizing method and device.

As shown in figure 1, the rhythm model training method for phonetic synthesis may include：

S1, the corresponding text feature of extraction participle and marker characteristic from training corpus text.

Wherein, training corpus can be split as multiple participles, and each participle is respectively provided with corresponding text feature and mark is special Levy.Text feature may include the features such as part of speech, word length.Marker characteristic can be the corresponding rhythm pause level of classification of participle, The corresponding rhythm pause level of such as rhythm word is #1, and the corresponding rhythm pause level of prosodic phrase is #2, intonation phrase correspondence Rhythm pause level for #3 etc..

For example, example sentence is " European Union #2 determines that #1 sets up #2 joint force #3 strikes #2 Mediterranean #1 and steals into another country #1 activities # 3”.Word sequence x is：European Union determines that setting up joint force's strike Mediterranean steals into another country activity, and flag sequence y is：#2#1#2#3#2#1# 1#3.Flag sequence y is made up of multiple marker characteristics.

S2, based on Chinese thesaurus in training corpus text multiple participles carry out it is extensive.

Specifically, feature can will be added with the identical synonym such as function word usage, meaning, part of speech, carries out extensive extension.

For example, the synonym of " establishment " may include " to set up ", " establishment " etc..

S3, according to text feature, marker characteristic and it is extensive after participle, rhythm model is trained.

Specifically, rhythm model can be trained by below equation.

Wherein, x is word sequence；Y is flag sequence；P (y | x) it is the probability for occurring flag sequence y under word sequence x；Z (x) it is normalization factor,t_k(y_i-1, y_i, x, i) and it is whole The feature of observation sequence and respective markers sequence at i-1 the and i moment, is transfer function；s_k(y_i, x, i) and to be whole at the i moment The feature of observation sequence and mark, is function of state；λ_kFor the weight parameter for the transfer function that need to train estimation；μ_kTo need training The weight parameter of the function of state of estimation.

For example, " European Union #2 determines that #1 sets up #2 joint force #3 strikes #2 Mediterranean #1 and steals into another country #1 work to training corpus Dynamic #3 " in participle " establishments " can be generalized for " setting up ", " establishment ", the following real number value tag of formation：

Its characteristic function is

Thus, weight parameter λ can be trained_kAnd μ_k。

As shown in Fig. 2 phoneme synthesizing method may include：

S4, text feature is extracted from text to be predicted, and text feature is inputted into rhythm model.

In an embodiment of the present invention, it can be multiple participles by text dividing to be predicted, then obtain each participle correspondence The feature such as part of speech, word length, above-mentioned text feature is then inputted into the rhythm model generated in a upper embodiment.

S5, according to rhythm model to text to be predicted carry out prosody prediction.

Specifically, using the weight parameter λ of characteristic function_kAnd μ_k, prosody prediction is carried out to text to be predicted.

Wherein, the feature of text progress prosody prediction to be predicted is：

Wherein, x is word sequence；I is the sequence moment；B (x, i) is features of the word sequence x at the i moment；x_iIt is x at the i moment State.

Function of state is：

Transfer function is：

Wherein, y is flag sequence；I is the sequence moment；B (x, i) is features of the word sequence x at the i moment；y_iIt is y in i The state at quarter.

For example, after extensive to participle progress based on Chinese thesaurus, in x_iDuring=" establishment ", deposited in rhythm model With real number value tag

With Corresponding characteristic function Related weight parameter λ_kAnd μ_k, then in the word sequence for corresponding to " determine set up joint force " x_iProsody prediction sequences y during=" establishment "_i=#2.And before synonym is extensive, above-mentioned real number value tag is not present, it is impossible to obtain Obtain the related weight parameter λ of corresponding characteristic function_kAnd μ_k, thus can not accurately provide the probabilistic information of correlation.Therefore add After Chinese thesaurus, the accuracy of prosody prediction can be improved.

Prosody prediction is carried out to whole segmentation sequence using the above method, the rhythm pause level of each participle is obtained, from And complete prosody prediction.

S6, further to text to be predicted carry out acoustical predictions, to generate parameters,acoustic sequence.

Rhythm pause level is input in acoustical predictions model, so as to carry out acoustical predictions to text to be predicted, can be given birth to Into parameters,acoustic sequences such as corresponding spectrum, fundamental frequencies.

S7, parameters,acoustic sequence is spliced, to generate phonetic synthesis result.

Waveform concatenation finally is carried out to parameters,acoustic sequence using vocoder, so as to generate final phonetic synthesis result.

To achieve the above object, the present invention also proposes a kind of rhythm model trainer for phonetic synthesis.

As shown in figure 3, the rhythm model trainer for phonetic synthesis may include：Extraction module 110, extensive module 120 and training module 130.

Extraction module 110 is used to from training corpus text extract text feature and marker characteristic.

Extensive module 120 is used for extensive to multiple participles progress in training corpus text based on Chinese thesaurus.

Specifically, extensive module 120 can will add feature with the identical synonym such as function word usage, meaning, part of speech, enter The extensive extension of row.

Training module 130 is used to be trained rhythm model.

Specifically, training module 130 can be trained by below equation to rhythm model.

Its characteristic function is

Thus, weight parameter λ can be trained_kAnd μ_k。

As shown in figure 4, speech synthetic device may include：Analysis module 140, prosody prediction module 150, acoustical predictions module 160 and generation module 170.

Analysis module 140 is used to extract text feature from text to be predicted, and text feature is inputted into rhythm model.

In an embodiment of the present invention, text dividing to be predicted can be multiple participles by analysis module 140, then obtain every The features such as the corresponding part of speech of individual participle, word length, then input the rhythm model generated in a upper embodiment by above-mentioned text feature.

Prosody prediction module 150 is used to carry out prosody prediction to text to be predicted according to rhythm model.

Specifically, prosody prediction module 150 can utilize characteristic function weight parameter λ_kAnd μ_k, rhythm is carried out to text to be predicted Rule prediction.

Wherein, the feature of text progress prosody prediction to be predicted is：

Acoustical predictions module 160 is used to further carry out acoustical predictions to text to be predicted, to generate parameters,acoustic sequence.

Specifically, rhythm pause level can be input in acoustical predictions model by acoustical predictions module 160, so as to treat pre- Survey text and carry out acoustical predictions, the parameters,acoustic sequences such as corresponding spectrum, fundamental frequency can be generated.

Generation module 170 is used to splice parameters,acoustic sequence, to generate phonetic synthesis result.

Specifically, generation module 170 can carry out waveform concatenation using vocoder to parameters,acoustic sequence, so as to generate final Phonetic synthesis result.

In the description of the invention, it is to be understood that term " " center ", " longitudinal direction ", " transverse direction ", " length ", " width ", " thickness ", " on ", " under ", "front", "rear", "left", "right", " vertical ", " level ", " top ", " bottom " " interior ", " outer ", " up time The orientation or position relationship of the instruction such as pin ", " counterclockwise ", " axial direction ", " radial direction ", " circumference " be based on orientation shown in the drawings or Position relationship, is for only for ease of the description present invention and simplifies description, rather than indicate or imply that the device or element of meaning must There must be specific orientation, with specific azimuth configuration and operation, therefore be not considered as limiting the invention.

In addition, term " first ", " second " are only used for describing purpose, and it is not intended that indicating or implying relative importance Or the implicit quantity for indicating indicated technical characteristic.Thus, define " first ", the feature of " second " can express or Implicitly include at least one this feature.In the description of the invention, " multiple " are meant that at least two, such as two, three It is individual etc., unless otherwise specifically defined.

In the present invention, unless otherwise clearly defined and limited, term " installation ", " connected ", " connection ", " fixation " etc. Term should be interpreted broadly, for example, it may be fixedly connected or be detachably connected, or integrally；Can be that machinery connects Connect or electrically connect；Can be joined directly together, can also be indirectly connected to by intermediary, can be in two elements The connection in portion or the interaction relationship of two elements, unless otherwise clear and definite restriction.For one of ordinary skill in the art For, the concrete meaning of above-mentioned term in the present invention can be understood as the case may be.

In the present invention, unless otherwise clearly defined and limited, fisrt feature can be with "above" or "below" second feature It is that the first and second features are directly contacted, or the first and second features pass through intermediary mediate contact.Moreover, fisrt feature exists Second feature " on ", " top " and " above " but fisrt feature are directly over second feature or oblique upper, or be merely representative of Fisrt feature level height is higher than second feature.Fisrt feature second feature " under ", " lower section " and " below " can be One feature is immediately below second feature or obliquely downward, or is merely representative of fisrt feature level height less than second feature.

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means to combine specific features, structure, material or the spy that the embodiment or example are described Point is contained at least one embodiment of the present invention or example.In this manual, to the schematic representation of above-mentioned term not Identical embodiment or example must be directed to.Moreover, specific features, structure, material or the feature of description can be with office Combined in an appropriate manner in one or more embodiments or example.In addition, in the case of not conflicting, the skill of this area Art personnel can be tied the not be the same as Example or the feature of example and non-be the same as Example or example described in this specification Close and combine.

Although embodiments of the invention have been shown and described above, it is to be understood that above-described embodiment is example Property, it is impossible to limitation of the present invention is interpreted as, one of ordinary skill in the art within the scope of the invention can be to above-mentioned Embodiment is changed, changed, replacing and modification.

Claims

1. a kind of rhythm model training method for phonetic synthesis, it is characterised in that comprise the following steps：

S1, the corresponding text feature of extraction participle and marker characteristic from training corpus text；

S2, based on Chinese thesaurus in the training corpus text participle carry out it is extensive；And

S3, according to the text feature, the marker characteristic and it is extensive after participle, the rhythm model is trained.

2. the method as described in claim 1, it is characterised in that it is described according to the text feature, the marker characteristic and Participle after extensive, is trained to the rhythm model, specifically includes：

The rhythm model is trained by object function, to obtain the weight parameter of transfer function and the power of function of state Weight parameter.

3. a kind of method that phonetic synthesis is carried out using rhythm model as claimed in claim 1 or 2, it is characterised in that including Following steps：

S4, extract text feature from text to be predicted, and the text feature is inputted into the rhythm model；

S5, prosody prediction carried out to the text to be predicted according to the rhythm model；

S6, acoustical predictions further are carried out to the text to be predicted, to generate parameters,acoustic sequence；And

S7, the parameters,acoustic sequence is spliced, to generate phonetic synthesis result.

4. method as claimed in claim 3, it is characterised in that described to be entered according to the rhythm model to the text to be predicted Row prosody prediction, is specifically included：

According to transfer function and function of state, judge the text feature with the presence or absence of the weight parameter of corresponding transfer function and The weight parameter of function of state, if in the presence of obtaining the corresponding rhythm pause level of the text to be predicted.

5. a kind of rhythm model trainer for phonetic synthesis, including：Extraction module, for being carried from training corpus text Take the corresponding text feature of participle and marker characteristic, it is characterised in that also include：

Extensive module, it is extensive for being carried out based on Chinese thesaurus to the participle in the training corpus text；And

Training module, for according to the text feature, the marker characteristic and it is extensive after participle, to the rhythm model It is trained.

6. device as claimed in claim 5, it is characterised in that the training module, specifically for：

7. a kind of rhythm model using as described in claim 5 or 6 carries out the device of phonetic synthesis, it is characterised in that including：

Analysis module, the rhythm model is inputted for extracting text feature from text to be predicted, and by the text feature；

Prosody prediction module, for carrying out prosody prediction to the text to be predicted according to the rhythm model；

Acoustical predictions module, for further carrying out acoustical predictions to the text to be predicted, to generate parameters,acoustic sequence；With And

Generation module, for splicing to the parameters,acoustic sequence, to generate phonetic synthesis result.

8. device as claimed in claim 7, it is characterised in that the prosody prediction module, specifically for：