CN102651217A

CN102651217A - Method and equipment for voice synthesis and method for training acoustic model used in voice synthesis

Info

Publication number: CN102651217A
Application number: CN2011100465804A
Authority: CN
Inventors: 汪曦; 楼晓雁; 李健
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2011-02-25
Filing date: 2011-02-25
Publication date: 2012-08-29
Also published as: US20120221339A1; US9058811B2

Abstract

The invention relates to a method and equipment for voice synthesis and a method for training an acoustic model used in voice synthesis. The method for voice synthesis includes the steps as follows: confirming that data generated by text analysis is fuzzy polyphone data; and performing fuzzy polyphone prediction for the fuzzy polyphone data, so as to output a plurality of candidate pronunciations and the probability thereof; generating the fuzzy context characteristic tagging based on the candidate pronunciations and the probability thereof; based on the acoustical model provided with a fuzzy decision tree, confirming model parameters direct at the fuzzy context characteristic tagging; generating voice parameters based on the model parameters; and synthesizing voice through the voice parameters. As per the method and equipment provided by the embodiment of the invention, the fuzzy treatment can be performed for polyphone words difficult for prediction in a Chinese text, so as to improve the synthesis quality of Chinese polyphones.

Description

The acoustic training model method that is used for method, the equipment of synthetic speech and is used for phonetic synthesis

Technical field

The present invention relates to phonetic synthesis, more specifically, relate to the synthetic of Chinese polyphone.

Background technology

Produce voice by manual work through certain machinery and equipment and be called phonetic synthesis.Phonetic synthesis is the important component part that the man machine language communicates by letter.Utilize speech synthesis technique can let machine resemble the people and speak, some are otherwise represented or canned data can convert voice into, thereby people can obtain these information easily through the sense of hearing.

What launch big quantity research and application at present is literary composition language conversion tts system; In this system, import text to be synthesized usually; The text analyzer that system comprises is handled it, output pronunciation descriptor, and it comprises the phonetic symbol of segment aspect and the prosodic sign of Supersonic section aspect.Text analyzer is at first according to Pronounceable dictionary; Text to be synthesized is decomposed into speech and the pronunciation symbol thereof that has attribute labeling; Again according to semantic rules and phonetic rules; For sentence structure and intonation confirmed in each speech, each syllable, and the linguistics and the prosodic features of target speech such as pause part of speech distance.The descriptor of will pronouncing afterwards is input to the compositor that this system comprises, through phonetic synthesis, and the voice that output is synthetic.

In the prior art, be widely used in speech synthesis technique, can have easily revised the sound synthetic with conversion based on Hidden Markov HMM acoustic model.Phonetic synthesis is divided into model training and composite part usually.In the model training stage, mark attributes such as parameters,acoustic that each voice unit in the sound bank is comprised and corresponding segment, the rhythm carry out the training of statistical model.These marks derive from language harmony and gain knowledge, and the contextual feature of its composition (context feature) has been described corresponding voice attributes (for example tone, part of speech etc.).In the training stage of HMM acoustic model, to the estimation of model parameter from statistical computation to these voice unit parameters.

In the prior art, consider so many, as to have a large amount of variations context combination, generally adopt the tree clustering method of decision tree to handle.Decision tree can gather into one type with candidate's primitive of contextual feature and acoustics feature similarity, thereby has avoided data sparse effectively, and has reduced the quantity of model effectively.Problem set is the set that supplies the problem of decision tree structure use, and the problem of being chosen during node division node is therewith bound, thereby determines which primitive to get into same leafy node.The process of cluster is with reference to predefined problem set; Each node of decision tree is all bound " Yes/No " problem; All candidate's primitives that allow to get into root node all will be answered the problem of binding on the node, select to get into left branch or right branch according to answering the result.Therefore, each will have in the same leafy node identical or that be in decision tree near the syllable or the phoneme of contextual feature, and the model that node is corresponding can be HMM model or state usually, and model is by parametric description.Simultaneously, cluster also is that a study processing runs into the process of new situation in synthetic, thereby can realize optimum coupling.The decision tree that obtains Hidden Markov (HMM) model and corresponding model through training and cluster to training data.

At synthesis phase, obtain the contextual feature mark of polyphone through text analyzer and context mark maker.Be labeled in to this contextual feature and find corresponding acoustic model parameter (the for example status switch of HMM acoustic model) on the decision tree that trains.This model parameter obtains the relevant voice parameter through parameter generation algorithm then, thereby through compositor (Vocoder) synthetic speech.

The target of speech synthesis system is exactly to synthesize sound equally intelligent with voice and nature.But for the Chinese speech synthesis system, the pronunciation predictablity rate of polyphone is difficult to guarantee, because the pronunciation of polyphone is often definite according to semanteme, and semantic understanding is a challenging problem.What complementary like this relation caused polyphone prediction is difficult to obtain gratifying high accuracy.In the prior art, even the prediction of this pronunciation is not enough held, speech synthesis system generally all can provide a definite pronunciation to this polyphone.

In Chinese, different significance represented in different pronunciations.If speech synthesis system is given the pronunciation make mistake, will cause the ambiguity of hearer on understanding, give the very bad impression of hearer.Thereby for the speech synthesis system of in life, work and scientific research (for example vehicle mounted guidance, automatic sound information service, broadcasting, robot simulation etc.), using; Will cause bad user experience owing to the polyphone pronunciation of apparent error, even the inconvenience of using.Therefore, in the phonetic synthesis field, exist the phoneme synthesizing method of improved polyphone and the needs of system.

Summary of the invention

For this reason, the method that is used for phonetic synthesis and the system thereof of embodiments of the invention and the method that training is used for the acoustic model of phonetic synthesis are provided.Embodiment through embodiment of the present invention; Can have the following advantages: can not have enough to hold to provide under the situation of right pronunciation in system; The pronunciation of obfuscation polyphone; And do not influence the quality of other normal sound of total system, the method will be avoided manifest error, thereby improves the whole subjective sense of hearing of synthesis system.

According to an aspect of the present invention, a kind of method that is used for phonetic synthesis is provided, can have comprised: confirmed that the data that text analyzing generates are fuzzy polyphone data; Said fuzzy polyphone data are blured the polyphone prediction, to export a plurality of candidate's pronunciations and the probability thereof of said fuzzy polyphone data; Based on said a plurality of candidate's pronunciations and probability thereof, generate fuzzy contextual feature mark; Based on the acoustic model of confirming, confirm model parameter to said fuzzy contextual feature mark with fuzzy decision-tree; Said model parameter is generated speech parameter; And said speech parameter synthesized voice.

Preferably, generating the step of bluring the contextual feature mark may further include: confirm that based on said probability the context mark of candidate's pronunciation of said fuzzy polyphone data falls into the degree of classification; And through quantizing the said degree of conversion generating said fuzzy contextual feature mark, wherein said fuzzy contextual feature be labeled as said candidate's pronunciation the context mark unite expression.

According to other aspect of the present invention, a kind of equipment that is used for synthetic speech is provided, can comprise: the polyphone predicting unit is used for the pronunciation of predictive fuzzy polyphone data, to export a plurality of candidate's pronunciations and the prediction probability of said fuzzy polyphone data; Fuzzy contextual feature mark generation unit is used for generating fuzzy contextual feature mark based on said a plurality of candidate's pronunciations and probability thereof; Confirm the unit, be used for, confirm model parameter to said fuzzy contextual feature mark based on the acoustic model of confirming with fuzzy decision-tree; Parameter generators is used for generating speech parameter to said model parameter; And compositor, be used for said speech parameter is synthesized voice.

Preferably, said fuzzy contextual feature mark generation unit can further be configured to: confirm that based on said probability the context mark of candidate's pronunciation of said fuzzy polyphone data falls into the degree of classification; And through quantizing the said degree of conversion generating said fuzzy contextual feature mark, wherein said fuzzy contextual feature be labeled as said candidate's pronunciation the context mark unite expression.

According to other aspect of the present invention, a kind of system that is used for synthetic speech is provided, can comprise: be used for confirming that the data that text analyzing generates are the device of fuzzy polyphone data; Be used for said fuzzy polyphone data are blured polyphone prediction, with a plurality of candidates' pronunciations of exporting said fuzzy polyphone data and the device of probability thereof; Be used for generating the device of fuzzy contextual feature mark based on said a plurality of candidate's pronunciations and probability thereof; Be used for based on acoustic model, confirm the device of model parameter to said fuzzy contextual feature mark with fuzzy decision-tree; Be used for said model parameter is generated the device of speech parameter; And the device that is used for said speech parameter is synthesized voice.

According to other aspect of the present invention, a kind of method that is used to train acoustic model is provided, can comprise: each voice unit in the training utterance storehouse, to generate acoustic model, said voice unit comprises parameters,acoustic and context mark; For the context combination, carry out the decision tree clustering processing has decision tree with generation acoustic model; Based on said acoustic model, confirm the fuzzy data in the sound bank with decision tree; To said fuzzy data, generate fuzzy contextual feature mark; And, said sound bank is carried out the cluster training based on said fuzzy contextual feature mark, have the acoustic model of fuzzy decision-tree with generation.

Preferably, the step of confirming fuzzy data may further include: the assessment voice unit; And confirm that candidate's context mark of said voice unit falls into the degree of classification; And if said degree satisfies predetermined threshold, confirm that then said voice unit is a fuzzy data.

Preferably, the step of assessment voice unit may further include: generate parameter and the distance between the voice unit parameter through model posterior probability or model and assess the score value that the contextual feature of candidate's pronunciation of said voice unit marks.

Preferably, generating the step of bluring the contextual feature mark may further include: confirm the score value of corresponding candidate's contextual feature mark of said voice unit pronunciation through assessing said voice unit; Confirm that based on said score value candidate's context mark of said voice unit falls into the degree of classification; And through quantizing the said degree of conversion generating said fuzzy contextual feature mark, wherein said fuzzy contextual feature be labeled as said candidate's pronunciation the context mark unite expression.

Preferably; Said based on said fuzzy contextual feature mark; It is one of following that the step of carrying out cluster training may further include: based on said fuzzy contextual feature mark and preset fuzzy problem collection, training comprises that the training set of said fuzzy data has the acoustic model of said fuzzy decision-tree with generation; And train each voice unit in the said sound bank once more based on problem set and contextual feature mark; Wherein said problem set also comprises preset fuzzy problem collection, and the contextual feature of the fuzzy data in the said sound bank is labeled as said fuzzy contextual feature mark.

Description of drawings

In conjunction with accompanying drawing, the object of the invention, characteristics and advantage will be obvious the detailed description of embodiments of the invention from facing down, wherein:

Fig. 1 shows the process flow diagram according to the method that is used to train the acoustic model with fuzzy decision-tree of the embodiment of the invention.

Fig. 2 show according to the method for the embodiment of the invention really cover half stick with paste the processing flow chart of data.

Fig. 3 shows the operation of passing through model posterior probability assessment training data of method according to an embodiment of the invention.

The model that passes through that Fig. 4 shows method according to an embodiment of the invention generates the operation that distance between parameter and the actual parameter is assessed training data.

Fig. 5 illustration according to an embodiment of the invention fuzzy data is quantized conversion operations to generate fuzzy context.

Fig. 6 illustration the method for synthetic speech according to an embodiment of the invention.

Fig. 7 is the block diagram according to the equipment that is used for synthetic speech of the embodiment of the invention.

Embodiment

Below, in conjunction with accompanying drawing embodiments of the invention are described in detail.

Usually, the embodiment of the invention relates to the method for synthetic speech in electronic equipment (for example telephone system, portable terminal, vehicle mounted traffic instrument, automatic sound information service system, broadcast system, robot etc. and/or analog) and the method for system and training acoustic model thereof.

Generally, basic design of the present invention is: synthetic to Chinese polyphone, and do not select the unique candidate who confirms to pronounce, but the voice of fuzzy polyphone are carried out Fuzzy processing, thereby avoided just providing in advance arbitrary decision even wrong choice.In an embodiment of the present invention, fuzzy polyphone is meant that polyphone predicting unit of the prior art is difficult to the polyphone of prediction processing; And fuzzy data be in the training utterance storehouse, because the speech data that the influence of the continuous speech coarticulation of speaker and accidental pronunciation error are produced; It satisfies hazy condition (usually can according to member function ambiguity in definition threshold value) and is used for model training; Correspondingly, this voice that are difficult for definite candidate's pronunciation are called fuzzy voice.Can introduce fuzzy decision-tree to realize this process preferably in training and synthesis phase; Fuzzy decision-tree is commonly used to handle uncertain; Can help to derive the more decision-making of intelligence on complicated with fuzzy border, thereby make the optimal selection under the ambiguity.And the pronunciation of obfuscation is intended to comprise the characteristic of each candidate's pronunciation, and candidate's pronunciation that particularly those probability are bigger can avoid producing the misjudgment of candidate's pronunciation like this, thereby reduces the probability of synthetic ear-piercing or wrong voice.

In an embodiment of the present invention; In the model training stage; Can introduce fuzzy decision-tree, the sound bank that comprises fuzzy data is further trained, obtain acoustic model (for example HMM acoustic model) and this model corresponding fuzzy decision tree (the HMM acoustic model that for example has fuzzy decision-tree); At synthesis phase, when the polyphone predicting unit can not provide suitable selection, then Fuzzy processing is carried out in the pronunciation of this word, with voice, thereby make synthetic sound more near the big candidate of prediction possibility in the synthetic correspondence of compositor.The processing of synthesis phase can be operated as follows: the probability that obtains a plurality of candidate's pronunciations through the polyphone predicting unit; Blur contextual feature and handle the fuzzy context mark that obtains having many candidates fuzzy characteristics; The acoustic model with fuzzy decision-tree, the basis that generate based on training should obtain corresponding model parameter by fuzzy context mark; This model parameter obtains the relevant voice parameter through parameter generation algorithm, thereby through compositor this speech parameter is synthesized voice.

Fig. 1 shows the process flow diagram according to the method that is used to train the acoustic model with fuzzy decision-tree of the embodiment of the invention.As shown in Figure 1, at step S110, each voice unit in the training utterance storehouse is to generate acoustic model.In an embodiment of the present invention, sound bank generally is a reference voice that prerecord, through phonetic entry port input.The context mark that each voice unit comprises parameters,acoustic and describes corresponding segment, rhythm attribute.

With the HMM acoustic model is example, and in the training stage of this model, from the statistical computation to these voice unit parameters, this is a widely used proven technique in this area, repeats no more at this to the estimation of model parameter.

At step S120,, adopt the tree clustering method of decision tree to handle acoustic model usually and have the acoustic model of decision tree, for example CART (Classification and Regression Tree) with generation for context combination with a large amount of variations.Adopt clustering method to avoid data sparse effectively, and reduce the quantity of model.Simultaneously, cluster also is that the process that in synthetic, runs into new situation is handled in study, can realize optimum coupling.The process of cluster is with reference to predefined problem set.Problem set is the set that supplies the problem of decision tree structure use, and the problem of being chosen during node division node is therewith bound, thereby determines which primitive to get into same leafy node.Its problem set can be different according to concrete applied environment.For example have in the Chinese 5 types of

tones

1,2,3,4,5}, each type can be used as the problem of decision tree, polyphone are being confirmed under the situation of tone, problem set can be provided with shown in table one:

Problem that table one problem set is used and value

Its code is as follows:

Is QS " phntone==1 " { " * | phntone=1|* " } tone the 1st type?

Is QS " phntone==2 " { " * | phntone=2|* " } tone the 2nd type?

Is QS " phntone==3 " { " * | phntone=3|* " } tone the 3rd type?

Is QS " phntone==4 " { " * | phntone=4|* " } tone the 4th type?

Is QS " phntone==5 " { " * | phntone=5|* " } tone the 5th type?

To those skilled in the art, the use of decision tree is this area technology commonly used, and it can adopt various decision trees according to various applied environments, and the variety of issue collection is set, and divides based on this problem and to make up decision tree, repeats no more at this.

In an embodiment of the present invention, through training data being trained and cluster can obtain the decision tree of Hidden Markov HMM model and corresponding model.Yet, it should be appreciated by those skilled in the art that the acoustic model of other types also can be applied in the Fuzzy processing of embodiments of the invention.

In an embodiment of the present invention, voice unit can be other unit such as phoneme, syllable or sound mother, and for the sake of simplicity, only illustration sound mother handles as voice unit.Yet, it should be appreciated by those skilled in the art that embodiments of the invention should be not limited thereto.

In an embodiment of the present invention, also based on fuzzy data, acoustic model is trained once more.For example, at step S140,, confirm the fuzzy data in the sound bank to above-mentioned acoustic model (Hidden Markov HMM model) with decision tree.In an embodiment of the present invention, can adopt some polyphone related context all possible mark, assess the ability that this mark characterizes real data based on real data, confirm according to this assessment result whether this speech data belongs to fuzzy data then.Afterwards, at step S160,, generate fuzzy contextual feature mark to qualified fuzzy data.So, at step S180,, mark based on this fuzzy contextual feature and to train fuzzy decision-tree to the sound bank that comprises fuzzy data, have the acoustic model of fuzzy decision-tree with generation.

Fig. 2 show according to the method for the embodiment of the invention really cover half stick with paste the processing flow chart of data.As shown in Figure 2, at step S210, generate all possible contextual feature mark of the speech data in the training storehouse.All possible context mark refers to for some and will such as tone, generate all possibilities as the attribute of polyphone Fuzzy Processing.In an embodiment of the present invention, do not pay close attention to whether meet linguistic norm, and generate all possibilities.For example, for polyphone " be ", the pronunciation of this polyphone is wei4 and wei2 in theory.Promptly refer to generate wei1 and all generate possible mark, wei2, wei3, wei4, wei5 for all tones.The contextual feature mark has characterized the language of voice segments and the attribute of voice, and for example the entity sound of speech primitive is female, tone, and syllable, the position in syllable, speech, phrase and sentence, the relevant information of the unit of forward-backward correlation, and the type of sentence etc.Tone is the key character of polyphone, is example with the tone, and 5 tones can be arranged in mandarin, for this training data 5 parallel contextual feature marks can be arranged then so.It should be appreciated by those skilled in the art that for the different pronunciation in the polyphone also can generate possible contextual feature mark, it handles with the processing of tone similar.

At step S220, be based on the acoustic model (the HMM model that for example has decision tree) that step S120 trains, the assessment training data.For example, for having N a certain voice unit under the parallel contextual feature mark, then can calculate its corresponding N score value is s [1] successively ... s [k] ... s [N], this score value have reflected that this mark characterizes the ability of actual parameter.In an embodiment of the present invention, any method that can quantize assessment can adopt, and for example posterior probability under the computation model condition or model generate distance between parameter and the actual parameter etc., will describe in detail below.

At step S230, based on assessment result, the reflection of for example calculating characterizes the score value of power, judges whether voice unit is fuzzy data.In an embodiment of the present invention, the data that point value of evaluation is lower can be confirmed as fuzzy data, are used for further training.At this, point value of evaluation is low to refer to that in parallel contextual feature mark all score values all do not have enough advantages to prove its actual optimum that is only this unit mark.

In an embodiment of the present invention, also can fall into the degree of this classification according to the corresponding score value of contextual feature mark that member function (membership function) calculates this voice unit.Member function m _kCan be following to these parallel minute value representation:

m_{k} = \frac{s [k]}{Σ_{K = 1}^{N} s [k]} - - - (1)

Wherein, s [k] is the corresponding score value of contextual feature mark, and N is the number of contextual feature mark.

In an embodiment of the present invention, the data that satisfy hazy condition (usually according to member function ambiguity in definition threshold value) then are fuzzy data.The setting of fuzzy threshold value can be fixed, and for example for the candidate who does not occupy 50% above score value among all candidates, then these data can be thought fuzzy data.Alternatively, this fuzzy threshold value also can be dynamic, for example can choose certain part (as 10%) that ranks behind according to the score value ordering of definition classification sum under the active cell in the current database

In an embodiment of the present invention; Tranining database being carried out selecting of fuzzy data and changing whole training is favourable; This process has not only generated the data that are used for the fuzzy decision-tree training; Also the training accuracy raising for normal data contributes, and need not significantly to increase training burden.

Fig. 3 shows the operation of passing through model posterior probability assessment training data of method according to an embodiment of the invention.In an embodiment of the present invention, for for simplicity, training data is an example with certain voice unit.As shown in Figure 3; For N of this voice unit possible contextual feature mark 16a-1label 1...16a-k label k...16a-N label N, can on the model (the HMM model that for example has decision tree) that step S120 trains, find each self-corresponding acoustic model (21a-1 model1...21a-k model k...21a-N model N).In an embodiment of the present invention, be the operation that example is explained following assessment training data with the HMM acoustic model.Yet, should be appreciated that embodiments of the invention are not limited thereto.

For given voice unit, its speech parameter vector sequence is represented as follows:

O = {[o_{1}^{T}, o_{2}^{T}, . . . o_{T}^{T}]}^{T} - - - (2)

The speech parameter vector sequence of this voice unit is expressed as in the posterior probability of model HMM λ:

P (O | λ) = \underset{Q}{Σ} P (O, Q | λ) - - - (3)

Wherein, Q is HMM status switch { q ₁, q ₂..., q _T.

Each frame of voice unit is alignd with model state, and obtain number of state indexes.Can calculate then with lower probability:

P (o_{t}, q_{i} | λ) = Σ_{j = 1}^{N} b_{j} (o_{t}) - - - (4)

Wherein, b _j(o _t) be t observed quantity constantly o _tAt the output probability of j state of current model, its Gaussian distribution probability and all depend on the type of HMM, for example hybrid density HMM continuously.

b_{j} (o_{t}) = P (o_{i} | i, j) = Σ_{m = 1}^{M} ω_{ijm} b_{ij} (o_{i}) = \frac{1}{{(2 π)}^{p / 2} {| Σ_{ij} |}^{1 / 2}} e^{{- \frac{1}{2} (o_{i} - μ_{ij}) {Σ_{ij}}^{- 1} {(o_{i} - μ_{ij})}^{T}}} - - - (5)

Wherein, ω _IjmIt is the weight of i mixed components of j state.μ _IjAnd ∑ _IjBe average and covariance.

Alternatively, in an embodiment of the present invention, can also assess training data through the distance that model generates between parameter and the actual parameter.The model that passes through that Fig. 4 shows method according to an embodiment of the invention generates the operation that distance between parameter and the actual parameter is assessed training data.As shown in Figure 4; Still be example with certain voice unit; It is similar to the above embodiments; Still have all possible contextual feature mark 16b-1label 1...16b-k label k...l6b-N label N, and confirm its each self-corresponding model 21a-1model 1..21a-k model k...21a-N model N.Simultaneously, recover speech parameter 25b-1parameter 1...25b-k parameter k...25b-N parameter N (it is test parameter) according to each model parameter.Through calculating, assess the score value of these possibility contextual features marks to the speech parameter (being reference parameter) of this unit and the distance between the recovery parameter.

As stated, for given voice unit, its speech parameter vector sequence O is expressed as:

O = {[o_{1}^{T}, o_{2}^{T}, . . . o_{T}^{T}]}^{T}

Can be expressed as as follows and recover speech parameter:

O^{'} = {[{o_{1}^{T}}^{'}, {o_{2}^{T}}^{'}, . . . {o_{T^{'}}^{T}}^{'}]}^{T} - - - (6)

At the actual parameter T of given voice unit with recover there are differences between the speech parameter T '.At first between T and T ', carry out linear mapping.Usually will recover speech parameter T ' expands or is compressed to and be T.So as calculating Euclidean distance between the two of getting off:

D (O, O^{'}) = sqrt (Σ_{i = 1}^{N} Σ_{m = 1}^{M} {(o_{mi} - {o_{mi}}^{'})}^{2}) - - - (7)

In an embodiment of the present invention, can generate fuzzy context mark through quantizing to shine upon to change.Fuzzy context mark has characterized the language and the acoustic feature of current speech unit; And the association attributes to the polyphone that will carry out Fuzzy processing has carried out the vague definition of degreeization; The score value that can quantize according to each mark of voice unit convert corresponding context degree (high for example into; Low etc.), and carry out association list and show, to generate fuzzy context mark.Notice that in an embodiment of the present invention, fuzzy context mark generates according to objective calculating, can not receive philological restriction, such as the tone 1 through calculating wei3 or wei and combination of 5 or the like.Below come the fuzzy context mark of its generation of illustration with operation to certain voice unit with 5 tones.

As shown in Figure 5, suppose that candidate's tone of this unit is a tone 2, be expressed as tone=2 at this; (it is corresponding to tone tone=(1,2,3 to each possible contextual feature mark for member function membership as described above; 4,5)) calculate the value that it falls into the degree of this classification.So each member function value is carried out normalization, quantize to the value between the 0-1, like (0.05,0.45,0.1,0.2,0.2).And definite its contextual degree, for example high, middle or low.Then each contextual feature mark association list is shown fuzzy contextual feature mark.

In an embodiment of the present invention, for example threshold=0.2 of threshold value be can establish, then the pronunciation candidate who only considers to satisfy this baseline requirement when fuzzy contextual feature marks, for example tone 2,4 and 5 generated.To generate fuzzy context mark, for example tone=High2_Low4_Low5 according to the corresponding distributed degrees of above-mentioned tone.

It should be appreciated by those skilled in the art that generating fuzzy contextual feature mark can have multiple mode, for example can distribute, the fuzzy context that obtains quantizing according to the histogram of distribution proportion then according to the score value of similar segment in the whole training of the statistics storehouse.Should be noted that embodiments of the invention only as illustration, the mode of the fuzzy contextual feature mark of the generation of the embodiment of the invention is not limited thereto.

In an embodiment of the present invention, through generating fuzzy contextual feature mark, can have the diversity characteristic of obfuscation, thereby can avoid in the uncertain attribute classification that bad data cause, making stiff classification.

In an embodiment of the present invention, fuzzy data generated fuzzy contextual feature mark after, can carry out the fuzzy decision-tree training, and just upgrade the model parameter of acoustic model this decision tree training the time.At this,, yet it will be understood by those skilled in the art that this method confirms that for the polyphone with different pronunciations candidate's pronunciation can be suitable for equally still confirming that tone is an example.Be that example is come brief description still with above-mentioned instance.Shown in table two, the corresponding fuzzy problem set can be set be:

Problem that table two problem set is used and value

More than illustrative problem can comprise the multiple situation of the classification that combines tone, can put question to every kind of situation.The combination of these situation can be from linguistry, the practical combinations that occurs in the time of also can coming self-training etc.

In an embodiment of the present invention, can adopt multiple cluster mode, for example carry out cluster again, or cluster etc. is carried out in the secondary training storehouse of only forming to fuzzy data to whole training storehouse.When cluster is carried out again in whole training storehouse,, then its mark is replaced by the fuzzy contextual feature mark that as above generates, and in problem set, increases similar fuzzy problem collection if the training data in this training storehouse is a fuzzy data.

In an embodiment of the present invention, when cluster is carried out in secondary training storehouse,, only use fuzzy context mark and the training of fuzzy problem collection based on acoustic model of having trained and decision tree.

Carry out cluster as described above, then obtain having the acoustic model of fuzzy decision-tree.

In an embodiment of the present invention, the acoustic model that from real speech, obtains having fuzzy decision-tree through training to be improving the quality of phonetic synthesis, thereby makes Fuzzy processing become reasonable, flexible and intelligent, and makes conventional voice also obtain training more accurately.

Fig. 6 illustration the method for synthetic speech according to an embodiment of the invention.This is used for the method for phonetic synthesis, can comprise: confirm that the data that text analyzing generates are fuzzy polyphone data; Said fuzzy polyphone data are blured the polyphone prediction, to export a plurality of candidate's pronunciations and the probability thereof of said fuzzy polyphone data; Based on said a plurality of candidate's pronunciations and probability thereof, generate fuzzy contextual feature mark; Based on the acoustic model of confirming, confirm model parameter to said fuzzy contextual feature mark with fuzzy decision-tree; Said model parameter is generated speech parameter; And said speech parameter synthesized voice.

As shown in Figure 6, at step S610, confirm that the data that text analyzing generates are fuzzy polyphone data.In an embodiment of the present invention; Text analyzer is treated synthesis text and is carried out the participle operation; It is decomposed into speech and the pronunciation symbol thereof that has attribute labeling; Again according to semantic rules and phonetic rules, for sentence structure and intonation confirmed in each speech, each syllable, and the prosodic features of target speech such as pause.Can obtain multi-character words and monosyllabic word according to word segmentation result, multi-character words generally can be confirmed pronunciation according to dictionary, wherein comprises polyphone, and then such polyphone is not as fuzzy polyphone data of the present invention.And the polyphone in the embodiments of the invention generally refers to the individual character that still has a plurality of pronunciations through participle later on.So this polyphone is being carried out in the voice prediction process, can produce predicting the outcome of each candidate's pronunciation, this predicts the outcome and has described under the situation of concrete speech, the corresponding probability that the pronunciation of polyphone has.It is multiple for the mode of bluring the polyphone data has to adjudicate this polyphone, and threshold value for example can be set, and the polyphone that satisfies this threshold value then is fuzzy polyphone data.Be the candidate more than 70% for there not being probability among all candidates for example, then this polyphone can be thought fuzzy polyphone data.Confirm the principle of fuzzy polyphone data and confirm that in the training stage principle of fuzzy data is similar, repeat no more at this.

Afterwards,, said fuzzy polyphone data are blured the polyphone prediction, to export a plurality of candidate's pronunciations and the probability thereof of said fuzzy polyphone data at step S620.In an embodiment of the present invention, for non-fuzzy polyphone data, its pronunciation can be confirmed with higher confidence level ground, therefore need not carry out Fuzzy processing, then carries out conventional polyphone prediction processing, to export this candidate who confirms pronunciation.If this polyphone then carries out Fuzzy processing for fuzzy polyphone data, export a plurality of candidate's pronunciations and corresponding probability.

Next, at step S630,, generate fuzzy contextual feature mark based on said a plurality of candidate's pronunciations and probability thereof.In an embodiment of the present invention, the step S160 that generates fuzzy contextual feature mark in the execution of this step and the training process is similar, all can mapping be changed or other modes realize through quantizing, repeat no more at this.

At step S640,, confirm corresponding model parameter to said fuzzy contextual feature mark based on acoustic model with fuzzy decision-tree.In an embodiment of the present invention, for the HMM acoustic model, the distribution of each component under the state that then corresponding model parameter comprises for the HMM model.

At step S650, said model parameter is generated speech parameter.Can adopt this area parameter generation algorithm commonly used,, repeat no more at this for example according to the parameter generation algorithm of maximum likelihood probability condition etc.

At last, at step S660, said speech parameter is synthesized voice.

In an embodiment of the present invention, come synthetic speech through Fuzzy processing is carried out in the pronunciation of fuzzy polyphone data, thereby under different context, this pronunciation can have various variation, thereby improve the quality of phonetic synthesis.

Under same inventive concept, Fig. 7 is the block diagram according to the equipment that is used for synthetic speech of the embodiment of the invention.Below just combine should figure, present embodiment is described.For those parts identical, suitably omit its explanation with front embodiment.

The equipment 700 that is used for synthetic speech can comprise: polyphone predicting unit 703 is used for fuzzy polyphone data are carried out fuzzy prediction, to export a plurality of candidate's pronunciations and the prediction probability of said fuzzy polyphone data; Fuzzy contextual feature mark generation unit 704 is used for generating fuzzy contextual feature mark based on said a plurality of candidate's pronunciations and probability thereof; Confirm unit 705, be used for, confirm model parameter to said fuzzy contextual feature mark based on the acoustic model of confirming with fuzzy decision-tree; Parameter generators 706 is used for generating speech parameter to said model parameter; And compositor 707, be used for said speech parameter synthetic speech.

The equipment 700 that is used for synthetic speech of the present invention can be realized the above-mentioned method that is used for synthetic speech, and its concrete operations please refer to as above content, repeat no more at this.

In an embodiment of the present invention, equipment 700 can also comprise text analyzer 702, is used for text to be synthesized is decomposed into speech and the pronunciation symbol thereof that has attribute labeling.Alternatively, equipment 700 can also comprise I/O unit 701, is used to import text to be synthesized and the synthetic voice of output.Alternatively, in an embodiment of the present invention, symbols streams that can also text analyzing has been carried out in direct input from the outside.Therefore, as shown in Figure 7, text analyzer 702 is shown in broken lines with I/O unit 701.

In an embodiment of the present invention, be used for the equipment 700 and the various piece thereof of synthetic speech, can realize the method that is used for synthetic speech or its step of the embodiment that the front is described in the operation.

The equipment that is used for synthetic speech 700 in the present embodiment and each ingredient thereof can use special-purpose circuit or chip to constitute, and also can carry out corresponding program through computing machine (processor) and realize.

Those having ordinary skill in the art will appreciate that can use a computer executable instruction and/or be included in the processor control routine of above-mentioned method and apparatus realizes, for example provides such code on such as the mounting medium of disk, CD or DVD-ROM, such as the programmable memory of ROM (read-only memory) (firmware) or the data carrier such as optics or electronic signal carrier.The method and apparatus of present embodiment also can by such as VLSI (very large scale integrated circuits) or gate array, such as the semiconductor of logic chip, transistor etc., or realize such as the hardware circuit of the programmable hardware device of field programmable gate array, programmable logic device etc., also can by the combination of above-mentioned hardware circuit and software for example firmware realize.

Though above combination specific embodiment is described in detail the method that is used to train acoustic model of the present invention, the method and apparatus that is used for synthetic speech; But the present invention is not limited to this, and those of ordinary skills can understand and can carry out multiple conversion, replacement and modification and without departing from the spirit and scope of the present invention to the present invention; Protection scope of the present invention is limited accompanying claims.

Claims

1. method that is used for phonetic synthesis comprises:

Confirm that the data that text analyzing generates are fuzzy polyphone data;

Said fuzzy polyphone data are blured the polyphone prediction, to export a plurality of candidate's pronunciations and the probability thereof of said fuzzy polyphone data;

Based on said a plurality of candidate's pronunciations and probability thereof, generate fuzzy contextual feature mark;

Based on acoustic model, confirm model parameter to said fuzzy contextual feature mark with fuzzy decision-tree;

Said model parameter is generated speech parameter; And

Said speech parameter is synthesized voice.

2. the method for claim 1, the step that wherein generates fuzzy contextual feature mark further comprises:

Confirm that based on said probability the context mark of candidate's pronunciation of said fuzzy polyphone data falls into the degree of classification; And

Through quantizing the said degree of conversion generating said fuzzy contextual feature mark, wherein said fuzzy contextual feature be labeled as said candidate's pronunciation the context mark unite expression.

3. equipment that is used for synthetic speech comprises:

The polyphone predicting unit is used for the pronunciation of the fuzzy polyphone data of fuzzy prediction, to export a plurality of candidate's pronunciations and the prediction probability of said fuzzy polyphone data;

Fuzzy contextual feature mark generation unit is used for generating fuzzy contextual feature mark based on said a plurality of candidate's pronunciations and probability thereof;

Confirm the unit, be used for, confirm model parameter to said fuzzy contextual feature mark based on acoustic model with fuzzy decision-tree;

Parameter generators is used for generating speech parameter to said model parameter; And

Compositor is used for said speech parameter synthetic speech.

4. equipment as claimed in claim 3, wherein said fuzzy contextual feature mark generation unit further is configured to:

5. system that is used for synthetic speech comprises:

Be used for confirming that the data that text analyzing generates are the device of fuzzy polyphone data;

Be used for said fuzzy polyphone data are blured polyphone prediction, with a plurality of candidates' pronunciations of exporting said fuzzy polyphone data and the device of probability thereof;

Be used for generating the device of fuzzy contextual feature mark based on said a plurality of candidate's pronunciations and probability thereof;

Be used for based on acoustic model, confirm the device of model parameter to said fuzzy contextual feature mark with fuzzy decision-tree;

Be used for said model parameter is generated the device of speech parameter; And

Be used for said speech parameter is synthesized the device of voice.

6. method that is used to train acoustic model comprises:

Each voice unit in the training utterance storehouse, to generate acoustic model, said voice unit comprises parameters,acoustic and context mark;

For the context combination, carry out the decision tree clustering processing has decision tree with generation acoustic model;

Based on said acoustic model, confirm the fuzzy data in the sound bank with decision tree;

To said fuzzy data, generate fuzzy contextual feature mark; And

Based on said fuzzy contextual feature mark, said sound bank is carried out the cluster training, have the acoustic model of fuzzy decision-tree with generation.

7. method as claimed in claim 6, confirm that wherein the step of fuzzy data further comprises:

The assessment voice unit; And

Confirm that candidate's context mark of said voice unit falls into the degree of classification; And

If said degree satisfies predetermined threshold, confirm that then said voice unit is a fuzzy data.

8. method as claimed in claim 7, the step of wherein assessing voice unit further comprises:

Generate parameter and the distance between the voice unit parameter through model posterior probability or model and assess the score value that the contextual feature of candidate's pronunciation of said voice unit marks.

9. method as claimed in claim 6, the step that wherein generates fuzzy contextual feature mark further comprises:

Through assessing the score value that said voice unit confirms that the contextual feature of candidate's pronunciation of said voice unit marks;

Confirm that based on said score value candidate's context mark of said voice unit falls into the degree of classification; And

10. method as claimed in claim 6, wherein based on said fuzzy contextual feature mark, it is one of following that the step of carrying out the cluster training further comprises:

Based on said fuzzy contextual feature mark and preset fuzzy problem collection, training comprises that the training set of said fuzzy data has the acoustic model of said fuzzy decision-tree with generation; And

Train each voice unit in the said sound bank once more based on problem set and contextual feature mark; Wherein said problem set also comprises preset fuzzy problem collection, and the contextual feature of the fuzzy data in the said sound bank is labeled as said fuzzy contextual feature mark.