CN102651217A - Method and equipment for voice synthesis and method for training acoustic model used in voice synthesis - Google Patents

Method and equipment for voice synthesis and method for training acoustic model used in voice synthesis Download PDF

Info

Publication number
CN102651217A
CN102651217A CN2011100465804A CN201110046580A CN102651217A CN 102651217 A CN102651217 A CN 102651217A CN 2011100465804 A CN2011100465804 A CN 2011100465804A CN 201110046580 A CN201110046580 A CN 201110046580A CN 102651217 A CN102651217 A CN 102651217A
Authority
CN
China
Prior art keywords
fuzzy
contextual feature
data
mark
polyphone
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011100465804A
Other languages
Chinese (zh)
Inventor
汪曦
楼晓雁
李健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toshiba Corp
Original Assignee
Toshiba Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toshiba Corp filed Critical Toshiba Corp
Priority to CN2011100465804A priority Critical patent/CN102651217A/en
Priority to US13/402,602 priority patent/US9058811B2/en
Publication of CN102651217A publication Critical patent/CN102651217A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a method and equipment for voice synthesis and a method for training an acoustic model used in voice synthesis. The method for voice synthesis includes the steps as follows: confirming that data generated by text analysis is fuzzy polyphone data; and performing fuzzy polyphone prediction for the fuzzy polyphone data, so as to output a plurality of candidate pronunciations and the probability thereof; generating the fuzzy context characteristic tagging based on the candidate pronunciations and the probability thereof; based on the acoustical model provided with a fuzzy decision tree, confirming model parameters direct at the fuzzy context characteristic tagging; generating voice parameters based on the model parameters; and synthesizing voice through the voice parameters. As per the method and equipment provided by the embodiment of the invention, the fuzzy treatment can be performed for polyphone words difficult for prediction in a Chinese text, so as to improve the synthesis quality of Chinese polyphones.

Description

The acoustic training model method that is used for method, the equipment of synthetic speech and is used for phonetic synthesis
Technical field
The present invention relates to phonetic synthesis, more specifically, relate to the synthetic of Chinese polyphone.
Background technology
Produce voice by manual work through certain machinery and equipment and be called phonetic synthesis.Phonetic synthesis is the important component part that the man machine language communicates by letter.Utilize speech synthesis technique can let machine resemble the people and speak, some are otherwise represented or canned data can convert voice into, thereby people can obtain these information easily through the sense of hearing.
What launch big quantity research and application at present is literary composition language conversion tts system; In this system, import text to be synthesized usually; The text analyzer that system comprises is handled it, output pronunciation descriptor, and it comprises the phonetic symbol of segment aspect and the prosodic sign of Supersonic section aspect.Text analyzer is at first according to Pronounceable dictionary; Text to be synthesized is decomposed into speech and the pronunciation symbol thereof that has attribute labeling; Again according to semantic rules and phonetic rules; For sentence structure and intonation confirmed in each speech, each syllable, and the linguistics and the prosodic features of target speech such as pause part of speech distance.The descriptor of will pronouncing afterwards is input to the compositor that this system comprises, through phonetic synthesis, and the voice that output is synthetic.
In the prior art, be widely used in speech synthesis technique, can have easily revised the sound synthetic with conversion based on Hidden Markov HMM acoustic model.Phonetic synthesis is divided into model training and composite part usually.In the model training stage, mark attributes such as parameters,acoustic that each voice unit in the sound bank is comprised and corresponding segment, the rhythm carry out the training of statistical model.These marks derive from language harmony and gain knowledge, and the contextual feature of its composition (context feature) has been described corresponding voice attributes (for example tone, part of speech etc.).In the training stage of HMM acoustic model, to the estimation of model parameter from statistical computation to these voice unit parameters.
In the prior art, consider so many, as to have a large amount of variations context combination, generally adopt the tree clustering method of decision tree to handle.Decision tree can gather into one type with candidate's primitive of contextual feature and acoustics feature similarity, thereby has avoided data sparse effectively, and has reduced the quantity of model effectively.Problem set is the set that supplies the problem of decision tree structure use, and the problem of being chosen during node division node is therewith bound, thereby determines which primitive to get into same leafy node.The process of cluster is with reference to predefined problem set; Each node of decision tree is all bound " Yes/No " problem; All candidate's primitives that allow to get into root node all will be answered the problem of binding on the node, select to get into left branch or right branch according to answering the result.Therefore, each will have in the same leafy node identical or that be in decision tree near the syllable or the phoneme of contextual feature, and the model that node is corresponding can be HMM model or state usually, and model is by parametric description.Simultaneously, cluster also is that a study processing runs into the process of new situation in synthetic, thereby can realize optimum coupling.The decision tree that obtains Hidden Markov (HMM) model and corresponding model through training and cluster to training data.
At synthesis phase, obtain the contextual feature mark of polyphone through text analyzer and context mark maker.Be labeled in to this contextual feature and find corresponding acoustic model parameter (the for example status switch of HMM acoustic model) on the decision tree that trains.This model parameter obtains the relevant voice parameter through parameter generation algorithm then, thereby through compositor (Vocoder) synthetic speech.
The target of speech synthesis system is exactly to synthesize sound equally intelligent with voice and nature.But for the Chinese speech synthesis system, the pronunciation predictablity rate of polyphone is difficult to guarantee, because the pronunciation of polyphone is often definite according to semanteme, and semantic understanding is a challenging problem.What complementary like this relation caused polyphone prediction is difficult to obtain gratifying high accuracy.In the prior art, even the prediction of this pronunciation is not enough held, speech synthesis system generally all can provide a definite pronunciation to this polyphone.
In Chinese, different significance represented in different pronunciations.If speech synthesis system is given the pronunciation make mistake, will cause the ambiguity of hearer on understanding, give the very bad impression of hearer.Thereby for the speech synthesis system of in life, work and scientific research (for example vehicle mounted guidance, automatic sound information service, broadcasting, robot simulation etc.), using; Will cause bad user experience owing to the polyphone pronunciation of apparent error, even the inconvenience of using.Therefore, in the phonetic synthesis field, exist the phoneme synthesizing method of improved polyphone and the needs of system.
Summary of the invention
For this reason, the method that is used for phonetic synthesis and the system thereof of embodiments of the invention and the method that training is used for the acoustic model of phonetic synthesis are provided.Embodiment through embodiment of the present invention; Can have the following advantages: can not have enough to hold to provide under the situation of right pronunciation in system; The pronunciation of obfuscation polyphone; And do not influence the quality of other normal sound of total system, the method will be avoided manifest error, thereby improves the whole subjective sense of hearing of synthesis system.
According to an aspect of the present invention, a kind of method that is used for phonetic synthesis is provided, can have comprised: confirmed that the data that text analyzing generates are fuzzy polyphone data; Said fuzzy polyphone data are blured the polyphone prediction, to export a plurality of candidate's pronunciations and the probability thereof of said fuzzy polyphone data; Based on said a plurality of candidate's pronunciations and probability thereof, generate fuzzy contextual feature mark; Based on the acoustic model of confirming, confirm model parameter to said fuzzy contextual feature mark with fuzzy decision-tree; Said model parameter is generated speech parameter; And said speech parameter synthesized voice.
Preferably, generating the step of bluring the contextual feature mark may further include: confirm that based on said probability the context mark of candidate's pronunciation of said fuzzy polyphone data falls into the degree of classification; And through quantizing the said degree of conversion generating said fuzzy contextual feature mark, wherein said fuzzy contextual feature be labeled as said candidate's pronunciation the context mark unite expression.
According to other aspect of the present invention, a kind of equipment that is used for synthetic speech is provided, can comprise: the polyphone predicting unit is used for the pronunciation of predictive fuzzy polyphone data, to export a plurality of candidate's pronunciations and the prediction probability of said fuzzy polyphone data; Fuzzy contextual feature mark generation unit is used for generating fuzzy contextual feature mark based on said a plurality of candidate's pronunciations and probability thereof; Confirm the unit, be used for, confirm model parameter to said fuzzy contextual feature mark based on the acoustic model of confirming with fuzzy decision-tree; Parameter generators is used for generating speech parameter to said model parameter; And compositor, be used for said speech parameter is synthesized voice.
Preferably, said fuzzy contextual feature mark generation unit can further be configured to: confirm that based on said probability the context mark of candidate's pronunciation of said fuzzy polyphone data falls into the degree of classification; And through quantizing the said degree of conversion generating said fuzzy contextual feature mark, wherein said fuzzy contextual feature be labeled as said candidate's pronunciation the context mark unite expression.
According to other aspect of the present invention, a kind of system that is used for synthetic speech is provided, can comprise: be used for confirming that the data that text analyzing generates are the device of fuzzy polyphone data; Be used for said fuzzy polyphone data are blured polyphone prediction, with a plurality of candidates' pronunciations of exporting said fuzzy polyphone data and the device of probability thereof; Be used for generating the device of fuzzy contextual feature mark based on said a plurality of candidate's pronunciations and probability thereof; Be used for based on acoustic model, confirm the device of model parameter to said fuzzy contextual feature mark with fuzzy decision-tree; Be used for said model parameter is generated the device of speech parameter; And the device that is used for said speech parameter is synthesized voice.
According to other aspect of the present invention, a kind of method that is used to train acoustic model is provided, can comprise: each voice unit in the training utterance storehouse, to generate acoustic model, said voice unit comprises parameters,acoustic and context mark; For the context combination, carry out the decision tree clustering processing has decision tree with generation acoustic model; Based on said acoustic model, confirm the fuzzy data in the sound bank with decision tree; To said fuzzy data, generate fuzzy contextual feature mark; And, said sound bank is carried out the cluster training based on said fuzzy contextual feature mark, have the acoustic model of fuzzy decision-tree with generation.
Preferably, the step of confirming fuzzy data may further include: the assessment voice unit; And confirm that candidate's context mark of said voice unit falls into the degree of classification; And if said degree satisfies predetermined threshold, confirm that then said voice unit is a fuzzy data.
Preferably, the step of assessment voice unit may further include: generate parameter and the distance between the voice unit parameter through model posterior probability or model and assess the score value that the contextual feature of candidate's pronunciation of said voice unit marks.
Preferably, generating the step of bluring the contextual feature mark may further include: confirm the score value of corresponding candidate's contextual feature mark of said voice unit pronunciation through assessing said voice unit; Confirm that based on said score value candidate's context mark of said voice unit falls into the degree of classification; And through quantizing the said degree of conversion generating said fuzzy contextual feature mark, wherein said fuzzy contextual feature be labeled as said candidate's pronunciation the context mark unite expression.
Preferably; Said based on said fuzzy contextual feature mark; It is one of following that the step of carrying out cluster training may further include: based on said fuzzy contextual feature mark and preset fuzzy problem collection, training comprises that the training set of said fuzzy data has the acoustic model of said fuzzy decision-tree with generation; And train each voice unit in the said sound bank once more based on problem set and contextual feature mark; Wherein said problem set also comprises preset fuzzy problem collection, and the contextual feature of the fuzzy data in the said sound bank is labeled as said fuzzy contextual feature mark.
Description of drawings
In conjunction with accompanying drawing, the object of the invention, characteristics and advantage will be obvious the detailed description of embodiments of the invention from facing down, wherein:
Fig. 1 shows the process flow diagram according to the method that is used to train the acoustic model with fuzzy decision-tree of the embodiment of the invention.
Fig. 2 show according to the method for the embodiment of the invention really cover half stick with paste the processing flow chart of data.
Fig. 3 shows the operation of passing through model posterior probability assessment training data of method according to an embodiment of the invention.
The model that passes through that Fig. 4 shows method according to an embodiment of the invention generates the operation that distance between parameter and the actual parameter is assessed training data.
Fig. 5 illustration according to an embodiment of the invention fuzzy data is quantized conversion operations to generate fuzzy context.
Fig. 6 illustration the method for synthetic speech according to an embodiment of the invention.
Fig. 7 is the block diagram according to the equipment that is used for synthetic speech of the embodiment of the invention.
Embodiment
Below, in conjunction with accompanying drawing embodiments of the invention are described in detail.
Usually, the embodiment of the invention relates to the method for synthetic speech in electronic equipment (for example telephone system, portable terminal, vehicle mounted traffic instrument, automatic sound information service system, broadcast system, robot etc. and/or analog) and the method for system and training acoustic model thereof.
Generally, basic design of the present invention is: synthetic to Chinese polyphone, and do not select the unique candidate who confirms to pronounce, but the voice of fuzzy polyphone are carried out Fuzzy processing, thereby avoided just providing in advance arbitrary decision even wrong choice.In an embodiment of the present invention, fuzzy polyphone is meant that polyphone predicting unit of the prior art is difficult to the polyphone of prediction processing; And fuzzy data be in the training utterance storehouse, because the speech data that the influence of the continuous speech coarticulation of speaker and accidental pronunciation error are produced; It satisfies hazy condition (usually can according to member function ambiguity in definition threshold value) and is used for model training; Correspondingly, this voice that are difficult for definite candidate's pronunciation are called fuzzy voice.Can introduce fuzzy decision-tree to realize this process preferably in training and synthesis phase; Fuzzy decision-tree is commonly used to handle uncertain; Can help to derive the more decision-making of intelligence on complicated with fuzzy border, thereby make the optimal selection under the ambiguity.And the pronunciation of obfuscation is intended to comprise the characteristic of each candidate's pronunciation, and candidate's pronunciation that particularly those probability are bigger can avoid producing the misjudgment of candidate's pronunciation like this, thereby reduces the probability of synthetic ear-piercing or wrong voice.
In an embodiment of the present invention; In the model training stage; Can introduce fuzzy decision-tree, the sound bank that comprises fuzzy data is further trained, obtain acoustic model (for example HMM acoustic model) and this model corresponding fuzzy decision tree (the HMM acoustic model that for example has fuzzy decision-tree); At synthesis phase, when the polyphone predicting unit can not provide suitable selection, then Fuzzy processing is carried out in the pronunciation of this word, with voice, thereby make synthetic sound more near the big candidate of prediction possibility in the synthetic correspondence of compositor.The processing of synthesis phase can be operated as follows: the probability that obtains a plurality of candidate's pronunciations through the polyphone predicting unit; Blur contextual feature and handle the fuzzy context mark that obtains having many candidates fuzzy characteristics; The acoustic model with fuzzy decision-tree, the basis that generate based on training should obtain corresponding model parameter by fuzzy context mark; This model parameter obtains the relevant voice parameter through parameter generation algorithm, thereby through compositor this speech parameter is synthesized voice.
Fig. 1 shows the process flow diagram according to the method that is used to train the acoustic model with fuzzy decision-tree of the embodiment of the invention.As shown in Figure 1, at step S110, each voice unit in the training utterance storehouse is to generate acoustic model.In an embodiment of the present invention, sound bank generally is a reference voice that prerecord, through phonetic entry port input.The context mark that each voice unit comprises parameters,acoustic and describes corresponding segment, rhythm attribute.
With the HMM acoustic model is example, and in the training stage of this model, from the statistical computation to these voice unit parameters, this is a widely used proven technique in this area, repeats no more at this to the estimation of model parameter.
At step S120,, adopt the tree clustering method of decision tree to handle acoustic model usually and have the acoustic model of decision tree, for example CART (Classification and Regression Tree) with generation for context combination with a large amount of variations.Adopt clustering method to avoid data sparse effectively, and reduce the quantity of model.Simultaneously, cluster also is that the process that in synthetic, runs into new situation is handled in study, can realize optimum coupling.The process of cluster is with reference to predefined problem set.Problem set is the set that supplies the problem of decision tree structure use, and the problem of being chosen during node division node is therewith bound, thereby determines which primitive to get into same leafy node.Its problem set can be different according to concrete applied environment.For example have in the Chinese 5 types of tones 1,2,3,4,5}, each type can be used as the problem of decision tree, polyphone are being confirmed under the situation of tone, problem set can be provided with shown in table one:
Figure BDA0000047981920000071
Problem that table one problem set is used and value
Its code is as follows:
Is QS " phntone==1 " { " * | phntone=1|* " } tone the 1st type?
Is QS " phntone==2 " { " * | phntone=2|* " } tone the 2nd type?
Is QS " phntone==3 " { " * | phntone=3|* " } tone the 3rd type?
Is QS " phntone==4 " { " * | phntone=4|* " } tone the 4th type?
Is QS " phntone==5 " { " * | phntone=5|* " } tone the 5th type?
To those skilled in the art, the use of decision tree is this area technology commonly used, and it can adopt various decision trees according to various applied environments, and the variety of issue collection is set, and divides based on this problem and to make up decision tree, repeats no more at this.
In an embodiment of the present invention, through training data being trained and cluster can obtain the decision tree of Hidden Markov HMM model and corresponding model.Yet, it should be appreciated by those skilled in the art that the acoustic model of other types also can be applied in the Fuzzy processing of embodiments of the invention.
In an embodiment of the present invention, voice unit can be other unit such as phoneme, syllable or sound mother, and for the sake of simplicity, only illustration sound mother handles as voice unit.Yet, it should be appreciated by those skilled in the art that embodiments of the invention should be not limited thereto.
In an embodiment of the present invention, also based on fuzzy data, acoustic model is trained once more.For example, at step S140,, confirm the fuzzy data in the sound bank to above-mentioned acoustic model (Hidden Markov HMM model) with decision tree.In an embodiment of the present invention, can adopt some polyphone related context all possible mark, assess the ability that this mark characterizes real data based on real data, confirm according to this assessment result whether this speech data belongs to fuzzy data then.Afterwards, at step S160,, generate fuzzy contextual feature mark to qualified fuzzy data.So, at step S180,, mark based on this fuzzy contextual feature and to train fuzzy decision-tree to the sound bank that comprises fuzzy data, have the acoustic model of fuzzy decision-tree with generation.
Fig. 2 show according to the method for the embodiment of the invention really cover half stick with paste the processing flow chart of data.As shown in Figure 2, at step S210, generate all possible contextual feature mark of the speech data in the training storehouse.All possible context mark refers to for some and will such as tone, generate all possibilities as the attribute of polyphone Fuzzy Processing.In an embodiment of the present invention, do not pay close attention to whether meet linguistic norm, and generate all possibilities.For example, for polyphone " be ", the pronunciation of this polyphone is wei4 and wei2 in theory.Promptly refer to generate wei1 and all generate possible mark, wei2, wei3, wei4, wei5 for all tones.The contextual feature mark has characterized the language of voice segments and the attribute of voice, and for example the entity sound of speech primitive is female, tone, and syllable, the position in syllable, speech, phrase and sentence, the relevant information of the unit of forward-backward correlation, and the type of sentence etc.Tone is the key character of polyphone, is example with the tone, and 5 tones can be arranged in mandarin, for this training data 5 parallel contextual feature marks can be arranged then so.It should be appreciated by those skilled in the art that for the different pronunciation in the polyphone also can generate possible contextual feature mark, it handles with the processing of tone similar.
At step S220, be based on the acoustic model (the HMM model that for example has decision tree) that step S120 trains, the assessment training data.For example, for having N a certain voice unit under the parallel contextual feature mark, then can calculate its corresponding N score value is s [1] successively ... s [k] ... s [N], this score value have reflected that this mark characterizes the ability of actual parameter.In an embodiment of the present invention, any method that can quantize assessment can adopt, and for example posterior probability under the computation model condition or model generate distance between parameter and the actual parameter etc., will describe in detail below.
At step S230, based on assessment result, the reflection of for example calculating characterizes the score value of power, judges whether voice unit is fuzzy data.In an embodiment of the present invention, the data that point value of evaluation is lower can be confirmed as fuzzy data, are used for further training.At this, point value of evaluation is low to refer to that in parallel contextual feature mark all score values all do not have enough advantages to prove its actual optimum that is only this unit mark.
In an embodiment of the present invention, also can fall into the degree of this classification according to the corresponding score value of contextual feature mark that member function (membership function) calculates this voice unit.Member function m kCan be following to these parallel minute value representation:
m k = s [ k ] Σ K = 1 N s [ k ] - - - ( 1 )
Wherein, s [k] is the corresponding score value of contextual feature mark, and N is the number of contextual feature mark.
In an embodiment of the present invention, the data that satisfy hazy condition (usually according to member function ambiguity in definition threshold value) then are fuzzy data.The setting of fuzzy threshold value can be fixed, and for example for the candidate who does not occupy 50% above score value among all candidates, then these data can be thought fuzzy data.Alternatively, this fuzzy threshold value also can be dynamic, for example can choose certain part (as 10%) that ranks behind according to the score value ordering of definition classification sum under the active cell in the current database
In an embodiment of the present invention; Tranining database being carried out selecting of fuzzy data and changing whole training is favourable; This process has not only generated the data that are used for the fuzzy decision-tree training; Also the training accuracy raising for normal data contributes, and need not significantly to increase training burden.
Fig. 3 shows the operation of passing through model posterior probability assessment training data of method according to an embodiment of the invention.In an embodiment of the present invention, for for simplicity, training data is an example with certain voice unit.As shown in Figure 3; For N of this voice unit possible contextual feature mark 16a-1label 1...16a-k label k...16a-N label N, can on the model (the HMM model that for example has decision tree) that step S120 trains, find each self-corresponding acoustic model (21a-1 model1...21a-k model k...21a-N model N).In an embodiment of the present invention, be the operation that example is explained following assessment training data with the HMM acoustic model.Yet, should be appreciated that embodiments of the invention are not limited thereto.
For given voice unit, its speech parameter vector sequence is represented as follows:
O = [ o 1 T , o 2 T , . . . o T T ] T - - - ( 2 )
The speech parameter vector sequence of this voice unit is expressed as in the posterior probability of model HMM λ:
P ( O | λ ) = Σ Q P ( O , Q | λ ) - - - ( 3 )
Wherein, Q is HMM status switch { q 1, q 2..., q T.
Each frame of voice unit is alignd with model state, and obtain number of state indexes.Can calculate then with lower probability:
P ( o t , q i | λ ) = Σ j = 1 N b j ( o t ) - - - ( 4 )
Wherein, b j(o t) be t observed quantity constantly o tAt the output probability of j state of current model, its Gaussian distribution probability and all depend on the type of HMM, for example hybrid density HMM continuously.
b j ( o t ) = P ( o i | i , j ) = Σ m = 1 M ω ijm b ij ( o i ) = 1 ( 2 π ) p / 2 | Σ ij | 1 / 2 e { - 1 2 ( o i - μ ij ) Σ ij - 1 ( o i - μ ij ) T } - - - ( 5 )
Wherein, ω IjmIt is the weight of i mixed components of j state.μ IjAnd ∑ IjBe average and covariance.
Alternatively, in an embodiment of the present invention, can also assess training data through the distance that model generates between parameter and the actual parameter.The model that passes through that Fig. 4 shows method according to an embodiment of the invention generates the operation that distance between parameter and the actual parameter is assessed training data.As shown in Figure 4; Still be example with certain voice unit; It is similar to the above embodiments; Still have all possible contextual feature mark 16b-1label 1...16b-k label k...l6b-N label N, and confirm its each self-corresponding model 21a-1model 1..21a-k model k...21a-N model N.Simultaneously, recover speech parameter 25b-1parameter 1...25b-k parameter k...25b-N parameter N (it is test parameter) according to each model parameter.Through calculating, assess the score value of these possibility contextual features marks to the speech parameter (being reference parameter) of this unit and the distance between the recovery parameter.
As stated, for given voice unit, its speech parameter vector sequence O is expressed as:
O = [ o 1 T , o 2 T , . . . o T T ] T
Can be expressed as as follows and recover speech parameter:
O ′ = [ o 1 T ′ , o 2 T ′ , . . . o T ′ T ′ ] T - - - ( 6 )
At the actual parameter T of given voice unit with recover there are differences between the speech parameter T '.At first between T and T ', carry out linear mapping.Usually will recover speech parameter T ' expands or is compressed to and be T.So as calculating Euclidean distance between the two of getting off:
D ( O , O ′ ) = sqrt ( Σ i = 1 N Σ m = 1 M ( o mi - o mi ′ ) 2 ) - - - ( 7 )
In an embodiment of the present invention, can generate fuzzy context mark through quantizing to shine upon to change.Fuzzy context mark has characterized the language and the acoustic feature of current speech unit; And the association attributes to the polyphone that will carry out Fuzzy processing has carried out the vague definition of degreeization; The score value that can quantize according to each mark of voice unit convert corresponding context degree (high for example into; Low etc.), and carry out association list and show, to generate fuzzy context mark.Notice that in an embodiment of the present invention, fuzzy context mark generates according to objective calculating, can not receive philological restriction, such as the tone 1 through calculating wei3 or wei and combination of 5 or the like.Below come the fuzzy context mark of its generation of illustration with operation to certain voice unit with 5 tones.
As shown in Figure 5, suppose that candidate's tone of this unit is a tone 2, be expressed as tone=2 at this; (it is corresponding to tone tone=(1,2,3 to each possible contextual feature mark for member function membership as described above; 4,5)) calculate the value that it falls into the degree of this classification.So each member function value is carried out normalization, quantize to the value between the 0-1, like (0.05,0.45,0.1,0.2,0.2).And definite its contextual degree, for example high, middle or low.Then each contextual feature mark association list is shown fuzzy contextual feature mark.
In an embodiment of the present invention, for example threshold=0.2 of threshold value be can establish, then the pronunciation candidate who only considers to satisfy this baseline requirement when fuzzy contextual feature marks, for example tone 2,4 and 5 generated.To generate fuzzy context mark, for example tone=High2_Low4_Low5 according to the corresponding distributed degrees of above-mentioned tone.
It should be appreciated by those skilled in the art that generating fuzzy contextual feature mark can have multiple mode, for example can distribute, the fuzzy context that obtains quantizing according to the histogram of distribution proportion then according to the score value of similar segment in the whole training of the statistics storehouse.Should be noted that embodiments of the invention only as illustration, the mode of the fuzzy contextual feature mark of the generation of the embodiment of the invention is not limited thereto.
In an embodiment of the present invention, through generating fuzzy contextual feature mark, can have the diversity characteristic of obfuscation, thereby can avoid in the uncertain attribute classification that bad data cause, making stiff classification.
In an embodiment of the present invention, fuzzy data generated fuzzy contextual feature mark after, can carry out the fuzzy decision-tree training, and just upgrade the model parameter of acoustic model this decision tree training the time.At this,, yet it will be understood by those skilled in the art that this method confirms that for the polyphone with different pronunciations candidate's pronunciation can be suitable for equally still confirming that tone is an example.Be that example is come brief description still with above-mentioned instance.Shown in table two, the corresponding fuzzy problem set can be set be:
Figure BDA0000047981920000121
Problem that table two problem set is used and value
More than illustrative problem can comprise the multiple situation of the classification that combines tone, can put question to every kind of situation.The combination of these situation can be from linguistry, the practical combinations that occurs in the time of also can coming self-training etc.
In an embodiment of the present invention, can adopt multiple cluster mode, for example carry out cluster again, or cluster etc. is carried out in the secondary training storehouse of only forming to fuzzy data to whole training storehouse.When cluster is carried out again in whole training storehouse,, then its mark is replaced by the fuzzy contextual feature mark that as above generates, and in problem set, increases similar fuzzy problem collection if the training data in this training storehouse is a fuzzy data.
In an embodiment of the present invention, when cluster is carried out in secondary training storehouse,, only use fuzzy context mark and the training of fuzzy problem collection based on acoustic model of having trained and decision tree.
Carry out cluster as described above, then obtain having the acoustic model of fuzzy decision-tree.
In an embodiment of the present invention, the acoustic model that from real speech, obtains having fuzzy decision-tree through training to be improving the quality of phonetic synthesis, thereby makes Fuzzy processing become reasonable, flexible and intelligent, and makes conventional voice also obtain training more accurately.
Fig. 6 illustration the method for synthetic speech according to an embodiment of the invention.This is used for the method for phonetic synthesis, can comprise: confirm that the data that text analyzing generates are fuzzy polyphone data; Said fuzzy polyphone data are blured the polyphone prediction, to export a plurality of candidate's pronunciations and the probability thereof of said fuzzy polyphone data; Based on said a plurality of candidate's pronunciations and probability thereof, generate fuzzy contextual feature mark; Based on the acoustic model of confirming, confirm model parameter to said fuzzy contextual feature mark with fuzzy decision-tree; Said model parameter is generated speech parameter; And said speech parameter synthesized voice.
As shown in Figure 6, at step S610, confirm that the data that text analyzing generates are fuzzy polyphone data.In an embodiment of the present invention; Text analyzer is treated synthesis text and is carried out the participle operation; It is decomposed into speech and the pronunciation symbol thereof that has attribute labeling; Again according to semantic rules and phonetic rules, for sentence structure and intonation confirmed in each speech, each syllable, and the prosodic features of target speech such as pause.Can obtain multi-character words and monosyllabic word according to word segmentation result, multi-character words generally can be confirmed pronunciation according to dictionary, wherein comprises polyphone, and then such polyphone is not as fuzzy polyphone data of the present invention.And the polyphone in the embodiments of the invention generally refers to the individual character that still has a plurality of pronunciations through participle later on.So this polyphone is being carried out in the voice prediction process, can produce predicting the outcome of each candidate's pronunciation, this predicts the outcome and has described under the situation of concrete speech, the corresponding probability that the pronunciation of polyphone has.It is multiple for the mode of bluring the polyphone data has to adjudicate this polyphone, and threshold value for example can be set, and the polyphone that satisfies this threshold value then is fuzzy polyphone data.Be the candidate more than 70% for there not being probability among all candidates for example, then this polyphone can be thought fuzzy polyphone data.Confirm the principle of fuzzy polyphone data and confirm that in the training stage principle of fuzzy data is similar, repeat no more at this.
Afterwards,, said fuzzy polyphone data are blured the polyphone prediction, to export a plurality of candidate's pronunciations and the probability thereof of said fuzzy polyphone data at step S620.In an embodiment of the present invention, for non-fuzzy polyphone data, its pronunciation can be confirmed with higher confidence level ground, therefore need not carry out Fuzzy processing, then carries out conventional polyphone prediction processing, to export this candidate who confirms pronunciation.If this polyphone then carries out Fuzzy processing for fuzzy polyphone data, export a plurality of candidate's pronunciations and corresponding probability.
Next, at step S630,, generate fuzzy contextual feature mark based on said a plurality of candidate's pronunciations and probability thereof.In an embodiment of the present invention, the step S160 that generates fuzzy contextual feature mark in the execution of this step and the training process is similar, all can mapping be changed or other modes realize through quantizing, repeat no more at this.
At step S640,, confirm corresponding model parameter to said fuzzy contextual feature mark based on acoustic model with fuzzy decision-tree.In an embodiment of the present invention, for the HMM acoustic model, the distribution of each component under the state that then corresponding model parameter comprises for the HMM model.
At step S650, said model parameter is generated speech parameter.Can adopt this area parameter generation algorithm commonly used,, repeat no more at this for example according to the parameter generation algorithm of maximum likelihood probability condition etc.
At last, at step S660, said speech parameter is synthesized voice.
In an embodiment of the present invention, come synthetic speech through Fuzzy processing is carried out in the pronunciation of fuzzy polyphone data, thereby under different context, this pronunciation can have various variation, thereby improve the quality of phonetic synthesis.
Under same inventive concept, Fig. 7 is the block diagram according to the equipment that is used for synthetic speech of the embodiment of the invention.Below just combine should figure, present embodiment is described.For those parts identical, suitably omit its explanation with front embodiment.
The equipment 700 that is used for synthetic speech can comprise: polyphone predicting unit 703 is used for fuzzy polyphone data are carried out fuzzy prediction, to export a plurality of candidate's pronunciations and the prediction probability of said fuzzy polyphone data; Fuzzy contextual feature mark generation unit 704 is used for generating fuzzy contextual feature mark based on said a plurality of candidate's pronunciations and probability thereof; Confirm unit 705, be used for, confirm model parameter to said fuzzy contextual feature mark based on the acoustic model of confirming with fuzzy decision-tree; Parameter generators 706 is used for generating speech parameter to said model parameter; And compositor 707, be used for said speech parameter synthetic speech.
The equipment 700 that is used for synthetic speech of the present invention can be realized the above-mentioned method that is used for synthetic speech, and its concrete operations please refer to as above content, repeat no more at this.
In an embodiment of the present invention, equipment 700 can also comprise text analyzer 702, is used for text to be synthesized is decomposed into speech and the pronunciation symbol thereof that has attribute labeling.Alternatively, equipment 700 can also comprise I/O unit 701, is used to import text to be synthesized and the synthetic voice of output.Alternatively, in an embodiment of the present invention, symbols streams that can also text analyzing has been carried out in direct input from the outside.Therefore, as shown in Figure 7, text analyzer 702 is shown in broken lines with I/O unit 701.
In an embodiment of the present invention, be used for the equipment 700 and the various piece thereof of synthetic speech, can realize the method that is used for synthetic speech or its step of the embodiment that the front is described in the operation.
The equipment that is used for synthetic speech 700 in the present embodiment and each ingredient thereof can use special-purpose circuit or chip to constitute, and also can carry out corresponding program through computing machine (processor) and realize.
Those having ordinary skill in the art will appreciate that can use a computer executable instruction and/or be included in the processor control routine of above-mentioned method and apparatus realizes, for example provides such code on such as the mounting medium of disk, CD or DVD-ROM, such as the programmable memory of ROM (read-only memory) (firmware) or the data carrier such as optics or electronic signal carrier.The method and apparatus of present embodiment also can by such as VLSI (very large scale integrated circuits) or gate array, such as the semiconductor of logic chip, transistor etc., or realize such as the hardware circuit of the programmable hardware device of field programmable gate array, programmable logic device etc., also can by the combination of above-mentioned hardware circuit and software for example firmware realize.
Though above combination specific embodiment is described in detail the method that is used to train acoustic model of the present invention, the method and apparatus that is used for synthetic speech; But the present invention is not limited to this, and those of ordinary skills can understand and can carry out multiple conversion, replacement and modification and without departing from the spirit and scope of the present invention to the present invention; Protection scope of the present invention is limited accompanying claims.

Claims (10)

1. method that is used for phonetic synthesis comprises:
Confirm that the data that text analyzing generates are fuzzy polyphone data;
Said fuzzy polyphone data are blured the polyphone prediction, to export a plurality of candidate's pronunciations and the probability thereof of said fuzzy polyphone data;
Based on said a plurality of candidate's pronunciations and probability thereof, generate fuzzy contextual feature mark;
Based on acoustic model, confirm model parameter to said fuzzy contextual feature mark with fuzzy decision-tree;
Said model parameter is generated speech parameter; And
Said speech parameter is synthesized voice.
2. the method for claim 1, the step that wherein generates fuzzy contextual feature mark further comprises:
Confirm that based on said probability the context mark of candidate's pronunciation of said fuzzy polyphone data falls into the degree of classification; And
Through quantizing the said degree of conversion generating said fuzzy contextual feature mark, wherein said fuzzy contextual feature be labeled as said candidate's pronunciation the context mark unite expression.
3. equipment that is used for synthetic speech comprises:
The polyphone predicting unit is used for the pronunciation of the fuzzy polyphone data of fuzzy prediction, to export a plurality of candidate's pronunciations and the prediction probability of said fuzzy polyphone data;
Fuzzy contextual feature mark generation unit is used for generating fuzzy contextual feature mark based on said a plurality of candidate's pronunciations and probability thereof;
Confirm the unit, be used for, confirm model parameter to said fuzzy contextual feature mark based on acoustic model with fuzzy decision-tree;
Parameter generators is used for generating speech parameter to said model parameter; And
Compositor is used for said speech parameter synthetic speech.
4. equipment as claimed in claim 3, wherein said fuzzy contextual feature mark generation unit further is configured to:
Confirm that based on said probability the context mark of candidate's pronunciation of said fuzzy polyphone data falls into the degree of classification; And
Through quantizing the said degree of conversion generating said fuzzy contextual feature mark, wherein said fuzzy contextual feature be labeled as said candidate's pronunciation the context mark unite expression.
5. system that is used for synthetic speech comprises:
Be used for confirming that the data that text analyzing generates are the device of fuzzy polyphone data;
Be used for said fuzzy polyphone data are blured polyphone prediction, with a plurality of candidates' pronunciations of exporting said fuzzy polyphone data and the device of probability thereof;
Be used for generating the device of fuzzy contextual feature mark based on said a plurality of candidate's pronunciations and probability thereof;
Be used for based on acoustic model, confirm the device of model parameter to said fuzzy contextual feature mark with fuzzy decision-tree;
Be used for said model parameter is generated the device of speech parameter; And
Be used for said speech parameter is synthesized the device of voice.
6. method that is used to train acoustic model comprises:
Each voice unit in the training utterance storehouse, to generate acoustic model, said voice unit comprises parameters,acoustic and context mark;
For the context combination, carry out the decision tree clustering processing has decision tree with generation acoustic model;
Based on said acoustic model, confirm the fuzzy data in the sound bank with decision tree;
To said fuzzy data, generate fuzzy contextual feature mark; And
Based on said fuzzy contextual feature mark, said sound bank is carried out the cluster training, have the acoustic model of fuzzy decision-tree with generation.
7. method as claimed in claim 6, confirm that wherein the step of fuzzy data further comprises:
The assessment voice unit; And
Confirm that candidate's context mark of said voice unit falls into the degree of classification; And
If said degree satisfies predetermined threshold, confirm that then said voice unit is a fuzzy data.
8. method as claimed in claim 7, the step of wherein assessing voice unit further comprises:
Generate parameter and the distance between the voice unit parameter through model posterior probability or model and assess the score value that the contextual feature of candidate's pronunciation of said voice unit marks.
9. method as claimed in claim 6, the step that wherein generates fuzzy contextual feature mark further comprises:
Through assessing the score value that said voice unit confirms that the contextual feature of candidate's pronunciation of said voice unit marks;
Confirm that based on said score value candidate's context mark of said voice unit falls into the degree of classification; And
Through quantizing the said degree of conversion generating said fuzzy contextual feature mark, wherein said fuzzy contextual feature be labeled as said candidate's pronunciation the context mark unite expression.
10. method as claimed in claim 6, wherein based on said fuzzy contextual feature mark, it is one of following that the step of carrying out the cluster training further comprises:
Based on said fuzzy contextual feature mark and preset fuzzy problem collection, training comprises that the training set of said fuzzy data has the acoustic model of said fuzzy decision-tree with generation; And
Train each voice unit in the said sound bank once more based on problem set and contextual feature mark; Wherein said problem set also comprises preset fuzzy problem collection, and the contextual feature of the fuzzy data in the said sound bank is labeled as said fuzzy contextual feature mark.
CN2011100465804A 2011-02-25 2011-02-25 Method and equipment for voice synthesis and method for training acoustic model used in voice synthesis Pending CN102651217A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN2011100465804A CN102651217A (en) 2011-02-25 2011-02-25 Method and equipment for voice synthesis and method for training acoustic model used in voice synthesis
US13/402,602 US9058811B2 (en) 2011-02-25 2012-02-22 Speech synthesis with fuzzy heteronym prediction using decision trees

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011100465804A CN102651217A (en) 2011-02-25 2011-02-25 Method and equipment for voice synthesis and method for training acoustic model used in voice synthesis

Publications (1)

Publication Number Publication Date
CN102651217A true CN102651217A (en) 2012-08-29

Family

ID=46693212

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011100465804A Pending CN102651217A (en) 2011-02-25 2011-02-25 Method and equipment for voice synthesis and method for training acoustic model used in voice synthesis

Country Status (2)

Country Link
US (1) US9058811B2 (en)
CN (1) CN102651217A (en)

Cited By (95)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103578467A (en) * 2013-10-18 2014-02-12 威盛电子股份有限公司 Acoustic model building method, speech recognition method and electronic device thereof
CN103854643A (en) * 2012-11-29 2014-06-11 株式会社东芝 Method and apparatus for speech synthesis
CN103902600A (en) * 2012-12-27 2014-07-02 富士通株式会社 Keywords list forming device and method and electronic equipment
CN104142909A (en) * 2014-05-07 2014-11-12 腾讯科技(深圳)有限公司 Method and device for phonetic annotation of Chinese characters
CN104200803A (en) * 2014-09-16 2014-12-10 北京开元智信通软件有限公司 Voice broadcasting method, device and system
CN104464731A (en) * 2013-09-20 2015-03-25 株式会社东芝 Data collection device, method, voice talking device and method
CN104599670A (en) * 2015-01-30 2015-05-06 成都星炫科技有限公司 Voice recognition method of touch and talk pen
CN104867491A (en) * 2015-06-17 2015-08-26 百度在线网络技术(北京)有限公司 Training method and device for prosody model used for speech synthesis
CN105225657A (en) * 2015-10-22 2016-01-06 百度在线网络技术(北京)有限公司 Polyphone mark template generation method and device
CN105304081A (en) * 2015-11-09 2016-02-03 上海语知义信息技术有限公司 Smart household voice broadcasting system and voice broadcasting method
CN105336322A (en) * 2015-09-30 2016-02-17 百度在线网络技术(北京)有限公司 Polyphone model training method, and speech synthesis method and device
CN105340004A (en) * 2013-06-28 2016-02-17 谷歌公司 Computer-implemented method, computer-readable medium and system for pronunciation learning
CN105702248A (en) * 2014-12-09 2016-06-22 苹果公司 Disambiguating heteronyms in speech synthesis
CN105931635A (en) * 2016-03-31 2016-09-07 北京奇艺世纪科技有限公司 Audio segmentation method and device
CN108305612A (en) * 2017-11-21 2018-07-20 腾讯科技(深圳)有限公司 Text-processing, model training method, device, storage medium and computer equipment
CN108346423A (en) * 2017-01-23 2018-07-31 北京搜狗科技发展有限公司 The treating method and apparatus of phonetic synthesis model
CN108364639A (en) * 2013-08-23 2018-08-03 株式会社东芝 Speech processing system and method
CN108389577A (en) * 2018-02-12 2018-08-10 广州视源电子科技股份有限公司 Method, system, device and storage medium for optimizing speech recognition acoustic model
CN109996149A (en) * 2017-12-29 2019-07-09 深圳市赛菲姆科技有限公司 A kind of parking lot Intelligent voice broadcasting system
US10720160B2 (en) 2018-06-01 2020-07-21 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US10741181B2 (en) 2017-05-09 2020-08-11 Apple Inc. User interface for correcting recognition errors
CN111681641A (en) * 2020-05-26 2020-09-18 微软技术许可有限责任公司 Phrase-based end-to-end text-to-speech (TTS) synthesis
CN111968676A (en) * 2020-08-18 2020-11-20 北京字节跳动网络技术有限公司 Pronunciation correction method and device, electronic equipment and storage medium
US10878809B2 (en) 2014-05-30 2020-12-29 Apple Inc. Multi-command single utterance input method
US10930282B2 (en) 2015-03-08 2021-02-23 Apple Inc. Competing devices responding to voice triggers
US10978090B2 (en) 2013-02-07 2021-04-13 Apple Inc. Voice trigger for a digital assistant
US11009970B2 (en) 2018-06-01 2021-05-18 Apple Inc. Attention aware virtual assistant dismissal
US11037565B2 (en) 2016-06-10 2021-06-15 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US11070949B2 (en) 2015-05-27 2021-07-20 Apple Inc. Systems and methods for proactively identifying and surfacing relevant content on an electronic device with a touch-sensitive display
US11087759B2 (en) 2015-03-08 2021-08-10 Apple Inc. Virtual assistant activation
US11120372B2 (en) 2011-06-03 2021-09-14 Apple Inc. Performing actions associated with task items that represent tasks to perform
US11126400B2 (en) 2015-09-08 2021-09-21 Apple Inc. Zero latency digital assistant
US11133008B2 (en) 2014-05-30 2021-09-28 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US11140099B2 (en) 2019-05-21 2021-10-05 Apple Inc. Providing message response suggestions
US11152002B2 (en) 2016-06-11 2021-10-19 Apple Inc. Application integration with a digital assistant
US11169616B2 (en) 2018-05-07 2021-11-09 Apple Inc. Raise to speak
US11217251B2 (en) 2019-05-06 2022-01-04 Apple Inc. Spoken notifications
US11237797B2 (en) 2019-05-31 2022-02-01 Apple Inc. User activity shortcut suggestions
US11257504B2 (en) 2014-05-30 2022-02-22 Apple Inc. Intelligent assistant for home automation
US11269678B2 (en) 2012-05-15 2022-03-08 Apple Inc. Systems and methods for integrating third party services with a digital assistant
US11289073B2 (en) 2019-05-31 2022-03-29 Apple Inc. Device text to speech
CN114360494A (en) * 2021-12-29 2022-04-15 广州酷狗计算机科技有限公司 Rhythm labeling method and device, computer equipment and storage medium
US11307752B2 (en) 2019-05-06 2022-04-19 Apple Inc. User configurable task triggers
US11348582B2 (en) 2008-10-02 2022-05-31 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US11348573B2 (en) 2019-03-18 2022-05-31 Apple Inc. Multimodality in digital assistant systems
US11360641B2 (en) 2019-06-01 2022-06-14 Apple Inc. Increasing the relevance of new available information
US11380310B2 (en) 2017-05-12 2022-07-05 Apple Inc. Low-latency intelligent automated assistant
US11388291B2 (en) 2013-03-14 2022-07-12 Apple Inc. System and method for processing voicemail
US11405466B2 (en) 2017-05-12 2022-08-02 Apple Inc. Synchronization and task delegation of a digital assistant
US11423886B2 (en) 2010-01-18 2022-08-23 Apple Inc. Task flow identification based on user intent
US11423908B2 (en) 2019-05-06 2022-08-23 Apple Inc. Interpreting spoken requests
US11431642B2 (en) 2018-06-01 2022-08-30 Apple Inc. Variable latency device coordination
US11468282B2 (en) 2015-05-15 2022-10-11 Apple Inc. Virtual assistant in a communication session
US11467802B2 (en) 2017-05-11 2022-10-11 Apple Inc. Maintaining privacy of personal information
US11475884B2 (en) 2019-05-06 2022-10-18 Apple Inc. Reducing digital assistant latency when a language is incorrectly determined
US11475898B2 (en) 2018-10-26 2022-10-18 Apple Inc. Low-latency multi-speaker speech recognition
US11488406B2 (en) 2019-09-25 2022-11-01 Apple Inc. Text detection using global geometry estimators
US11496600B2 (en) 2019-05-31 2022-11-08 Apple Inc. Remote execution of machine-learned models
US11500672B2 (en) 2015-09-08 2022-11-15 Apple Inc. Distributed personal assistant
US11516537B2 (en) 2014-06-30 2022-11-29 Apple Inc. Intelligent automated assistant for TV user interactions
CN115440205A (en) * 2021-06-04 2022-12-06 中国移动通信集团浙江有限公司 Speech processing method, device, terminal and program product
US11526368B2 (en) 2015-11-06 2022-12-13 Apple Inc. Intelligent automated assistant in a messaging environment
US11532306B2 (en) 2017-05-16 2022-12-20 Apple Inc. Detecting a trigger of a digital assistant
US11580990B2 (en) 2017-05-12 2023-02-14 Apple Inc. User-specific acoustic models
US11599331B2 (en) 2017-05-11 2023-03-07 Apple Inc. Maintaining privacy of personal information
US11657813B2 (en) 2019-05-31 2023-05-23 Apple Inc. Voice identification in digital assistant systems
US11656884B2 (en) 2017-01-09 2023-05-23 Apple Inc. Application integration with a digital assistant
US11671920B2 (en) 2007-04-03 2023-06-06 Apple Inc. Method and system for operating a multifunction portable electronic device using voice-activation
US11675829B2 (en) 2017-05-16 2023-06-13 Apple Inc. Intelligent automated assistant for media exploration
US11696060B2 (en) 2020-07-21 2023-07-04 Apple Inc. User identification using headphones
US11710482B2 (en) 2018-03-26 2023-07-25 Apple Inc. Natural assistant interaction
US11727219B2 (en) 2013-06-09 2023-08-15 Apple Inc. System and method for inferring user intent from speech inputs
US11755276B2 (en) 2020-05-12 2023-09-12 Apple Inc. Reducing description length based on confidence
US11765209B2 (en) 2020-05-11 2023-09-19 Apple Inc. Digital assistant hardware abstraction
US11790914B2 (en) 2019-06-01 2023-10-17 Apple Inc. Methods and user interfaces for voice-based control of electronic devices
US11798547B2 (en) 2013-03-15 2023-10-24 Apple Inc. Voice activated device for use with a voice-based digital assistant
US11809483B2 (en) 2015-09-08 2023-11-07 Apple Inc. Intelligent automated assistant for media search and playback
US11809783B2 (en) 2016-06-11 2023-11-07 Apple Inc. Intelligent device arbitration and control
US11838734B2 (en) 2020-07-20 2023-12-05 Apple Inc. Multi-device audio adjustment coordination
US11854539B2 (en) 2018-05-07 2023-12-26 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US11853647B2 (en) 2015-12-23 2023-12-26 Apple Inc. Proactive assistance based on dialog communication between devices
US11853536B2 (en) 2015-09-08 2023-12-26 Apple Inc. Intelligent automated assistant in a media environment
US11886805B2 (en) 2015-11-09 2024-01-30 Apple Inc. Unconventional virtual assistant interactions
US11893992B2 (en) 2018-09-28 2024-02-06 Apple Inc. Multi-modal inputs for voice commands
US11914848B2 (en) 2020-05-11 2024-02-27 Apple Inc. Providing relevant data items based on context
US11947873B2 (en) 2015-06-29 2024-04-02 Apple Inc. Virtual assistant for media playback
US12010262B2 (en) 2013-08-06 2024-06-11 Apple Inc. Auto-activating smart responses based on activities from remote devices
US12014118B2 (en) 2017-05-15 2024-06-18 Apple Inc. Multi-modal interfaces having selection disambiguation and text modification capability
US12051413B2 (en) 2015-09-30 2024-07-30 Apple Inc. Intelligent device identification
US12067985B2 (en) 2018-06-01 2024-08-20 Apple Inc. Virtual assistant operations in multi-device environments
US12073147B2 (en) 2013-06-09 2024-08-27 Apple Inc. Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant
US12087308B2 (en) 2010-01-18 2024-09-10 Apple Inc. Intelligent automated assistant
US12223282B2 (en) 2016-06-09 2025-02-11 Apple Inc. Intelligent automated assistant in a home environment
US12254887B2 (en) 2017-05-16 2025-03-18 Apple Inc. Far-field extension of digital assistant services for providing a notification of an event to a user
US12277954B2 (en) 2024-04-16 2025-04-15 Apple Inc. Voice trigger for a digital assistant

Families Citing this family (133)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8645137B2 (en) 2000-03-16 2014-02-04 Apple Inc. Fast, language-independent method for user authentication by voice
US8677377B2 (en) 2005-09-08 2014-03-18 Apple Inc. Method and apparatus for building an intelligent automated assistant
US10002189B2 (en) 2007-12-20 2018-06-19 Apple Inc. Method and apparatus for searching using an active ontology
US9330720B2 (en) 2008-01-03 2016-05-03 Apple Inc. Methods and apparatus for altering audio output signals
US8996376B2 (en) 2008-04-05 2015-03-31 Apple Inc. Intelligent text-to-speech conversion
US10496753B2 (en) 2010-01-18 2019-12-03 Apple Inc. Automatically adapting user interfaces for hands-free interaction
US20100030549A1 (en) 2008-07-31 2010-02-04 Lee Michael M Mobile device having human language translation capability with positional feedback
WO2010067118A1 (en) 2008-12-11 2010-06-17 Novauris Technologies Limited Speech recognition involving a mobile device
US9858925B2 (en) 2009-06-05 2018-01-02 Apple Inc. Using context information to facilitate processing of commands in a virtual assistant
US10241644B2 (en) 2011-06-03 2019-03-26 Apple Inc. Actionable reminder entries
US10241752B2 (en) 2011-09-30 2019-03-26 Apple Inc. Interface for a virtual digital assistant
US9431006B2 (en) 2009-07-02 2016-08-30 Apple Inc. Methods and apparatuses for automatic speech recognition
US10553209B2 (en) 2010-01-18 2020-02-04 Apple Inc. Systems and methods for hands-free notification summaries
US10679605B2 (en) 2010-01-18 2020-06-09 Apple Inc. Hands-free list-reading by intelligent automated assistant
US10705794B2 (en) 2010-01-18 2020-07-07 Apple Inc. Automatically adapting user interfaces for hands-free interaction
US8682667B2 (en) 2010-02-25 2014-03-25 Apple Inc. User profiling for selecting user specific voice input processing information
US10762293B2 (en) 2010-12-22 2020-09-01 Apple Inc. Using parts-of-speech tagging and named entity recognition for spelling correction
US9262612B2 (en) 2011-03-21 2016-02-16 Apple Inc. Device access using voice authentication
US10057736B2 (en) 2011-06-03 2018-08-21 Apple Inc. Active transport based notifications
US8706472B2 (en) * 2011-08-11 2014-04-22 Apple Inc. Method for disambiguating multiple readings in language conversion
US8994660B2 (en) 2011-08-29 2015-03-31 Apple Inc. Text correction processing
US10134385B2 (en) 2012-03-02 2018-11-20 Apple Inc. Systems and methods for name pronunciation
US9483461B2 (en) 2012-03-06 2016-11-01 Apple Inc. Handling speech synthesis of content for multiple languages
US9280610B2 (en) 2012-05-14 2016-03-08 Apple Inc. Crowd sourcing information to fulfill user requests
US9721563B2 (en) 2012-06-08 2017-08-01 Apple Inc. Name recognition system
US9495129B2 (en) 2012-06-29 2016-11-15 Apple Inc. Device, method, and user interface for voice-activated navigation and browsing of a document
US9576574B2 (en) 2012-09-10 2017-02-21 Apple Inc. Context-sensitive handling of interruptions by intelligent digital assistant
US9547647B2 (en) 2012-09-19 2017-01-17 Apple Inc. Voice-based media searching
CN102982019B (en) * 2012-11-26 2019-01-15 百度国际科技(深圳)有限公司 Input method corpus phonetic notation method, the method and electronic device for generating evaluation and test corpus
US9396723B2 (en) 2013-02-01 2016-07-19 Tencent Technology (Shenzhen) Company Limited Method and device for acoustic language model training
CN103971677B (en) * 2013-02-01 2015-08-12 腾讯科技(深圳)有限公司 A kind of acoustics language model training method and device
US9368114B2 (en) 2013-03-14 2016-06-14 Apple Inc. Context-sensitive handling of interruptions
WO2014144579A1 (en) 2013-03-15 2014-09-18 Apple Inc. System and method for updating an adaptive speech recognition model
WO2014144949A2 (en) 2013-03-15 2014-09-18 Apple Inc. Training an at least partial voice command system
US20140351196A1 (en) * 2013-05-21 2014-11-27 Sas Institute Inc. Methods and systems for using clustering for splitting tree nodes in classification decision trees
WO2014197334A2 (en) 2013-06-07 2014-12-11 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
WO2014197336A1 (en) 2013-06-07 2014-12-11 Apple Inc. System and method for detecting errors in interactions with a voice-based digital assistant
US9582608B2 (en) 2013-06-07 2017-02-28 Apple Inc. Unified ranking with entropy-weighted information for phrase-based semantic auto-completion
WO2014197335A1 (en) 2013-06-08 2014-12-11 Apple Inc. Interpreting and acting upon commands that involve sharing information with remote devices
KR101809808B1 (en) 2013-06-13 2017-12-15 애플 인크. System and method for emergency calls initiated by voice command
CN105531757B (en) * 2013-09-20 2019-08-06 株式会社东芝 Voice selecting auxiliary device and voice selecting method
US10296160B2 (en) 2013-12-06 2019-05-21 Apple Inc. Method for extracting salient dialog usage from live data
CA2934298C (en) * 2014-01-14 2023-03-07 Interactive Intelligence Group, Inc. System and method for synthesis of speech from provided text
US9620105B2 (en) 2014-05-15 2017-04-11 Apple Inc. Analyzing audio input for efficient speech and music recognition
US10592095B2 (en) 2014-05-23 2020-03-17 Apple Inc. Instantaneous speaking of content on touch devices
US9502031B2 (en) 2014-05-27 2016-11-22 Apple Inc. Method for supporting dynamic grammars in WFST-based ASR
US9734193B2 (en) 2014-05-30 2017-08-15 Apple Inc. Determining domain salience ranking from ambiguous words in natural speech
US9842101B2 (en) 2014-05-30 2017-12-12 Apple Inc. Predictive conversion of language input
US9430463B2 (en) 2014-05-30 2016-08-30 Apple Inc. Exemplar-based natural language processing
US10078631B2 (en) 2014-05-30 2018-09-18 Apple Inc. Entropy-guided text prediction using combined word and character n-gram language models
US9633004B2 (en) 2014-05-30 2017-04-25 Apple Inc. Better resolution when referencing to concepts
US9760559B2 (en) 2014-05-30 2017-09-12 Apple Inc. Predictive text input
US9785630B2 (en) 2014-05-30 2017-10-10 Apple Inc. Text prediction using combined word N-gram and unigram language models
US10289433B2 (en) 2014-05-30 2019-05-14 Apple Inc. Domain specific language for encoding assistant dialog
US10659851B2 (en) 2014-06-30 2020-05-19 Apple Inc. Real-time digital assistant knowledge updates
US10446141B2 (en) 2014-08-28 2019-10-15 Apple Inc. Automatic speech recognition based on user feedback
US9818400B2 (en) 2014-09-11 2017-11-14 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US10789041B2 (en) 2014-09-12 2020-09-29 Apple Inc. Dynamic thresholds for always listening speech trigger
US10127911B2 (en) 2014-09-30 2018-11-13 Apple Inc. Speaker identification and unsupervised speaker adaptation techniques
US9668121B2 (en) 2014-09-30 2017-05-30 Apple Inc. Social reminders
US10074360B2 (en) 2014-09-30 2018-09-11 Apple Inc. Providing an indication of the suitability of speech recognition
US9886432B2 (en) 2014-09-30 2018-02-06 Apple Inc. Parsimonious handling of word inflection via categorical stem + suffix N-gram language models
US9646609B2 (en) 2014-09-30 2017-05-09 Apple Inc. Caching apparatus for serving phonetic pronunciations
KR20160058470A (en) * 2014-11-17 2016-05-25 삼성전자주식회사 Speech synthesis apparatus and control method thereof
US10552013B2 (en) 2014-12-02 2020-02-04 Apple Inc. Data detection
US9865280B2 (en) 2015-03-06 2018-01-09 Apple Inc. Structured dictation using intelligent automated assistants
US10152299B2 (en) 2015-03-06 2018-12-11 Apple Inc. Reducing response latency of intelligent automated assistants
US10567477B2 (en) 2015-03-08 2020-02-18 Apple Inc. Virtual assistant continuity
US9899019B2 (en) 2015-03-18 2018-02-20 Apple Inc. Systems and methods for structured stem and suffix language models
US9842105B2 (en) 2015-04-16 2017-12-12 Apple Inc. Parsimonious continuous-space phrase representations for natural language processing
US10083688B2 (en) 2015-05-27 2018-09-25 Apple Inc. Device voice control for selecting a displayed affordance
US10127220B2 (en) 2015-06-04 2018-11-13 Apple Inc. Language identification from short strings
US10101822B2 (en) 2015-06-05 2018-10-16 Apple Inc. Language input correction
US9578173B2 (en) 2015-06-05 2017-02-21 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US10255907B2 (en) 2015-06-07 2019-04-09 Apple Inc. Automatic accent detection using acoustic models
US10186254B2 (en) 2015-06-07 2019-01-22 Apple Inc. Context-based endpoint detection
US11025565B2 (en) 2015-06-07 2021-06-01 Apple Inc. Personalized prediction of responses for instant messaging
US9697820B2 (en) 2015-09-24 2017-07-04 Apple Inc. Unit-selection text-to-speech synthesis using concatenation-sensitive neural networks
US11010550B2 (en) 2015-09-29 2021-05-18 Apple Inc. Unified language modeling framework for word prediction, auto-completion and auto-correction
US10366158B2 (en) 2015-09-29 2019-07-30 Apple Inc. Efficient word encoding for recurrent neural network language models
US10049668B2 (en) 2015-12-02 2018-08-14 Apple Inc. Applying neural network language models to weighted finite state transducers for automatic speech recognition
US10446143B2 (en) 2016-03-14 2019-10-15 Apple Inc. Identification of voice inputs providing credentials
US9934775B2 (en) 2016-05-26 2018-04-03 Apple Inc. Unit-selection text-to-speech synthesis based on predicted concatenation parameters
US9972304B2 (en) 2016-06-03 2018-05-15 Apple Inc. Privacy preserving distributed evaluation framework for embedded personalized systems
US11227589B2 (en) 2016-06-06 2022-01-18 Apple Inc. Intelligent list reading
US10249300B2 (en) 2016-06-06 2019-04-02 Apple Inc. Intelligent list reading
US10049663B2 (en) 2016-06-08 2018-08-14 Apple, Inc. Intelligent automated assistant for media exploration
DK179588B1 (en) 2016-06-09 2019-02-22 Apple Inc. Intelligent automated assistant in a home environment
US10509862B2 (en) 2016-06-10 2019-12-17 Apple Inc. Dynamic phrase expansion of language input
US10067938B2 (en) 2016-06-10 2018-09-04 Apple Inc. Multilingual word prediction
US10490187B2 (en) 2016-06-10 2019-11-26 Apple Inc. Digital assistant providing automated status report
US10192552B2 (en) 2016-06-10 2019-01-29 Apple Inc. Digital assistant providing whispered speech
US12197817B2 (en) 2016-06-11 2025-01-14 Apple Inc. Intelligent device arbitration and control
DK179049B1 (en) 2016-06-11 2017-09-18 Apple Inc Data driven natural language event detection and classification
DK179343B1 (en) 2016-06-11 2018-05-14 Apple Inc Intelligent task discovery
KR102392094B1 (en) 2016-09-06 2022-04-28 딥마인드 테크놀로지스 리미티드 Sequence processing using convolutional neural networks
US11080591B2 (en) 2016-09-06 2021-08-03 Deepmind Technologies Limited Processing sequences using convolutional neural networks
KR102353284B1 (en) 2016-09-06 2022-01-19 딥마인드 테크놀로지스 리미티드 Generate audio using neural networks
US10474753B2 (en) 2016-09-07 2019-11-12 Apple Inc. Language identification using recurrent neural networks
US10043516B2 (en) 2016-09-23 2018-08-07 Apple Inc. Intelligent automated assistant
JP6756916B2 (en) 2016-10-26 2020-09-16 ディープマインド テクノロジーズ リミテッド Processing text sequences using neural networks
US11281993B2 (en) 2016-12-05 2022-03-22 Apple Inc. Model and ensemble compression for metric learning
US10593346B2 (en) 2016-12-22 2020-03-17 Apple Inc. Rank-reduced token representation for automatic speech recognition
CN107122179A (en) 2017-03-31 2017-09-01 阿里巴巴集团控股有限公司 The function control method and device of voice
US10417266B2 (en) 2017-05-09 2019-09-17 Apple Inc. Context-aware ranking of intelligent response suggestions
US10395654B2 (en) 2017-05-11 2019-08-27 Apple Inc. Text normalization based on a data-driven learning network
DK201770439A1 (en) 2017-05-11 2018-12-13 Apple Inc. Offline personal assistant
US11301477B2 (en) 2017-05-12 2022-04-12 Apple Inc. Feedback analysis of a digital assistant
DK201770432A1 (en) 2017-05-15 2018-12-21 Apple Inc. Hierarchical belief states for digital assistants
DK201770431A1 (en) 2017-05-15 2018-12-20 Apple Inc. Optimizing dialogue policy decisions for digital assistants using implicit feedback
US10403278B2 (en) 2017-05-16 2019-09-03 Apple Inc. Methods and systems for phonetic matching in digital assistant services
US10311144B2 (en) 2017-05-16 2019-06-04 Apple Inc. Emoji word sense disambiguation
US10657328B2 (en) 2017-06-02 2020-05-19 Apple Inc. Multi-task recurrent neural network architecture for efficient morphology handling in neural language modeling
US10431203B2 (en) * 2017-09-05 2019-10-01 International Business Machines Corporation Machine training for native language and fluency identification
US10445429B2 (en) 2017-09-21 2019-10-15 Apple Inc. Natural language understanding using vocabularies with compressed serialized tries
US10755051B2 (en) 2017-09-29 2020-08-25 Apple Inc. Rule-based natural language processing
US10636424B2 (en) 2017-11-30 2020-04-28 Apple Inc. Multi-turn canned dialog
US10733982B2 (en) 2018-01-08 2020-08-04 Apple Inc. Multi-directional dialog
US10733375B2 (en) 2018-01-31 2020-08-04 Apple Inc. Knowledge-based framework for improving natural language understanding
US10789959B2 (en) 2018-03-02 2020-09-29 Apple Inc. Training speaker recognition models for digital assistants
US10592604B2 (en) 2018-03-12 2020-03-17 Apple Inc. Inverse text normalization for automatic speech recognition
US10909331B2 (en) 2018-03-30 2021-02-02 Apple Inc. Implicit identification of translation payload with neural machine translation
US10984780B2 (en) 2018-05-21 2021-04-20 Apple Inc. Global semantic word embeddings using bi-directional recurrent neural networks
US11386266B2 (en) 2018-06-01 2022-07-12 Apple Inc. Text correction
US10496705B1 (en) 2018-06-03 2019-12-03 Apple Inc. Accelerated task performance
US11010561B2 (en) 2018-09-27 2021-05-18 Apple Inc. Sentiment prediction from textual data
US10839159B2 (en) 2018-09-28 2020-11-17 Apple Inc. Named entity normalization in a spoken dialog system
US11170166B2 (en) 2018-09-28 2021-11-09 Apple Inc. Neural typographical error modeling via generative adversarial networks
US11638059B2 (en) 2019-01-04 2023-04-25 Apple Inc. Content playback on multiple devices
CN110047463B (en) * 2019-01-31 2021-03-02 北京捷通华声科技股份有限公司 Voice synthesis method and device and electronic equipment
CN109767755A (en) * 2019-03-01 2019-05-17 广州多益网络股份有限公司 A kind of phoneme synthesizing method and system
CN115116427B (en) * 2022-06-22 2023-11-14 马上消费金融股份有限公司 Labeling method, voice synthesis method, training method and training device
CN115512696B (en) * 2022-09-20 2024-09-13 中国第一汽车股份有限公司 Simulation training method and vehicle

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6098042A (en) * 1998-01-30 2000-08-01 International Business Machines Corporation Homograph filter for speech synthesis system
CN1836226A (en) * 2003-08-21 2006-09-20 熊锦棠 Method and apparatus for converting characters of non-alphabetic languages
US20060277045A1 (en) * 2005-06-06 2006-12-07 International Business Machines Corporation System and method for word-sense disambiguation by recursive partitioning

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6366883B1 (en) * 1996-05-15 2002-04-02 Atr Interpreting Telecommunications Concatenation of speech segments by use of a speech synthesizer
US6081781A (en) * 1996-09-11 2000-06-27 Nippon Telegragh And Telephone Corporation Method and apparatus for speech synthesis and program recorded medium
JP3587048B2 (en) * 1998-03-02 2004-11-10 株式会社日立製作所 Prosody control method and speech synthesizer
ATE298453T1 (en) * 1998-11-13 2005-07-15 Lernout & Hauspie Speechprod SPEECH SYNTHESIS BY CONTACTING SPEECH WAVEFORMS
EP1159733B1 (en) * 1999-03-08 2003-08-13 Siemens Aktiengesellschaft Method and array for determining a representative phoneme
US7657102B2 (en) * 2003-08-27 2010-02-02 Microsoft Corp. System and method for fast on-line learning of transformed hidden Markov models
US7881934B2 (en) * 2003-09-12 2011-02-01 Toyota Infotechnology Center Co., Ltd. Method and system for adjusting the voice prompt of an interactive system based upon the user's state
FR2861491B1 (en) * 2003-10-24 2006-01-06 Thales Sa METHOD FOR SELECTING SYNTHESIS UNITS
US7983910B2 (en) * 2006-03-03 2011-07-19 International Business Machines Corporation Communicating across voice and text channels with emotion preservation
US20080120093A1 (en) * 2006-11-16 2008-05-22 Seiko Epson Corporation System for creating dictionary for speech synthesis, semiconductor integrated circuit device, and method for manufacturing semiconductor integrated circuit device
US20090299731A1 (en) * 2007-03-12 2009-12-03 Mongoose Ventures Limited Aural similarity measuring system for text
GB0704772D0 (en) * 2007-03-12 2007-04-18 Mongoose Ventures Ltd Aural similarity measuring system for text
BRPI0809759A2 (en) * 2007-04-26 2014-10-07 Ford Global Tech Llc "EMOTIVE INFORMATION SYSTEM, EMOTIVE INFORMATION SYSTEMS, EMOTIVE INFORMATION DRIVING METHODS, EMOTIVE INFORMATION SYSTEMS FOR A PASSENGER VEHICLE AND COMPUTER IMPLEMENTED METHOD"
US8321222B2 (en) * 2007-08-14 2012-11-27 Nuance Communications, Inc. Synthesis by generation and concatenation of multi-form segments
CN101452699A (en) * 2007-12-04 2009-06-10 株式会社东芝 Rhythm self-adapting and speech synthesizing method and apparatus
JP5422754B2 (en) * 2010-01-04 2014-02-19 株式会社東芝 Speech synthesis apparatus and method
WO2012001457A1 (en) * 2010-06-28 2012-01-05 Kabushiki Kaisha Toshiba Method and apparatus for fusing voiced phoneme units in text-to-speech
US9009050B2 (en) * 2010-11-30 2015-04-14 At&T Intellectual Property I, L.P. System and method for cloud-based text-to-speech web services
US8706472B2 (en) * 2011-08-11 2014-04-22 Apple Inc. Method for disambiguating multiple readings in language conversion

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6098042A (en) * 1998-01-30 2000-08-01 International Business Machines Corporation Homograph filter for speech synthesis system
CN1836226A (en) * 2003-08-21 2006-09-20 熊锦棠 Method and apparatus for converting characters of non-alphabetic languages
US20060277045A1 (en) * 2005-06-06 2006-12-07 International Business Machines Corporation System and method for word-sense disambiguation by recursive partitioning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
K. TOKUDA ET AL: "AN HMM-BASED SPEECH SYNTHESIS SYSTEM APPLIED TO ENGLISH", 《PROC. OF 2002 IEEE SSW》, 30 September 2002 (2002-09-30) *
LU HENG ET AL: "HETERONYM VERIFICATION FOR MANDARIN SPEECH SYNTHESIS", 《INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING》, 19 December 2008 (2008-12-19) *
张子荣,初敏: "解决多音字字-音转换的一种统计学习方法", 《中文信息学报》, vol. 16, no. 3, 31 December 2002 (2002-12-31) *

Cited By (139)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11671920B2 (en) 2007-04-03 2023-06-06 Apple Inc. Method and system for operating a multifunction portable electronic device using voice-activation
US11348582B2 (en) 2008-10-02 2022-05-31 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US11900936B2 (en) 2008-10-02 2024-02-13 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US12087308B2 (en) 2010-01-18 2024-09-10 Apple Inc. Intelligent automated assistant
US12165635B2 (en) 2010-01-18 2024-12-10 Apple Inc. Intelligent automated assistant
US11423886B2 (en) 2010-01-18 2022-08-23 Apple Inc. Task flow identification based on user intent
US11120372B2 (en) 2011-06-03 2021-09-14 Apple Inc. Performing actions associated with task items that represent tasks to perform
US11269678B2 (en) 2012-05-15 2022-03-08 Apple Inc. Systems and methods for integrating third party services with a digital assistant
US11321116B2 (en) 2012-05-15 2022-05-03 Apple Inc. Systems and methods for integrating third party services with a digital assistant
CN103854643A (en) * 2012-11-29 2014-06-11 株式会社东芝 Method and apparatus for speech synthesis
CN103854643B (en) * 2012-11-29 2017-03-01 株式会社东芝 Method and apparatus for synthesizing voice
CN103902600A (en) * 2012-12-27 2014-07-02 富士通株式会社 Keywords list forming device and method and electronic equipment
CN103902600B (en) * 2012-12-27 2017-12-01 富士通株式会社 Lists of keywords forming apparatus and method and electronic equipment
US11636869B2 (en) 2013-02-07 2023-04-25 Apple Inc. Voice trigger for a digital assistant
US11557310B2 (en) 2013-02-07 2023-01-17 Apple Inc. Voice trigger for a digital assistant
US11862186B2 (en) 2013-02-07 2024-01-02 Apple Inc. Voice trigger for a digital assistant
US10978090B2 (en) 2013-02-07 2021-04-13 Apple Inc. Voice trigger for a digital assistant
US11388291B2 (en) 2013-03-14 2022-07-12 Apple Inc. System and method for processing voicemail
US11798547B2 (en) 2013-03-15 2023-10-24 Apple Inc. Voice activated device for use with a voice-based digital assistant
US12073147B2 (en) 2013-06-09 2024-08-27 Apple Inc. Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant
US11727219B2 (en) 2013-06-09 2023-08-15 Apple Inc. System and method for inferring user intent from speech inputs
CN105340004B (en) * 2013-06-28 2019-09-10 谷歌有限责任公司 Computer implemented method, computer-readable medium and system for word pronunciation learning
CN105340004A (en) * 2013-06-28 2016-02-17 谷歌公司 Computer-implemented method, computer-readable medium and system for pronunciation learning
US12010262B2 (en) 2013-08-06 2024-06-11 Apple Inc. Auto-activating smart responses based on activities from remote devices
CN108364639A (en) * 2013-08-23 2018-08-03 株式会社东芝 Speech processing system and method
CN104464731A (en) * 2013-09-20 2015-03-25 株式会社东芝 Data collection device, method, voice talking device and method
CN103578467A (en) * 2013-10-18 2014-02-12 威盛电子股份有限公司 Acoustic model building method, speech recognition method and electronic device thereof
US10114809B2 (en) 2014-05-07 2018-10-30 Tencent Technology (Shenzhen) Company Limited Method and apparatus for phonetically annotating text
CN104142909A (en) * 2014-05-07 2014-11-12 腾讯科技(深圳)有限公司 Method and device for phonetic annotation of Chinese characters
CN104142909B (en) * 2014-05-07 2016-04-27 腾讯科技(深圳)有限公司 A kind of phonetic annotation of Chinese characters method and device
US11133008B2 (en) 2014-05-30 2021-09-28 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US11670289B2 (en) 2014-05-30 2023-06-06 Apple Inc. Multi-command single utterance input method
US11699448B2 (en) 2014-05-30 2023-07-11 Apple Inc. Intelligent assistant for home automation
US10878809B2 (en) 2014-05-30 2020-12-29 Apple Inc. Multi-command single utterance input method
US11257504B2 (en) 2014-05-30 2022-02-22 Apple Inc. Intelligent assistant for home automation
US11810562B2 (en) 2014-05-30 2023-11-07 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US11838579B2 (en) 2014-06-30 2023-12-05 Apple Inc. Intelligent automated assistant for TV user interactions
US11516537B2 (en) 2014-06-30 2022-11-29 Apple Inc. Intelligent automated assistant for TV user interactions
CN104200803A (en) * 2014-09-16 2014-12-10 北京开元智信通软件有限公司 Voice broadcasting method, device and system
CN105702248B (en) * 2014-12-09 2019-11-19 苹果公司 Electronic device and method, storage medium for operating an intelligent automated assistant
CN105702248A (en) * 2014-12-09 2016-06-22 苹果公司 Disambiguating heteronyms in speech synthesis
CN104599670A (en) * 2015-01-30 2015-05-06 成都星炫科技有限公司 Voice recognition method of touch and talk pen
US10930282B2 (en) 2015-03-08 2021-02-23 Apple Inc. Competing devices responding to voice triggers
US11087759B2 (en) 2015-03-08 2021-08-10 Apple Inc. Virtual assistant activation
US11842734B2 (en) 2015-03-08 2023-12-12 Apple Inc. Virtual assistant activation
US12001933B2 (en) 2015-05-15 2024-06-04 Apple Inc. Virtual assistant in a communication session
US12154016B2 (en) 2015-05-15 2024-11-26 Apple Inc. Virtual assistant in a communication session
US11468282B2 (en) 2015-05-15 2022-10-11 Apple Inc. Virtual assistant in a communication session
US11070949B2 (en) 2015-05-27 2021-07-20 Apple Inc. Systems and methods for proactively identifying and surfacing relevant content on an electronic device with a touch-sensitive display
CN104867491B (en) * 2015-06-17 2017-08-18 百度在线网络技术(北京)有限公司 Rhythm model training method and device for phonetic synthesis
CN104867491A (en) * 2015-06-17 2015-08-26 百度在线网络技术(北京)有限公司 Training method and device for prosody model used for speech synthesis
US11947873B2 (en) 2015-06-29 2024-04-02 Apple Inc. Virtual assistant for media playback
US11809483B2 (en) 2015-09-08 2023-11-07 Apple Inc. Intelligent automated assistant for media search and playback
US11126400B2 (en) 2015-09-08 2021-09-21 Apple Inc. Zero latency digital assistant
US11853536B2 (en) 2015-09-08 2023-12-26 Apple Inc. Intelligent automated assistant in a media environment
US11954405B2 (en) 2015-09-08 2024-04-09 Apple Inc. Zero latency digital assistant
US12204932B2 (en) 2015-09-08 2025-01-21 Apple Inc. Distributed personal assistant
US11550542B2 (en) 2015-09-08 2023-01-10 Apple Inc. Zero latency digital assistant
US11500672B2 (en) 2015-09-08 2022-11-15 Apple Inc. Distributed personal assistant
US12051413B2 (en) 2015-09-30 2024-07-30 Apple Inc. Intelligent device identification
CN105336322A (en) * 2015-09-30 2016-02-17 百度在线网络技术(北京)有限公司 Polyphone model training method, and speech synthesis method and device
CN105225657A (en) * 2015-10-22 2016-01-06 百度在线网络技术(北京)有限公司 Polyphone mark template generation method and device
US11809886B2 (en) 2015-11-06 2023-11-07 Apple Inc. Intelligent automated assistant in a messaging environment
US11526368B2 (en) 2015-11-06 2022-12-13 Apple Inc. Intelligent automated assistant in a messaging environment
CN105304081A (en) * 2015-11-09 2016-02-03 上海语知义信息技术有限公司 Smart household voice broadcasting system and voice broadcasting method
US11886805B2 (en) 2015-11-09 2024-01-30 Apple Inc. Unconventional virtual assistant interactions
US11853647B2 (en) 2015-12-23 2023-12-26 Apple Inc. Proactive assistance based on dialog communication between devices
CN105931635A (en) * 2016-03-31 2016-09-07 北京奇艺世纪科技有限公司 Audio segmentation method and device
US12223282B2 (en) 2016-06-09 2025-02-11 Apple Inc. Intelligent automated assistant in a home environment
US11657820B2 (en) 2016-06-10 2023-05-23 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US11037565B2 (en) 2016-06-10 2021-06-15 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US11152002B2 (en) 2016-06-11 2021-10-19 Apple Inc. Application integration with a digital assistant
US11749275B2 (en) 2016-06-11 2023-09-05 Apple Inc. Application integration with a digital assistant
US11809783B2 (en) 2016-06-11 2023-11-07 Apple Inc. Intelligent device arbitration and control
US11656884B2 (en) 2017-01-09 2023-05-23 Apple Inc. Application integration with a digital assistant
CN108346423A (en) * 2017-01-23 2018-07-31 北京搜狗科技发展有限公司 The treating method and apparatus of phonetic synthesis model
US10741181B2 (en) 2017-05-09 2020-08-11 Apple Inc. User interface for correcting recognition errors
US11599331B2 (en) 2017-05-11 2023-03-07 Apple Inc. Maintaining privacy of personal information
US11467802B2 (en) 2017-05-11 2022-10-11 Apple Inc. Maintaining privacy of personal information
US11538469B2 (en) 2017-05-12 2022-12-27 Apple Inc. Low-latency intelligent automated assistant
US11580990B2 (en) 2017-05-12 2023-02-14 Apple Inc. User-specific acoustic models
US11380310B2 (en) 2017-05-12 2022-07-05 Apple Inc. Low-latency intelligent automated assistant
US11405466B2 (en) 2017-05-12 2022-08-02 Apple Inc. Synchronization and task delegation of a digital assistant
US11862151B2 (en) 2017-05-12 2024-01-02 Apple Inc. Low-latency intelligent automated assistant
US12014118B2 (en) 2017-05-15 2024-06-18 Apple Inc. Multi-modal interfaces having selection disambiguation and text modification capability
US12254887B2 (en) 2017-05-16 2025-03-18 Apple Inc. Far-field extension of digital assistant services for providing a notification of an event to a user
US11675829B2 (en) 2017-05-16 2023-06-13 Apple Inc. Intelligent automated assistant for media exploration
US11532306B2 (en) 2017-05-16 2022-12-20 Apple Inc. Detecting a trigger of a digital assistant
CN108305612B (en) * 2017-11-21 2020-07-31 腾讯科技(深圳)有限公司 Text processing method, text processing device, model training method, model training device, storage medium and computer equipment
CN108305612A (en) * 2017-11-21 2018-07-20 腾讯科技(深圳)有限公司 Text-processing, model training method, device, storage medium and computer equipment
CN109996149A (en) * 2017-12-29 2019-07-09 深圳市赛菲姆科技有限公司 A kind of parking lot Intelligent voice broadcasting system
CN108389577A (en) * 2018-02-12 2018-08-10 广州视源电子科技股份有限公司 Method, system, device and storage medium for optimizing speech recognition acoustic model
US11710482B2 (en) 2018-03-26 2023-07-25 Apple Inc. Natural assistant interaction
US11907436B2 (en) 2018-05-07 2024-02-20 Apple Inc. Raise to speak
US11854539B2 (en) 2018-05-07 2023-12-26 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US11900923B2 (en) 2018-05-07 2024-02-13 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US11169616B2 (en) 2018-05-07 2021-11-09 Apple Inc. Raise to speak
US11487364B2 (en) 2018-05-07 2022-11-01 Apple Inc. Raise to speak
US10984798B2 (en) 2018-06-01 2021-04-20 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US12067985B2 (en) 2018-06-01 2024-08-20 Apple Inc. Virtual assistant operations in multi-device environments
US10720160B2 (en) 2018-06-01 2020-07-21 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US11009970B2 (en) 2018-06-01 2021-05-18 Apple Inc. Attention aware virtual assistant dismissal
US12080287B2 (en) 2018-06-01 2024-09-03 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US11360577B2 (en) 2018-06-01 2022-06-14 Apple Inc. Attention aware virtual assistant dismissal
US11431642B2 (en) 2018-06-01 2022-08-30 Apple Inc. Variable latency device coordination
US11630525B2 (en) 2018-06-01 2023-04-18 Apple Inc. Attention aware virtual assistant dismissal
US11893992B2 (en) 2018-09-28 2024-02-06 Apple Inc. Multi-modal inputs for voice commands
US11475898B2 (en) 2018-10-26 2022-10-18 Apple Inc. Low-latency multi-speaker speech recognition
US11783815B2 (en) 2019-03-18 2023-10-10 Apple Inc. Multimodality in digital assistant systems
US11348573B2 (en) 2019-03-18 2022-05-31 Apple Inc. Multimodality in digital assistant systems
US11705130B2 (en) 2019-05-06 2023-07-18 Apple Inc. Spoken notifications
US11217251B2 (en) 2019-05-06 2022-01-04 Apple Inc. Spoken notifications
US11675491B2 (en) 2019-05-06 2023-06-13 Apple Inc. User configurable task triggers
US11307752B2 (en) 2019-05-06 2022-04-19 Apple Inc. User configurable task triggers
US11475884B2 (en) 2019-05-06 2022-10-18 Apple Inc. Reducing digital assistant latency when a language is incorrectly determined
US11423908B2 (en) 2019-05-06 2022-08-23 Apple Inc. Interpreting spoken requests
US11140099B2 (en) 2019-05-21 2021-10-05 Apple Inc. Providing message response suggestions
US11888791B2 (en) 2019-05-21 2024-01-30 Apple Inc. Providing message response suggestions
US11237797B2 (en) 2019-05-31 2022-02-01 Apple Inc. User activity shortcut suggestions
US11496600B2 (en) 2019-05-31 2022-11-08 Apple Inc. Remote execution of machine-learned models
US11360739B2 (en) 2019-05-31 2022-06-14 Apple Inc. User activity shortcut suggestions
US11657813B2 (en) 2019-05-31 2023-05-23 Apple Inc. Voice identification in digital assistant systems
US11289073B2 (en) 2019-05-31 2022-03-29 Apple Inc. Device text to speech
US11790914B2 (en) 2019-06-01 2023-10-17 Apple Inc. Methods and user interfaces for voice-based control of electronic devices
US11360641B2 (en) 2019-06-01 2022-06-14 Apple Inc. Increasing the relevance of new available information
US11488406B2 (en) 2019-09-25 2022-11-01 Apple Inc. Text detection using global geometry estimators
US11765209B2 (en) 2020-05-11 2023-09-19 Apple Inc. Digital assistant hardware abstraction
US11924254B2 (en) 2020-05-11 2024-03-05 Apple Inc. Digital assistant hardware abstraction
US11914848B2 (en) 2020-05-11 2024-02-27 Apple Inc. Providing relevant data items based on context
US11755276B2 (en) 2020-05-12 2023-09-12 Apple Inc. Reducing description length based on confidence
CN111681641B (en) * 2020-05-26 2024-02-06 微软技术许可有限责任公司 Phrase-based end-to-end text-to-speech (TTS) synthesis
CN111681641A (en) * 2020-05-26 2020-09-18 微软技术许可有限责任公司 Phrase-based end-to-end text-to-speech (TTS) synthesis
US11838734B2 (en) 2020-07-20 2023-12-05 Apple Inc. Multi-device audio adjustment coordination
US11750962B2 (en) 2020-07-21 2023-09-05 Apple Inc. User identification using headphones
US11696060B2 (en) 2020-07-21 2023-07-04 Apple Inc. User identification using headphones
CN111968676A (en) * 2020-08-18 2020-11-20 北京字节跳动网络技术有限公司 Pronunciation correction method and device, electronic equipment and storage medium
CN115440205A (en) * 2021-06-04 2022-12-06 中国移动通信集团浙江有限公司 Speech processing method, device, terminal and program product
CN114360494A (en) * 2021-12-29 2022-04-15 广州酷狗计算机科技有限公司 Rhythm labeling method and device, computer equipment and storage medium
US12277954B2 (en) 2024-04-16 2025-04-15 Apple Inc. Voice trigger for a digital assistant

Also Published As

Publication number Publication date
US20120221339A1 (en) 2012-08-30
US9058811B2 (en) 2015-06-16

Similar Documents

Publication Publication Date Title
CN102651217A (en) Method and equipment for voice synthesis and method for training acoustic model used in voice synthesis
Qian et al. Contentvec: An improved self-supervised speech representation by disentangling speakers
Kharitonov et al. Text-free prosody-aware generative spoken language modeling
US11837216B2 (en) Speech recognition using unspoken text and speech synthesis
CN110782870B (en) Speech synthesis method, device, electronic equipment and storage medium
Stoller et al. End-to-end lyrics alignment for polyphonic music using an audio-to-character recognition model
US10332508B1 (en) Confidence checking for speech processing and query answering
US10388274B1 (en) Confidence checking for speech processing and query answering
Franco et al. Automatic pronunciation scoring for language instruction
Morgan Deep and wide: Multiple layers in automatic speech recognition
EP2815398B1 (en) Audio human interactive proof based on text-to-speech and semantics
CN101828218B (en) Synthesis by generation and concatenation of multi-form segments
CN106297800B (en) A method and device for adaptive speech recognition
WO2022148176A1 (en) Method, device, and computer program product for english pronunciation assessment
CN101551947A (en) Computer system for assisting spoken language learning
CN110459202B (en) Rhythm labeling method, device, equipment and medium
Abdou et al. Computer aided pronunciation learning system using speech recognition techniques.
CN110415725A (en) Use the method and system of first language data assessment second language pronunciation quality
US20020040296A1 (en) Phoneme assigning method
JP6810580B2 (en) Language model learning device and its program
CN102651218A (en) Method and equipment for creating voice tag
Chang et al. Speechprompt: Prompting speech language models for speech processing tasks
Barbany et al. FastVC: Fast Voice Conversion with non-parallel data
Li et al. Improving mandarin tone mispronunciation detection for non-native learners with soft-target tone labels and blstm-based deep models
Janyoi et al. An Isarn dialect HMM-based text-to-speech system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
AD01 Patent right deemed abandoned

Effective date of abandoning: 20161130

C20 Patent right or utility model deemed to be abandoned or is abandoned
OSZAR »