US20100241418A1 - Voice recognition device and voice recognition method, language model generating device and language model generating method, and computer program - Google Patents
Voice recognition device and voice recognition method, language model generating device and language model generating method, and computer program Download PDFInfo
- Publication number
- US20100241418A1 US20100241418A1 US12/661,164 US66116410A US2010241418A1 US 20100241418 A1 US20100241418 A1 US 20100241418A1 US 66116410 A US66116410 A US 66116410A US 2010241418 A1 US2010241418 A1 US 2010241418A1
- Authority
- US
- United States
- Prior art keywords
- intention
- language model
- language
- indicating
- vocabulary
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1815—Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
Definitions
- the present invention relates to a speech recognition device and a speech recognition method, a language model generation device and a language model generation method, and a computer program for recognizing the content of an utterance of a speaker, and particularly, a speech recognition device and a speech recognition method, a language model generation device and a language model generation method, and a computer program for estimating an intention of a speaker and grasping a task that a system is made to perform by a speech input.
- the present invention relates to a speech recognition device and a speech recognition method, a language model generation device and a language model generation method, and a computer program for accurately estimating an intention in the content of an utterance by using a statistical language model, and particularly, a speech recognition device and a speech recognition method, a language model generation device and a language model generation method, and a computer program for accurately estimating an intention for a focused task based on the content of an utterance.
- a language that human beings use in daily communication such as Japanese or English language, is called a “natural language”.
- Many natural languages originated from spontaneous generation, and have advanced with the histories of civilization, ethnic groups, and societies.
- human beings can communicate with each other through gestures of their bodies and hands, but achieve the most natural and advanced communication with natural language.
- speech understanding or speech conversation can be exemplified.
- speech understanding or speech recognition is a vital technique for realizing input from a human being to a calculator.
- speech recognition aims at converting the content of an utterance to characters as they are.
- speech understanding aims at more precisely estimating the intention of a speaker and grasping the task that the system is made to perform by speech input without accurately understanding each syllable or each word in the speech.
- speech recognition and speech understanding together are called “speech recognition” for the sake of convenience.
- An input speech from a speaker is taken as an electronic signal through, for example, a microphone, subjected to AD conversion, and is turned into speech data constituted by a digital signal.
- a string X of temporal feature vectors is generated by applying acoustic analysis to the speech data for each frame of a slight time.
- a string of word models is obtained as a recognition result while referring to an acoustic model database, a lexicon, and a language model database.
- An acoustic model recorded in an acoustic model database is, for example, a hidden Markov model (HMM) for a phoneme of the Japanese language.
- HMM hidden Markov model
- W) in which input speech data X is a word W registered in a lexicon can be obtained as an acoustic score.
- a word sequence ratio (N-gram) that describes how N number of words form a sequence is recorded.
- an appearance probability p(W) of the word W registered in the lexicon can be obtained as a language score.
- a recognition result can be obtained based on the acoustic score and the language score.
- the descriptive grammar model is a language model that describes a structure of a phrase in a sentence according to grammar rules, and described by using context-free grammar in the Backus-Naur-Form (BNF), as shown in FIG. 10 , for example.
- the statistical language model is a language model that is subjected to probability estimation from a learning data (corpus) with a statistical technique. For example, an N-gram model causes a probability p (W i
- the descriptive grammar model is basically created manually, and recognition accuracy is high if the input speech data conforms to the grammar, but the recognition is not able to be achieved if the data fail to conform to the grammar even by only a little.
- the statistical language model represented in the N-gram model can be automatically created by subjecting the learning data to a statistical processing, and furthermore, can recognize the input speech data even if the arrangement of words in the input speech data runs slightly counter to the grammar rules.
- corpus a large amount of learning data (corpus) is necessary.
- methods of collecting the corpus there are general methods such as collecting the corpus from media including books, newspapers, magazines, or the like and collecting the corpus from texts disclosed on web sites.
- a speech processing device was suggested in which a language model is prepared for each intention (information on wishes) and an intention corresponding to the highest total score is selected as information indicating a wish of uttering based on an acoustic score and a language score (for example, please refer to Japanese Unexamined Patent Application Publication No. 2006-53203).
- the speech processing device uses each statistical language model as a language model for intentions, and recognizes the intentions even when the arrangement of words in input speech data runs slightly counter to grammar rules. However, even when the content of an utterance does not correspond to any intention of a focused task, the device fits any intention to the content by force. For example, when the speech processing device is configured to provide the service of a task relating to a television operation and provided with a plurality of statistical language models in which each intention relating to the television operation is inherent, an intention corresponding to a statistical language model showing a high value of a calculated language score is output as a recognition result even for the content of an utterance that does not intend a television operation. Accordingly, it ends up with the result of extracting an intention different from the intended content of the utterance.
- the inventors of the present invention consider that it is necessary to solve the following two points in order to realize a speech recognition device that accurately estimates an intention relating to a focused task in the content of an utterance.
- a corpus having content that a speaker is likely to utter is simply and appropriately collected for each intention.
- a speech recognition device includes one intention extracting language model and more in which each intention of a focused specific task is inherent, an absorbing language model in which any intention of the task is not inherent, a language score calculating section that calculates a language score indicating a linguistic similarity between each of the intention extracting language models and the absorbing language model, and the content of an utterance, and a decoder that estimates an intention in the content of an utterance based on a language score of each of the language models calculated by the language score calculating section.
- the intention extracting language model is a statistical language model obtained by subjecting learning data, which are composed of a plurality of sentences indicating the intention of the task, to a statistical processing.
- a speech recognition device in which the absorbing language model is a statistical language model obtained by subjecting to statistical processing an enormous amount of learning data, which are irrelevant to indicating the intention of the task or are composed of spontaneous utterances.
- a speech recognition device in which the learning data for obtaining the intention extracting language model are composed of sentences which are generated based on a descriptive grammar model indicating a corresponding intention and consistent with the intention.
- a speech recognition method including the steps of firstly calculating a language score indicating a linguistic similarity between one intention extracting language model and more in which each intention of a focused specific task is inherent and the content of an utterance, secondly calculating a language score indicating a linguistic similarity between an absorbing language model in which any intention of the task is not inherent and the content of an utterance, and estimating the intention in the content of an utterance based on a language score of each of the language models calculated in the first and second language score calculations.
- a language model generation device including a word meaning database in which a combination of an abstracted vocabulary of a first part-of-speech string and an abstracted vocabulary of a second part-of-speech string and one or more words indicating the same meaning or a similar intention of the abstracted vocabularies are registered, by making abstract the vocabulary candidate of the first part-of-speech string and the vocabulary candidate of the second part-of-speech string that may appear in an utterance indicating an intention, with respect to each intention of a focused specific task, a descriptive grammar model creating unit which creates a descriptive grammar model indicating an intention based on the combination of the abstracted vocabulary of the first part-of-speech string and the abstracted vocabulary of the second part-of-speech string indicating the intention of the task and one or more words indicating a same meaning or a similar intention for abstract vocabularies registered in the word meaning database, a collecting unit which collect
- first part-of-speech is a noun
- second part-of-speech is a verb
- the language model generation device in which the word meaning database has the abstracted vocabulary of the first part-of-speech string and the abstracted vocabulary of the second part-of-speech string arranged on a matrix for each string and has a mark indicating the existence of the intention given in a column corresponding to the combination of the vocabulary of the first part-of-speech and the vocabulary of the second part-of-speech having intentions.
- a language model generation method including the steps of creating a grammar model by making abstract a necessary phrase for transmitting each intention included in a focused task, collecting a corpus having content that a speaker is likely to utter for an intention by automatically generating sentences consistent with each intention by using the grammar model, and constructing a plurality of statistical language models corresponding to each intention by performing probabilistic estimation from each corpus with a statistical technique.
- a computer program described in a computer readable format so as to execute a processing for speech recognition on a computer the program causing the computer to function as one intention extracting language model and more in which each intention of a focused specific task is inherent, an absorbing language model in which any intention of the task is not inherent, a language score calculating section that calculates a language score indicating a linguistic similarity between each of the intention extracting language model and the absorbing language model, and the content of an utterance, and a decoder that estimates an intention in the content of an utterance based on a language score of each of the language models calculated by the language score calculating section.
- the computer program according to the above embodiment of the present invention is defined as a computer program that is described in a computer readable format so as to realize a predetermined processing on the computer.
- a cooperative action can be exerted on the computer and the same action and effect as in a speech recognition device according to the first embodiment of the present invention can be obtained.
- a computer program described in a computer readable format so as to execute processing for the generation of a language model on a computer the program causing the computer to function as a word meaning database in which a combination of an abstracted vocabulary of a first part-of-speech string and an abstracted vocabulary of a second part-of-speech string and one or more words indicating the same meaning or a similar intention of the abstracted vocabularies are registered, by making abstract the vocabulary candidate of the first part-of-speech string and the vocabulary candidate of the second part-of-speech string that may appear in an utterance indicating an intention, with respect to each intention of a focused specific task, a descriptive grammar model creating unit which creates a descriptive grammar model indicating an intention based on the combination of the abstracted vocabulary of the first part-of-speech string and the abstracted vocabulary of the second part-of-speech string indicating the intention of the task and one
- the computer program according to the above embodiment of the present invention is defined as a computer program that is described in a computer readable format so as to realize a predetermined processing on the computer.
- a cooperative action can be exerted on the computer and the same action and effect as in the language model generation device according to the sixth embodiment of the present invention can be obtained.
- a speech recognition device and a speech recognition method it is possible to provide a speech recognition device and a speech recognition method, a language model generation device and a language model generation method, and a computer program which are excellent in estimating an intention of a speaker, and accurately grasping a task that a system is made to perform by a speech input.
- a speech recognition device and a speech recognition method it is possible to provide a speech recognition device and a speech recognition method, a language model generation device and a language model generation method, and a computer program which are excellent in accurately estimating an intention of the content of an utterance by using a statistical language model.
- a speech recognition device and a speech recognition method it is possible to provide a speech recognition device and a speech recognition method, a language model generation device and a language model generation method, and a computer program which are excellent in accurately estimating an intention relating to a task focused in the content of an utterance.
- the present intention it is possible to realize robust intention extraction for the task, by being provided with a statistical language model corresponding to the content of an utterance that is inconsistent with a focused task, such as a spontaneous utterance language model or the like, in addition to a statistical language model in which an intention included in a focused task is inherent, by performing processing in parallel, and by ignoring the estimation of an intention in the content of an utterance that is inconsistent with the task.
- a statistical language model corresponding to the content of an utterance that is inconsistent with a focused task, such as a spontaneous utterance language model or the like
- a corpus having a content that a speaker is likely to utter can be simply and appropriately collected for an intention by determining the intention included in a focused task in advance and automatically generating sentences consistent with the intention from a descriptive grammar model indicating the intention.
- the content that is likely to be uttered can be grasped without the omission by arranging the vocabulary candidate of the noun string and the vocabulary candidate of the verb string that may appear in the utterance on a matrix for a string.
- one or more words having the same meaning or a similar meaning are registered in symbols of the vocabulary candidates of each string, it is possible to come up with a combination corresponding to various expressions of an utterance having a same meaning and to generate a large amount of sentences having the same intention as the learning data.
- the corpus consistent with one focused task can be divided for each intention and can be simply and efficiently collected.
- a group of language models in which one intention of the same task is inherent can be obtained.
- part-of-speech and conjugation information are given to each morpheme to be used during the creation of the statistical language model.
- the collecting unit collects a corpus having a content that a speaker is likely to utter for each intention by automatically generating sentences consistent with each intention from the descriptive grammar model for the intention
- the language model creating unit creates the statistical language model in which an intention is inherent by subjecting the corpus collected for each intention to a statistical processing.
- FIG. 1 is a block diagram schematically illustrating a functional structure of a speech recognition device according to an embodiment of the present invention
- FIG. 2 is a diagram schematically illustrating a the minimum necessary structure of phrases for transmitting an intention
- FIG. 3A is a diagram illustrating a word meaning database in which abstracted noun vocabularies and verb vocabularies are arranged in a matrix form;
- FIG. 3B is a diagram illustrating a state in which words indicating a same meaning or a similar intention are registered for abstracted vocabularies
- FIG. 4 is a diagram for describing a method of creating a descriptive grammar model based on a combination of a noun vocabulary and a verb vocabulary put a mark in the matrix shown in FIG. 3A ;
- FIG. 5 is a diagram for describing a method of collecting a corpus having a content that a speaker is likely to utter by automatically generating sentences consistent with an intention from the descriptive grammar model for each intention;
- FIG. 6 is a diagram illustrating a flow of data in a technique of constructing a statistical language model from a grammar model
- FIG. 7 is a diagram schematically illustrating a structural example of a language model database constituted with N number of statistical language models 1 to N learned for an intention of a focused task and one absorbing statistical language model;
- FIG. 8 is a diagram illustrating an operative example when a speech recognition device performs meaning estimation for the task “Operate the television”;
- FIG. 9 is a diagram illustrating a structural example of a personal computer provided in an embodiment of the present invention.
- FIG. 10 is a diagram illustrating an example of a descriptive grammar model described with the context-free grammar.
- the present invention relates to a speech recognition technology and has a main characteristic of accurately estimating an intention in content that a speaker utters focusing on a specific task, and thereby resolving the following two points.
- a corpus having content that a speaker is likely to utter is simply and appropriately collected for each intention.
- FIG. 1 schematically illustrates a functional structure of a speech recognition device according to an embodiment of the present invention.
- the speech recognition device 10 in the drawing is provided with a signal processing section 11 , an acoustic score calculating section 12 , a language score calculating section 13 , a lexicon 14 , and a decoder 15 .
- the speech recognition device 10 is configured to accurately estimate an intention of a speaker, rather than to accurately understand all of syllable by syllable and word by word in speech.
- Input speech from a speaker is brought into the signal processing section 11 as electric signals through, for example, a microphone.
- Such analog electric signals undergo AD conversion through sampling and quantization processing to turn into speech data constituted with digital signals.
- the signal processing section 11 generates a series X of temporal feature vector by applying acoustic analysis to the speech data for each frame of a slight time.
- process of frequency analysis such as Discrete Fourier Transform (DFT) or the like
- the series X of the feature vector which has characteristics of, such as, energy for each frequency band (so called power spectrum) based on the frequency analysis is generated.
- a string of word models is obtained as a recognition result while referring to an acoustic model database 16 , the lexicon 14 , and a language model database 17 .
- the acoustic score calculating section 12 calculates an acoustic score indicating an acoustic similarity between an acoustic model including a string of words formed based on the lexicon 14 and input speech signals.
- the acoustic model recorded in the acoustic model database 16 is, for example, a Hidden Markov Model (HMM) for a phoneme of the Japanese language.
- the acoustic score calculating section 12 can obtain a probability p (X
- the language score calculating section calculates an acoustic score indicating a linguistic similarity between a language model including a string of words formed based on the lexicon 14 and input speech signals.
- the word sequence ratio N-gram that describes how N number of words form a sequence is recorded.
- the language score calculating section 13 can obtain an appearance probability p(W) of the word W registered in the lexicon 14 as a language score with reference to the language model database 17 .
- the decoder 15 obtains a recognition result based on the acoustic score and the language score. Specifically, as shown in Equation (1) below, if a probability p(W
- the decoder 15 can estimate an optimal result with Equation (2) shown below.
- a language model that the language score calculating section 13 uses is the statistical language model.
- the statistical language model represented by the N-gram model can be automatically created from learning data and can recognize speech even when the arrangement of words in the input speech data runs counter to grammar rules a little.
- the speech recognition device 10 according to the present embodiment is assumed to estimate an intention relating to a task focused in the content of an utterance, and for that reason, the language model database 17 is installed with a plurality of statistical language models corresponding to each intention included in a focused task.
- the language model database 17 is installed with a statistical language model corresponding to the content of an utterance inconsistent with a focused task in order to ignore an intention estimation for the content of an utterance inconsistent with the task, which will be described in detail later.
- the present embodiment makes it possible to simply and appropriately collect a corpus having content that a speaker is likely to utter for each intention and to construct statistical language models for each intention, by using a technique of constructing the statistical language models from a grammar model.
- the grammar model is efficiently created by making phrases necessary for transmitting the intention abstract (or symbolized).
- sentences consistent with each intention are automatically generated.
- the plurality of statistical language models corresponding to each intention can be constructed by performing a probability estimation from each corpus with a statistical technique.
- a descriptive grammar model is created for obtaining the corpus.
- the inventors think that a structure of a simple and short sentence that a speaker is likely to utter (or a minimum phrase necessary for transmitting an intention) is composed of a combination of a noun vocabulary and a verb vocabulary, as “PERFORM SOMETHING” (as shown in FIG. 2 ). Therefore, words for each of the noun vocabulary and the verb vocabulary are made to be abstract (or symbolized) in order to efficiently construct the grammar model.
- noun vocabularies indicating a title of a television program such as “Taiga Drama” (a historical drama) or “Waratte ii tomo” (a comedy program) are made abstract as a vocabulary “_Title”.
- verb vocabularies for machines used in watching programs such as a television, or the like, such as “please replay”, “please show”, or “I want to watch” are made to be abstract as the vocabulary “_Play”.
- the utterance having an intention of “please show the program” can be expressed by a combination of symbols for _Title & _Play.
- “_Play the _Title”, or the like are created as the descriptive grammar model for obtaining corpuses. Corpuses such as “Please show the Taiga Drama” (historical drama) or the like can be created from the descriptive grammar model “_Play the _Title”.
- the descriptive grammar models can be composed of the combination of each of the abstracted noun vocabularies and the verb vocabularies.
- the combination of each of the abstracted noun vocabularies and the verb vocabularies may express one intention. Therefore, as shown in FIG. 3A , a matrix is formed by arranging the abstracted noun vocabularies in each row and arranging the abstracted verb vocabularies in each column, and a word meaning database is constructed by putting a mark indicating the existence of an intention in a corresponding column on the matrix for the each of the combinations of abstracted noun vocabularies and the verb vocabularies having the intention.
- a noun vocabulary and a verb vocabulary combined with a mark indicates a descriptive grammar model in which any one intention is included.
- words indicating the same meaning or a similar intention are registered in the word meaning database for the abstracted noun vocabularies divided with the rows in the matrix.
- words indicating a same meaning or a similar intention are registered in the word meaning database for the abstracted verb vocabularies divided with the columns in the matrix.
- the word meaning database can be expanded into a three-dimensional arrangement, not a two-dimensional arrangement as the matrix shown in FIG. 3A .
- each of the combinations of the noun vocabularies and the verb vocabularies given with marks corresponds to a descriptive grammar model indicating an intention.
- the descriptive grammar model described in the form of BNF can be efficiently created, as shown in FIG. 4 .
- a group of language models specified to the task can be obtained by registering noun vocabularies and verb vocabularies that may appear when a speaker makes an utterance.
- each of the language models has one intention (or operation) inherent therein.
- corpuses having content that a speaker is likely to utter can be collected for each intention by automatically generating sentences consistent with the intention as shown in FIG. 5 .
- a plurality of statistical language models corresponding to each intention can be constructed by performing a probability estimation from each corpus with a statistical technique.
- a method of constructing the statistical language models from each corpus is not limited to any specific method, and since a known technique can be applied thereto, detailed description thereof will not be mentioned here.
- the “Speech Recognition System” written by Kiyohiro Shikano and Katsunobu Ito mentioned above may be referred, if necessary.
- FIG. 6 illustrates a flow of data in a method of constructing a statistical language model from a grammar model, which has been described hitherto.
- the structure of the word meaning database is as shown in FIG. 3A .
- noun vocabularies relating to a focused task (for example, operation of a television, or the like) are made into each group indicating a same meaning or a similar intention, and the noun vocabularies that are made into each abstracted group are arranged in each row of the matrix.
- verb vocabularies relating to a focused task are made into each group indicating a same meaning or a similar intention, and the verb vocabularies that are made into each abstracted group are arranged in each column of the matrix.
- FIG. 3A The structure of the word meaning database is as shown in FIG. 3A .
- a plurality of words indicating same meanings or similar intentions is registered for each of the abstracted noun vocabularies and a plurality of words indicating same meanings or similar intentions is registered for each of the abstracted verb vocabularies.
- a mark indicating the existence of an intention is given in a column corresponding to a combination of a noun vocabulary and a verb vocabulary having the intention.
- each of the combinations of noun vocabularies and verb vocabularies matched with marks corresponds to a descriptive grammar model indicating an intention.
- a descriptive grammar model creating unit 61 picks up a combination of an noun vocabulary and an abstracted vocabulary indicating an intention having a mark on the matrix as a clue, then forces to fit each registered word indicating a same meaning or a similar intention to each of abstracted noun vocabularies and abstracted verb vocabularies, and creates a descriptive grammar model in the form of BNF to store the model as a file of the context-free grammar.
- Basic files of the BNF form are automatically created, and then the model will be modified in the form of a BNF file according to the expression of an utterance. In the example shown in FIG.
- the N number of descriptive grammar models from 1 to N are constructed by the descriptive grammar model creating unit 61 based on the word meaning database, and stored as files of the context-free grammar.
- the BNF form is used in defining the context-free grammar, but the spirit of the present invention is not necessarily limited thereto.
- a sentence indicating a specific intention can be obtained by creating a sentence from a created BNF file.
- transcription of a grammar model in the BNF form is a sentence creation rule from a non-terminal symbol (Start) to a terminal symbol (End). Therefore, collecting unit 62 can automatically generate a plurality of sentences indicating same intentions as shown in FIG. 5 and can collect corpuses having a content that a speaker is likely to utter for each intention by searching a route from the non-terminal symbol (Start) to the terminal symbol (End) for a descriptive grammar model indicating an intention.
- the group of sentences automatically generated from each of the descriptive grammar models is used as learning data indicating the same intention. In other words, learning data 1 to N collected for each intention by the collecting unit 62 become corpuses for constructing statistical language models.
- the language model creating unit 63 can construct a plurality of statistical language models corresponding to each intention by performing a probability estimation for corpuses of each intention with a statistical technique.
- the sentence generated from the descriptive grammar model in the BNF form indicates a specific intention in a task, and therefore, a statistical language model created using a corpus including the sentence can be said as a robust language model in the content of an utterance for the intention.
- the method of constructing a statistical language model from a corpus is not limited to any specific method, and since a known technique can be applied, detailed description thereof will not be mentioned here.
- the “Speech Recognition System” written by Kiyohiro Shikano and Katsunobu Ito mentioned above may be referred, if necessary.
- a corpus having a content that a speaker is likely to utter is simply and appropriately collected for each intention and a statistical language model for each intention can be constructed by using a technique of constructing the statistical language model from a grammar model.
- the language score calculating section 13 calculates a language score from a group of language models created for each intention
- the acoustic score calculating section 12 calculates an acoustic score with an acoustic model
- the decoder 15 employs the most likely language model as a result of speech recognition processing. Accordingly, it is possible to extract or estimate the intention of an utterance from information for identifying the language model selected for the utterance.
- the language score calculating section 13 uses When the group of language models that the language score calculating section 13 uses is composed only of language models created for an intention in a focused specific task, utterance irrelevant to the task may be forced to fit to any language model and the model may be output as a recognition result. Accordingly, it ends up with a result of extracting an intention different from the content of the utterance.
- an absorbing statistical language model corresponding to the content of an utterance inconsistent with a task is provided in the language model database 17 in addition to statistical language models for each intention in a focused task, and the group of statistical language models in the task is processed in tandem with the absorbing statistical language model, in order to absorb the content of an utterance not indicating any intention in the focused task (in other words, irrelevant to the task).
- FIG. 7 schematically illustrates the structural example of N number of the statistical language models 1 to N learned corresponding to each intention in a focused task and the language model database 17 including one absorbing statistical language model.
- the statistical language models corresponding to each intention in the task are constructed by performing a probability estimation for texts for learning generated from the descriptive grammar models indicating each intention in the task with the statistical technique, as described above.
- the absorbing statistical language model is constructed by generally performing a probability estimation for corpuses collected from web sites or the like with the statistical technique.
- the statistical language model is, for example, an N-gram model which causes a probability p (W i
- a probability p (k) (W i
- the absorbing statistical language model is created by using general corpuses including an enormous amount of sentences collected from, for example, web sites, and is a spontaneous utterance language model (spoken language model) composed of a larger amount of vocabularies than the statistical language models having each intention in the task.
- a spontaneous utterance language model spoke language model
- the absorbing statistical language model contains vocabularies indicating an intention in a task, but when a language score is calculated for the content of an utterance having an intention in a task, the statistical language model having an intention in a task has a higher language score than the spontaneous utterance language model does. That is because the absorbing statistical language model is a spontaneous utterance language model and has a larger amount of vocabularies than each of the statistical language models in which the intentions are specified, and therefore, the appearance probability of a vocabulary having a specific intention is necessarily low.
- a probability in which a sentence similar to the content of the utterance exists in a text for learning that specifies an intention is relatively high.
- a language score obtained from an absorbing statistical language model obtained by learning a general corpus is relatively higher than a language score obtained from any statistical language model obtained by learning a text for learning that specifies an intention.
- FIG. 8 illustrates an operative example when a speech recognition device according to the present embodiment performs a meaning estimation for the task “operate the television”
- the corresponding intention in the task can be searched in the decoder 15 based on the an acoustic score calculated by the acoustic score calculating section 12 and a language score calculated by the language score calculating section 13 .
- the speech recognition device does not employ any statistical language model in a task but uses an absorbing statistical language model even when the content of an utterance irrelevant to the task is recognized, by applying the absorbing statistical language model composed of the spontaneous utterance language model or the like to the language model database 17 , in addition to the statistical language models corresponding to each intention in a task, and therefore the risk of erroneously extracting an intention can be reduced.
- a series of the processes described above can be executed with hardware, and also with software.
- a speech recognition device can be realized in a personal computer executing a predetermined program.
- FIG. 9 illustrates a structural example of the personal computer provided in an embodiment of the present invention.
- a central processing unit (CPU) 121 executes various kinds of processes following a program recorded in a read only memory (ROM) 122 , or a recording unit 128 .
- Processing executed following the program includes a speech recognition process, a process of creating a statistical language model used in speech recognition processing, and a process of creating learning data used in creating the statistical language model. Details of each process are as described above.
- a random access memory (RAM) 123 properly stores the program that the CPU 121 executes and data.
- the CPU 121 , ROM 122 , and RAM 123 are connected to one another via a bus 124 .
- the CPU 121 is connected to an input/output interface 125 via the bus 124 .
- the input/output interface 125 is connected to an input unit 126 including a microphone, a keyboard, a mouse, a switch, and the like, and an output unit 127 including a display, a speaker, a lamp, and the like.
- the CPU 121 executes various kinds of processing according to a command input from the input unit 126 .
- the recording unit 128 connected to the input/output interface 125 is, for example, a hard disk drive (HDD), and records a program to be executed by the CPU 121 or various kinds of computer files such as processing data.
- a communicating unit 129 communicates with an external device (not shown) via a communication network such as the Internet or other networks (any of which is not shown).
- the personal computer may acquire program files or download data files via the communicating unit 129 in order to record them in the recording unit 128 .
- a drive 130 connected to the input/output interface 125 drives a magnetic disk 151 , an optical disk 152 , a magneto-optical disk 153 , a semiconductor memory 154 , or the like when they are installed therein, and acquires a program or data recorded in such storage regions.
- the acquired program or data is transferred to the recording unit 128 to be recorded if necessary.
- a program constituting the software is installed in a computer incorporated into dedicated hardware or a general personal computer installed with various programs that enables the execution of various functions, from a recording medium.
- the recording medium includes a magnetic disk 151 where a program is recorded (including a flexible disk), an optical disk 152 (including compact disc-read only memory (CD-ROM) and, a digital versatile disc (DVD)), a magneto-optical disk 153 (including Mini-Disc (MD) as a trademark), or package media including a semiconductor memory 154 or the like, which are distributed to provide users with programs, in addition to the ROM 122 in which a program is recorded, a hard disk included in the recording unit 128 or the like, which are provided for the users in a state of being incorporated into a computer in advance, different from the computers described above.
- a magnetic disk 151 where a program is recorded including a flexible disk
- an optical disk 152 including compact disc-read only memory (CD-ROM) and, a digital versatile disc (DVD)
- DVD digital versatile disc
- magneto-optical disk 153 including Mini-Disc (MD) as a trademark
- package media including a semiconductor memory 154 or the like, which are distributed to provide
- a program for executing a series of processes described above may be installed in a computer via a wired or wireless communication medium such as a local area network (LAN), the Internet, or digital satellite broadcasting through an interface such as a router or a modem or the like if necessary.
- a wired or wireless communication medium such as a local area network (LAN), the Internet, or digital satellite broadcasting through an interface such as a router or a modem or the like if necessary.
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Abstract
A speech recognition device includes one intention extracting language model and more in which an intention of a focused specific task is inherent, an absorbing language model in which any intention of the task is not inherent, a language score calculating section that calculates a language score indicating a linguistic similarity between each of the intention extracting language model and the absorbing language model, and the content of an utterance, and a decoder that estimates an intention in the content of an utterance based on a language score of each of the language models calculated by the language score calculating section.
Description
- 1. Field of the Invention
- The present invention relates to a speech recognition device and a speech recognition method, a language model generation device and a language model generation method, and a computer program for recognizing the content of an utterance of a speaker, and particularly, a speech recognition device and a speech recognition method, a language model generation device and a language model generation method, and a computer program for estimating an intention of a speaker and grasping a task that a system is made to perform by a speech input.
- To put more precisely, the present invention relates to a speech recognition device and a speech recognition method, a language model generation device and a language model generation method, and a computer program for accurately estimating an intention in the content of an utterance by using a statistical language model, and particularly, a speech recognition device and a speech recognition method, a language model generation device and a language model generation method, and a computer program for accurately estimating an intention for a focused task based on the content of an utterance.
- 2. Description of the Related Art
- A language that human beings use in daily communication, such as Japanese or English language, is called a “natural language”. Many natural languages originated from spontaneous generation, and have advanced with the histories of mankind, ethnic groups, and societies. Of course, human beings can communicate with each other through gestures of their bodies and hands, but achieve the most natural and advanced communication with natural language.
- On the other hand, accompanying the development of information technologies, computers are settled in human societies, and have deeply penetrated in various industries and our daily lives. Natural language inherently has characteristics of being highly abstract and ambiguous, but can be subjected to a computer processing by mathematically dealing with sentences, and as a result, various kinds of applications and services relating to natural language are realized.
- As an application system of a natural language processing, speech understanding or speech conversation can be exemplified. For example, when a speech-based computer interface is constructed, speech understanding or speech recognition is a vital technique for realizing input from a human being to a calculator.
- Here, speech recognition aims at converting the content of an utterance to characters as they are. On the contrary, speech understanding aims at more precisely estimating the intention of a speaker and grasping the task that the system is made to perform by speech input without accurately understanding each syllable or each word in the speech. However, in the present specification, speech recognition and speech understanding together are called “speech recognition” for the sake of convenience.
- Hereinafter, procedures of speech recognition processing will be briefly described.
- An input speech from a speaker is taken as an electronic signal through, for example, a microphone, subjected to AD conversion, and is turned into speech data constituted by a digital signal. In addition, in a signal processing section, a string X of temporal feature vectors is generated by applying acoustic analysis to the speech data for each frame of a slight time.
- Next, a string of word models is obtained as a recognition result while referring to an acoustic model database, a lexicon, and a language model database.
- An acoustic model recorded in an acoustic model database is, for example, a hidden Markov model (HMM) for a phoneme of the Japanese language. With reference to the acoustic model database, a probability p (X|W) in which input speech data X is a word W registered in a lexicon can be obtained as an acoustic score. Furthermore, in a language model database, for example, a word sequence ratio (N-gram) that describes how N number of words form a sequence is recorded. With reference to the language model database, an appearance probability p(W) of the word W registered in the lexicon can be obtained as a language score. Moreover, a recognition result can be obtained based on the acoustic score and the language score.
- Here, as a language model used in the computation of the language score, a descriptive grammar model and a statistical language model can be exemplified. The descriptive grammar model is a language model that describes a structure of a phrase in a sentence according to grammar rules, and described by using context-free grammar in the Backus-Naur-Form (BNF), as shown in
FIG. 10 , for example. In addition, the statistical language model is a language model that is subjected to probability estimation from a learning data (corpus) with a statistical technique. For example, an N-gram model causes a probability p (Wi|W1, . . . , Wi−1) in which a word Wi appears in the order of i-th after an (i−1)-th word appears in the order of W1, . . . , and Wi−1 to approximate to the sequence ratio p of the nearest N number of words (Wi|Wi−N+1, . . . , Wi−1) (please refer to, for example, “Speech Recognition System” (“Statistical Language Model” in Chapter 4) written by Kiyohiro Shikano and Katsunobu Ito, pp. 53 to 69, published by Ohmsha, Ltd., May 15, 2001, first edition, ISBN 4-274-13228-5) - The descriptive grammar model is basically created manually, and recognition accuracy is high if the input speech data conforms to the grammar, but the recognition is not able to be achieved if the data fail to conform to the grammar even by only a little. On the other hand, the statistical language model represented in the N-gram model can be automatically created by subjecting the learning data to a statistical processing, and furthermore, can recognize the input speech data even if the arrangement of words in the input speech data runs slightly counter to the grammar rules.
- Furthermore, in creating the statistical language model, a large amount of learning data (corpus) is necessary. As methods of collecting the corpus, there are general methods such as collecting the corpus from media including books, newspapers, magazines, or the like and collecting the corpus from texts disclosed on web sites.
- In a speech recognition processing, expressions uttered by a speaker are recognized by a word and a phrase. However, in many application systems, it is more important to accurately estimate the intention of the speaker than to accurately understand all syllables and words in the speech. To add further, when the content of an utterance is not relevant to a task focused in speech recognition, it is not necessary to fit any intention of a task to the recognition by force. If an intention that is erroneously estimated is output, there is even a concern that may cause a wasteful operation in which the system provides the user with irrelevant tasks.
- There are various ways of uttering even for one intention. For example, in the task of “operate the television”, there is a plurality of intentions such as “switch the channel”, “watch a program”, and “turn up the volume”, but there is a plurality of ways of uttering for each of the intentions. For example, in the intention to switch the channel (to NHK), there are two or more ways of uttering such as “please switch to NHK” and “to NHK”, in the intention to watch a program (Taiga Drama: a historical drama), there are two or more ways of uttering, such as “I want to watch Taiga Drama” and “Turn on the Taiga Drama”, and in the intention to turn up the volume, there are two or more ways of uttering, such as “raise the volume” and “volume up”.
- For example, a speech processing device was suggested in which a language model is prepared for each intention (information on wishes) and an intention corresponding to the highest total score is selected as information indicating a wish of uttering based on an acoustic score and a language score (for example, please refer to Japanese Unexamined Patent Application Publication No. 2006-53203).
- The speech processing device uses each statistical language model as a language model for intentions, and recognizes the intentions even when the arrangement of words in input speech data runs slightly counter to grammar rules. However, even when the content of an utterance does not correspond to any intention of a focused task, the device fits any intention to the content by force. For example, when the speech processing device is configured to provide the service of a task relating to a television operation and provided with a plurality of statistical language models in which each intention relating to the television operation is inherent, an intention corresponding to a statistical language model showing a high value of a calculated language score is output as a recognition result even for the content of an utterance that does not intend a television operation. Accordingly, it ends up with the result of extracting an intention different from the intended content of the utterance.
- Furthermore, in configuring the speech processing device in which individual language models are provided for intentions as described above, it is necessary to prepare a sufficient number of language models for extracting the intentions of a task in consideration of the content of an utterance according to a focused specific task. In addition, it is necessary to collect learning data (corpus) according to intentions for creating robust language models for the intentions in a task.
- There is a general method of collecting the corpus from media such as books, newspapers, and magazines, and texts on web sites. For example, a method of generating a language model was suggested which generates a symbol sequence ratio with high accuracy by putting heavier importance on a text nearer to a recognition task (the content of an utterance) in an enormous text database, and improves the recognition capability by using the ratio in the recognition (for example, please refer to Japanese Unexamined Patent Application Publication No. 2002-82690).
- However, even if an enormous amount of learning data can be collected from the media such as books, newspapers, and magazines, and texts on web sites, selecting a phrase that a speaker is likely to utter takes effort and having a huge number of corpuses completely consistent with the intention is difficult. In addition, it is difficult to specify an intention of each text or to classify a text by intention. In other words, a corpus completely consistent with the intention of a speaker may not be collected.
- The inventors of the present invention consider that it is necessary to solve the following two points in order to realize a speech recognition device that accurately estimates an intention relating to a focused task in the content of an utterance.
- (1) A corpus having content that a speaker is likely to utter is simply and appropriately collected for each intention.
- (2) Any intention is not forced to fit to the content of an utterance, which is inconsistent with a task, but rather ignored.
- It is desirable to provide a speech recognition device and a speech recognition method, a language model generation device and a language model generation method, and a computer program which are excellent in estimating the intention of a speaker, and accurately grasping a task that the system is made to perform by a speech input.
- It is more desirable to provide a speech recognition device and a speech recognition method, a language model generation device and a language model generation method, and a computer program which are excellent in accurately estimating an intention of the content of an utterance by using a statistical language model.
- It is still more desirable to provide a speech recognition device and a speech recognition method, a language model generation device and a language model generation method, and a computer program, which are excellent in accurately estimating the intention relating to a task focused in the content of an utterance.
- The present invention takes into consideration the above matters, and according to a first embodiment of the present invention, a speech recognition device includes one intention extracting language model and more in which each intention of a focused specific task is inherent, an absorbing language model in which any intention of the task is not inherent, a language score calculating section that calculates a language score indicating a linguistic similarity between each of the intention extracting language models and the absorbing language model, and the content of an utterance, and a decoder that estimates an intention in the content of an utterance based on a language score of each of the language models calculated by the language score calculating section.
- According to a second embodiment of the present invention, there is provided a speech recognition device in which the intention extracting language model is a statistical language model obtained by subjecting learning data, which are composed of a plurality of sentences indicating the intention of the task, to a statistical processing.
- Furthermore, according to a third embodiment of the present invention, there is provided a speech recognition device in which the absorbing language model is a statistical language model obtained by subjecting to statistical processing an enormous amount of learning data, which are irrelevant to indicating the intention of the task or are composed of spontaneous utterances.
- Furthermore, according to a fourth embodiment of the present invention, there is provided a speech recognition device in which the learning data for obtaining the intention extracting language model are composed of sentences which are generated based on a descriptive grammar model indicating a corresponding intention and consistent with the intention.
- Furthermore, according to a fifth embodiment of the present invention, there is provided a speech recognition method including the steps of firstly calculating a language score indicating a linguistic similarity between one intention extracting language model and more in which each intention of a focused specific task is inherent and the content of an utterance, secondly calculating a language score indicating a linguistic similarity between an absorbing language model in which any intention of the task is not inherent and the content of an utterance, and estimating the intention in the content of an utterance based on a language score of each of the language models calculated in the first and second language score calculations.
- Furthermore, according to a sixth embodiment of the present invention, there is provided a language model generation device including a word meaning database in which a combination of an abstracted vocabulary of a first part-of-speech string and an abstracted vocabulary of a second part-of-speech string and one or more words indicating the same meaning or a similar intention of the abstracted vocabularies are registered, by making abstract the vocabulary candidate of the first part-of-speech string and the vocabulary candidate of the second part-of-speech string that may appear in an utterance indicating an intention, with respect to each intention of a focused specific task, a descriptive grammar model creating unit which creates a descriptive grammar model indicating an intention based on the combination of the abstracted vocabulary of the first part-of-speech string and the abstracted vocabulary of the second part-of-speech string indicating the intention of the task and one or more words indicating a same meaning or a similar intention for abstract vocabularies registered in the word meaning database, a collecting unit which collects a corpus having content that a speaker is likely to utter for an intention by automatically generating sentences consistent with each intention from the descriptive grammar model for the intention, and a language model creating unit that creates a statistical language model in which each intention is inherent by subjecting the corpus collected for the intention to statistical processing.
- However, the specific example of the first part-of-speech mentioned here is a noun and the specific example of the second part-of-speech mentioned here is a verb. To put simply, it would be better to make understood that a combination of important vocabularies indicating an intention is referred to as the first part-of-speech or the second part-of-speech.
- According to a seventh embodiment of the present invention, there is provided the language model generation device in which the word meaning database has the abstracted vocabulary of the first part-of-speech string and the abstracted vocabulary of the second part-of-speech string arranged on a matrix for each string and has a mark indicating the existence of the intention given in a column corresponding to the combination of the vocabulary of the first part-of-speech and the vocabulary of the second part-of-speech having intentions.
- Furthermore, according to an eighth embodiment of the present invention, there is provided a language model generation method including the steps of creating a grammar model by making abstract a necessary phrase for transmitting each intention included in a focused task, collecting a corpus having content that a speaker is likely to utter for an intention by automatically generating sentences consistent with each intention by using the grammar model, and constructing a plurality of statistical language models corresponding to each intention by performing probabilistic estimation from each corpus with a statistical technique.
- Furthermore, according to a ninth embodiment of the present invention, there is provided a computer program described in a computer readable format so as to execute a processing for speech recognition on a computer, the program causing the computer to function as one intention extracting language model and more in which each intention of a focused specific task is inherent, an absorbing language model in which any intention of the task is not inherent, a language score calculating section that calculates a language score indicating a linguistic similarity between each of the intention extracting language model and the absorbing language model, and the content of an utterance, and a decoder that estimates an intention in the content of an utterance based on a language score of each of the language models calculated by the language score calculating section.
- The computer program according to the above embodiment of the present invention is defined as a computer program that is described in a computer readable format so as to realize a predetermined processing on the computer. In other words, by installing the computer program according to the embodiment of the present invention on a computer, a cooperative action can be exerted on the computer and the same action and effect as in a speech recognition device according to the first embodiment of the present invention can be obtained.
- Furthermore, according to a tenth embodiment of the present invention, there is provided a computer program described in a computer readable format so as to execute processing for the generation of a language model on a computer, the program causing the computer to function as a word meaning database in which a combination of an abstracted vocabulary of a first part-of-speech string and an abstracted vocabulary of a second part-of-speech string and one or more words indicating the same meaning or a similar intention of the abstracted vocabularies are registered, by making abstract the vocabulary candidate of the first part-of-speech string and the vocabulary candidate of the second part-of-speech string that may appear in an utterance indicating an intention, with respect to each intention of a focused specific task, a descriptive grammar model creating unit which creates a descriptive grammar model indicating an intention based on the combination of the abstracted vocabulary of the first part-of-speech string and the abstracted vocabulary of the second part-of-speech string indicating the intention of the task and one or more words indicating a same meaning or a similar intention for abstracted vocabularies registered in the word meaning database, a collecting unit which collects a corpus having a content that a speaker is likely to utter for an intention by automatically generating sentences consistent with each intention from the descriptive grammar model for the intention, and a language model creating unit that creates a statistical language model in which each intention is inherent by subjecting the corpus collected for the intention to statistical processing.
- The computer program according to the above embodiment of the present invention is defined as a computer program that is described in a computer readable format so as to realize a predetermined processing on the computer. In other words, by installing the computer program according to the embodiment of the present invention on a computer, a cooperative action can be exerted on the computer and the same action and effect as in the language model generation device according to the sixth embodiment of the present invention can be obtained.
- According to the present invention, it is possible to provide a speech recognition device and a speech recognition method, a language model generation device and a language model generation method, and a computer program which are excellent in estimating an intention of a speaker, and accurately grasping a task that a system is made to perform by a speech input.
- Furthermore, according to the present invention, it is possible to provide a speech recognition device and a speech recognition method, a language model generation device and a language model generation method, and a computer program which are excellent in accurately estimating an intention of the content of an utterance by using a statistical language model.
- Furthermore, according to the present invention, it is possible to provide a speech recognition device and a speech recognition method, a language model generation device and a language model generation method, and a computer program which are excellent in accurately estimating an intention relating to a task focused in the content of an utterance.
- According to the first to fifth, and ninth embodiments of the present intention, it is possible to realize robust intention extraction for the task, by being provided with a statistical language model corresponding to the content of an utterance that is inconsistent with a focused task, such as a spontaneous utterance language model or the like, in addition to a statistical language model in which an intention included in a focused task is inherent, by performing processing in parallel, and by ignoring the estimation of an intention in the content of an utterance that is inconsistent with the task.
- According to the sixth to eighth, and tenth embodiments of the present invention, a corpus having a content that a speaker is likely to utter (in other words, a corpus necessary to create a statistical language model in which an intention is inherent) can be simply and appropriately collected for an intention by determining the intention included in a focused task in advance and automatically generating sentences consistent with the intention from a descriptive grammar model indicating the intention.
- According to the seventh embodiment of the present invention, the content that is likely to be uttered can be grasped without the omission by arranging the vocabulary candidate of the noun string and the vocabulary candidate of the verb string that may appear in the utterance on a matrix for a string. In addition, since one or more words having the same meaning or a similar meaning are registered in symbols of the vocabulary candidates of each string, it is possible to come up with a combination corresponding to various expressions of an utterance having a same meaning and to generate a large amount of sentences having the same intention as the learning data.
- If the collecting method for the learning data is employed according to the sixth to eighth, and tenth embodiment of the present invention, the corpus consistent with one focused task can be divided for each intention and can be simply and efficiently collected. Moreover, by creating the statistical language model from each of the created learning data, a group of language models in which one intention of the same task is inherent can be obtained. In addition, by using a morpheme interpreting software, part-of-speech and conjugation information are given to each morpheme to be used during the creation of the statistical language model.
- According to the sixth and tenth embodiments of the present invention, it is configured to take procedures of creating the statistical language model, in which the collecting unit collects a corpus having a content that a speaker is likely to utter for each intention by automatically generating sentences consistent with each intention from the descriptive grammar model for the intention, and the language model creating unit creates the statistical language model in which an intention is inherent by subjecting the corpus collected for each intention to a statistical processing. In that sense, there are two advantages shown below.
- (1) Uniformity of morphemes (division of words) is promoted. In a grammar model that is created manually, there is a high possibility that the uniformity of morphemes is not achievable. However, even if the morphemes are not united, it is possible to use united morphemes by using the morpheme interpreting software when the statistical language model is created.
- (2) By using the morpheme interpreting software, information on parts of speech or conjugations can be obtained, and the information can be reflected during the creation of the statistical language model.
- Another aim, characteristic, and advantage of the present invention will be clarified with detailed description based on embodiments of the present intention to be described below and accompanying drawings.
-
FIG. 1 is a block diagram schematically illustrating a functional structure of a speech recognition device according to an embodiment of the present invention; -
FIG. 2 is a diagram schematically illustrating a the minimum necessary structure of phrases for transmitting an intention; -
FIG. 3A is a diagram illustrating a word meaning database in which abstracted noun vocabularies and verb vocabularies are arranged in a matrix form; -
FIG. 3B is a diagram illustrating a state in which words indicating a same meaning or a similar intention are registered for abstracted vocabularies; -
FIG. 4 is a diagram for describing a method of creating a descriptive grammar model based on a combination of a noun vocabulary and a verb vocabulary put a mark in the matrix shown inFIG. 3A ; -
FIG. 5 is a diagram for describing a method of collecting a corpus having a content that a speaker is likely to utter by automatically generating sentences consistent with an intention from the descriptive grammar model for each intention; -
FIG. 6 is a diagram illustrating a flow of data in a technique of constructing a statistical language model from a grammar model; -
FIG. 7 is a diagram schematically illustrating a structural example of a language model database constituted with N number ofstatistical language models 1 to N learned for an intention of a focused task and one absorbing statistical language model; -
FIG. 8 is a diagram illustrating an operative example when a speech recognition device performs meaning estimation for the task “Operate the television”; -
FIG. 9 is a diagram illustrating a structural example of a personal computer provided in an embodiment of the present invention; and -
FIG. 10 is a diagram illustrating an example of a descriptive grammar model described with the context-free grammar. - The present invention relates to a speech recognition technology and has a main characteristic of accurately estimating an intention in content that a speaker utters focusing on a specific task, and thereby resolving the following two points.
- (1) A corpus having content that a speaker is likely to utter is simply and appropriately collected for each intention.
- (2) Any intention is not forced to fit to the content of an utterance, which is inconsistent with a task, but rather ignored.
- Hereinbelow, an embodiment for resolving the two points will be described in detail with reference to accompanying drawings.
-
FIG. 1 schematically illustrates a functional structure of a speech recognition device according to an embodiment of the present invention. Thespeech recognition device 10 in the drawing is provided with asignal processing section 11, an acousticscore calculating section 12, a languagescore calculating section 13, alexicon 14, and adecoder 15. Thespeech recognition device 10 is configured to accurately estimate an intention of a speaker, rather than to accurately understand all of syllable by syllable and word by word in speech. - Input speech from a speaker is brought into the
signal processing section 11 as electric signals through, for example, a microphone. Such analog electric signals undergo AD conversion through sampling and quantization processing to turn into speech data constituted with digital signals. In addition, thesignal processing section 11 generates a series X of temporal feature vector by applying acoustic analysis to the speech data for each frame of a slight time. By applying process of frequency analysis such as Discrete Fourier Transform (DFT) or the like as the acoustic analysis, for example, the series X of the feature vector, which has characteristics of, such as, energy for each frequency band (so called power spectrum) based on the frequency analysis is generated. - Next, a string of word models is obtained as a recognition result while referring to an
acoustic model database 16, thelexicon 14, and alanguage model database 17. - The acoustic
score calculating section 12 calculates an acoustic score indicating an acoustic similarity between an acoustic model including a string of words formed based on thelexicon 14 and input speech signals. The acoustic model recorded in theacoustic model database 16 is, for example, a Hidden Markov Model (HMM) for a phoneme of the Japanese language. The acousticscore calculating section 12 can obtain a probability p (X|W) in which the input speech data X is a word W registered in thelexicon 14 as an acoustic score while referring to the acoustic model database. - Furthermore, the language score calculating section calculates an acoustic score indicating a linguistic similarity between a language model including a string of words formed based on the
lexicon 14 and input speech signals. In thelanguage model database 17, the word sequence ratio (N-gram) that describes how N number of words form a sequence is recorded. The languagescore calculating section 13 can obtain an appearance probability p(W) of the word W registered in thelexicon 14 as a language score with reference to thelanguage model database 17. - The
decoder 15 obtains a recognition result based on the acoustic score and the language score. Specifically, as shown in Equation (1) below, if a probability p(W|X) in which the word W registered in thelexicon 14 is the input speech data X is calculated, the candidate words are searched and output in the order of having a high probability. -
p(W|X)∝p(W)·p(X|W) (1) - In addition, the
decoder 15 can estimate an optimal result with Equation (2) shown below. -
W=arg max p(W|X) (2) - A language model that the language
score calculating section 13 uses is the statistical language model. The statistical language model represented by the N-gram model can be automatically created from learning data and can recognize speech even when the arrangement of words in the input speech data runs counter to grammar rules a little. Thespeech recognition device 10 according to the present embodiment is assumed to estimate an intention relating to a task focused in the content of an utterance, and for that reason, thelanguage model database 17 is installed with a plurality of statistical language models corresponding to each intention included in a focused task. In addition, thelanguage model database 17 is installed with a statistical language model corresponding to the content of an utterance inconsistent with a focused task in order to ignore an intention estimation for the content of an utterance inconsistent with the task, which will be described in detail later. - There is a problem that constructing a plurality of statistical language models corresponding to each intention is difficult. The reason is because it takes effort to select out phrases that a speaker is likely to utter, even if an enormous amount of text data in media such as books, newspapers, magazines and the like, and on web sites can be collected, and it is difficult to have an enormous amount of corpuses for each intention. In addition, it is not easy to specify intentions in each text or to classify texts for each intention.
- Therefore, the present embodiment makes it possible to simply and appropriately collect a corpus having content that a speaker is likely to utter for each intention and to construct statistical language models for each intention, by using a technique of constructing the statistical language models from a grammar model.
- First, if an intention included in a focused task is determined in advance, the grammar model is efficiently created by making phrases necessary for transmitting the intention abstract (or symbolized). Next, by using the created grammar model, sentences consistent with each intention are automatically generated. As such, after collecting the corpus having the content that the speaker is likely to utter for each intention, the plurality of statistical language models corresponding to each intention can be constructed by performing a probability estimation from each corpus with a statistical technique.
- Furthermore, for example, “Bootstrapping Language Models for Dialogue Systems” written by Karl Weilhammer, Matthew N. Stuttle, and Steve Young (Interspeech, 2006) describes the technique of constructing statistical language models from the grammar model, but made no mention of an efficient construction method. On the contrary, in the present embodiment, the statistical language models can be efficiently constructed from the grammar model as described below.
- There will be described about a method of creating a corpus for each intention using the grammar model.
- When a corpus for learning a language model in which any one intention is included is created, a descriptive grammar model is created for obtaining the corpus. The inventors think that a structure of a simple and short sentence that a speaker is likely to utter (or a minimum phrase necessary for transmitting an intention) is composed of a combination of a noun vocabulary and a verb vocabulary, as “PERFORM SOMETHING” (as shown in
FIG. 2 ). Therefore, words for each of the noun vocabulary and the verb vocabulary are made to be abstract (or symbolized) in order to efficiently construct the grammar model. - For example, noun vocabularies indicating a title of a television program such as “Taiga Drama” (a historical drama) or “Waratte ii tomo” (a comedy program) are made abstract as a vocabulary “_Title”. In addition, verb vocabularies for machines used in watching programs such as a television, or the like, such as “please replay”, “please show”, or “I want to watch” are made to be abstract as the vocabulary “_Play”. As a result, the utterance having an intention of “please show the program” can be expressed by a combination of symbols for _Title & _Play.
- Furthermore, words indicating a same meaning or a similar intention are registered, for example, as below for each of the abstracted vocabularies. The registering work may be done manually.
- _Title=Taiga Drama, Waratte ii tomo, . . .
- _Play=please replay, replay, show, please show, I want to watch, do it, turn on, play, . . .
- In addition, “_Play the _Title”, or the like are created as the descriptive grammar model for obtaining corpuses. Corpuses such as “Please show the Taiga Drama” (historical drama) or the like can be created from the descriptive grammar model “_Play the _Title”.
- As such, the descriptive grammar models can be composed of the combination of each of the abstracted noun vocabularies and the verb vocabularies. In addition, the combination of each of the abstracted noun vocabularies and the verb vocabularies may express one intention. Therefore, as shown in
FIG. 3A , a matrix is formed by arranging the abstracted noun vocabularies in each row and arranging the abstracted verb vocabularies in each column, and a word meaning database is constructed by putting a mark indicating the existence of an intention in a corresponding column on the matrix for the each of the combinations of abstracted noun vocabularies and the verb vocabularies having the intention. - In the matrix shown in
FIG. 3A , a noun vocabulary and a verb vocabulary combined with a mark indicates a descriptive grammar model in which any one intention is included. In addition, words indicating the same meaning or a similar intention are registered in the word meaning database for the abstracted noun vocabularies divided with the rows in the matrix. Moreover, as shown inFIG. 3B , words indicating a same meaning or a similar intention are registered in the word meaning database for the abstracted verb vocabularies divided with the columns in the matrix. Furthermore, the word meaning database can be expanded into a three-dimensional arrangement, not a two-dimensional arrangement as the matrix shown inFIG. 3A . - There are advantages as follows in expressing the word meaning database that deals with the descriptive grammar models corresponding to each intention included in a task by making into a matrix as above.
- (1) It is easy to confirm whether the contents of an utterance by a speaker are comprehensively included.
- (2) It is easy to confirm whether functions of a system can be matched without omissions.
- (3) It is possible to efficiently construct a grammar model.
- In the matrix shown in
FIG. 3A , each of the combinations of the noun vocabularies and the verb vocabularies given with marks corresponds to a descriptive grammar model indicating an intention. In addition, if each of registered words indicating a same meaning or a similar intention is forced to fit to each of the abstracted noun vocabularies and the abstracted verb vocabularies, the descriptive grammar model described in the form of BNF can be efficiently created, as shown inFIG. 4 . - With regard to one focused task, a group of language models specified to the task can be obtained by registering noun vocabularies and verb vocabularies that may appear when a speaker makes an utterance. In addition, each of the language models has one intention (or operation) inherent therein.
- In other words, from the descriptive grammar models for each intention that are obtained from the word meaning database in the form of matrix shown in
FIG. 3A , corpuses having content that a speaker is likely to utter can be collected for each intention by automatically generating sentences consistent with the intention as shown inFIG. 5 . - A plurality of statistical language models corresponding to each intention can be constructed by performing a probability estimation from each corpus with a statistical technique. A method of constructing the statistical language models from each corpus is not limited to any specific method, and since a known technique can be applied thereto, detailed description thereof will not be mentioned here. The “Speech Recognition System” written by Kiyohiro Shikano and Katsunobu Ito mentioned above may be referred, if necessary.
-
FIG. 6 illustrates a flow of data in a method of constructing a statistical language model from a grammar model, which has been described hitherto. - The structure of the word meaning database is as shown in
FIG. 3A . In other words, noun vocabularies relating to a focused task (for example, operation of a television, or the like) are made into each group indicating a same meaning or a similar intention, and the noun vocabularies that are made into each abstracted group are arranged in each row of the matrix. In the same way, verb vocabularies relating to a focused task are made into each group indicating a same meaning or a similar intention, and the verb vocabularies that are made into each abstracted group are arranged in each column of the matrix. In addition, as shown inFIG. 3B , a plurality of words indicating same meanings or similar intentions is registered for each of the abstracted noun vocabularies and a plurality of words indicating same meanings or similar intentions is registered for each of the abstracted verb vocabularies. - On the matrix shown in
FIG. 3A , a mark indicating the existence of an intention is given in a column corresponding to a combination of a noun vocabulary and a verb vocabulary having the intention. In other words, each of the combinations of noun vocabularies and verb vocabularies matched with marks corresponds to a descriptive grammar model indicating an intention. A descriptive grammarmodel creating unit 61 picks up a combination of an noun vocabulary and an abstracted vocabulary indicating an intention having a mark on the matrix as a clue, then forces to fit each registered word indicating a same meaning or a similar intention to each of abstracted noun vocabularies and abstracted verb vocabularies, and creates a descriptive grammar model in the form of BNF to store the model as a file of the context-free grammar. Basic files of the BNF form are automatically created, and then the model will be modified in the form of a BNF file according to the expression of an utterance. In the example shown inFIG. 6 , the N number of descriptive grammar models from 1 to N are constructed by the descriptive grammarmodel creating unit 61 based on the word meaning database, and stored as files of the context-free grammar. In the present embodiment, the BNF form is used in defining the context-free grammar, but the spirit of the present invention is not necessarily limited thereto. - A sentence indicating a specific intention can be obtained by creating a sentence from a created BNF file. As shown in
FIG. 4 , transcription of a grammar model in the BNF form is a sentence creation rule from a non-terminal symbol (Start) to a terminal symbol (End). Therefore, collectingunit 62 can automatically generate a plurality of sentences indicating same intentions as shown inFIG. 5 and can collect corpuses having a content that a speaker is likely to utter for each intention by searching a route from the non-terminal symbol (Start) to the terminal symbol (End) for a descriptive grammar model indicating an intention. In the example shown inFIG. 6 , the group of sentences automatically generated from each of the descriptive grammar models is used as learning data indicating the same intention. In other words, learningdata 1 to N collected for each intention by the collectingunit 62 become corpuses for constructing statistical language models. - As such, it is possible to obtain descriptive grammar models by focusing on parts of nouns and verbs forming a meaning in a simple and short utterance, and symbolizing each of them. In addition, since a sentence indicating a specific meaning in a task is generated from the descriptive grammar model in the BNF form, corpuses necessary for creating statistical language models in which intentions are inherent can be simply and efficiently collected.
- Moreover, the language
model creating unit 63 can construct a plurality of statistical language models corresponding to each intention by performing a probability estimation for corpuses of each intention with a statistical technique. The sentence generated from the descriptive grammar model in the BNF form indicates a specific intention in a task, and therefore, a statistical language model created using a corpus including the sentence can be said as a robust language model in the content of an utterance for the intention. - Furthermore, the method of constructing a statistical language model from a corpus is not limited to any specific method, and since a known technique can be applied, detailed description thereof will not be mentioned here. The “Speech Recognition System” written by Kiyohiro Shikano and Katsunobu Ito mentioned above may be referred, if necessary.
- In the descriptions hitherto, it can be understood that a corpus having a content that a speaker is likely to utter is simply and appropriately collected for each intention and a statistical language model for each intention can be constructed by using a technique of constructing the statistical language model from a grammar model.
- Consecutively, there will be provided a description of a method in which any intention is not forced to fit to the content of an utterance inconsistent with a task, but can be ignored in the speech recognition device.
- When a speech recognition processing is performed, the language
score calculating section 13 calculates a language score from a group of language models created for each intention, the acousticscore calculating section 12 calculates an acoustic score with an acoustic model, and thedecoder 15 employs the most likely language model as a result of speech recognition processing. Accordingly, it is possible to extract or estimate the intention of an utterance from information for identifying the language model selected for the utterance. - When the group of language models that the language
score calculating section 13 uses is composed only of language models created for an intention in a focused specific task, utterance irrelevant to the task may be forced to fit to any language model and the model may be output as a recognition result. Accordingly, it ends up with a result of extracting an intention different from the content of the utterance. - Therefore, in a speech recognition device according to the preset embodiment, an absorbing statistical language model corresponding to the content of an utterance inconsistent with a task is provided in the
language model database 17 in addition to statistical language models for each intention in a focused task, and the group of statistical language models in the task is processed in tandem with the absorbing statistical language model, in order to absorb the content of an utterance not indicating any intention in the focused task (in other words, irrelevant to the task). -
FIG. 7 schematically illustrates the structural example of N number of thestatistical language models 1 to N learned corresponding to each intention in a focused task and thelanguage model database 17 including one absorbing statistical language model. - The statistical language models corresponding to each intention in the task are constructed by performing a probability estimation for texts for learning generated from the descriptive grammar models indicating each intention in the task with the statistical technique, as described above. On the contrary, the absorbing statistical language model is constructed by generally performing a probability estimation for corpuses collected from web sites or the like with the statistical technique.
- Here, the statistical language model is, for example, an N-gram model which causes a probability p (Wi|W1, . . . , Wi−1) in which a word Wi appears in the order of i-th after an (i−1)-th word appears in the order of W1, . . . , and Wi−1 to approximate to the sequence ratio p of the nearest N number of words (Wi|Wi−N+1, . . . , Wi−1) (as described before). When the content of an utterance by a speaker indicates an intention in a focused task, a probability p(k)(Wi|Wi−N+1, . . . , Wi−1) obtained from a statistical language model k obtained by learning a text for learning that has the intention has a high value, and
intentions 1 to N in the focused task can be accurately grasped (where, k is an integer from 1 to N). - On the other hand, the absorbing statistical language model is created by using general corpuses including an enormous amount of sentences collected from, for example, web sites, and is a spontaneous utterance language model (spoken language model) composed of a larger amount of vocabularies than the statistical language models having each intention in the task.
- The absorbing statistical language model contains vocabularies indicating an intention in a task, but when a language score is calculated for the content of an utterance having an intention in a task, the statistical language model having an intention in a task has a higher language score than the spontaneous utterance language model does. That is because the absorbing statistical language model is a spontaneous utterance language model and has a larger amount of vocabularies than each of the statistical language models in which the intentions are specified, and therefore, the appearance probability of a vocabulary having a specific intention is necessarily low.
- On the contrary, when the content of an utterance by a speaker is not relevant to a focused task, a probability in which a sentence similar to the content of the utterance exists in a text for learning that specifies an intention. For this reason, a probability in which a sentence similar to the content of the utterance exists in a general corpus is relatively high. In other words, a language score obtained from an absorbing statistical language model obtained by learning a general corpus is relatively higher than a language score obtained from any statistical language model obtained by learning a text for learning that specifies an intention. In addition, it is possible to prevent instances where any intention is forced to fit to the content of an utterance inconsistent with a task by outputting “others” as a corresponding intention from the
decoder 15. -
FIG. 8 illustrates an operative example when a speech recognition device according to the present embodiment performs a meaning estimation for the task “operate the television” - When the input content of an utterance indicates any intention in the task “operate the television” such as “change the channel”, “watch the program”, or the like, the corresponding intention in the task can be searched in the
decoder 15 based on the an acoustic score calculated by the acousticscore calculating section 12 and a language score calculated by the languagescore calculating section 13. - On the contrary, when the input content of an utterance does not indicate an intention in the task “operate the television” as “it's time to go to the market”, the probability value obtained with reference to the absorbing statistical language model is expected to be the highest, and the
decoder 15 obtains the intention of “others” as a search result. - The speech recognition device according to the present embodiment does not employ any statistical language model in a task but uses an absorbing statistical language model even when the content of an utterance irrelevant to the task is recognized, by applying the absorbing statistical language model composed of the spontaneous utterance language model or the like to the
language model database 17, in addition to the statistical language models corresponding to each intention in a task, and therefore the risk of erroneously extracting an intention can be reduced. - A series of the processes described above can be executed with hardware, and also with software. In the case of using the latter, for example, a speech recognition device can be realized in a personal computer executing a predetermined program.
-
FIG. 9 illustrates a structural example of the personal computer provided in an embodiment of the present invention. A central processing unit (CPU) 121 executes various kinds of processes following a program recorded in a read only memory (ROM) 122, or arecording unit 128. Processing executed following the program includes a speech recognition process, a process of creating a statistical language model used in speech recognition processing, and a process of creating learning data used in creating the statistical language model. Details of each process are as described above. - A random access memory (RAM) 123 properly stores the program that the
CPU 121 executes and data. TheCPU 121,ROM 122, andRAM 123 are connected to one another via abus 124. - The
CPU 121 is connected to an input/output interface 125 via thebus 124. The input/output interface 125 is connected to aninput unit 126 including a microphone, a keyboard, a mouse, a switch, and the like, and anoutput unit 127 including a display, a speaker, a lamp, and the like. In addition, theCPU 121 executes various kinds of processing according to a command input from theinput unit 126. - The
recording unit 128 connected to the input/output interface 125 is, for example, a hard disk drive (HDD), and records a program to be executed by theCPU 121 or various kinds of computer files such as processing data. A communicatingunit 129 communicates with an external device (not shown) via a communication network such as the Internet or other networks (any of which is not shown). In addition, the personal computer may acquire program files or download data files via the communicatingunit 129 in order to record them in therecording unit 128. - A
drive 130 connected to the input/output interface 125 drives amagnetic disk 151, anoptical disk 152, a magneto-optical disk 153, asemiconductor memory 154, or the like when they are installed therein, and acquires a program or data recorded in such storage regions. The acquired program or data is transferred to therecording unit 128 to be recorded if necessary. - When a series of processing is made to be executed with software, a program constituting the software is installed in a computer incorporated into dedicated hardware or a general personal computer installed with various programs that enables the execution of various functions, from a recording medium.
- As shown in
FIG. 9 , the recording medium includes amagnetic disk 151 where a program is recorded (including a flexible disk), an optical disk 152 (including compact disc-read only memory (CD-ROM) and, a digital versatile disc (DVD)), a magneto-optical disk 153 (including Mini-Disc (MD) as a trademark), or package media including asemiconductor memory 154 or the like, which are distributed to provide users with programs, in addition to theROM 122 in which a program is recorded, a hard disk included in therecording unit 128 or the like, which are provided for the users in a state of being incorporated into a computer in advance, different from the computers described above. - Furthermore, a program for executing a series of processes described above may be installed in a computer via a wired or wireless communication medium such as a local area network (LAN), the Internet, or digital satellite broadcasting through an interface such as a router or a modem or the like if necessary.
- The present application contains subject matter related to that disclosed in Japanese Priority Patent Application JP 2009-070992 filed in the Japan Patent Office on Mar. 23, 2009, the entire content of which is hereby incorporated by reference.
- It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and alterations may occur depending on design requirements and other factors insofar as they are within the scope of the appended claims or the equivalents thereof.
Claims (11)
1. A speech recognition device, comprising:
one intention extracting language model and more in which each intention of a focused specific task is inherent;
an absorbing language model in which any intention of the task is not inherent;
a language score calculating section that calculates a language score indicating a linguistic similarity between each of the intention extracting language model and the absorbing language model, and the content of an utterance; and
a decoder that estimates an intention in the content of an utterance based on a language score of each of the language models calculated by the language score calculating section.
2. The speech recognition device according to claim 1 , wherein the intention extracting language model is a statistical language model obtained by subjecting learning data, which are composed of a plurality of sentences indicating the intention of the task, to a statistical processing.
3. The speech recognition device according to claim 1 , wherein the absorbing language model is a statistical language model obtained by subjecting an enormous amount of learning data, which are irrelevant to indicating the intention of the task or are composed of spontaneous utterances, to a statistical processing.
4. The speech recognition device according to claim 2 , wherein the learning data for obtaining the intention extracting language model are composed of sentences which are generated based on a descriptive grammar model indicating a corresponding intention and consistent with the intention.
5. A speech recognition method, comprising the steps of:
firstly calculating a language score indicating a linguistic similarity between one intention extracting language model and more in which each intention of a focused specific task is inherent and the content of an utterance;
secondly calculating a language score indicating a linguistic similarity between an absorbing language model in which any intention of the task is not inherent and the content of an utterance; and
estimating an intention in the content of an utterance based on a language score of each of the language models calculated in the first and second language score calculations.
6. A language model generation device, comprising;
a word meaning database in which a combination of an abstracted vocabulary of a first part-of-speech string and an abstracted vocabulary of a second part-of-speech string and one or more words indicating the same meaning or a similar intention of the abstract vocabularies are registered, by making abstract the vocabulary candidate of the first part-of-speech string and the vocabulary candidate of the second part-of-speech string that may appear in an utterance indicating an intention, with respect to each intention of a focused specific task;
descriptive grammar model creating means for creating a descriptive grammar model indicating an intention based on the combination of the abstracted vocabulary of the first part-of-speech string and the abstracted vocabulary of the second part-of-speech string indicating the intention of the task and one or more words indicating a same meaning or a similar intention for abstract vocabularies registered in the word meaning database;
collecting means for collecting a corpus having a content that a speaker is likely to utter for an intention by automatically generating sentences consistent with each intention from the descriptive grammar model for the intention; and
language model creating means for creating a statistical language model in which each intention is inherent by subjecting the corpus collected for the intention to statistical processing.
7. The language model generation device according to claim 6 , wherein the word meaning database has the abstracted vocabulary of the first part-of-speech string and the abstracted vocabulary of the second part-of-speech string arranged on a matrix for each string and has a mark indicating the existence of the intention given in a column corresponding to the combination of the vocabulary of the first part-of-speech and the vocabulary of the second part-of-speech having intentions.
8. A language model generation method, comprising the steps of:
creating a grammar model by making abstract a necessary phrase for transmitting each intention included in a focused task;
collecting a corpus having a content that a speaker is likely to utter for an intention by automatically generating sentences consistent with each intention by using the grammar model; and
constructing a plurality of statistical language models corresponding to each intention by performing probabilistic estimation from each corpus with a statistical technique.
9. A computer program described in a computer readable format so as to execute a process for speech recognition on a computer, the program causing the computer to function as:
one intention extracting language model and more in which each intention of a focused specific task is inherent;
an absorbing language model in which any intention of the task is not inherent;
a language score calculating section that calculates a language score indicating a linguistic similarity between each of the intention extracting language model and the absorbing language model, and the content of an utterance; and
a decoder that estimates an intention in the content of an utterance based on a language score of each of the langue models calculated by the language score calculating section.
10. A computer program described in a computer readable format so as to execute a process for the generation of a language model on a computer, the program causing the computer to function as:
a word meaning database in which a combination of an abstracted vocabulary of a first part-of-speech string and an abstracted vocabulary of a second part-of-speech string and one or more words indicating the same or a similar intention of the abstract vocabularies are registered, by making abstract the vocabulary candidate of the first part-of-speech string and the vocabulary candidate of the second part-of-speech string that may appear in an utterance indicating an intention, with respect to each intention of a focused specific task;
descriptive grammar model creating means for creating a descriptive grammar model indicating an intention based on the combination of the abstracted vocabulary of the first part-of-speech string and the abstracted vocabulary of the second part-of-speech string indicating the intention of the task and one or more words indicating a same meaning or a similar intention for abstract vocabularies registered in the word meaning database;
collecting means for collecting a corpus having a content that a speaker is likely to utter for an intention by automatically generating sentences consistent with each intention from the descriptive grammar model for the intention; and
language model creating means for creating a statistical language model in which each intention is inherent by subjecting the corpus collected for the intention to statistical processing.
11. A language model generation device, comprising;
a word meaning database in which a combination of an abstracted vocabulary of a first part-of-speech string and an abstracted vocabulary of a second part-of-speech string and one or more words indicating the same meaning or a similar intention of the abstract vocabularies are registered, by making abstract the vocabulary candidate of the first part-of-speech string and the vocabulary candidate of the second part-of-speech string that may appear in an utterance indicating an intention, with respect to each intention of a focused specific task;
a descriptive grammar model creating unit which creates a descriptive grammar model indicating an intention based on the combination of the abstracted vocabulary of the first part-of-speech string and the abstracted vocabulary of the second part-of-speech string indicating the intention of the task and one or more words indicating a same meaning or a similar intention for abstracted vocabularies registered in the word meaning database;
a collecting unit which collects a corpus having a content that a speaker is likely to utter for an intention by automatically generating sentences consistent with each intention from the descriptive grammar model for the intention; and
a language model creating unit that creates a statistical language model in which each intention is inherent by subjecting the corpus collected for the intention to statistical processing.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2009070992A JP2010224194A (en) | 2009-03-23 | 2009-03-23 | Speech recognition device and speech recognition method, language model generating device and language model generating method, and computer program |
JPP2009-070992 | 2009-03-23 |
Publications (1)
Publication Number | Publication Date |
---|---|
US20100241418A1 true US20100241418A1 (en) | 2010-09-23 |
Family
ID=42738393
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/661,164 Abandoned US20100241418A1 (en) | 2009-03-23 | 2010-03-11 | Voice recognition device and voice recognition method, language model generating device and language model generating method, and computer program |
Country Status (3)
Country | Link |
---|---|
US (1) | US20100241418A1 (en) |
JP (1) | JP2010224194A (en) |
CN (1) | CN101847405B (en) |
Cited By (166)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100299138A1 (en) * | 2009-05-22 | 2010-11-25 | Kim Yeo Jin | Apparatus and method for language expression using context and intent awareness |
US20110218812A1 (en) * | 2010-03-02 | 2011-09-08 | Nilang Patel | Increasing the relevancy of media content |
US20120259620A1 (en) * | 2009-12-23 | 2012-10-11 | Upstream Mobile Marketing Limited | Message optimization |
US20130080162A1 (en) * | 2011-09-23 | 2013-03-28 | Microsoft Corporation | User Query History Expansion for Improving Language Model Adaptation |
US20130325535A1 (en) * | 2012-05-30 | 2013-12-05 | Majid Iqbal | Service design system and method of using same |
US20140019131A1 (en) * | 2012-07-13 | 2014-01-16 | Korea University Research And Business Foundation | Method of recognizing speech and electronic device thereof |
US20140365218A1 (en) * | 2013-06-07 | 2014-12-11 | Microsoft Corporation | Language model adaptation using result selection |
US9292488B2 (en) | 2014-02-01 | 2016-03-22 | Soundhound, Inc. | Method for embedding voice mail in a spoken utterance using a natural language processing computer system |
US9348809B1 (en) * | 2015-02-02 | 2016-05-24 | Linkedin Corporation | Modifying a tokenizer based on pseudo data for natural language processing |
US9390167B2 (en) | 2010-07-29 | 2016-07-12 | Soundhound, Inc. | System and methods for continuous audio matching |
US9449598B1 (en) * | 2013-09-26 | 2016-09-20 | Amazon Technologies, Inc. | Speech recognition with combined grammar and statistical language models |
US9507849B2 (en) | 2013-11-28 | 2016-11-29 | Soundhound, Inc. | Method for combining a query and a communication command in a natural language computer system |
US9548050B2 (en) | 2010-01-18 | 2017-01-17 | Apple Inc. | Intelligent automated assistant |
US9564123B1 (en) | 2014-05-12 | 2017-02-07 | Soundhound, Inc. | Method and system for building an integrated user profile |
US9582608B2 (en) | 2013-06-07 | 2017-02-28 | Apple Inc. | Unified ranking with entropy-weighted information for phrase-based semantic auto-completion |
US9620104B2 (en) | 2013-06-07 | 2017-04-11 | Apple Inc. | System and method for user-specified pronunciation of words for speech synthesis and recognition |
US9626955B2 (en) | 2008-04-05 | 2017-04-18 | Apple Inc. | Intelligent text-to-speech conversion |
US9633660B2 (en) | 2010-02-25 | 2017-04-25 | Apple Inc. | User profiling for voice input processing |
US9646614B2 (en) | 2000-03-16 | 2017-05-09 | Apple Inc. | Fast, language-independent method for user authentication by voice |
US9668024B2 (en) | 2014-06-30 | 2017-05-30 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US20180075842A1 (en) * | 2016-09-14 | 2018-03-15 | GM Global Technology Operations LLC | Remote speech recognition at a vehicle |
US9934775B2 (en) | 2016-05-26 | 2018-04-03 | Apple Inc. | Unit-selection text-to-speech synthesis based on predicted concatenation parameters |
US9953088B2 (en) | 2012-05-14 | 2018-04-24 | Apple Inc. | Crowd sourcing information to fulfill user requests |
US9966068B2 (en) | 2013-06-08 | 2018-05-08 | Apple Inc. | Interpreting and acting upon commands that involve sharing information with remote devices |
US9971774B2 (en) | 2012-09-19 | 2018-05-15 | Apple Inc. | Voice-based media searching |
US9972304B2 (en) | 2016-06-03 | 2018-05-15 | Apple Inc. | Privacy preserving distributed evaluation framework for embedded personalized systems |
US9986419B2 (en) | 2014-09-30 | 2018-05-29 | Apple Inc. | Social reminders |
US10043516B2 (en) | 2016-09-23 | 2018-08-07 | Apple Inc. | Intelligent automated assistant |
US10049663B2 (en) | 2016-06-08 | 2018-08-14 | Apple, Inc. | Intelligent automated assistant for media exploration |
US10049668B2 (en) | 2015-12-02 | 2018-08-14 | Apple Inc. | Applying neural network language models to weighted finite state transducers for automatic speech recognition |
US10067938B2 (en) | 2016-06-10 | 2018-09-04 | Apple Inc. | Multilingual word prediction |
US10079014B2 (en) | 2012-06-08 | 2018-09-18 | Apple Inc. | Name recognition system |
US10083690B2 (en) | 2014-05-30 | 2018-09-25 | Apple Inc. | Better resolution when referencing to concepts |
US10089072B2 (en) | 2016-06-11 | 2018-10-02 | Apple Inc. | Intelligent device arbitration and control |
US10102359B2 (en) | 2011-03-21 | 2018-10-16 | Apple Inc. | Device access using voice authentication |
US10108612B2 (en) | 2008-07-31 | 2018-10-23 | Apple Inc. | Mobile device having human language translation capability with positional feedback |
US10121165B1 (en) | 2011-05-10 | 2018-11-06 | Soundhound, Inc. | System and method for targeting content based on identified audio and multimedia |
CN108885618A (en) * | 2016-03-30 | 2018-11-23 | 三菱电机株式会社 | Intent estimation device and intention estimation method |
US10169329B2 (en) | 2014-05-30 | 2019-01-01 | Apple Inc. | Exemplar-based natural language processing |
US10176167B2 (en) | 2013-06-09 | 2019-01-08 | Apple Inc. | System and method for inferring user intent from speech inputs |
US10185542B2 (en) | 2013-06-09 | 2019-01-22 | Apple Inc. | Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant |
US10192552B2 (en) | 2016-06-10 | 2019-01-29 | Apple Inc. | Digital assistant providing whispered speech |
US10223066B2 (en) | 2015-12-23 | 2019-03-05 | Apple Inc. | Proactive assistance based on dialog communication between devices |
US10249300B2 (en) | 2016-06-06 | 2019-04-02 | Apple Inc. | Intelligent list reading |
US20190114317A1 (en) * | 2017-10-13 | 2019-04-18 | Via Technologies, Inc. | Natural language recognizing apparatus and natural language recognizing method |
US10269345B2 (en) | 2016-06-11 | 2019-04-23 | Apple Inc. | Intelligent task discovery |
US10283110B2 (en) | 2009-07-02 | 2019-05-07 | Apple Inc. | Methods and apparatuses for automatic speech recognition |
US10297253B2 (en) | 2016-06-11 | 2019-05-21 | Apple Inc. | Application integration with a digital assistant |
US10303715B2 (en) | 2017-05-16 | 2019-05-28 | Apple Inc. | Intelligent automated assistant for media exploration |
US10311144B2 (en) | 2017-05-16 | 2019-06-04 | Apple Inc. | Emoji word sense disambiguation |
US10311871B2 (en) | 2015-03-08 | 2019-06-04 | Apple Inc. | Competing devices responding to voice triggers |
US10318871B2 (en) | 2005-09-08 | 2019-06-11 | Apple Inc. | Method and apparatus for building an intelligent automated assistant |
US10332518B2 (en) | 2017-05-09 | 2019-06-25 | Apple Inc. | User interface for correcting recognition errors |
US10356243B2 (en) | 2015-06-05 | 2019-07-16 | Apple Inc. | Virtual assistant aided communication with 3rd party service in a communication session |
US10354011B2 (en) | 2016-06-09 | 2019-07-16 | Apple Inc. | Intelligent automated assistant in a home environment |
US10366158B2 (en) | 2015-09-29 | 2019-07-30 | Apple Inc. | Efficient word encoding for recurrent neural network language models |
US10381016B2 (en) | 2008-01-03 | 2019-08-13 | Apple Inc. | Methods and apparatus for altering audio output signals |
US10395654B2 (en) | 2017-05-11 | 2019-08-27 | Apple Inc. | Text normalization based on a data-driven learning network |
US10395270B2 (en) | 2012-05-17 | 2019-08-27 | Persado Intellectual Property Limited | System and method for recommending a grammar for a message campaign used by a message optimization system |
US10403283B1 (en) | 2018-06-01 | 2019-09-03 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
US10403278B2 (en) | 2017-05-16 | 2019-09-03 | Apple Inc. | Methods and systems for phonetic matching in digital assistant services |
US10410637B2 (en) | 2017-05-12 | 2019-09-10 | Apple Inc. | User-specific acoustic models |
US10417266B2 (en) | 2017-05-09 | 2019-09-17 | Apple Inc. | Context-aware ranking of intelligent response suggestions |
US10431204B2 (en) | 2014-09-11 | 2019-10-01 | Apple Inc. | Method and apparatus for discovering trending terms in speech requests |
US10438595B2 (en) | 2014-09-30 | 2019-10-08 | Apple Inc. | Speaker identification and unsupervised speaker adaptation techniques |
US10446143B2 (en) | 2016-03-14 | 2019-10-15 | Apple Inc. | Identification of voice inputs providing credentials |
US10445429B2 (en) | 2017-09-21 | 2019-10-15 | Apple Inc. | Natural language understanding using vocabularies with compressed serialized tries |
US10453443B2 (en) | 2014-09-30 | 2019-10-22 | Apple Inc. | Providing an indication of the suitability of speech recognition |
US10460034B2 (en) | 2015-01-28 | 2019-10-29 | Mitsubishi Electric Corporation | Intention inference system and intention inference method |
US10474753B2 (en) | 2016-09-07 | 2019-11-12 | Apple Inc. | Language identification using recurrent neural networks |
EP3564948A4 (en) * | 2017-11-02 | 2019-11-13 | Sony Corporation | Information processing device and information processing method |
US10482874B2 (en) | 2017-05-15 | 2019-11-19 | Apple Inc. | Hierarchical belief states for digital assistants |
US10490187B2 (en) | 2016-06-10 | 2019-11-26 | Apple Inc. | Digital assistant providing automated status report |
US10496705B1 (en) | 2018-06-03 | 2019-12-03 | Apple Inc. | Accelerated task performance |
US10497365B2 (en) | 2014-05-30 | 2019-12-03 | Apple Inc. | Multi-command single utterance input method |
US10496753B2 (en) | 2010-01-18 | 2019-12-03 | Apple Inc. | Automatically adapting user interfaces for hands-free interaction |
US10504137B1 (en) | 2015-10-08 | 2019-12-10 | Persado Intellectual Property Limited | System, method, and computer program product for monitoring and responding to the performance of an ad |
US10509862B2 (en) | 2016-06-10 | 2019-12-17 | Apple Inc. | Dynamic phrase expansion of language input |
US10521466B2 (en) | 2016-06-11 | 2019-12-31 | Apple Inc. | Data driven natural language event detection and classification |
US10529332B2 (en) | 2015-03-08 | 2020-01-07 | Apple Inc. | Virtual assistant activation |
US10553209B2 (en) | 2010-01-18 | 2020-02-04 | Apple Inc. | Systems and methods for hands-free notification summaries |
US10567477B2 (en) | 2015-03-08 | 2020-02-18 | Apple Inc. | Virtual assistant continuity |
US10593346B2 (en) | 2016-12-22 | 2020-03-17 | Apple Inc. | Rank-reduced token representation for automatic speech recognition |
US10592604B2 (en) | 2018-03-12 | 2020-03-17 | Apple Inc. | Inverse text normalization for automatic speech recognition |
US10636424B2 (en) | 2017-11-30 | 2020-04-28 | Apple Inc. | Multi-turn canned dialog |
US10643611B2 (en) | 2008-10-02 | 2020-05-05 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
US10657328B2 (en) | 2017-06-02 | 2020-05-19 | Apple Inc. | Multi-task recurrent neural network architecture for efficient morphology handling in neural language modeling |
US10671428B2 (en) | 2015-09-08 | 2020-06-02 | Apple Inc. | Distributed personal assistant |
US10679605B2 (en) | 2010-01-18 | 2020-06-09 | Apple Inc. | Hands-free list-reading by intelligent automated assistant |
US10684703B2 (en) | 2018-06-01 | 2020-06-16 | Apple Inc. | Attention aware virtual assistant dismissal |
US10691473B2 (en) | 2015-11-06 | 2020-06-23 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US10699717B2 (en) | 2014-05-30 | 2020-06-30 | Apple Inc. | Intelligent assistant for home automation |
US10705794B2 (en) | 2010-01-18 | 2020-07-07 | Apple Inc. | Automatically adapting user interfaces for hands-free interaction |
US10714117B2 (en) | 2013-02-07 | 2020-07-14 | Apple Inc. | Voice trigger for a digital assistant |
US10726832B2 (en) | 2017-05-11 | 2020-07-28 | Apple Inc. | Maintaining privacy of personal information |
US10733375B2 (en) | 2018-01-31 | 2020-08-04 | Apple Inc. | Knowledge-based framework for improving natural language understanding |
US10733993B2 (en) | 2016-06-10 | 2020-08-04 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
US10733982B2 (en) | 2018-01-08 | 2020-08-04 | Apple Inc. | Multi-directional dialog |
US10741185B2 (en) | 2010-01-18 | 2020-08-11 | Apple Inc. | Intelligent automated assistant |
US10747498B2 (en) | 2015-09-08 | 2020-08-18 | Apple Inc. | Zero latency digital assistant |
US10748546B2 (en) | 2017-05-16 | 2020-08-18 | Apple Inc. | Digital assistant services based on device capabilities |
US10755051B2 (en) | 2017-09-29 | 2020-08-25 | Apple Inc. | Rule-based natural language processing |
US10755703B2 (en) | 2017-05-11 | 2020-08-25 | Apple Inc. | Offline personal assistant |
US10789959B2 (en) | 2018-03-02 | 2020-09-29 | Apple Inc. | Training speaker recognition models for digital assistants |
US10791176B2 (en) | 2017-05-12 | 2020-09-29 | Apple Inc. | Synchronization and task delegation of a digital assistant |
US10789945B2 (en) | 2017-05-12 | 2020-09-29 | Apple Inc. | Low-latency intelligent automated assistant |
US10795541B2 (en) | 2009-06-05 | 2020-10-06 | Apple Inc. | Intelligent organization of tasks items |
US10810274B2 (en) | 2017-05-15 | 2020-10-20 | Apple Inc. | Optimizing dialogue policy decisions for digital assistants using implicit feedback |
US10818288B2 (en) | 2018-03-26 | 2020-10-27 | Apple Inc. | Natural assistant interaction |
US10832283B1 (en) | 2015-12-09 | 2020-11-10 | Persado Intellectual Property Limited | System, method, and computer program for providing an instance of a promotional message to a user based on a predicted emotional response corresponding to user characteristics |
US10839159B2 (en) | 2018-09-28 | 2020-11-17 | Apple Inc. | Named entity normalization in a spoken dialog system |
US10892996B2 (en) | 2018-06-01 | 2021-01-12 | Apple Inc. | Variable latency device coordination |
US10909331B2 (en) | 2018-03-30 | 2021-02-02 | Apple Inc. | Implicit identification of translation payload with neural machine translation |
US10930280B2 (en) | 2017-11-20 | 2021-02-23 | Lg Electronics Inc. | Device for providing toolkit for agent developer |
US10928918B2 (en) | 2018-05-07 | 2021-02-23 | Apple Inc. | Raise to speak |
US10957310B1 (en) * | 2012-07-23 | 2021-03-23 | Soundhound, Inc. | Integrated programming framework for speech and text understanding with meaning parsing |
US10984780B2 (en) | 2018-05-21 | 2021-04-20 | Apple Inc. | Global semantic word embeddings using bi-directional recurrent neural networks |
US11010127B2 (en) | 2015-06-29 | 2021-05-18 | Apple Inc. | Virtual assistant for media playback |
US11010561B2 (en) | 2018-09-27 | 2021-05-18 | Apple Inc. | Sentiment prediction from textual data |
US11010550B2 (en) | 2015-09-29 | 2021-05-18 | Apple Inc. | Unified language modeling framework for word prediction, auto-completion and auto-correction |
US11025565B2 (en) | 2015-06-07 | 2021-06-01 | Apple Inc. | Personalized prediction of responses for instant messaging |
US11023513B2 (en) | 2007-12-20 | 2021-06-01 | Apple Inc. | Method and apparatus for searching using an active ontology |
US11069336B2 (en) | 2012-03-02 | 2021-07-20 | Apple Inc. | Systems and methods for name pronunciation |
US11080012B2 (en) | 2009-06-05 | 2021-08-03 | Apple Inc. | Interface for a virtual digital assistant |
US11120372B2 (en) | 2011-06-03 | 2021-09-14 | Apple Inc. | Performing actions associated with task items that represent tasks to perform |
US11127397B2 (en) | 2015-05-27 | 2021-09-21 | Apple Inc. | Device voice control |
US11133008B2 (en) | 2014-05-30 | 2021-09-28 | Apple Inc. | Reducing the need for manual start/end-pointing and trigger phrases |
US11140099B2 (en) | 2019-05-21 | 2021-10-05 | Apple Inc. | Providing message response suggestions |
US11145294B2 (en) | 2018-05-07 | 2021-10-12 | Apple Inc. | Intelligent automated assistant for delivering content from user experiences |
US20210343292A1 (en) * | 2020-05-04 | 2021-11-04 | Lingua Robotica, Inc. | Techniques for converting natural speech to programming code |
US11170166B2 (en) | 2018-09-28 | 2021-11-09 | Apple Inc. | Neural typographical error modeling via generative adversarial networks |
US11204787B2 (en) | 2017-01-09 | 2021-12-21 | Apple Inc. | Application integration with a digital assistant |
US11217251B2 (en) | 2019-05-06 | 2022-01-04 | Apple Inc. | Spoken notifications |
US11227589B2 (en) | 2016-06-06 | 2022-01-18 | Apple Inc. | Intelligent list reading |
US11231904B2 (en) | 2015-03-06 | 2022-01-25 | Apple Inc. | Reducing response latency of intelligent automated assistants |
US11237797B2 (en) | 2019-05-31 | 2022-02-01 | Apple Inc. | User activity shortcut suggestions |
US11269678B2 (en) | 2012-05-15 | 2022-03-08 | Apple Inc. | Systems and methods for integrating third party services with a digital assistant |
US11281993B2 (en) | 2016-12-05 | 2022-03-22 | Apple Inc. | Model and ensemble compression for metric learning |
US11289073B2 (en) | 2019-05-31 | 2022-03-29 | Apple Inc. | Device text to speech |
US11295730B1 (en) | 2014-02-27 | 2022-04-05 | Soundhound, Inc. | Using phonetic variants in a local context to improve natural language understanding |
US11301477B2 (en) | 2017-05-12 | 2022-04-12 | Apple Inc. | Feedback analysis of a digital assistant |
US11307752B2 (en) | 2019-05-06 | 2022-04-19 | Apple Inc. | User configurable task triggers |
US11314370B2 (en) | 2013-12-06 | 2022-04-26 | Apple Inc. | Method for extracting salient dialog usage from live data |
US11350253B2 (en) | 2011-06-03 | 2022-05-31 | Apple Inc. | Active transport based notifications |
US11348573B2 (en) | 2019-03-18 | 2022-05-31 | Apple Inc. | Multimodality in digital assistant systems |
US11360641B2 (en) | 2019-06-01 | 2022-06-14 | Apple Inc. | Increasing the relevance of new available information |
US11386266B2 (en) | 2018-06-01 | 2022-07-12 | Apple Inc. | Text correction |
US11388291B2 (en) | 2013-03-14 | 2022-07-12 | Apple Inc. | System and method for processing voicemail |
US11423908B2 (en) | 2019-05-06 | 2022-08-23 | Apple Inc. | Interpreting spoken requests |
US11462215B2 (en) | 2018-09-28 | 2022-10-04 | Apple Inc. | Multi-modal inputs for voice commands |
US11468282B2 (en) | 2015-05-15 | 2022-10-11 | Apple Inc. | Virtual assistant in a communication session |
US11475898B2 (en) | 2018-10-26 | 2022-10-18 | Apple Inc. | Low-latency multi-speaker speech recognition |
US11475884B2 (en) | 2019-05-06 | 2022-10-18 | Apple Inc. | Reducing digital assistant latency when a language is incorrectly determined |
US11488406B2 (en) | 2019-09-25 | 2022-11-01 | Apple Inc. | Text detection using global geometry estimators |
US11495218B2 (en) | 2018-06-01 | 2022-11-08 | Apple Inc. | Virtual assistant operation in multi-device environments |
US11496600B2 (en) | 2019-05-31 | 2022-11-08 | Apple Inc. | Remote execution of machine-learned models |
US20220366911A1 (en) * | 2021-05-17 | 2022-11-17 | Google Llc | Arranging and/or clearing speech-to-text content without a user providing express instructions |
US11532306B2 (en) | 2017-05-16 | 2022-12-20 | Apple Inc. | Detecting a trigger of a digital assistant |
US11587559B2 (en) | 2015-09-30 | 2023-02-21 | Apple Inc. | Intelligent device identification |
US11638059B2 (en) | 2019-01-04 | 2023-04-25 | Apple Inc. | Content playback on multiple devices |
US11657813B2 (en) | 2019-05-31 | 2023-05-23 | Apple Inc. | Voice identification in digital assistant systems |
US11755276B2 (en) | 2020-05-12 | 2023-09-12 | Apple Inc. | Reducing description length based on confidence |
US11798547B2 (en) | 2013-03-15 | 2023-10-24 | Apple Inc. | Voice activated device for use with a voice-based digital assistant |
US11838734B2 (en) | 2020-07-20 | 2023-12-05 | Apple Inc. | Multi-device audio adjustment coordination |
US11978436B2 (en) | 2022-06-03 | 2024-05-07 | Apple Inc. | Application vocabulary integration with a digital assistant |
US12010262B2 (en) | 2013-08-06 | 2024-06-11 | Apple Inc. | Auto-activating smart responses based on activities from remote devices |
Families Citing this family (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101828273B1 (en) * | 2011-01-04 | 2018-02-14 | 삼성전자주식회사 | Apparatus and method for voice command recognition based on combination of dialog models |
KR101565658B1 (en) | 2012-11-28 | 2015-11-04 | 포항공과대학교 산학협력단 | Method for dialog management using memory capcity and apparatus therefor |
CN103474065A (en) * | 2013-09-24 | 2013-12-25 | 贵阳世纪恒通科技有限公司 | Method for determining and recognizing voice intentions based on automatic classification technology |
CN103458056B (en) * | 2013-09-24 | 2017-04-26 | 世纪恒通科技股份有限公司 | Speech intention judging system based on automatic classification technology for automatic outbound system |
CN103578464B (en) * | 2013-10-18 | 2017-01-11 | 威盛电子股份有限公司 | Language model building method, speech recognition method and electronic device |
CN103578465B (en) * | 2013-10-18 | 2016-08-17 | 威盛电子股份有限公司 | Speech recognition method and electronic device |
CN103677729B (en) * | 2013-12-18 | 2017-02-08 | 北京搜狗科技发展有限公司 | Voice input method and system |
DE112014007123T5 (en) * | 2014-10-30 | 2017-07-20 | Mitsubishi Electric Corporation | Dialogue control system and dialogue control procedures |
JP6514503B2 (en) * | 2014-12-25 | 2019-05-15 | クラリオン株式会社 | Intention estimation device and intention estimation system |
US9607616B2 (en) * | 2015-08-17 | 2017-03-28 | Mitsubishi Electric Research Laboratories, Inc. | Method for using a multi-scale recurrent neural network with pretraining for spoken language understanding tasks |
CN106486114A (en) * | 2015-08-28 | 2017-03-08 | 株式会社东芝 | Improve method and apparatus and audio recognition method and the device of language model |
CN106095791B (en) * | 2016-01-31 | 2019-08-09 | 长源动力(北京)科技有限公司 | A kind of abstract sample information searching system based on context |
US10229687B2 (en) * | 2016-03-10 | 2019-03-12 | Microsoft Technology Licensing, Llc | Scalable endpoint-dependent natural language understanding |
JP6636379B2 (en) * | 2016-04-11 | 2020-01-29 | 日本電信電話株式会社 | Identifier construction apparatus, method and program |
CN106384594A (en) * | 2016-11-04 | 2017-02-08 | 湖南海翼电子商务股份有限公司 | On-vehicle terminal for voice recognition and method thereof |
KR20180052347A (en) | 2016-11-10 | 2018-05-18 | 삼성전자주식회사 | Voice recognition apparatus and method |
CN106710586B (en) * | 2016-12-27 | 2020-06-30 | 北京儒博科技有限公司 | Method and device for automatic switching of speech recognition engine |
JP6857581B2 (en) * | 2017-09-13 | 2021-04-14 | 株式会社日立製作所 | Growth interactive device |
CN107908743B (en) * | 2017-11-16 | 2021-12-03 | 百度在线网络技术(北京)有限公司 | Artificial intelligence application construction method and device |
KR102209336B1 (en) * | 2017-11-20 | 2021-01-29 | 엘지전자 주식회사 | Toolkit providing device for agent developer |
JP7058574B2 (en) * | 2018-09-10 | 2022-04-22 | ヤフー株式会社 | Information processing equipment, information processing methods, and programs |
KR102017229B1 (en) * | 2019-04-15 | 2019-09-02 | 미디어젠(주) | A text sentence automatic generating system based deep learning for improving infinity of speech pattern |
CN112382279B (en) * | 2020-11-24 | 2021-09-14 | 北京百度网讯科技有限公司 | Voice recognition method and device, electronic equipment and storage medium |
JP6954549B1 (en) * | 2021-06-15 | 2021-10-27 | ソプラ株式会社 | Automatic generators and programs for entities, intents and corpora |
CN114120993A (en) * | 2021-07-15 | 2022-03-01 | 意欧斯物流科技(上海)有限公司 | A Repository-Based Speech-Semantic Interaction System |
WO2025004190A1 (en) * | 2023-06-27 | 2025-01-02 | 日本電信電話株式会社 | Intent extraction device, intent extraction method, and program |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5737734A (en) * | 1995-09-15 | 1998-04-07 | Infonautics Corporation | Query word relevance adjustment in a search of an information retrieval system |
US6381465B1 (en) * | 1999-08-27 | 2002-04-30 | Leap Wireless International, Inc. | System and method for attaching an advertisement to an SMS message for wireless transmission |
US20020087525A1 (en) * | 2000-04-02 | 2002-07-04 | Abbott Kenneth H. | Soliciting information based on a computer user's context |
US20030154476A1 (en) * | 1999-12-15 | 2003-08-14 | Abbott Kenneth H. | Storing and recalling information to augment human memories |
US20050182628A1 (en) * | 2004-02-18 | 2005-08-18 | Samsung Electronics Co., Ltd. | Domain-based dialog speech recognition method and apparatus |
US20060286527A1 (en) * | 2005-06-16 | 2006-12-21 | Charles Morel | Interactive teaching web application |
US20070099602A1 (en) * | 2005-10-28 | 2007-05-03 | Microsoft Corporation | Multi-modal device capable of automated actions |
US7228275B1 (en) * | 2002-10-21 | 2007-06-05 | Toyota Infotechnology Center Co., Ltd. | Speech recognition system having multiple speech recognizers |
US20080005053A1 (en) * | 2006-06-30 | 2008-01-03 | Microsoft Corporation | Communication-prompted user assistance |
US20080243501A1 (en) * | 2007-04-02 | 2008-10-02 | Google Inc. | Location-Based Responses to Telephone Requests |
US20090048821A1 (en) * | 2005-07-27 | 2009-02-19 | Yahoo! Inc. | Mobile language interpreter with text to speech |
US20090243998A1 (en) * | 2008-03-28 | 2009-10-01 | Nokia Corporation | Apparatus, method and computer program product for providing an input gesture indicator |
US20100153321A1 (en) * | 2006-04-06 | 2010-06-17 | Yale University | Framework of hierarchical sensory grammars for inferring behaviors using distributed sensors |
US20100222102A1 (en) * | 2009-02-05 | 2010-09-02 | Rodriguez Tony F | Second Screens and Widgets |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6526380B1 (en) * | 1999-03-26 | 2003-02-25 | Koninklijke Philips Electronics N.V. | Speech recognition system having parallel large vocabulary recognition engines |
KR100812109B1 (en) * | 1999-10-19 | 2008-03-12 | 소니 일렉트로닉스 인코포레이티드 | Natural Language Interface Control System |
JP3628245B2 (en) * | 2000-09-05 | 2005-03-09 | 日本電信電話株式会社 | Language model generation method, speech recognition method, and program recording medium thereof |
US7395205B2 (en) * | 2001-02-13 | 2008-07-01 | International Business Machines Corporation | Dynamic language model mixtures with history-based buckets |
US6999931B2 (en) * | 2002-02-01 | 2006-02-14 | Intel Corporation | Spoken dialog system using a best-fit language model and best-fit grammar |
JP4581549B2 (en) * | 2004-08-10 | 2010-11-17 | ソニー株式会社 | Audio processing apparatus and method, recording medium, and program |
US7634406B2 (en) * | 2004-12-10 | 2009-12-15 | Microsoft Corporation | System and method for identifying semantic intent from acoustic information |
JP4733436B2 (en) * | 2005-06-07 | 2011-07-27 | 日本電信電話株式会社 | Word / semantic expression group database creation method, speech understanding method, word / semantic expression group database creation device, speech understanding device, program, and storage medium |
CN101034390A (en) * | 2006-03-10 | 2007-09-12 | 日电(中国)有限公司 | Apparatus and method for verbal model switching and self-adapting |
JPWO2007138875A1 (en) * | 2006-05-31 | 2009-10-01 | 日本電気株式会社 | Word dictionary / language model creation system, method, program, and speech recognition system for speech recognition |
JP2008064885A (en) * | 2006-09-05 | 2008-03-21 | Honda Motor Co Ltd | Voice recognition device, voice recognition method and voice recognition program |
JP5148532B2 (en) * | 2009-02-25 | 2013-02-20 | 株式会社エヌ・ティ・ティ・ドコモ | Topic determination device and topic determination method |
-
2009
- 2009-03-23 JP JP2009070992A patent/JP2010224194A/en not_active Ceased
-
2010
- 2010-03-11 US US12/661,164 patent/US20100241418A1/en not_active Abandoned
- 2010-03-16 CN CN2010101358523A patent/CN101847405B/en not_active Expired - Fee Related
Patent Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5737734A (en) * | 1995-09-15 | 1998-04-07 | Infonautics Corporation | Query word relevance adjustment in a search of an information retrieval system |
US6381465B1 (en) * | 1999-08-27 | 2002-04-30 | Leap Wireless International, Inc. | System and method for attaching an advertisement to an SMS message for wireless transmission |
US20030154476A1 (en) * | 1999-12-15 | 2003-08-14 | Abbott Kenneth H. | Storing and recalling information to augment human memories |
US20020087525A1 (en) * | 2000-04-02 | 2002-07-04 | Abbott Kenneth H. | Soliciting information based on a computer user's context |
US7228275B1 (en) * | 2002-10-21 | 2007-06-05 | Toyota Infotechnology Center Co., Ltd. | Speech recognition system having multiple speech recognizers |
US20050182628A1 (en) * | 2004-02-18 | 2005-08-18 | Samsung Electronics Co., Ltd. | Domain-based dialog speech recognition method and apparatus |
US20060286527A1 (en) * | 2005-06-16 | 2006-12-21 | Charles Morel | Interactive teaching web application |
US20090048821A1 (en) * | 2005-07-27 | 2009-02-19 | Yahoo! Inc. | Mobile language interpreter with text to speech |
US20070099602A1 (en) * | 2005-10-28 | 2007-05-03 | Microsoft Corporation | Multi-modal device capable of automated actions |
US7778632B2 (en) * | 2005-10-28 | 2010-08-17 | Microsoft Corporation | Multi-modal device capable of automated actions |
US20100153321A1 (en) * | 2006-04-06 | 2010-06-17 | Yale University | Framework of hierarchical sensory grammars for inferring behaviors using distributed sensors |
US20080005053A1 (en) * | 2006-06-30 | 2008-01-03 | Microsoft Corporation | Communication-prompted user assistance |
US20080243501A1 (en) * | 2007-04-02 | 2008-10-02 | Google Inc. | Location-Based Responses to Telephone Requests |
US20090243998A1 (en) * | 2008-03-28 | 2009-10-01 | Nokia Corporation | Apparatus, method and computer program product for providing an input gesture indicator |
US20100222102A1 (en) * | 2009-02-05 | 2010-09-02 | Rodriguez Tony F | Second Screens and Widgets |
Cited By (243)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9646614B2 (en) | 2000-03-16 | 2017-05-09 | Apple Inc. | Fast, language-independent method for user authentication by voice |
US11928604B2 (en) | 2005-09-08 | 2024-03-12 | Apple Inc. | Method and apparatus for building an intelligent automated assistant |
US10318871B2 (en) | 2005-09-08 | 2019-06-11 | Apple Inc. | Method and apparatus for building an intelligent automated assistant |
US11023513B2 (en) | 2007-12-20 | 2021-06-01 | Apple Inc. | Method and apparatus for searching using an active ontology |
US10381016B2 (en) | 2008-01-03 | 2019-08-13 | Apple Inc. | Methods and apparatus for altering audio output signals |
US9865248B2 (en) | 2008-04-05 | 2018-01-09 | Apple Inc. | Intelligent text-to-speech conversion |
US9626955B2 (en) | 2008-04-05 | 2017-04-18 | Apple Inc. | Intelligent text-to-speech conversion |
US10108612B2 (en) | 2008-07-31 | 2018-10-23 | Apple Inc. | Mobile device having human language translation capability with positional feedback |
US11348582B2 (en) | 2008-10-02 | 2022-05-31 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
US10643611B2 (en) | 2008-10-02 | 2020-05-05 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
US20100299138A1 (en) * | 2009-05-22 | 2010-11-25 | Kim Yeo Jin | Apparatus and method for language expression using context and intent awareness |
US8560301B2 (en) * | 2009-05-22 | 2013-10-15 | Samsung Electronics Co., Ltd. | Apparatus and method for language expression using context and intent awareness |
US10795541B2 (en) | 2009-06-05 | 2020-10-06 | Apple Inc. | Intelligent organization of tasks items |
US11080012B2 (en) | 2009-06-05 | 2021-08-03 | Apple Inc. | Interface for a virtual digital assistant |
US10283110B2 (en) | 2009-07-02 | 2019-05-07 | Apple Inc. | Methods and apparatuses for automatic speech recognition |
US10269028B2 (en) | 2009-12-23 | 2019-04-23 | Persado Intellectual Property Limited | Message optimization |
US9741043B2 (en) * | 2009-12-23 | 2017-08-22 | Persado Intellectual Property Limited | Message optimization |
US20120259620A1 (en) * | 2009-12-23 | 2012-10-11 | Upstream Mobile Marketing Limited | Message optimization |
US10706841B2 (en) | 2010-01-18 | 2020-07-07 | Apple Inc. | Task flow identification based on user intent |
US10705794B2 (en) | 2010-01-18 | 2020-07-07 | Apple Inc. | Automatically adapting user interfaces for hands-free interaction |
US12087308B2 (en) | 2010-01-18 | 2024-09-10 | Apple Inc. | Intelligent automated assistant |
US10496753B2 (en) | 2010-01-18 | 2019-12-03 | Apple Inc. | Automatically adapting user interfaces for hands-free interaction |
US10741185B2 (en) | 2010-01-18 | 2020-08-11 | Apple Inc. | Intelligent automated assistant |
US10553209B2 (en) | 2010-01-18 | 2020-02-04 | Apple Inc. | Systems and methods for hands-free notification summaries |
US9548050B2 (en) | 2010-01-18 | 2017-01-17 | Apple Inc. | Intelligent automated assistant |
US11423886B2 (en) | 2010-01-18 | 2022-08-23 | Apple Inc. | Task flow identification based on user intent |
US10679605B2 (en) | 2010-01-18 | 2020-06-09 | Apple Inc. | Hands-free list-reading by intelligent automated assistant |
US10692504B2 (en) | 2010-02-25 | 2020-06-23 | Apple Inc. | User profiling for voice input processing |
US10049675B2 (en) | 2010-02-25 | 2018-08-14 | Apple Inc. | User profiling for voice input processing |
US9633660B2 (en) | 2010-02-25 | 2017-04-25 | Apple Inc. | User profiling for voice input processing |
US20110218812A1 (en) * | 2010-03-02 | 2011-09-08 | Nilang Patel | Increasing the relevancy of media content |
US8635058B2 (en) * | 2010-03-02 | 2014-01-21 | Nilang Patel | Increasing the relevancy of media content |
US9390167B2 (en) | 2010-07-29 | 2016-07-12 | Soundhound, Inc. | System and methods for continuous audio matching |
US10657174B2 (en) | 2010-07-29 | 2020-05-19 | Soundhound, Inc. | Systems and methods for providing identification information in response to an audio segment |
US10055490B2 (en) | 2010-07-29 | 2018-08-21 | Soundhound, Inc. | System and methods for continuous audio matching |
US10417405B2 (en) | 2011-03-21 | 2019-09-17 | Apple Inc. | Device access using voice authentication |
US10102359B2 (en) | 2011-03-21 | 2018-10-16 | Apple Inc. | Device access using voice authentication |
US12100023B2 (en) | 2011-05-10 | 2024-09-24 | Soundhound Ai Ip, Llc | Query-specific targeted ad delivery |
US10121165B1 (en) | 2011-05-10 | 2018-11-06 | Soundhound, Inc. | System and method for targeting content based on identified audio and multimedia |
US10832287B2 (en) | 2011-05-10 | 2020-11-10 | Soundhound, Inc. | Promotional content targeting based on recognized audio |
US11120372B2 (en) | 2011-06-03 | 2021-09-14 | Apple Inc. | Performing actions associated with task items that represent tasks to perform |
US11350253B2 (en) | 2011-06-03 | 2022-05-31 | Apple Inc. | Active transport based notifications |
US20130080162A1 (en) * | 2011-09-23 | 2013-03-28 | Microsoft Corporation | User Query History Expansion for Improving Language Model Adaptation |
US9299342B2 (en) * | 2011-09-23 | 2016-03-29 | Microsoft Technology Licensing, Llc | User query history expansion for improving language model adaptation |
US20150325237A1 (en) * | 2011-09-23 | 2015-11-12 | Microsoft Technology Licensing, Llc | User query history expansion for improving language model adaptation |
US9129606B2 (en) * | 2011-09-23 | 2015-09-08 | Microsoft Technology Licensing, Llc | User query history expansion for improving language model adaptation |
US11069336B2 (en) | 2012-03-02 | 2021-07-20 | Apple Inc. | Systems and methods for name pronunciation |
US9953088B2 (en) | 2012-05-14 | 2018-04-24 | Apple Inc. | Crowd sourcing information to fulfill user requests |
US11269678B2 (en) | 2012-05-15 | 2022-03-08 | Apple Inc. | Systems and methods for integrating third party services with a digital assistant |
US10395270B2 (en) | 2012-05-17 | 2019-08-27 | Persado Intellectual Property Limited | System and method for recommending a grammar for a message campaign used by a message optimization system |
US20130325535A1 (en) * | 2012-05-30 | 2013-12-05 | Majid Iqbal | Service design system and method of using same |
US10079014B2 (en) | 2012-06-08 | 2018-09-18 | Apple Inc. | Name recognition system |
US20140019131A1 (en) * | 2012-07-13 | 2014-01-16 | Korea University Research And Business Foundation | Method of recognizing speech and electronic device thereof |
US11776533B2 (en) | 2012-07-23 | 2023-10-03 | Soundhound, Inc. | Building a natural language understanding application using a received electronic record containing programming code including an interpret-block, an interpret-statement, a pattern expression and an action statement |
US10996931B1 (en) | 2012-07-23 | 2021-05-04 | Soundhound, Inc. | Integrated programming framework for speech and text understanding with block and statement structure |
US10957310B1 (en) * | 2012-07-23 | 2021-03-23 | Soundhound, Inc. | Integrated programming framework for speech and text understanding with meaning parsing |
US9971774B2 (en) | 2012-09-19 | 2018-05-15 | Apple Inc. | Voice-based media searching |
US10714117B2 (en) | 2013-02-07 | 2020-07-14 | Apple Inc. | Voice trigger for a digital assistant |
US10978090B2 (en) | 2013-02-07 | 2021-04-13 | Apple Inc. | Voice trigger for a digital assistant |
US11388291B2 (en) | 2013-03-14 | 2022-07-12 | Apple Inc. | System and method for processing voicemail |
US11798547B2 (en) | 2013-03-15 | 2023-10-24 | Apple Inc. | Voice activated device for use with a voice-based digital assistant |
US20140365218A1 (en) * | 2013-06-07 | 2014-12-11 | Microsoft Corporation | Language model adaptation using result selection |
US9582608B2 (en) | 2013-06-07 | 2017-02-28 | Apple Inc. | Unified ranking with entropy-weighted information for phrase-based semantic auto-completion |
US9966060B2 (en) | 2013-06-07 | 2018-05-08 | Apple Inc. | System and method for user-specified pronunciation of words for speech synthesis and recognition |
US9620104B2 (en) | 2013-06-07 | 2017-04-11 | Apple Inc. | System and method for user-specified pronunciation of words for speech synthesis and recognition |
US10657961B2 (en) | 2013-06-08 | 2020-05-19 | Apple Inc. | Interpreting and acting upon commands that involve sharing information with remote devices |
US9966068B2 (en) | 2013-06-08 | 2018-05-08 | Apple Inc. | Interpreting and acting upon commands that involve sharing information with remote devices |
US11048473B2 (en) | 2013-06-09 | 2021-06-29 | Apple Inc. | Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant |
US10176167B2 (en) | 2013-06-09 | 2019-01-08 | Apple Inc. | System and method for inferring user intent from speech inputs |
US10185542B2 (en) | 2013-06-09 | 2019-01-22 | Apple Inc. | Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant |
US11727219B2 (en) | 2013-06-09 | 2023-08-15 | Apple Inc. | System and method for inferring user intent from speech inputs |
US10769385B2 (en) | 2013-06-09 | 2020-09-08 | Apple Inc. | System and method for inferring user intent from speech inputs |
US12010262B2 (en) | 2013-08-06 | 2024-06-11 | Apple Inc. | Auto-activating smart responses based on activities from remote devices |
US9449598B1 (en) * | 2013-09-26 | 2016-09-20 | Amazon Technologies, Inc. | Speech recognition with combined grammar and statistical language models |
US9507849B2 (en) | 2013-11-28 | 2016-11-29 | Soundhound, Inc. | Method for combining a query and a communication command in a natural language computer system |
US11314370B2 (en) | 2013-12-06 | 2022-04-26 | Apple Inc. | Method for extracting salient dialog usage from live data |
US9601114B2 (en) | 2014-02-01 | 2017-03-21 | Soundhound, Inc. | Method for embedding voice mail in a spoken utterance using a natural language processing computer system |
US9292488B2 (en) | 2014-02-01 | 2016-03-22 | Soundhound, Inc. | Method for embedding voice mail in a spoken utterance using a natural language processing computer system |
US11295730B1 (en) | 2014-02-27 | 2022-04-05 | Soundhound, Inc. | Using phonetic variants in a local context to improve natural language understanding |
US12175964B2 (en) | 2014-05-12 | 2024-12-24 | Soundhound, Inc. | Deriving acoustic features and linguistic features from received speech audio |
US9564123B1 (en) | 2014-05-12 | 2017-02-07 | Soundhound, Inc. | Method and system for building an integrated user profile |
US10311858B1 (en) | 2014-05-12 | 2019-06-04 | Soundhound, Inc. | Method and system for building an integrated user profile |
US11030993B2 (en) | 2014-05-12 | 2021-06-08 | Soundhound, Inc. | Advertisement selection by linguistic classification |
US10417344B2 (en) | 2014-05-30 | 2019-09-17 | Apple Inc. | Exemplar-based natural language processing |
US10497365B2 (en) | 2014-05-30 | 2019-12-03 | Apple Inc. | Multi-command single utterance input method |
US10169329B2 (en) | 2014-05-30 | 2019-01-01 | Apple Inc. | Exemplar-based natural language processing |
US11257504B2 (en) | 2014-05-30 | 2022-02-22 | Apple Inc. | Intelligent assistant for home automation |
US11133008B2 (en) | 2014-05-30 | 2021-09-28 | Apple Inc. | Reducing the need for manual start/end-pointing and trigger phrases |
US10083690B2 (en) | 2014-05-30 | 2018-09-25 | Apple Inc. | Better resolution when referencing to concepts |
US10714095B2 (en) | 2014-05-30 | 2020-07-14 | Apple Inc. | Intelligent assistant for home automation |
US10878809B2 (en) | 2014-05-30 | 2020-12-29 | Apple Inc. | Multi-command single utterance input method |
US10699717B2 (en) | 2014-05-30 | 2020-06-30 | Apple Inc. | Intelligent assistant for home automation |
US10657966B2 (en) | 2014-05-30 | 2020-05-19 | Apple Inc. | Better resolution when referencing to concepts |
US10904611B2 (en) | 2014-06-30 | 2021-01-26 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US11838579B2 (en) | 2014-06-30 | 2023-12-05 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US9668024B2 (en) | 2014-06-30 | 2017-05-30 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US10431204B2 (en) | 2014-09-11 | 2019-10-01 | Apple Inc. | Method and apparatus for discovering trending terms in speech requests |
US9986419B2 (en) | 2014-09-30 | 2018-05-29 | Apple Inc. | Social reminders |
US10438595B2 (en) | 2014-09-30 | 2019-10-08 | Apple Inc. | Speaker identification and unsupervised speaker adaptation techniques |
US10453443B2 (en) | 2014-09-30 | 2019-10-22 | Apple Inc. | Providing an indication of the suitability of speech recognition |
US10390213B2 (en) | 2014-09-30 | 2019-08-20 | Apple Inc. | Social reminders |
US10460034B2 (en) | 2015-01-28 | 2019-10-29 | Mitsubishi Electric Corporation | Intention inference system and intention inference method |
US9348809B1 (en) * | 2015-02-02 | 2016-05-24 | Linkedin Corporation | Modifying a tokenizer based on pseudo data for natural language processing |
CN108124477A (en) * | 2015-02-02 | 2018-06-05 | 微软技术授权有限责任公司 | Segmenter is improved based on pseudo- data to handle natural language |
US11231904B2 (en) | 2015-03-06 | 2022-01-25 | Apple Inc. | Reducing response latency of intelligent automated assistants |
US10529332B2 (en) | 2015-03-08 | 2020-01-07 | Apple Inc. | Virtual assistant activation |
US10567477B2 (en) | 2015-03-08 | 2020-02-18 | Apple Inc. | Virtual assistant continuity |
US11087759B2 (en) | 2015-03-08 | 2021-08-10 | Apple Inc. | Virtual assistant activation |
US10311871B2 (en) | 2015-03-08 | 2019-06-04 | Apple Inc. | Competing devices responding to voice triggers |
US10930282B2 (en) | 2015-03-08 | 2021-02-23 | Apple Inc. | Competing devices responding to voice triggers |
US11468282B2 (en) | 2015-05-15 | 2022-10-11 | Apple Inc. | Virtual assistant in a communication session |
US11127397B2 (en) | 2015-05-27 | 2021-09-21 | Apple Inc. | Device voice control |
US10356243B2 (en) | 2015-06-05 | 2019-07-16 | Apple Inc. | Virtual assistant aided communication with 3rd party service in a communication session |
US10681212B2 (en) | 2015-06-05 | 2020-06-09 | Apple Inc. | Virtual assistant aided communication with 3rd party service in a communication session |
US11025565B2 (en) | 2015-06-07 | 2021-06-01 | Apple Inc. | Personalized prediction of responses for instant messaging |
US11010127B2 (en) | 2015-06-29 | 2021-05-18 | Apple Inc. | Virtual assistant for media playback |
US10671428B2 (en) | 2015-09-08 | 2020-06-02 | Apple Inc. | Distributed personal assistant |
US10747498B2 (en) | 2015-09-08 | 2020-08-18 | Apple Inc. | Zero latency digital assistant |
US11126400B2 (en) | 2015-09-08 | 2021-09-21 | Apple Inc. | Zero latency digital assistant |
US11500672B2 (en) | 2015-09-08 | 2022-11-15 | Apple Inc. | Distributed personal assistant |
US11010550B2 (en) | 2015-09-29 | 2021-05-18 | Apple Inc. | Unified language modeling framework for word prediction, auto-completion and auto-correction |
US10366158B2 (en) | 2015-09-29 | 2019-07-30 | Apple Inc. | Efficient word encoding for recurrent neural network language models |
US11587559B2 (en) | 2015-09-30 | 2023-02-21 | Apple Inc. | Intelligent device identification |
US10504137B1 (en) | 2015-10-08 | 2019-12-10 | Persado Intellectual Property Limited | System, method, and computer program product for monitoring and responding to the performance of an ad |
US10691473B2 (en) | 2015-11-06 | 2020-06-23 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US11526368B2 (en) | 2015-11-06 | 2022-12-13 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US10354652B2 (en) | 2015-12-02 | 2019-07-16 | Apple Inc. | Applying neural network language models to weighted finite state transducers for automatic speech recognition |
US10049668B2 (en) | 2015-12-02 | 2018-08-14 | Apple Inc. | Applying neural network language models to weighted finite state transducers for automatic speech recognition |
US10832283B1 (en) | 2015-12-09 | 2020-11-10 | Persado Intellectual Property Limited | System, method, and computer program for providing an instance of a promotional message to a user based on a predicted emotional response corresponding to user characteristics |
US10942703B2 (en) | 2015-12-23 | 2021-03-09 | Apple Inc. | Proactive assistance based on dialog communication between devices |
US10223066B2 (en) | 2015-12-23 | 2019-03-05 | Apple Inc. | Proactive assistance based on dialog communication between devices |
US10446143B2 (en) | 2016-03-14 | 2019-10-15 | Apple Inc. | Identification of voice inputs providing credentials |
CN108885618A (en) * | 2016-03-30 | 2018-11-23 | 三菱电机株式会社 | Intent estimation device and intention estimation method |
US20190005950A1 (en) * | 2016-03-30 | 2019-01-03 | Mitsubishi Electric Corporation | Intention estimation device and intention estimation method |
US9934775B2 (en) | 2016-05-26 | 2018-04-03 | Apple Inc. | Unit-selection text-to-speech synthesis based on predicted concatenation parameters |
US9972304B2 (en) | 2016-06-03 | 2018-05-15 | Apple Inc. | Privacy preserving distributed evaluation framework for embedded personalized systems |
US10249300B2 (en) | 2016-06-06 | 2019-04-02 | Apple Inc. | Intelligent list reading |
US11227589B2 (en) | 2016-06-06 | 2022-01-18 | Apple Inc. | Intelligent list reading |
US11069347B2 (en) | 2016-06-08 | 2021-07-20 | Apple Inc. | Intelligent automated assistant for media exploration |
US10049663B2 (en) | 2016-06-08 | 2018-08-14 | Apple, Inc. | Intelligent automated assistant for media exploration |
US10354011B2 (en) | 2016-06-09 | 2019-07-16 | Apple Inc. | Intelligent automated assistant in a home environment |
US10192552B2 (en) | 2016-06-10 | 2019-01-29 | Apple Inc. | Digital assistant providing whispered speech |
US10509862B2 (en) | 2016-06-10 | 2019-12-17 | Apple Inc. | Dynamic phrase expansion of language input |
US10733993B2 (en) | 2016-06-10 | 2020-08-04 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
US10067938B2 (en) | 2016-06-10 | 2018-09-04 | Apple Inc. | Multilingual word prediction |
US11037565B2 (en) | 2016-06-10 | 2021-06-15 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
US10490187B2 (en) | 2016-06-10 | 2019-11-26 | Apple Inc. | Digital assistant providing automated status report |
US10269345B2 (en) | 2016-06-11 | 2019-04-23 | Apple Inc. | Intelligent task discovery |
US10089072B2 (en) | 2016-06-11 | 2018-10-02 | Apple Inc. | Intelligent device arbitration and control |
US10521466B2 (en) | 2016-06-11 | 2019-12-31 | Apple Inc. | Data driven natural language event detection and classification |
US10580409B2 (en) | 2016-06-11 | 2020-03-03 | Apple Inc. | Application integration with a digital assistant |
US10942702B2 (en) | 2016-06-11 | 2021-03-09 | Apple Inc. | Intelligent device arbitration and control |
US10297253B2 (en) | 2016-06-11 | 2019-05-21 | Apple Inc. | Application integration with a digital assistant |
US11152002B2 (en) | 2016-06-11 | 2021-10-19 | Apple Inc. | Application integration with a digital assistant |
US10474753B2 (en) | 2016-09-07 | 2019-11-12 | Apple Inc. | Language identification using recurrent neural networks |
US20180075842A1 (en) * | 2016-09-14 | 2018-03-15 | GM Global Technology Operations LLC | Remote speech recognition at a vehicle |
US10553215B2 (en) | 2016-09-23 | 2020-02-04 | Apple Inc. | Intelligent automated assistant |
US10043516B2 (en) | 2016-09-23 | 2018-08-07 | Apple Inc. | Intelligent automated assistant |
US11281993B2 (en) | 2016-12-05 | 2022-03-22 | Apple Inc. | Model and ensemble compression for metric learning |
US10593346B2 (en) | 2016-12-22 | 2020-03-17 | Apple Inc. | Rank-reduced token representation for automatic speech recognition |
US11656884B2 (en) | 2017-01-09 | 2023-05-23 | Apple Inc. | Application integration with a digital assistant |
US11204787B2 (en) | 2017-01-09 | 2021-12-21 | Apple Inc. | Application integration with a digital assistant |
US10332518B2 (en) | 2017-05-09 | 2019-06-25 | Apple Inc. | User interface for correcting recognition errors |
US10417266B2 (en) | 2017-05-09 | 2019-09-17 | Apple Inc. | Context-aware ranking of intelligent response suggestions |
US10741181B2 (en) | 2017-05-09 | 2020-08-11 | Apple Inc. | User interface for correcting recognition errors |
US10395654B2 (en) | 2017-05-11 | 2019-08-27 | Apple Inc. | Text normalization based on a data-driven learning network |
US10847142B2 (en) | 2017-05-11 | 2020-11-24 | Apple Inc. | Maintaining privacy of personal information |
US10726832B2 (en) | 2017-05-11 | 2020-07-28 | Apple Inc. | Maintaining privacy of personal information |
US11599331B2 (en) | 2017-05-11 | 2023-03-07 | Apple Inc. | Maintaining privacy of personal information |
US10755703B2 (en) | 2017-05-11 | 2020-08-25 | Apple Inc. | Offline personal assistant |
US10789945B2 (en) | 2017-05-12 | 2020-09-29 | Apple Inc. | Low-latency intelligent automated assistant |
US11380310B2 (en) | 2017-05-12 | 2022-07-05 | Apple Inc. | Low-latency intelligent automated assistant |
US11405466B2 (en) | 2017-05-12 | 2022-08-02 | Apple Inc. | Synchronization and task delegation of a digital assistant |
US10791176B2 (en) | 2017-05-12 | 2020-09-29 | Apple Inc. | Synchronization and task delegation of a digital assistant |
US10410637B2 (en) | 2017-05-12 | 2019-09-10 | Apple Inc. | User-specific acoustic models |
US11301477B2 (en) | 2017-05-12 | 2022-04-12 | Apple Inc. | Feedback analysis of a digital assistant |
US10810274B2 (en) | 2017-05-15 | 2020-10-20 | Apple Inc. | Optimizing dialogue policy decisions for digital assistants using implicit feedback |
US10482874B2 (en) | 2017-05-15 | 2019-11-19 | Apple Inc. | Hierarchical belief states for digital assistants |
US10311144B2 (en) | 2017-05-16 | 2019-06-04 | Apple Inc. | Emoji word sense disambiguation |
US10403278B2 (en) | 2017-05-16 | 2019-09-03 | Apple Inc. | Methods and systems for phonetic matching in digital assistant services |
US10909171B2 (en) | 2017-05-16 | 2021-02-02 | Apple Inc. | Intelligent automated assistant for media exploration |
US11217255B2 (en) | 2017-05-16 | 2022-01-04 | Apple Inc. | Far-field extension for digital assistant services |
US11532306B2 (en) | 2017-05-16 | 2022-12-20 | Apple Inc. | Detecting a trigger of a digital assistant |
US10303715B2 (en) | 2017-05-16 | 2019-05-28 | Apple Inc. | Intelligent automated assistant for media exploration |
US10748546B2 (en) | 2017-05-16 | 2020-08-18 | Apple Inc. | Digital assistant services based on device capabilities |
US10657328B2 (en) | 2017-06-02 | 2020-05-19 | Apple Inc. | Multi-task recurrent neural network architecture for efficient morphology handling in neural language modeling |
US10445429B2 (en) | 2017-09-21 | 2019-10-15 | Apple Inc. | Natural language understanding using vocabularies with compressed serialized tries |
US10755051B2 (en) | 2017-09-29 | 2020-08-25 | Apple Inc. | Rule-based natural language processing |
US20190114317A1 (en) * | 2017-10-13 | 2019-04-18 | Via Technologies, Inc. | Natural language recognizing apparatus and natural language recognizing method |
US10635859B2 (en) * | 2017-10-13 | 2020-04-28 | Via Technologies, Inc. | Natural language recognizing apparatus and natural language recognizing method |
EP3564948A4 (en) * | 2017-11-02 | 2019-11-13 | Sony Corporation | Information processing device and information processing method |
US10930280B2 (en) | 2017-11-20 | 2021-02-23 | Lg Electronics Inc. | Device for providing toolkit for agent developer |
US10636424B2 (en) | 2017-11-30 | 2020-04-28 | Apple Inc. | Multi-turn canned dialog |
US10733982B2 (en) | 2018-01-08 | 2020-08-04 | Apple Inc. | Multi-directional dialog |
US10733375B2 (en) | 2018-01-31 | 2020-08-04 | Apple Inc. | Knowledge-based framework for improving natural language understanding |
US10789959B2 (en) | 2018-03-02 | 2020-09-29 | Apple Inc. | Training speaker recognition models for digital assistants |
US10592604B2 (en) | 2018-03-12 | 2020-03-17 | Apple Inc. | Inverse text normalization for automatic speech recognition |
US10818288B2 (en) | 2018-03-26 | 2020-10-27 | Apple Inc. | Natural assistant interaction |
US11710482B2 (en) | 2018-03-26 | 2023-07-25 | Apple Inc. | Natural assistant interaction |
US10909331B2 (en) | 2018-03-30 | 2021-02-02 | Apple Inc. | Implicit identification of translation payload with neural machine translation |
US11145294B2 (en) | 2018-05-07 | 2021-10-12 | Apple Inc. | Intelligent automated assistant for delivering content from user experiences |
US10928918B2 (en) | 2018-05-07 | 2021-02-23 | Apple Inc. | Raise to speak |
US11854539B2 (en) | 2018-05-07 | 2023-12-26 | Apple Inc. | Intelligent automated assistant for delivering content from user experiences |
US11169616B2 (en) | 2018-05-07 | 2021-11-09 | Apple Inc. | Raise to speak |
US10984780B2 (en) | 2018-05-21 | 2021-04-20 | Apple Inc. | Global semantic word embeddings using bi-directional recurrent neural networks |
US11386266B2 (en) | 2018-06-01 | 2022-07-12 | Apple Inc. | Text correction |
US10684703B2 (en) | 2018-06-01 | 2020-06-16 | Apple Inc. | Attention aware virtual assistant dismissal |
US11009970B2 (en) | 2018-06-01 | 2021-05-18 | Apple Inc. | Attention aware virtual assistant dismissal |
US10984798B2 (en) | 2018-06-01 | 2021-04-20 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
US11431642B2 (en) | 2018-06-01 | 2022-08-30 | Apple Inc. | Variable latency device coordination |
US10720160B2 (en) | 2018-06-01 | 2020-07-21 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
US10403283B1 (en) | 2018-06-01 | 2019-09-03 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
US11495218B2 (en) | 2018-06-01 | 2022-11-08 | Apple Inc. | Virtual assistant operation in multi-device environments |
US10892996B2 (en) | 2018-06-01 | 2021-01-12 | Apple Inc. | Variable latency device coordination |
US10496705B1 (en) | 2018-06-03 | 2019-12-03 | Apple Inc. | Accelerated task performance |
US10504518B1 (en) | 2018-06-03 | 2019-12-10 | Apple Inc. | Accelerated task performance |
US10944859B2 (en) | 2018-06-03 | 2021-03-09 | Apple Inc. | Accelerated task performance |
US11010561B2 (en) | 2018-09-27 | 2021-05-18 | Apple Inc. | Sentiment prediction from textual data |
US11462215B2 (en) | 2018-09-28 | 2022-10-04 | Apple Inc. | Multi-modal inputs for voice commands |
US10839159B2 (en) | 2018-09-28 | 2020-11-17 | Apple Inc. | Named entity normalization in a spoken dialog system |
US11170166B2 (en) | 2018-09-28 | 2021-11-09 | Apple Inc. | Neural typographical error modeling via generative adversarial networks |
US11475898B2 (en) | 2018-10-26 | 2022-10-18 | Apple Inc. | Low-latency multi-speaker speech recognition |
US11638059B2 (en) | 2019-01-04 | 2023-04-25 | Apple Inc. | Content playback on multiple devices |
US11348573B2 (en) | 2019-03-18 | 2022-05-31 | Apple Inc. | Multimodality in digital assistant systems |
US11475884B2 (en) | 2019-05-06 | 2022-10-18 | Apple Inc. | Reducing digital assistant latency when a language is incorrectly determined |
US11217251B2 (en) | 2019-05-06 | 2022-01-04 | Apple Inc. | Spoken notifications |
US11307752B2 (en) | 2019-05-06 | 2022-04-19 | Apple Inc. | User configurable task triggers |
US11423908B2 (en) | 2019-05-06 | 2022-08-23 | Apple Inc. | Interpreting spoken requests |
US11140099B2 (en) | 2019-05-21 | 2021-10-05 | Apple Inc. | Providing message response suggestions |
US11496600B2 (en) | 2019-05-31 | 2022-11-08 | Apple Inc. | Remote execution of machine-learned models |
US11289073B2 (en) | 2019-05-31 | 2022-03-29 | Apple Inc. | Device text to speech |
US11237797B2 (en) | 2019-05-31 | 2022-02-01 | Apple Inc. | User activity shortcut suggestions |
US11657813B2 (en) | 2019-05-31 | 2023-05-23 | Apple Inc. | Voice identification in digital assistant systems |
US11360739B2 (en) | 2019-05-31 | 2022-06-14 | Apple Inc. | User activity shortcut suggestions |
US11360641B2 (en) | 2019-06-01 | 2022-06-14 | Apple Inc. | Increasing the relevance of new available information |
US11488406B2 (en) | 2019-09-25 | 2022-11-01 | Apple Inc. | Text detection using global geometry estimators |
US20210343292A1 (en) * | 2020-05-04 | 2021-11-04 | Lingua Robotica, Inc. | Techniques for converting natural speech to programming code |
US11532309B2 (en) * | 2020-05-04 | 2022-12-20 | Austin Cox | Techniques for converting natural speech to programming code |
US11755276B2 (en) | 2020-05-12 | 2023-09-12 | Apple Inc. | Reducing description length based on confidence |
US11838734B2 (en) | 2020-07-20 | 2023-12-05 | Apple Inc. | Multi-device audio adjustment coordination |
US20220366911A1 (en) * | 2021-05-17 | 2022-11-17 | Google Llc | Arranging and/or clearing speech-to-text content without a user providing express instructions |
US12033637B2 (en) * | 2021-05-17 | 2024-07-09 | Google Llc | Arranging and/or clearing speech-to-text content without a user providing express instructions |
US11978436B2 (en) | 2022-06-03 | 2024-05-07 | Apple Inc. | Application vocabulary integration with a digital assistant |
Also Published As
Publication number | Publication date |
---|---|
JP2010224194A (en) | 2010-10-07 |
CN101847405B (en) | 2012-10-24 |
CN101847405A (en) | 2010-09-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20100241418A1 (en) | Voice recognition device and voice recognition method, language model generating device and language model generating method, and computer program | |
US8566076B2 (en) | System and method for applying bridging models for robust and efficient speech to speech translation | |
US11227579B2 (en) | Data augmentation by frame insertion for speech data | |
Abushariah et al. | Phonetically rich and balanced text and speech corpora for Arabic language | |
Arslan et al. | A detailed survey of Turkish automatic speech recognition | |
Cucu et al. | Recent improvements of the SpeeD Romanian LVCSR system | |
Sasmal et al. | Isolated words recognition of Adi, a low-resource indigenous language of Arunachal Pradesh | |
Kayte et al. | Implementation of Marathi Language Speech Databases for Large Dictionary | |
JP4581549B2 (en) | Audio processing apparatus and method, recording medium, and program | |
AbuZeina et al. | Toward enhanced Arabic speech recognition using part of speech tagging | |
Patel et al. | An Automatic Speech Transcription System for Manipuri Language. | |
Nga et al. | A Survey of Vietnamese Automatic Speech Recognition | |
Mittal et al. | Speaker-independent automatic speech recognition system for mobile phone applications in Punjabi | |
Hsieh et al. | Acoustic and Textual Data Augmentation for Code-Switching Speech Recognition in Under-Resourced Language | |
Staš et al. | Recent advances in the statistical modeling of the Slovak language | |
Mon et al. | Building HMM-SGMM continuous automatic speech recognition on Myanmar web news | |
Antoniadis et al. | A mechanism for personalized Automatic Speech Recognition for less frequently spoken languages: the Greek case | |
Rista et al. | CASR: A Corpus for Albanian Speech Recognition | |
Sharif et al. | From Statistical Methods to Pre-Trained Models; A Survey on Automatic Speech Recognition for Resource Scarce Urdu Language | |
Ruiz Domingo et al. | FILENG: an automatic English subtitle generator from Filipino video clips using hidden Markov model | |
Chen et al. | Speech retrieval of Mandarin broadcast news via mobile devices. | |
Deng et al. | Recent progress of mandrain spontaneous speech recognition on mandrain conversation dialogue corpus | |
Arısoy et al. | Turkish speech recognition | |
Sung et al. | Deploying Google Search by Voice in Cantonese. | |
Kadyan et al. | Hindi dialect (Bangro) spoken language recognition (HD-SLR) system using Sphinx3 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: SONY CORPORATION, JAPAN Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:MAEDA, YOSHINORI;HONDA, HITOSHI;MINAMINO, KATSUKI;REEL/FRAME:024121/0298 Effective date: 20100224 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |