JP6754184B2

JP6754184B2 - Voice recognition device and voice recognition method

Info

Publication number: JP6754184B2
Application number: JP2015239951A
Authority: JP
Inventors: 剛樹西川
Original assignee: Panasonic Intellectual Property Corp of America
Current assignee: Panasonic Intellectual Property Corp of America
Priority date: 2014-12-26
Filing date: 2015-12-09
Publication date: 2020-09-09
Anticipated expiration: 2035-12-09
Also published as: US20160189715A1; JP2016126330A; US9966077B2; CN105741836B; CN105741836A

Description

本開示は、音声情報に含まれるノイズを除去し、ノイズを除去した音声情報に対して音声認識を行う音声認識装置及び音声認識方法に関するものである。 The present disclosure relates to a voice recognition device and a voice recognition method that remove noise contained in voice information and perform voice recognition on the voice information from which noise has been removed.

従来、音声によって端末を制御する機能、又は音声によってキーワードを検索する機能が検討されている。これらの機能を実現するために、従来は端末を操作するためのリモコンにマイクロホンが搭載され、マイクロホンにより収音されている。近年では、さらなる利便性向上を目的に、端末がマイクロホンを内蔵し、端末から離れたところからユーザが発話しても端末を動作させることができるようにする技術が検討されている。しかしながら、端末から離れたところからユーザが発話することで、ユーザが発話した音声と不要な音（雑音）との音量差がなくなり、端末において正しく発話内容を認識することが困難になる。従来、端末の音声認識処理機能は、端末操作に関する発話に対してレスポンスを速くするために利用され、サーバの音声認識処理機能は、膨大な語彙数の辞書を必要とする情報検索に関する発話にレスポンスするために利用されてきた。 Conventionally, a function of controlling a terminal by voice or a function of searching for a keyword by voice has been studied. In order to realize these functions, a microphone is conventionally mounted on a remote controller for operating a terminal, and sound is collected by the microphone. In recent years, for the purpose of further improving convenience, a technique has been studied in which a terminal has a built-in microphone so that the terminal can be operated even if the user speaks from a place away from the terminal. However, when the user speaks from a place away from the terminal, the volume difference between the voice spoken by the user and the unnecessary sound (noise) disappears, and it becomes difficult for the terminal to correctly recognize the utterance content. Conventionally, the voice recognition processing function of the terminal has been used to speed up the response to the utterance related to the terminal operation, and the voice recognition processing function of the server responds to the utterance related to the information search that requires a dictionary with a huge number of vocabularies. Has been used to do.

例えば、特許文献１では、サーバが辞書規模優先の音声認識手段を備え、クライアントが速度優先の音声認識手段を備えている。 For example, in Patent Document 1, the server is provided with a dictionary-scale priority voice recognition means, and the client is provided with a speed priority voice recognition means.

特開２０１３−６４７７７号公報Japanese Unexamined Patent Publication No. 2013-64777

しかしながら、特許文献１では、ユーザがマイクロホンから離れた位置から発話した場合、音声にノイズが含まれ、音声認識が正確に実行されないという課題がある。さらに、特許文献１では、ノイズ除去処理を端末とサーバとに分散させておらず、ノイズ除去処理と音声認識処理との両方を実行する構成又は条件について検討されていない。 However, Patent Document 1 has a problem that when a user speaks from a position away from the microphone, the voice contains noise and voice recognition is not executed accurately. Further, in Patent Document 1, the noise removal processing is not distributed between the terminal and the server, and the configuration or condition for executing both the noise removal processing and the voice recognition processing is not examined.

本開示は、上記の問題を解決するためになされたもので、高騒音環境では音声認識の正確性を向上させることができ、低騒音環境では音声認識の高速化を実現することができる音声認識装置及び音声認識方法を提供することを目的とするものである。 The present disclosure has been made to solve the above problems, and it is possible to improve the accuracy of speech recognition in a high noise environment, and to realize high-speed speech recognition in a low noise environment. It is an object of the present invention to provide an apparatus and a voice recognition method.

本開示の一局面に係る音声認識装置は、第１の音声情報を取得する音声取得部と、前記音声取得部によって取得された前記第１の音声情報に含まれるノイズを第１の除去方式を用いて除去し、前記ノイズを除去した音声情報を第２の音声情報として出力するノイズ除去処理部と、前記ノイズ除去処理部によって出力された前記第２の音声情報に対して音声認識を行い、音声認識結果を第１の音声認識結果情報として出力する音声認識部と、前記音声取得部によって取得された前記第１の音声情報をサーバへ送信し、前記サーバにおいて、前記第１の除去方式により前記第１の音声情報から除去されるノイズの量よりもより多くの量のノイズを前記第１の音声情報から除去する第２の除去方式を用いて前記第１の音声情報に含まれるノイズが除去され、前記ノイズが除去された第３の音声情報に対して音声認識が行われた結果である、音声認識結果を第２の音声認識結果情報として前記サーバから受信する通信部と、前記音声認識部によって出力された前記第１の音声認識結果情報と、前記通信部によって受信された前記第２の音声認識結果情報とのうちのいずれを出力するかを選択する調停部と、を備える。 The voice recognition device according to one aspect of the present disclosure uses a voice acquisition unit that acquires the first voice information and a first removal method that removes noise contained in the first voice information acquired by the voice acquisition unit. The noise removal processing unit that removes the noise by using and outputs the voice information from which the noise has been removed as the second voice information, and the second voice information output by the noise removal processing unit perform voice recognition. A voice recognition unit that outputs the voice recognition result as the first voice recognition result information and the first voice information acquired by the voice acquisition unit are transmitted to the server, and the server uses the first removal method. The noise contained in the first voice information is generated by using the second removal method of removing a larger amount of noise from the first voice information than the amount of noise removed from the first voice information. A communication unit that receives the voice recognition result as the second voice recognition result information, which is the result of voice recognition on the third voice information from which the noise has been removed, and the voice. It includes a mediation unit for selecting which of the first voice recognition result information output by the recognition unit and the second voice recognition result information received by the communication unit is to be output.

本開示によれば、高騒音環境では、音声認識装置よりも多くの量のノイズを第１の音声情報から除去することが可能なサーバにおいてノイズを第１の音声情報から除去することで、音声認識の正確性を向上させることができ、静音環境では音声認識装置において第１の音声情報からノイズを除去することで、音声認識の高速化を実現することができる。 According to the present disclosure, in a noisy environment, voice can be removed by removing noise from the first voice information in a server capable of removing more noise from the first voice information than a voice recognition device. The accuracy of recognition can be improved, and in a silent environment, the speed of voice recognition can be realized by removing noise from the first voice information in the voice recognition device.

図１は、本開示の実施の形態１に係る音声認識システムの全体構成を示す図である。FIG. 1 is a diagram showing an overall configuration of a voice recognition system according to the first embodiment of the present disclosure. 図２は、本開示の実施の形態１における音声認識システムの機能構成を示す図である。FIG. 2 is a diagram showing a functional configuration of the voice recognition system according to the first embodiment of the present disclosure. 図３は、本開示の実施の形態１における音声認識システムの動作の一例を示すフローチャートである。FIG. 3 is a flowchart showing an example of the operation of the voice recognition system according to the first embodiment of the present disclosure. 図４は、音声認識不可情報を表示する表示画面の一例を示す図である。FIG. 4 is a diagram showing an example of a display screen for displaying non-speech recognition information. 図５は、サーバ送信情報を表示する表示画面の一例を示す図である。FIG. 5 is a diagram showing an example of a display screen for displaying server transmission information. 図６は、送信確認情報を表示する表示画面の一例を示す図である。FIG. 6 is a diagram showing an example of a display screen for displaying transmission confirmation information. 図７は、本開示の実施の形態１の変形例における音声認識システムの機能構成を示す図である。FIG. 7 is a diagram showing a functional configuration of a voice recognition system in a modified example of the first embodiment of the present disclosure. 図８は、本開示の実施の形態２における音声認識システムの機能構成を示す図である。FIG. 8 is a diagram showing a functional configuration of the voice recognition system according to the second embodiment of the present disclosure. 図９は、本開示の実施の形態２における音声認識システムの動作の一例を示す第１のフローチャートである。FIG. 9 is a first flowchart showing an example of the operation of the voice recognition system according to the second embodiment of the present disclosure. 図１０は、本開示の実施の形態２における音声認識システムの動作の一例を示す第２のフローチャートである。FIG. 10 is a second flowchart showing an example of the operation of the voice recognition system according to the second embodiment of the present disclosure. 図１１は、本開示の実施の形態２における音声認識システムの動作の一例を示す第３のフローチャートである。FIG. 11 is a third flowchart showing an example of the operation of the voice recognition system according to the second embodiment of the present disclosure. 図１２は、本開示の実施の形態２の変形例における音声認識システムの動作の一例を示す第１のフローチャートである。FIG. 12 is a first flowchart showing an example of the operation of the voice recognition system in the modified example of the second embodiment of the present disclosure. 図１３は、本開示の実施の形態２の変形例における音声認識システムの動作の一例を示す第２のフローチャートである。FIG. 13 is a second flowchart showing an example of the operation of the voice recognition system in the modified example of the second embodiment of the present disclosure. 図１４は、本開示の実施の形態３における音声認識システムの機能構成を示す図である。FIG. 14 is a diagram showing a functional configuration of the voice recognition system according to the third embodiment of the present disclosure. 図１５は、本開示の実施の形態４における音声認識システムの機能構成を示す図である。FIG. 15 is a diagram showing a functional configuration of the voice recognition system according to the fourth embodiment of the present disclosure. 図１６は、本開示の実施の形態５における音声認識システムの機能構成を示す図である。FIG. 16 is a diagram showing a functional configuration of the voice recognition system according to the fifth embodiment of the present disclosure. 図１７は、本開示の実施の形態５の変形例における音声認識システムの機能構成を示す図である。FIG. 17 is a diagram showing a functional configuration of a voice recognition system in a modified example of the fifth embodiment of the present disclosure.

（本発明の基礎となった知見）
特許文献１では、ユーザがマイクロホンから離れた位置から発話した場合、音声にノイズが含まれ、音声認識が正しく動作しないという課題がある。さらに、特許文献１では、ノイズ除去処理を端末とサーバとに分散させておらず、ノイズ除去処理と音声認識処理とを併用した構成又は条件について検討されていない。 (Knowledge that became the basis of the present invention)
Patent Document 1 has a problem that when a user speaks from a position away from a microphone, the voice contains noise and voice recognition does not operate correctly. Further, in Patent Document 1, the noise removal processing is not distributed between the terminal and the server, and the configuration or condition in which the noise removal processing and the voice recognition processing are used in combination is not examined.

本開示は、上記の問題を解決するためになされたもので、高騒音環境では音声認識の正確性を向上させることができ、低騒音環境では音声認識の高速化を実現することができる音声認識装置及び音声認識方法を提供する。 The present disclosure has been made to solve the above problems, and it is possible to improve the accuracy of speech recognition in a high noise environment, and to realize high-speed speech recognition in a low noise environment. A device and a voice recognition method are provided.

この構成によれば、第１の音声情報が取得される。取得された第１の音声情報に含まれるノイズが第１の除去方式を用いて除去され、ノイズが除去された音声情報が第２の音声情報として出力される。出力された第２の音声情報に対して音声認識が行われ、音声認識結果が第１の音声認識結果情報として出力される。また、取得された第１の音声情報がサーバへ送信され、サーバにおいて、第１の除去方式により前記第１の音声情報から除去されるノイズの量よりもより多くの量のノイズを前記第１の音声情報から除去する第２の除去方式を用いて第１の音声情報に含まれるノイズが除去され、ノイズが除去された第３の音声情報に対して音声認識が行われた結果である、音声認識結果が第２の音声認識結果情報としてサーバから受信される。出力された第１の音声認識結果情報と、受信された第２の音声認識結果情報とのうちのいずれを出力するかが選択される。 According to this configuration, the first voice information is acquired. The noise contained in the acquired first voice information is removed by using the first removal method, and the voice information from which the noise has been removed is output as the second voice information. Voice recognition is performed on the output second voice information, and the voice recognition result is output as the first voice recognition result information. In addition, the acquired first voice information is transmitted to the server, and the server produces a larger amount of noise than the amount of noise removed from the first voice information by the first removal method. This is the result of voice recognition being performed on the third voice information from which the noise has been removed by removing the noise contained in the first voice information by using the second removal method for removing from the voice information of. The voice recognition result is received from the server as the second voice recognition result information. Which of the output first voice recognition result information and the received second voice recognition result information is to be output is selected.

したがって、高騒音環境では、音声認識装置よりもより多くの量のノイズを第１の音声情報から除去することが可能なサーバにおいてノイズを除去することで、音声認識の正確性を向上させることができ、低騒音環境では音声認識装置において第１の音声情報からノイズを除去することで、音声認識の高速化を実現することができる。 Therefore, in a noisy environment, it is possible to improve the accuracy of speech recognition by removing noise in a server that can remove more noise from the first speech information than a speech recognition device. In a low noise environment, the voice recognition device can remove noise from the first voice information, so that the speed of voice recognition can be increased.

また、上記の音声認識装置において、前記音声認識部は、前記第１の音声認識結果情報の尤もらしさを示す第１の尤度を算出し、算出した前記第１の尤度を前記調停部に出力し、前記通信部は、前記サーバによって算出された前記第２の音声認識結果情報の尤もらしさを示す第２の尤度を受信し、受信した前記第２の尤度を前記調停部に出力し、前記調停部は、前記第１の音声認識結果情報と前記第２の音声認識結果情報とのうちのいずれを出力するかを、前記第１の尤度及び前記第２の尤度の少なくとも１つに基づいて選択してもよい。 Further, in the voice recognition device, the voice recognition unit calculates a first likelihood indicating the likelihood of the first voice recognition result information, and the calculated first likelihood is applied to the mediation unit. Output, the communication unit receives the second likelihood indicating the likelihood of the second voice recognition result information calculated by the server, and outputs the received second likelihood to the mediation unit. Then, the arbitration unit determines which of the first voice recognition result information and the second voice recognition result information is output, at least of the first likelihood and the second likelihood. You may choose based on one.

この構成によれば、第１の音声認識結果情報の尤もらしさを示す第１の尤度が算出され、算出された第１の尤度が出力される。また、サーバによって算出された第２の音声認識結果情報の尤もらしさを示す第２の尤度が受信され、受信された第２の尤度が出力される。そして、第１の音声認識結果情報と第２の音声認識結果情報とのうちのいずれを出力するかが、第１の尤度及び第２の尤度の少なくとも１つに基づいて選択される。 According to this configuration, the first likelihood indicating the plausibility of the first speech recognition result information is calculated, and the calculated first likelihood is output. In addition, a second likelihood indicating the likelihood of the second speech recognition result information calculated by the server is received, and the received second likelihood is output. Then, which of the first voice recognition result information and the second voice recognition result information is output is selected based on at least one of the first likelihood and the second likelihood.

したがって、出力される音声認識結果が尤度に基づいて選択されるので、より正確な音声認識結果を出力することができる。 Therefore, since the output voice recognition result is selected based on the likelihood, a more accurate voice recognition result can be output.

また、上記の音声認識装置において、前記調停部は、前記第１の尤度が所定の第１の閾値より大きい場合には前記第１の音声認識結果情報を出力し、前記第１の尤度が前記第１の閾値以下であり、前記第２の尤度が所定の第２の閾値より大きい場合には前記第２の音声認識結果情報を出力し、前記第１の尤度が前記第１の閾値以下であり、前記第２の尤度が前記第２の閾値以下である場合には前記第１の音声認識結果情報及び前記第２の音声認識結果情報のいずれも出力しなくてもよい。 Further, in the voice recognition device, the arbitration unit outputs the first voice recognition result information when the first likelihood is larger than a predetermined first threshold value, and the first likelihood is obtained. Is equal to or less than the first threshold value, and when the second likelihood is greater than the predetermined second threshold value, the second voice recognition result information is output, and the first likelihood is the first. When it is equal to or less than the threshold value of and the second likelihood is equal to or less than the second threshold value, neither the first voice recognition result information nor the second voice recognition result information may be output. ..

この構成によれば、第１の尤度が所定の第１の閾値より大きい場合には第１の音声認識結果情報が出力され、第１の尤度が第１の閾値以下であり、第２の尤度が所定の第２の閾値より大きい場合には第２の音声認識結果情報が出力され、第１の尤度が第１の閾値以下であり、第２の尤度が第２の閾値以下である場合には第１の音声認識結果情報及び第２の音声認識結果情報のいずれも出力されない。 According to this configuration, when the first likelihood is larger than a predetermined first threshold value, the first speech recognition result information is output, the first likelihood is equal to or less than the first threshold value, and the second When the likelihood of is greater than a predetermined second threshold value, the second speech recognition result information is output, the first likelihood is equal to or less than the first threshold value, and the second likelihood is the second threshold value. In the following cases, neither the first voice recognition result information nor the second voice recognition result information is output.

したがって、尤度と閾値とを比較することにより音声認識結果が選択されるので、出力する音声認識結果をより簡単な構成で選択することができる。 Therefore, since the voice recognition result is selected by comparing the likelihood and the threshold value, the voice recognition result to be output can be selected with a simpler configuration.

また、上記の音声認識装置において、前記音声取得部によって取得された前記第１の音声情報におけるユーザが発話した発話区間を検出する発話区間検出部をさらに備え、前記発話区間検出部によって前記発話区間が検出されない場合、前記ノイズ除去処理部は、前記第１の音声情報に含まれるノイズを除去しないとともに、前記第２の音声情報を出力せず、前記通信部は、前記第１の音声情報をサーバへ送信しなくてもよい。 Further, the voice recognition device further includes a utterance section detection unit that detects an utterance section spoken by the user in the first voice information acquired by the voice acquisition unit, and the utterance section detection unit further includes the utterance section. If is not detected, the noise removing processing unit does not remove the noise contained in the first voice information and does not output the second voice information, and the communication unit outputs the first voice information. It does not have to be sent to the server.

この構成によれば、取得された第１の音声情報におけるユーザが発話した発話区間が検出される。そして、発話区間が検出されない場合には、第１の音声情報に含まれるノイズが除去されないとともに、第２の音声情報が出力されず、第１の音声情報がサーバへ送信されない。 According to this configuration, the utterance section spoken by the user in the acquired first voice information is detected. If the utterance section is not detected, the noise included in the first voice information is not removed, the second voice information is not output, and the first voice information is not transmitted to the server.

したがって、ユーザが発話した発話区間でなければ、第１の音声情報に含まれるノイズが除去されないとともに、第２の音声情報が出力されず、第１の音声情報がサーバへ送信されないので、不要な演算処理が行われるのを防止することができるとともに、不要な情報が送信されるのを防止することができる。 Therefore, if it is not the utterance section spoken by the user, the noise contained in the first voice information is not removed, the second voice information is not output, and the first voice information is not transmitted to the server, which is unnecessary. It is possible to prevent the arithmetic processing from being performed and to prevent unnecessary information from being transmitted.

また、上記の音声認識装置において、前記発話区間検出部によって前記発話区間が検出された場合に、前記発話区間検出部において検出された発話区間の継続時間である発話継続時間を測定する発話継続時間測定部をさらに備え、前記発話区間検出部によって前記発話区間が検出された場合、前記ノイズ除去処理部は、前記第１の音声情報に含まれるノイズを除去し、前記通信部は、前記発話区間内における前記第１の音声情報を前記サーバへ送信し、前記調停部は、前記音声認識部によって出力された前記第１の音声認識結果情報と、前記通信部によって受信された前記第２の音声認識結果情報とのうちのいずれを出力するかを、少なくとも前記発話継続時間の長さに関する情報を用いて選択してもよい。 Further, in the above-mentioned voice recognition device, when the utterance section is detected by the utterance section detection unit, the utterance duration is measured, which is the duration of the utterance section detected by the utterance section detection unit. Further including a measuring unit, when the utterance section is detected by the utterance section detection unit, the noise removal processing unit removes noise included in the first voice information, and the communication unit removes the noise included in the first voice information. The first utterance information is transmitted to the server, and the arbitration unit transmits the first utterance recognition result information output by the utterance recognition unit and the second utterance received by the communication unit. Which of the recognition result information to output may be selected by using at least the information regarding the length of the utterance duration.

この構成によれば、発話区間が検出された場合に、検出された発話区間の継続時間である発話継続時間が測定される。発話区間が検出された場合には、第１の音声情報に含まれるノイズが除去されるとともに、発話区間内における第１の音声情報がサーバへ送信される。そして、出力された第１の音声認識結果情報と、受信された第２の音声認識結果情報とのうちのいずれを出力するかが、少なくとも発話継続時間の長さに関する情報を用いて選択される。 According to this configuration, when the utterance section is detected, the utterance duration, which is the duration of the detected utterance section, is measured. When the utterance section is detected, the noise included in the first voice information is removed, and the first voice information in the utterance section is transmitted to the server. Then, which of the output first voice recognition result information and the received second voice recognition result information is to be output is selected by using at least the information regarding the length of the utterance duration. ..

したがって、少なくとも発話継続時間の長さに関する情報が用いられることにより音声認識結果が選択されるので、出力する音声認識結果をより簡単な構成で選択することができる。 Therefore, since the voice recognition result is selected by using at least the information regarding the length of the utterance duration, the voice recognition result to be output can be selected with a simpler configuration.

また、上記の音声認識装置において、前記調停部は、前記発話継続時間が所定の長さより長い場合に、前記第２の音声認識結果情報の尤もらしさを示す第２の尤度に乗算する重み付けを、前記第１の音声認識結果情報の尤もらしさを示す第１の尤度に乗算する重み付けよりも上げてもよい。 Further, in the above-mentioned voice recognition device, the arbitration unit multiplies the second likelihood indicating the likelihood of the second voice recognition result information when the utterance duration is longer than a predetermined length. , May be higher than the weighting multiplied by the first likelihood indicating the likelihood of the first speech recognition result information.

この構成によれば、発話継続時間が所定の長さより長い場合に、第２の音声認識結果情報の尤もらしさを示す第２の尤度に乗算する重み付けが、第１の音声認識結果情報の尤もらしさを示す第１の尤度に乗算する重み付けよりも上げられる。発話継続時間が長い場合、単語数が多い高度な音声指示を行っている可能性が高い。そのため、発話継続時間が長い場合、サーバから出力される音声認識結果を採用することにより、誤認識を防止することができる。 According to this configuration, when the utterance duration is longer than a predetermined length, the weighting by multiplying the second likelihood indicating the likelihood of the second speech recognition result information is the likelihood of the first speech recognition result information. It is higher than the weighting that is multiplied by the first likelihood that indicates the likelihood. If the utterance duration is long, it is highly possible that advanced voice instructions with a large number of words are being performed. Therefore, when the utterance duration is long, erroneous recognition can be prevented by adopting the voice recognition result output from the server.

また、上記の音声認識装置において、前記通信部は、前記第３の音声情報を前記サーバから受信し、受信した前記第３の音声情報を前記音声認識部へ出力し、前記音声認識部は、前記通信部によって受信された前記第３の音声情報に対して音声認識を行い、音声認識結果を第４の音声認識結果情報として出力し、前記ノイズ除去処理部によって出力された前記第２の音声情報をサーバへ送信し、前記第２の音声情報に対して音声認識が行われた音声認識結果を第３の音声認識結果情報として前記サーバから受信し、受信した前記第３の音声認識結果情報を前記調停部へ出力し、前記調停部は、前記音声認識部によって出力された前記第１の音声認識結果情報と、前記通信部によって受信された前記第２の音声認識結果情報と、前記通信部によって受信された前記第３の音声認識結果情報と、前記音声認識部によって出力された前記第４の音声認識結果情報とのうちのいずれを出力するかを選択してもよい。 Further, in the voice recognition device, the communication unit receives the third voice information from the server, outputs the received third voice information to the voice recognition unit, and the voice recognition unit receives the third voice information. Voice recognition is performed on the third voice information received by the communication unit, the voice recognition result is output as the fourth voice recognition result information, and the second voice output by the noise removal processing unit. The information is transmitted to the server, and the voice recognition result obtained by performing voice recognition for the second voice information is received from the server as the third voice recognition result information, and the received third voice recognition result information. Is output to the arbitration unit, and the arbitration unit receives the first voice recognition result information output by the voice recognition unit, the second voice recognition result information received by the communication unit, and the communication. You may select whether to output the third voice recognition result information received by the unit or the fourth voice recognition result information output by the voice recognition unit.

この構成によれば、第３の音声情報がサーバから受信され、受信された第３の音声情報が音声認識部へ出力される。受信された第３の音声情報に対して音声認識が行われ、音声認識結果が第４の音声認識結果情報として出力される。また、出力された第２の音声情報がサーバへ送信され、第２の音声情報に対して音声認識が行われた音声認識結果が第３の音声認識結果情報としてサーバから受信され、受信された第３の音声認識結果情報が調停部へ出力される。そして、出力された第１の音声認識結果情報と、受信された第２の音声認識結果情報と、受信された第３の音声認識結果情報と、出力された第４の音声認識結果情報とのうちのいずれを出力するかが選択される。 According to this configuration, the third voice information is received from the server, and the received third voice information is output to the voice recognition unit. Voice recognition is performed on the received third voice information, and the voice recognition result is output as the fourth voice recognition result information. Further, the output second voice information is transmitted to the server, and the voice recognition result obtained by voice recognition for the second voice information is received from the server as the third voice recognition result information and received. The third voice recognition result information is output to the arbitration unit. Then, the output first voice recognition result information, the received second voice recognition result information, the received third voice recognition result information, and the output fourth voice recognition result information Which of them is output is selected.

したがって、音声認識装置によりノイズ除去処理及び音声認識処理が行われた第１の音声認識結果と、サーバによりノイズ除去処理及び音声認識処理が行われた第２の音声認識結果と、音声認識装置によりノイズ除去処理が行われてサーバにより音声認識処理が行われた第３の音声認識結果と、サーバによりノイズ除去処理が行われて音声認識装置により音声認識処理が行われた第４の音声認識結果とのいずれかが出力されるので、環境音の状態と音声認識の性能とに応じて最適な音声認識結果を得ることができる。 Therefore, the first voice recognition result in which the noise removal processing and the voice recognition processing are performed by the voice recognition device, the second voice recognition result in which the noise removal processing and the voice recognition processing are performed by the server, and the voice recognition device A third voice recognition result in which noise removal processing is performed and voice recognition processing is performed by the server, and a fourth voice recognition result in which noise removal processing is performed by the server and voice recognition processing is performed by the voice recognition device. Since either of the above is output, the optimum voice recognition result can be obtained according to the state of the environmental sound and the performance of voice recognition.

また、上記の音声認識装置において、前記音声認識部は、前記第１の音声認識結果情報の尤もらしさを示す第１の尤度を算出し、算出した前記第１の尤度を前記調停部に出力し、前記通信部は、前記サーバによって算出された前記第２の音声認識結果情報の尤もらしさを示す第２の尤度を受信し、受信した前記第２の尤度を前記調停部に出力し、前記通信部は、前記サーバによって算出された前記第３の音声認識結果情報の尤もらしさを示す第３の尤度を受信し、受信した前記第３の尤度を前記調停部に出力し、前記音声認識部は、前記第４の音声認識結果情報の尤もらしさを示す第４の尤度を算出し、算出した前記第４の尤度を前記調停部に出力し、前記調停部は、前記第１の音声認識結果情報と、前記第２の音声認識結果情報と、前記第３の音声認識結果情報と、前記第４の音声認識結果情報とのうちのいずれを出力するかを、前記第１の尤度、前記第２の尤度、前記第３の尤度及び前記第４の尤度のうちの少なくとも１つに基づいて選択してもよい。 Further, in the voice recognition device, the voice recognition unit calculates a first likelihood indicating the likelihood of the first voice recognition result information, and the calculated first likelihood is applied to the mediation unit. The communication unit receives a second likelihood indicating the likelihood of the second voice recognition result information calculated by the server, and outputs the received second likelihood to the mediation unit. Then, the communication unit receives the third likelihood indicating the likelihood of the third voice recognition result information calculated by the server, and outputs the received third likelihood to the mediation unit. The voice recognition unit calculates a fourth likelihood indicating the likelihood of the fourth voice recognition result information, outputs the calculated fourth likelihood to the mediation unit, and the mediation unit receives the calculated fourth likelihood. Which of the first voice recognition result information, the second voice recognition result information, the third voice recognition result information, and the fourth voice recognition result information is output is determined. The selection may be based on at least one of a first likelihood, a second likelihood, a third likelihood and a fourth likelihood.

この構成によれば、第１の音声認識結果情報の尤もらしさを示す第１の尤度が算出され、算出された第１の尤度が出力される。また、サーバによって算出された第２の音声認識結果情報の尤もらしさを示す第２の尤度が受信され、受信された第２の尤度が出力される。さらに、サーバによって算出された第３の音声認識結果情報の尤もらしさを示す第３の尤度が受信され、受信された第３の尤度が出力される。さらにまた、第４の音声認識結果情報の尤もらしさを示す第４の尤度が算出され、算出された第４の尤度が出力される。そして、第１の音声認識結果情報と、第２の音声認識結果情報と、第３の音声認識結果情報と、第４の音声認識結果情報とのうちのいずれを出力するかが、第１の尤度、第２の尤度、第３の尤度及び第４の尤度のうちの少なくとも１つに基づいて選択される。 According to this configuration, the first likelihood indicating the plausibility of the first speech recognition result information is calculated, and the calculated first likelihood is output. In addition, a second likelihood indicating the likelihood of the second speech recognition result information calculated by the server is received, and the received second likelihood is output. Further, a third likelihood indicating the likelihood of the third speech recognition result information calculated by the server is received, and the received third likelihood is output. Furthermore, a fourth likelihood indicating the likelihood of the fourth speech recognition result information is calculated, and the calculated fourth likelihood is output. Then, which of the first voice recognition result information, the second voice recognition result information, the third voice recognition result information, and the fourth voice recognition result information is output is the first. It is selected based on at least one of a likelihood, a second likelihood, a third likelihood and a fourth likelihood.

この構成によれば、取得された音声情報におけるユーザが発話した発話区間が検出される。そして、発話区間が検出されない場合には、第１の音声情報に含まれるノイズが除去されないとともに、第１の音声情報がサーバへ送信されない。 According to this configuration, the utterance section spoken by the user in the acquired voice information is detected. If the utterance section is not detected, the noise included in the first voice information is not removed, and the first voice information is not transmitted to the server.

また、上記の音声認識装置において、前記発話区間検出部によって前記発話区間が検出された場合に、前記発話区間検出部において検出された発話区間の継続時間である発話継続時間を測定する発話継続時間測定部をさらに備え、前記発話区間検出部によって前記発話区間が検出された場合、前記ノイズ除去処理部は、前記第１の音声情報に含まれるノイズを除去し、前記通信部は、前記発話区間内における前記第１の音声情報を前記サーバへ送信し、前記調停部は、前記第１の音声認識結果情報と、前記第２の音声認識結果情報と、前記第３の音声認識結果情報と、前記第４の音声認識結果情報とのうちのいずれを出力するかを、少なくとも前記発話継続時間の長さに関する情報を用いて選択してもよい。 Further, in the above-mentioned voice recognition device, when the utterance section is detected by the utterance section detection unit, the utterance duration is measured, which is the duration of the utterance section detected by the utterance section detection unit. Further including a measuring unit, when the utterance section is detected by the utterance section detection unit, the noise removal processing unit removes noise included in the first voice information, and the communication unit removes the noise included in the first voice information. The first utterance information is transmitted to the server, and the arbitration unit uses the first utterance recognition result information, the second utterance recognition result information, and the third utterance recognition result information. Which of the fourth voice recognition result information is to be output may be selected by using at least the information regarding the length of the utterance duration.

この構成によれば、発話区間が検出された場合に、検出された発話区間の継続時間である発話継続時間が測定される。発話区間が検出された場合には、第１の音声情報に含まれるノイズが除去されるとともに、発話区間内における第１の音声情報がサーバへ送信される。そして、第１の音声認識結果情報と、第２の音声認識結果情報と、第３の音声認識結果情報と、第４の音声認識結果情報とのうちのいずれを出力するかが、少なくとも発話継続時間の長さに関する情報を用いて選択される。 According to this configuration, when the utterance section is detected, the utterance duration, which is the duration of the detected utterance section, is measured. When the utterance section is detected, the noise included in the first voice information is removed, and the first voice information in the utterance section is transmitted to the server. Then, which of the first voice recognition result information, the second voice recognition result information, the third voice recognition result information, and the fourth voice recognition result information is output is at least the utterance continuation. Selected with information about the length of time.

また、上記の音声認識装置において、前記調停部は、前記発話継続時間が所定の長さより長い場合に、前記第２の音声認識結果情報の尤もらしさを示す第２の尤度及び前記第３の音声認識結果情報の尤もらしさを示す第３の尤度に乗算する重み付けを、前記第１の音声認識結果情報の尤もらしさを示す第１の尤度及び前記第４の音声認識結果情報の尤もらしさを示す第４の尤度に乗算する重み付けよりも上げてもよい。 Further, in the voice recognition device, the arbitration unit has a second likelihood and a third likelihood of indicating the likelihood of the second voice recognition result information when the utterance duration is longer than a predetermined length. The weighting by multiplying the third likelihood indicating the likelihood of the speech recognition result information is the likelihood of the first likelihood and the fourth speech recognition result information indicating the likelihood of the first speech recognition result information. It may be higher than the weighting that multiplies the fourth likelihood that indicates.

この構成によれば、発話継続時間が所定の長さより長い場合に、第２の音声認識結果情報の尤もらしさを示す第２の尤度及び第３の音声認識結果情報の尤もらしさを示す第３の尤度に乗算する重み付けが、第１の音声認識結果情報の尤もらしさを示す第１の尤度及び第４の音声認識結果情報の尤もらしさを示す第４の尤度に乗算する重み付けよりも上げられる。発話継続時間が長い場合、単語数が多い高度な音声指示を行っている可能性が高い。そのため、発話継続時間が長い場合、サーバによって算出される音声認識結果を採用することにより、誤認識を防止することができる。 According to this configuration, when the utterance duration is longer than a predetermined length, the second likelihood indicating the likelihood of the second speech recognition result information and the third likelihood indicating the likelihood of the third speech recognition result information are shown. The weighting multiplied by the likelihood of is greater than the weighting multiplied by the first likelihood indicating the likelihood of the first speech recognition result information and the fourth likelihood indicating the likelihood of the fourth speech recognition result information. Can be raised. If the utterance duration is long, it is highly possible that advanced voice instructions with a large number of words are being performed. Therefore, when the utterance duration is long, erroneous recognition can be prevented by adopting the voice recognition result calculated by the server.

また、上記の音声認識装置において、前記調停部は、前記発話継続時間が所定の長さより長い場合に、前記第２の尤度に乗算する重み付けを、前記第３の尤度に乗算する重み付けよりも上げてもよい。 Further, in the voice recognition device, the arbitration unit multiplies the second likelihood by the weighting obtained by multiplying the third likelihood when the utterance duration is longer than a predetermined length. May also be raised.

この構成によれば、発話継続時間が所定の長さより長い場合に、第２の尤度に乗算する重み付けが、第３の尤度に乗算する重み付けよりも上げられる。 According to this configuration, when the utterance duration is longer than a predetermined length, the weight to be multiplied by the second likelihood is higher than the weight to be multiplied by the third likelihood.

したがって、サーバによりノイズ除去処理及び音声認識処理が行われた第２の音声認識結果情報が、音声認識装置によりノイズ除去処理が行われてサーバにより音声認識処理が行われた第３の音声認識結果情報よりもより高い優先順位が与えられるので、より誤認識を防止することができる。 Therefore, the second voice recognition result information in which the noise removal processing and the voice recognition processing are performed by the server is the third voice recognition result in which the noise removal processing is performed by the voice recognition device and the voice recognition processing is performed by the server. Since a higher priority is given than information, it is possible to prevent misrecognition.

本開示の他の局面に係る音声認識装置は、第１の音声情報を取得する音声取得部と、前記音声取得部によって取得された前記第１の音声情報に含まれるノイズを第１の除去方式を用いて除去し、前記ノイズを除去した音声情報を第２の音声情報として出力するノイズ除去処理部と、前記音声取得部によって取得された前記第１の音声情報をサーバへ送信し、前記サーバにおいて前記第１の除去方式により前記第１の音声情報から除去されるノイズの量よりもより多くの量のノイズを前記第１の音声情報から除去する第２の除去方式を用いて前記第１の音声情報に含まれるノイズが除去された第３の音声情報を前記サーバから受信する通信部と、前記ノイズ除去処理部によって出力された前記第２の音声情報に対して音声認識を行い、音声認識結果を第１の音声認識結果情報として出力するとともに、前記通信部によって受信された前記第３の音声情報に対して音声認識を行い、音声認識結果を第２の音声認識結果情報として出力する音声認識部と、前記音声認識部によって出力された前記第１の音声認識結果情報と前記第２の音声認識結果情報とのうちのいずれを出力するかを選択する調停部と、を備える。 The voice recognition device according to another aspect of the present disclosure is a voice recognition unit that acquires the first voice information, and a first removal method that removes noise contained in the first voice information acquired by the voice acquisition unit. The noise removal processing unit that removes the noise using the above and outputs the voice information from which the noise has been removed as the second voice information, and the first voice information acquired by the voice acquisition unit is transmitted to the server, and the server The first removal method is used to remove a larger amount of noise from the first voice information than the amount of noise removed from the first voice information by the first removal method. The communication unit that receives the third voice information from which the noise contained in the voice information is removed from the server and the second voice information output by the noise removal processing unit are voice-recognized to perform voice recognition. The recognition result is output as the first voice recognition result information, voice recognition is performed on the third voice information received by the communication unit, and the voice recognition result is output as the second voice recognition result information. It includes a voice recognition unit and an arbitration unit that selects which of the first voice recognition result information and the second voice recognition result information output by the voice recognition unit is to be output.

この構成によれば、第１の音声情報が取得される。取得された第１の音声情報に含まれるノイズが第１の除去方式を用いて除去され、ノイズが除去された音声情報が第２の音声情報として出力される。取得された第１の音声情報がサーバへ送信され、サーバにおいて第１の除去方式により前記第１の音声情報から除去されるノイズの量よりもより多くの量のノイズを前記第１の音声情報から除去する第２の除去方式を用いて第１の音声情報に含まれるノイズが除去された第３の音声情報がサーバから受信される。出力された第２の音声情報に対して音声認識が行われ、音声認識結果が第１の音声認識結果情報として出力されるとともに、受信された第３の音声情報に対して音声認識が行われ、音声認識結果が第２の音声認識結果情報として出力される。出力された第１の音声認識結果情報と第２の音声認識結果情報とのうちのいずれを出力するかが選択される。 According to this configuration, the first voice information is acquired. The noise contained in the acquired first voice information is removed by using the first removal method, and the voice information from which the noise has been removed is output as the second voice information. The acquired first voice information is transmitted to the server, and the first voice information causes a larger amount of noise than the amount of noise removed from the first voice information by the first removal method in the server. The third voice information from which the noise contained in the first voice information has been removed by using the second removal method for removing from the first voice information is received from the server. Voice recognition is performed on the output second voice information, the voice recognition result is output as the first voice recognition result information, and voice recognition is performed on the received third voice information. , The voice recognition result is output as the second voice recognition result information. Which of the output first voice recognition result information and the second voice recognition result information is to be output is selected.

したがって、高騒音環境では、音声認識装置よりも多くの量のノイズを除去することが可能なサーバにおいてノイズを除去することで、音声認識の正確性を向上させることができ、静音環境では音声認識装置においてノイズを除去することで、音声認識の高速化を実現することができる。 Therefore, in a high noise environment, the accuracy of voice recognition can be improved by removing noise in a server that can remove a larger amount of noise than a voice recognition device, and voice recognition in a quiet environment. By removing noise in the device, it is possible to realize high-speed voice recognition.

本開示の他の局面に係る音声認識方法は、通信部、ノイズ除去処理部、音声認識部及び調停部を備え、端末によって取得された音声情報に対して音声認識を行うサーバにおける音声認識方法であって、前記通信部が、前記端末によって取得された第１の音声情報を受信し、前記ノイズ除去処理部が、受信した前記第１の音声情報に含まれるノイズを第１の除去方式を用いて除去し、前記ノイズを除去した音声情報を第２の音声情報として出力し、前記音声認識部が、前記第２の音声情報に対して音声認識を行い、音声認識結果を第１の音声認識結果情報として出力し、前記通信部が、前記端末において、前記第１の除去方式により前記第１の音声情報から除去されるノイズの量よりも少ない量のノイズを除去する第２の除去方式を用いて前記第１の音声情報に含まれるノイズが除去され、前記ノイズが除去された第３の音声情報に対して音声認識が行われた結果である、音声認識結果を第２の音声認識結果情報として前記端末から受信し、前記調停部が、前記第１の音声認識結果情報と前記第２の音声認識結果情報とのうちのいずれを出力するかを選択する。 The voice recognition method according to another aspect of the present disclosure is a voice recognition method in a server including a communication unit, a noise removal processing unit, a voice recognition unit, and a mediation unit, and performing voice recognition for voice information acquired by a terminal. Therefore, the communication unit receives the first voice information acquired by the terminal, and the noise removal processing unit uses the first removal method to remove the noise contained in the received first voice information. The voice information from which the noise has been removed is output as the second voice information, the voice recognition unit performs voice recognition on the second voice information, and the voice recognition result is the first voice recognition. A second removal method, which is output as result information and in which the communication unit removes less noise than the amount of noise removed from the first voice information by the first removal method in the terminal. The voice recognition result, which is the result of voice recognition being performed on the third voice information from which the noise has been removed by removing the noise contained in the first voice information, is used as the second voice recognition result. Received from the terminal as information, the arbitration unit selects which of the first voice recognition result information and the second voice recognition result information is output.

この構成によれば、第１の音声情報が受信される。受信された第１の音声情報に含まれるノイズが第１の除去方式を用いて除去され、ノイズが除去された音声情報が第２の音声情報として出力される。出力された第２の音声情報に対して音声認識が行われ、音声認識結果が第１の音声認識結果情報として出力される。また、端末において第１の除去方式により前記第１の音声情報から除去されるノイズの量よりも少ない量のノイズを除去する第２の除去方式を用いて第１の音声情報に含まれるノイズが除去され、ノイズが除去された第３の音声情報に対して音声認識が行われた結果である、音声認識結果が第２の音声認識結果情報として端末から受信される。出力された第１の音声認識結果情報と、受信された第２の音声認識結果情報とのうちのいずれを出力するかが選択される。 According to this configuration, the first voice information is received. The noise contained in the received first voice information is removed by using the first removal method, and the noise-removed voice information is output as the second voice information. Voice recognition is performed on the output second voice information, and the voice recognition result is output as the first voice recognition result information. In addition, the noise contained in the first voice information is generated by using the second removal method for removing a smaller amount of noise than the noise removed from the first voice information by the first removal method in the terminal. The voice recognition result, which is the result of voice recognition performed on the third voice information from which the noise has been removed, is received from the terminal as the second voice recognition result information. Which of the output first voice recognition result information and the received second voice recognition result information is to be output is selected.

したがって、高騒音環境では、音声認識装置よりも多くの量のノイズを第１の音声情報から除去することが可能なサーバにおいてノイズを除去することで、音声認識の正確性を向上させることができ、静音環境では音声認識装置においてノイズを第１の音声情報から除去することで、音声認識の高速化を実現することができる。 Therefore, in a noisy environment, the accuracy of speech recognition can be improved by removing the noise in a server that can remove more noise from the first speech information than the speech recognition device. In a silent environment, the voice recognition device can increase the speed of voice recognition by removing noise from the first voice information.

以下添付図面を参照しながら、本開示の実施の形態について説明する。なお、以下の実施の形態は、本開示を具体化した一例であって、本開示の技術的範囲を限定するものではない。 Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. The following embodiments are examples that embody the present disclosure, and do not limit the technical scope of the present disclosure.

（実施の形態１）
図１は、本開示の実施の形態１に係る音声認識システムの全体構成を示す図である。 (Embodiment 1)
FIG. 1 is a diagram showing an overall configuration of a voice recognition system according to the first embodiment of the present disclosure.

図１に示す音声認識システムは、端末１００及びサーバ２００を備える。端末１００は、ネットワーク（例えば、インターネット）３００を介してサーバ２００と互いに通信可能に接続されている。 The voice recognition system shown in FIG. 1 includes a terminal 100 and a server 200. The terminal 100 is communicably connected to the server 200 via a network (for example, the Internet) 300.

端末１００は、例えば、家庭内に配置されたテレビ又はエアコンなどを制御する機器である。端末１００は、所定の言語で発話された音声の認識を行う。また、端末１００は、音声を認識し、音声認識の結果に基づいて家庭内に配置されたテレビ又はエアコンを制御する。 The terminal 100 is, for example, a device for controlling a television or an air conditioner arranged in a home. The terminal 100 recognizes the voice spoken in a predetermined language. Further, the terminal 100 recognizes the voice and controls the television or the air conditioner arranged in the home based on the result of the voice recognition.

端末１００は、例えば、制御対象の機器（例えば、家庭内に配置されたテレビ又はエアコン）と別体であってもよいし、制御対象の機器に含まれていてもよい。 The terminal 100 may be a separate body from, for example, a device to be controlled (for example, a television or an air conditioner arranged in a home), or may be included in the device to be controlled.

端末１００は、通信部１０１、マイク１０２、スピーカ１０３、制御部１０４、メモリ１０５及び表示部１０６を備える。なお、マイク１０２、スピーカ１０３及び表示部１０６は、端末１００に内蔵されていなくてもよい。 The terminal 100 includes a communication unit 101, a microphone 102, a speaker 103, a control unit 104, a memory 105, and a display unit 106. The microphone 102, the speaker 103, and the display unit 106 do not have to be built in the terminal 100.

通信部１０１は、ネットワーク３００を介してサーバ２００に情報を送信するとともに、ネットワーク３００を介してサーバ２００から情報を受信する。通信部１０１のネットワーク３００への接続方法に関しては問わない。マイク１０２は、周囲の音を収集し、音声情報を取得する。スピーカ１０３は、音声を出力する。 The communication unit 101 transmits information to the server 200 via the network 300, and receives information from the server 200 via the network 300. The method of connecting the communication unit 101 to the network 300 does not matter. The microphone 102 collects ambient sounds and acquires voice information. The speaker 103 outputs sound.

制御部１０４は、例えば、ＣＰＵ（中央演算処理装置）を有し、後述するメモリ１０５に格納された制御用のプログラムをＣＰＵが実行することにより、制御部１０４として機能する。制御部１０４は、例えば、通信部１０１によって受信された様々なデータ（情報）を処理し、端末１００内の各構成の動作を制御する。 The control unit 104 has, for example, a CPU (central processing unit), and functions as the control unit 104 when the CPU executes a control program stored in a memory 105, which will be described later. The control unit 104 processes, for example, various data (information) received by the communication unit 101, and controls the operation of each configuration in the terminal 100.

メモリ１０５は、例えば、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）又はＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）などであり、通信部１０１によって受信されたデータ（情報）、制御部１０４によって演算されたデータ（情報）、又は制御用のプログラム等を格納する。表示部１０６は、例えば液晶表示装置であり、種々の情報を表示する。 The memory 105 is, for example, a ROM (Read Only Memory), a RAM (Random Access Memory), an HDD (Hard Disk Drive), or the like, and includes data (information) received by the communication unit 101 and data calculated by the control unit 104. (Information) or a control program, etc. is stored. The display unit 106 is, for example, a liquid crystal display device, and displays various information.

また、端末１００は、音声認識の結果である言葉又は文章を別の言語の言葉又は文章に翻訳する翻訳部（図示せず）を有してもよい。翻訳部が翻訳した結果は、例えば表示部１０６に表示してもよい。また、翻訳部が翻訳した結果は、例えば、端末１００の制御対象である家庭内に配置されたテレビなどの機器の表示画面に表示をしてもよい。 Further, the terminal 100 may have a translation unit (not shown) that translates a word or sentence resulting from voice recognition into a word or sentence in another language. The result translated by the translation unit may be displayed on the display unit 106, for example. Further, the result translated by the translation unit may be displayed on the display screen of a device such as a television arranged in the home, which is the control target of the terminal 100, for example.

サーバ２００は、通信部２０１、制御部２０２及びメモリ２０３を備える。 The server 200 includes a communication unit 201, a control unit 202, and a memory 203.

通信部２０１は、ネットワーク３００を介して端末１００に情報を送信するとともに、ネットワーク３００を介して端末１００から情報を受信する。 The communication unit 201 transmits information to the terminal 100 via the network 300, and receives information from the terminal 100 via the network 300.

制御部２０２は、例えば、ＣＰＵを有し、後述するメモリ２０３に格納された制御用のプログラムをＣＰＵが実行することにより、制御部２０２として機能する。制御部２０２は、例えば、通信部２０１によって受信された様々なデータ（情報）を処理し、サーバ２００内の各構成の動作を制御する。 The control unit 202 has, for example, a CPU, and functions as the control unit 202 when the CPU executes a control program stored in the memory 203 described later. The control unit 202 processes, for example, various data (information) received by the communication unit 201, and controls the operation of each configuration in the server 200.

メモリ２０３は、例えば、ＲＯＭ、ＲＡＭ又はＨＤＤなどであり、通信部２０１によって受信されたデータ（情報）、制御部２０２によって処理されたデータ（情報）、又は制御用のプログラム等を格納する。 The memory 203 is, for example, a ROM, a RAM, an HDD, or the like, and stores data (information) received by the communication unit 201, data (information) processed by the control unit 202, a control program, or the like.

図２は、本開示の実施の形態１における音声認識システムの機能構成を示す図である。図２に示すように、音声認識システムは、端末１００及びサーバ２００を備える。端末１００は、音声取得部１１、第１の収音処理部１２、第１の音声認識部１３及び調停部１４を備える。サーバ２００は、第２の収音処理部２１及び第２の音声認識部２２を備える。 FIG. 2 is a diagram showing a functional configuration of the voice recognition system according to the first embodiment of the present disclosure. As shown in FIG. 2, the voice recognition system includes a terminal 100 and a server 200. The terminal 100 includes a voice acquisition unit 11, a first sound collection processing unit 12, a first voice recognition unit 13, and an arbitration unit 14. The server 200 includes a second sound collection processing unit 21 and a second voice recognition unit 22.

なお、音声取得部１１は、マイク１０２によって実現され、第１の収音処理部１２、第１の音声認識部１３及び調停部１４は、制御部１０４によって実現される。また、第２の収音処理部２１及び第２の音声認識部２２は、制御部２０２によって実現される。 The voice acquisition unit 11 is realized by the microphone 102, and the first sound collection processing unit 12, the first voice recognition unit 13, and the arbitration unit 14 are realized by the control unit 104. Further, the second sound collection processing unit 21 and the second voice recognition unit 22 are realized by the control unit 202.

音声取得部１１は、第１の音声情報を取得する。ここで、音声情報とは、例えば音声の信号波形であるが、信号波形を周波数分析した音声の特徴量であっても構わない。不図示の通信部１０１は、音声取得部１１によって取得された第１の音声情報をサーバ２００へ送信する。サーバ２００の通信部２０１は、端末１００によって送信された第１の音声情報を受信する。 The voice acquisition unit 11 acquires the first voice information. Here, the voice information is, for example, a voice signal waveform, but it may be a feature amount of voice obtained by frequency-analyzing the signal waveform. The communication unit 101 (not shown) transmits the first voice information acquired by the voice acquisition unit 11 to the server 200. The communication unit 201 of the server 200 receives the first voice information transmitted by the terminal 100.

第１の収音処理部１２は、音声取得部１１によって取得された第１の音声情報に含まれるノイズを第１の除去方式を用いて除去し、ノイズを除去した音声情報を第２の音声情報として出力する。 The first sound collecting processing unit 12 removes the noise contained in the first voice information acquired by the voice acquisition unit 11 by using the first removing method, and the voice information from which the noise is removed is used as the second voice. Output as information.

第１の音声認識部１３は、第１の収音処理部１２によって出力された第２の音声情報に対して音声認識を行い、音声認識結果を第１の音声認識結果情報として出力する。第１の音声認識部１３は、音声認識を行った際の第１の音声認識結果情報の尤もらしさを示す第１の尤度を算出し、算出した第１の尤度を第１の音声認識結果情報とともに調停部１４に出力する。 The first voice recognition unit 13 performs voice recognition on the second voice information output by the first sound collection processing unit 12, and outputs the voice recognition result as the first voice recognition result information. The first voice recognition unit 13 calculates a first likelihood indicating the likelihood of the first voice recognition result information when voice recognition is performed, and the calculated first likelihood is used as the first voice recognition. It is output to the arbitration unit 14 together with the result information.

第１の音声認識部１３は、第１の収音処理部１２によってノイズが除去された第２の音声情報に対する音声認識を行う。第１の音声認識部１３は、予め記憶された音響モデル及び言語モデルと、端末用辞書とを参照して、第２の音声情報に対する音声認識を行う。音声認識結果は、第２の音声情報を音声認識した結果の文字列である、複数の単語から構成される文字列データを含む。第１の尤度は、第２の音声情報の音声認識結果（つまり、第１の音声認識結果情報）の尤もらしさを示す。具体的には第１の尤度は、例えば、文字列データ全体の尤もらしさ、または文字列データを構成する各単語の尤もらしさを示す。 The first voice recognition unit 13 performs voice recognition for the second voice information from which noise has been removed by the first sound collection processing unit 12. The first voice recognition unit 13 refers to the acoustic model and the language model stored in advance and the terminal dictionary, and performs voice recognition for the second voice information. The voice recognition result includes character string data composed of a plurality of words, which is a character string of the result of voice recognition of the second voice information. The first likelihood indicates the likelihood of the speech recognition result (that is, the first speech recognition result information) of the second speech information. Specifically, the first likelihood indicates, for example, the likelihood of the entire character string data or the likelihood of each word constituting the character string data.

第１の音声認識部１３は、第２の音声情報から得られる発話内容と、端末用辞書に含まれる複数の語彙のそれぞれとの一致する度合い（尤度）を計算する。第１の音声認識部１３は、発話した内容と、最も一致する度合いの高い語彙を端末用辞書に含まれる語彙の中から選択し、選択した語彙を音声認識結果に含める。 The first voice recognition unit 13 calculates the degree of agreement (likelihood) between the utterance content obtained from the second voice information and each of the plurality of vocabularies included in the terminal dictionary. The first voice recognition unit 13 selects a vocabulary having the highest degree of matching with the spoken content from the vocabularies included in the terminal dictionary, and includes the selected vocabulary in the voice recognition result.

第１の音声認識部１３は、発話内容に複数の単語が含まれる場合、各単語に対して一致する度合いの最も高い語彙を選択し、選択した語彙を音声認識結果に含める。 When a plurality of words are included in the utterance content, the first speech recognition unit 13 selects the vocabulary having the highest degree of matching for each word, and includes the selected vocabulary in the speech recognition result.

第１の音声認識部１３は、選択した語彙に対応する尤度を第１の尤度とする。 The first speech recognition unit 13 sets the likelihood corresponding to the selected vocabulary as the first likelihood.

または、第１の音声認識部１３は、音声認識結果に複数の語彙が含まれる場合、各語彙に対応する尤度に基づいて、複数の語彙全体に対する尤度を算出し、算出した尤度を第１の尤度としてもよい。 Alternatively, when the voice recognition result includes a plurality of vocabularies, the first voice recognition unit 13 calculates the likelihood for the entire plurality of vocabularies based on the likelihood corresponding to each vocabulary, and calculates the calculated likelihood. It may be the first likelihood.

第１の尤度の値は、第１の音声認識部１３が選択する語彙と、発話内容との一致する度合いが高い程、高くなる。 The value of the first likelihood becomes higher as the degree of matching between the vocabulary selected by the first voice recognition unit 13 and the utterance content is higher.

そして、第１の音声認識部１３は、音声認識結果を第１の音声認識結果情報として調停部１４へ出力する。また、第１の音声認識部１３は、第１の尤度を調停部１４へ出力する。 Then, the first voice recognition unit 13 outputs the voice recognition result to the arbitration unit 14 as the first voice recognition result information. Further, the first voice recognition unit 13 outputs the first likelihood to the arbitration unit 14.

端末用辞書は、認識対象の単語が登録されてリスト化されたものであり、端末１００に記憶されている。端末用辞書には、例えば、端末１００の動作を制御するための単語が主に含まれる。端末用辞書には、例えば、端末１００が家庭内に配置された機器を制御するための単語が含まれていてもよい。 The terminal dictionary is a list in which words to be recognized are registered and are stored in the terminal 100. The terminal dictionary mainly contains, for example, words for controlling the operation of the terminal 100. The terminal dictionary may contain, for example, words for the terminal 100 to control a device arranged in the home.

第２の収音処理部２１は、通信部２０１によって受信された第１の音声情報に含まれるノイズを、第１の除去方式よりも高いレベル（または、より多い量）のノイズを除去する第２の除去方式を用いて除去する。 The second sound collecting processing unit 21 removes noise contained in the first voice information received by the communication unit 201 at a higher level (or a larger amount) than that of the first removing method. Remove using the removal method of 2.

逆に言えば、第１の収音処理部１２は、第１の音声情報に含まれるノイズを、第２の除去方式よりも低いレベル（または、より少ない量）のノイズを除去する第１の除去方式を用いて除去する。 Conversely, the first sound collecting processing unit 12 removes the noise contained in the first voice information at a lower level (or less amount) than the second removing method. Remove using a removal method.

第２の収音処理部２１は、第２の除去方式を用いて第１の音声情報からノイズを除去した音声情報を第３の音声情報として出力する。第２の収音処理部２１は、第１の収音処理部１２によるノイズ除去量よりも多い量のノイズを第１の音声情報から除去する。 The second sound collection processing unit 21 outputs the voice information obtained by removing noise from the first voice information as the third voice information by using the second removal method. The second sound collecting processing unit 21 removes a larger amount of noise from the first voice information than the amount of noise removed by the first sound collecting processing unit 12.

第２の音声認識部２２は、第２の収音処理部２１によって出力された第３の音声情報に対して音声認識を行い、音声認識結果を第２の音声認識結果情報として出力する。第２の音声認識部２２は、音声認識を行った際の第２の音声認識結果情報の尤もらしさを示す第２の尤度を算出し、算出した第２の尤度を第２の音声認識結果情報とともに通信部２０１に出力する。通信部２０１は、第２の音声認識部２２によって出力された第２の音声認識結果情報および第２の尤度を端末１００へ送信する。端末１００の通信部１０１は、サーバ２００によって送信された第２の音声認識結果情報を受信する。通信部１０１は、サーバ２００によって音声認識を行った際に算出された第２の音声認識結果情報の尤もらしさを示す第２の尤度を受信し、受信した第２の尤度を調停部１４に出力する。 The second voice recognition unit 22 performs voice recognition on the third voice information output by the second sound collection processing unit 21, and outputs the voice recognition result as the second voice recognition result information. The second voice recognition unit 22 calculates a second likelihood indicating the likelihood of the second voice recognition result information when voice recognition is performed, and the calculated second likelihood is used as the second voice recognition. It is output to the communication unit 201 together with the result information. The communication unit 201 transmits the second voice recognition result information output by the second voice recognition unit 22 and the second likelihood to the terminal 100. The communication unit 101 of the terminal 100 receives the second voice recognition result information transmitted by the server 200. The communication unit 101 receives the second likelihood indicating the likelihood of the second voice recognition result information calculated when the server 200 performs voice recognition, and the arbitration unit 14 receives the received second likelihood. Output to.

第２の音声認識部２２は、第２の収音処理部２１によってノイズが除去された第３の音声情報に対する音声認識を行う。第２の音声認識部２２は、予め記憶された音響モデル及び言語モデルと、サーバ用辞書とを参照して、第３の音声情報に対する音声認識を行う。音声認識結果は、第３の音声情報を音声認識した結果の文字列である、複数の単語から構成される文字列データを含む。第２の尤度は、第３の音声情報の音声認識結果（つまり第２の音声認識結果情報）の尤もらしさを示す。具体的には第２の尤度は、例えば、文字列データ全体の尤もらしさ、または文字列データを構成する各単語の尤もらしさを示す。 The second voice recognition unit 22 performs voice recognition for the third voice information from which noise has been removed by the second sound collection processing unit 21. The second voice recognition unit 22 refers to the sound model and the language model stored in advance and the server dictionary, and performs voice recognition for the third voice information. The voice recognition result includes character string data composed of a plurality of words, which is a character string of the result of voice recognition of the third voice information. The second likelihood indicates the likelihood of the speech recognition result (that is, the second speech recognition result information) of the third speech information. Specifically, the second likelihood indicates, for example, the likelihood of the entire character string data or the likelihood of each word constituting the character string data.

第２の音声認識部２２は、第３の音声情報から得られる発話内容と、サーバ用辞書に含まれる複数の語彙のそれぞれとの一致する度合い（尤度）を計算する。第２の音声認識部２２は、発話した内容と、最も一致する度合いの高い語彙をサーバ用辞書に含まれる語彙の中から選択し、選択した語彙を音声認識結果に含める。 The second voice recognition unit 22 calculates the degree of agreement (likelihood) between the utterance content obtained from the third voice information and each of the plurality of vocabularies included in the server dictionary. The second voice recognition unit 22 selects the vocabulary having the highest degree of matching with the spoken content from the vocabularies included in the server dictionary, and includes the selected vocabulary in the voice recognition result.

第２の音声認識部２２は、発話内容に複数の単語が含まれる場合、各単語に対して一致する度合いの最も高い語彙を選択し、選択した語彙を音声認識結果に含める。 When the utterance content includes a plurality of words, the second voice recognition unit 22 selects the vocabulary having the highest degree of matching for each word, and includes the selected vocabulary in the voice recognition result.

第２の音声認識部２２は、選択した語彙に対応する尤度を第２の尤度とする。 The second speech recognition unit 22 sets the likelihood corresponding to the selected vocabulary as the second likelihood.

または、第２の音声認識部２２は、音声認識結果に複数の語彙が含まれる場合、各語彙に対応する尤度に基づいて、複数の語彙全体に対する尤度を算出し、算出した尤度を第２の尤度としてもよい。 Alternatively, when the voice recognition result includes a plurality of vocabularies, the second voice recognition unit 22 calculates the likelihood for the entire plurality of vocabularies based on the likelihood corresponding to each vocabulary, and calculates the calculated likelihood. It may be the second likelihood.

第２の尤度の値は、第２の音声認識部２２が選択する語彙と、発話内容との一致する度合いが高い程、高くなる。 The value of the second likelihood becomes higher as the degree of matching between the vocabulary selected by the second voice recognition unit 22 and the utterance content is higher.

そして、第２の音声認識部２２は、音声認識結果を第２の音声認識結果情報として端末１００へ通信部２０１を介して送信する。第２の音声認識部２２は、第２の尤度を端末１００へ通信部２０１を介して送信する。 Then, the second voice recognition unit 22 transmits the voice recognition result as the second voice recognition result information to the terminal 100 via the communication unit 201. The second voice recognition unit 22 transmits the second likelihood to the terminal 100 via the communication unit 201.

また、通信部１０１は、受信した第２の音声認識結果情報および第２の尤度を調停部１４に出力する。 Further, the communication unit 101 outputs the received second voice recognition result information and the second likelihood to the arbitration unit 14.

サーバ用辞書は、認識対象の単語が登録されてリスト化されたものであり、サーバ２００に記憶されている。サーバ用辞書には、端末１００の動作を制御するための単語だけでなく、種々の検索キーワードなどが含まれる。サーバ用辞書には、端末１００が機器を制御するための情報が含まれていてもよい。サーバ用辞書の語彙数は、端末用辞書の語彙数よりも多い。サーバ用辞書の語彙数が例えば十万〜数十万語であるのに対し、端末用辞書の語彙数は例えば数十〜数百語である。 The server dictionary is a list of words to be recognized registered and stored in the server 200. The server dictionary includes not only words for controlling the operation of the terminal 100 but also various search keywords and the like. The server dictionary may include information for the terminal 100 to control the device. The number of vocabularies in the server dictionary is larger than the number of vocabularies in the terminal dictionary. The number of vocabularies in the server dictionary is, for example, 100,000 to hundreds of thousands of words, while the number of vocabularies in the terminal dictionary is, for example, tens to hundreds of words.

ここで、第１の収音処理部１２と第２の収音処理部２１との差異について説明する。第１の収音処理部１２は、複数の音声信号のパワー又は相関を用いた信号処理によりノイズを除去する。一方、第２の収音処理部２１は、上記の信号処理に加えて、音声を示す信号又はノイズを示す信号を統計的にモデル化し、分離する信号の確率的な尤もらしさを用いてノイズを除去する。 Here, the difference between the first sound collecting processing unit 12 and the second sound collecting processing unit 21 will be described. The first sound collecting processing unit 12 removes noise by signal processing using the power or correlation of a plurality of voice signals. On the other hand, in addition to the above signal processing, the second sound collection processing unit 21 statistically models a signal indicating voice or a signal indicating noise, and uses the probabilistic plausibility of the separated signal to generate noise. Remove.

モデル化においては、第２の収音処理部２１が受信する第１の音声情報（音声信号）を発話者の音声に対応する音声信号、ノイズに対応する信号に分離するためにパラメータを事前に決定する必要がある。 In modeling, parameters are set in advance in order to separate the first voice information (voice signal) received by the second sound collection processing unit 21 into a voice signal corresponding to the speaker's voice and a signal corresponding to noise. You need to decide.

例えば、以下の処理を事前に行う。まず、予めモデル化に必要なパラメータを決めておく。そして、騒音が大きい環境下において、発話者が発話することにより得られる第１の音声情報に上述のモデルを適用し、ノイズに対応する信号を除去する処理を行い、この処理により得られる音声信号の評価を行う。 For example, the following processing is performed in advance. First, the parameters required for modeling are determined in advance. Then, in a noisy environment, the above model is applied to the first voice information obtained by the speaker speaking, a process of removing the signal corresponding to the noise is performed, and the voice signal obtained by this process is performed. To evaluate.

または、騒音が大きい環境下において、発話者が発話することにより得られる第１の音声情報に対し、第１の収音処理部１２と同じ処理を行い、第１の音声情報からノイズを除去した音声信号に対し、上述のモデルを適用し、ノイズに対応する信号を除去する処理を行い、この処理により得られる音声信号の評価を行うのでもよい。 Alternatively, in a noisy environment, the first voice information obtained by the speaker speaking is subjected to the same processing as that of the first sound collection processing unit 12, and noise is removed from the first voice information. The above model may be applied to the voice signal, a process for removing the signal corresponding to the noise may be performed, and the voice signal obtained by this process may be evaluated.

この処理により得られる音声信号に対する評価値が、予め定めた評価値よりも低ければ、上述のパラメータを修正し、再度、騒音が大きい環境下において、発話者が発話することにより得られる第１の音声情報からノイズに対応する信号を除去する処理、音声信号の評価を行う。 If the evaluation value for the voice signal obtained by this processing is lower than the predetermined evaluation value, the first parameter obtained by modifying the above parameters and again speaking by the speaker in a noisy environment is obtained. The process of removing the signal corresponding to noise from the voice information and the evaluation of the voice signal are performed.

この処理により得られる音声信号に対する評価値が、予め定めた評価値よりも高ければ、上述の音声信号を得るために用いたパラメータを第２の収音処理部２１の処理に用いるパラメータ（事前学習されたパラメータ）として保持する。 If the evaluation value for the voice signal obtained by this processing is higher than the predetermined evaluation value, the parameter used for obtaining the above-mentioned voice signal is used for the processing of the second sound collection processing unit 21 (pre-learning). It is retained as a parameter).

そして、事前学習されたパラメータを用いて、音声を示す信号又はノイズを示す信号を統計的にモデル化する。第２の収音処理部２１は、事前学習されたパラメータを保持している。 Then, using the pre-learned parameters, a signal indicating voice or a signal indicating noise is statistically modeled. The second sound collecting processing unit 21 holds the pre-learned parameters.

事前学習されたパラメータは、騒音が大きい環境下において、発話者が発話したとき、端末１００により取得される第１の音声情報に含まれる音声を示す信号又はノイズを示す信号を統計的にモデル化し、分離する信号の確率的な尤もらしさを用いてノイズを除去するのに適したパラメータとなっている。 The pre-learned parameters statistically model a signal indicating voice or a signal indicating noise included in the first voice information acquired by the terminal 100 when the speaker speaks in a noisy environment. , It is a parameter suitable for removing noise by using the probabilistic plausibility of the separated signals.

実際に音声認識が必要となった場合、第２の収音処理部２１は、事前学習されたパラメータを用いて、音声を示す信号又はノイズを示す信号を統計的にモデル化し、分離する信号の確率的な尤もらしさを用いてノイズを第１の音声情報から除去する。 When voice recognition is actually required, the second sound collection processing unit 21 statistically models a signal indicating voice or a signal indicating noise using pre-learned parameters, and separates the signals. Noise is removed from the first speech information using probabilistic plausibility.

このとき、第２の収音処理部２１は、端末１００から取得される第１の音声情報を用いて、事前学習されたパラメータを必要に応じて更新してもよい。 At this time, the second sound collecting processing unit 21 may update the pre-learned parameters as necessary by using the first voice information acquired from the terminal 100.

このようにすることで、事前学習されたパラメータが発話者が発話している環境により適合したパラメータに更新される。 By doing so, the pre-learned parameters are updated to the parameters more suitable for the environment in which the speaker is speaking.

一般的な知見として、音声信号のパワー又は相関を用いた信号処理により第１の音声情報からノイズを除去する方式よりも、音声を示す信号又はノイズを示す信号を統計的にモデル化し、分離する信号の確率的な尤もらしさを用いて第１の音声情報からノイズを除去する方式の方が除去できるノイズの量は多い。 As a general finding, a signal indicating audio or a signal indicating noise is statistically modeled and separated rather than a method of removing noise from the first audio information by signal processing using the power or correlation of the audio signal. The amount of noise that can be removed by the method of removing noise from the first audio information by using the probabilistic plausibility of the signal is larger.

当然のことながら、音声信号のパワー又は相関を用いた信号処理により第１の音声情報からノイズを除去した後、この方式によりノイズが除去された第１の音声情報に含まれる音声を示す信号又はノイズを示す信号を統計的にモデル化し、分離する信号の確率的な尤もらしさを用いてノイズを除去することにより、音声信号のパワー又は相関を用いた信号処理により第１の音声情報からノイズを除去する方式のみを行う場合に比べ第１の音声情報からより多くのノイズを除去できる。 As a matter of course, after the noise is removed from the first voice information by signal processing using the power or correlation of the voice signal, the signal indicating the voice included in the first voice information from which the noise is removed by this method or By statistically modeling the signal indicating noise and removing the noise by using the probabilistic plausibility of the separated signal, the noise is removed from the first audio information by signal processing using the power or correlation of the audio signal. More noise can be removed from the first audio information as compared with the case where only the removal method is used.

つまり、第２の収音処理部２１が除去するノイズ量は、第１の収音処理部１２が除去するノイズ量よりも大きい。そのため、第２の収音処理部２１は、騒音が大きい環境においても、十分にノイズ（騒音）を除去し、ユーザの発話のみを抽出することができる。第２の収音処理部２１は、例えば、第１の収音処理部１２よりもより多くの事前学習されたパラメータを保持して、より多くの演算処理を行う。そのため、第２の収音処理部２１がノイズ除去に要する時間は、第１の収音処理部１２がノイズ除去に要する時間よりも長くなり、例えば数十ｍｓから数百ｍｓ程度長くなる。また、第２の収音処理部２１では、ノイズ除去処理のアルゴリズムをリアルタイムに更新することが可能であるのに対し、第１の収音処理部１２では、ノイズ除去処理のアルゴリズムを更新するためにプログラムのアップデートが必要となる。 That is, the amount of noise removed by the second sound collecting processing unit 21 is larger than the amount of noise removed by the first sound collecting processing unit 12. Therefore, the second sound collecting processing unit 21 can sufficiently remove noise (noise) even in a noisy environment, and can extract only the user's utterance. The second sound collecting processing unit 21 holds, for example, more pre-learned parameters than the first sound collecting processing unit 12, and performs more arithmetic processing. Therefore, the time required for the second sound collecting processing unit 21 to remove noise is longer than the time required for the first sound collecting processing unit 12 to remove noise, for example, about several tens of ms to several hundreds ms longer. Further, the second sound collecting processing unit 21 can update the noise removing processing algorithm in real time, whereas the first sound collecting processing unit 12 updates the noise removing processing algorithm. The program needs to be updated.

上記のように第１の収音処理部１２は音声信号のパワー又は相関を用いてノイズを除去し、第２の収音処理部２１は音声を示す信号又はノイズを示す信号を統計的にモデル化し、分離する信号の確率的な尤もらしさを用いてノイズを除去する。しかしながら、これらの収音処理部は、別の方法でノイズを除去してもよい。 As described above, the first sound collection processing unit 12 removes noise by using the power or correlation of the audio signal, and the second sound collection processing unit 21 statistically models the signal indicating sound or the signal indicating noise. Noise is removed using the probabilistic plausibility of the separated signals. However, these sound collecting processing units may remove noise by another method.

すなわち、第２の収音処理部２１が第１の収音処理部１２よりも多くの量のノイズを第１の音声情報から除去するのであれば、第１の収音処理部１２、および第２の収音処理部２１のノイズを除去する具体的な処理はどのようなものであってもよい。 That is, if the second sound collecting processing unit 21 removes a larger amount of noise from the first voice information than the first sound collecting processing unit 12, the first sound collecting processing unit 12 and the first Any specific process for removing noise from the sound collecting processing unit 21 of 2 may be used.

続いて、第１の音声認識部１３と第２の音声認識部２２との差異について説明する。上記のように、第１の音声認識部１３と第２の音声認識部２２とでは、音声認識に使用される辞書の語彙数が異なっており、サーバ用辞書の語彙数は、端末用辞書の語彙数よりも多い。そのため、第２の音声認識部２２の認識可能な単語数は、第１の音声認識部１３の認識可能な単語数よりも多い。なお、第１の音声認識部１３は、辞書を用いずに音声を単に文字化してもよい。第２の音声認識部２２が音声認識に要する時間は、第１の音声認識部１３が音声認識に要する時間よりも長くなり、例えば数十ｍｓから数百ｍｓ程度長くなる。また、第２の音声認識部２２では、音声認識処理のアルゴリズムをリアルタイムに更新することが可能であるのに対し、第１の音声認識部１３では、音声認識処理のアルゴリズムを更新するためにプログラムのアップデートが必要となる。 Subsequently, the difference between the first voice recognition unit 13 and the second voice recognition unit 22 will be described. As described above, the first voice recognition unit 13 and the second voice recognition unit 22 have different numbers of vocabularies in the dictionary used for voice recognition, and the number of vocabularies in the server dictionary is that of the terminal dictionary. More than the number of vocabulary. Therefore, the number of recognizable words of the second voice recognition unit 22 is larger than the number of recognizable words of the first voice recognition unit 13. The first voice recognition unit 13 may simply convert the voice into characters without using a dictionary. The time required for the second voice recognition unit 22 to perform voice recognition is longer than the time required for the first voice recognition unit 13 to perform voice recognition, for example, it is several tens of ms to several hundred ms longer. Further, the second voice recognition unit 22 can update the speech recognition processing algorithm in real time, whereas the first voice recognition unit 13 has a program for updating the speech recognition processing algorithm. Needs an update.

調停部１４は、第１の音声認識部１３によって出力された第１の音声認識結果情報と、通信部１０１によって受信された第２の音声認識結果情報とのうちのいずれを出力するかを選択する。調停部１４は、この選択を、第１の尤度及び第２の尤度の少なくとも１つに基づいて行う。すなわち、調停部１４は、第１の尤度が所定の第１の閾値より大きい場合には第１の音声認識結果情報を出力する。また、調停部１４は、第１の尤度が第１の閾値以下であり、第２の尤度が所定の第２の閾値より大きい場合には第２の音声認識結果情報を出力する。さらに、調停部１４は、第１の尤度が第１の閾値以下であり、第２の尤度が第２の閾値以下である場合には第１の音声認識結果情報及び第２の音声認識結果情報のいずれも出力しない。第１の閾値および第２の閾値は、例えば、端末１００のメモリ１０５に記憶されている。 The arbitration unit 14 selects which of the first voice recognition result information output by the first voice recognition unit 13 and the second voice recognition result information received by the communication unit 101 is output. To do. The arbitrator 14 makes this selection based on at least one of a first likelihood and a second likelihood. That is, the arbitration unit 14 outputs the first voice recognition result information when the first likelihood is larger than the predetermined first threshold value. Further, the arbitration unit 14 outputs the second voice recognition result information when the first likelihood is equal to or less than the first threshold value and the second likelihood is larger than the predetermined second threshold value. Further, in the arbitration unit 14, when the first likelihood is equal to or less than the first threshold value and the second likelihood is equal to or less than the second threshold value, the first speech recognition result information and the second speech recognition are performed. No result information is output. The first threshold value and the second threshold value are stored in, for example, the memory 105 of the terminal 100.

なお、調停部１４は、第１の音声認識結果情報及び第２の音声認識結果情報のうち、先に入力された情報の尤度と閾値とを比較する。例えば、第１の音声認識結果情報が第２の音声認識結果情報よりも先に調停部１４に入力された場合、調停部１４は、第１の音声認識結果情報に対応する第１の尤度と第１の閾値とを比較し、第１の尤度が第１の閾値より大きい場合には第１の音声認識結果情報を出力する。一方、第１の尤度が第１の閾値以下である場合、調停部１４は、第２の音声認識結果情報が入力されるのを待ち、その後、第２の音声認識結果情報が入力された場合、第２の音声認識結果情報に対応する第２の尤度と第２の閾値とを比較し、第２の尤度が第２の閾値より大きい場合には第２の音声認識結果情報を出力する。このとき、第２の尤度が第２の閾値以下である場合、調停部１４は、満足のいく音声認識結果が得られなかったと判断し、第１の音声認識結果情報及び第２の音声認識結果情報のいずれも出力しない。以上の処理は、第２の音声認識結果情報が第１の音声認識結果情報よりも先に入力された場合にも、同様に行われる。 The arbitration unit 14 compares the likelihood and the threshold value of the previously input information among the first voice recognition result information and the second voice recognition result information. For example, when the first voice recognition result information is input to the arbitration unit 14 before the second voice recognition result information, the arbitration unit 14 has a first likelihood corresponding to the first voice recognition result information. Is compared with the first threshold value, and if the first likelihood is larger than the first threshold value, the first voice recognition result information is output. On the other hand, when the first likelihood is equal to or less than the first threshold value, the arbitration unit 14 waits for the second voice recognition result information to be input, and then the second voice recognition result information is input. In this case, the second likelihood corresponding to the second speech recognition result information is compared with the second threshold value, and if the second likelihood is larger than the second threshold value, the second speech recognition result information is used. Output. At this time, if the second likelihood is equal to or less than the second threshold value, the arbitration unit 14 determines that a satisfactory voice recognition result has not been obtained, and determines that the first voice recognition result information and the second voice recognition are performed. No result information is output. The above processing is also performed in the same manner when the second voice recognition result information is input before the first voice recognition result information.

図３は、本開示の実施の形態１における音声認識システムの動作の一例を示すフローチャートである。 FIG. 3 is a flowchart showing an example of the operation of the voice recognition system according to the first embodiment of the present disclosure.

まず、ステップＳ１において、端末１００の音声取得部１１は、第１の音声情報を取得する。 First, in step S1, the voice acquisition unit 11 of the terminal 100 acquires the first voice information.

次に、ステップＳ２において、通信部１０１は、音声取得部１１によって取得された第１の音声情報をサーバ２００へ送信する。 Next, in step S2, the communication unit 101 transmits the first voice information acquired by the voice acquisition unit 11 to the server 200.

次に、ステップＳ３において、第１の収音処理部１２は、音声取得部１１によって取得された第１の音声情報に含まれるノイズを除去し、ノイズを除去した第２の音声情報を出力する。 Next, in step S3, the first sound collection processing unit 12 removes the noise included in the first voice information acquired by the voice acquisition unit 11, and outputs the second voice information from which the noise has been removed. ..

次に、ステップＳ４において、第１の音声認識部１３は、第１の収音処理部１２によって出力された第２の音声情報に対して音声認識を行い、音声認識結果を第１の音声認識結果情報として調停部１４に出力する。また、第１の音声認識部１３は、第１の音声認識結果情報の尤もらしさを示す第１の尤度を調停部１４に出力する。 Next, in step S4, the first voice recognition unit 13 performs voice recognition on the second voice information output by the first sound collection processing unit 12, and the voice recognition result is the first voice recognition. It is output to the arbitration unit 14 as result information. Further, the first voice recognition unit 13 outputs the first likelihood indicating the likelihood of the first voice recognition result information to the arbitration unit 14.

次に、ステップＳ５において、調停部１４は、第１の音声認識結果情報の尤もらしさを示す第１の尤度が第１の閾値より大きいか否かを判断する。なお、第１の閾値は、第１の音声認識結果情報が正しい認識結果であると判断可能な閾値である。ここで、第１の尤度が第１の閾値より大きいと判断された場合（ステップＳ５でＹＥＳ）、ステップＳ６において、調停部１４は、第１の音声認識結果情報を出力する。 Next, in step S5, the arbitration unit 14 determines whether or not the first likelihood indicating the likelihood of the first voice recognition result information is larger than the first threshold value. The first threshold value is a threshold value at which it can be determined that the first voice recognition result information is a correct recognition result. Here, when it is determined that the first likelihood is larger than the first threshold value (YES in step S5), the arbitration unit 14 outputs the first voice recognition result information in step S6.

一方、第１の尤度が第１の閾値以下であると判断された場合（ステップＳ５でＮＯ）、ステップＳ７の処理へ移行する。 On the other hand, when it is determined that the first likelihood is equal to or less than the first threshold value (NO in step S5), the process proceeds to step S7.

ここで、端末１００のステップＳ３〜ステップＳ５の処理に並行して、サーバ２００のステップＳ３１〜ステップＳ３４の処理が行われる。 Here, the processes of steps S31 to S34 of the server 200 are performed in parallel with the processes of steps S3 to S5 of the terminal 100.

ステップＳ３１において、サーバ２００の通信部２０１は、端末１００によって送信された第１の音声情報を受信する。 In step S31, the communication unit 201 of the server 200 receives the first voice information transmitted by the terminal 100.

次に、ステップＳ３２において、第２の収音処理部２１は、通信部２０１によって受信された第１の音声情報に含まれるノイズを除去し、ノイズを除去した第３の音声情報を出力する。 Next, in step S32, the second sound collecting processing unit 21 removes the noise included in the first voice information received by the communication unit 201, and outputs the third voice information from which the noise has been removed.

次に、ステップＳ３３において、第２の音声認識部２２は、第２の収音処理部２１によって出力された第３の音声情報に対して音声認識を行い、音声認識結果を第２の音声認識結果情報として通信部２０１に出力する。また、第２の音声認識部２２は、第２の音声認識結果情報の尤もらしさを示す第２の尤度を通信部２０１に出力する。 Next, in step S33, the second voice recognition unit 22 performs voice recognition on the third voice information output by the second sound collection processing unit 21, and recognizes the voice recognition result as the second voice recognition. It is output to the communication unit 201 as the result information. Further, the second voice recognition unit 22 outputs a second likelihood indicating the likelihood of the second voice recognition result information to the communication unit 201.

次に、ステップＳ３４において、通信部２０１は、第２の音声認識部２２から出力された第２の音声認識結果情報及び第２の尤度を端末１００へ送信する。 Next, in step S34, the communication unit 201 transmits the second voice recognition result information and the second likelihood output from the second voice recognition unit 22 to the terminal 100.

次に、ステップＳ７において、端末１００の通信部１０１は、サーバ２００によって送信された第２の音声認識結果情報及び第２の尤度を受信する。また、通信部１０１は、第２の音声認識結果情報および第２の尤度を調停部１４に出力する。 Next, in step S7, the communication unit 101 of the terminal 100 receives the second voice recognition result information and the second likelihood transmitted by the server 200. Further, the communication unit 101 outputs the second voice recognition result information and the second likelihood to the arbitration unit 14.

次に、ステップＳ８において、調停部１４は、第２の音声認識結果情報の尤もらしさを示す第２の尤度が第２の閾値より大きいか否かを判断する。なお、第２の閾値は、第２の音声認識結果情報が正しい認識結果であると判断可能な閾値である。ここで、第２の尤度が第２の閾値より大きいと判断された場合（ステップＳ８でＹＥＳ）、ステップＳ９において、調停部１４は、第２の音声認識結果情報を出力する。 Next, in step S8, the arbitration unit 14 determines whether or not the second likelihood indicating the likelihood of the second voice recognition result information is larger than the second threshold value. The second threshold value is a threshold value at which it can be determined that the second voice recognition result information is the correct recognition result. Here, when it is determined that the second likelihood is larger than the second threshold value (YES in step S8), in step S9, the arbitration unit 14 outputs the second voice recognition result information.

一方、第２の尤度が第２の閾値以下であると判断された場合（ステップＳ８でＮＯ）、ステップＳ１０において、表示部１０６は、音声認識ができなかったことを示す音声認識不可情報をユーザに通知する。 On the other hand, when it is determined that the second likelihood is equal to or less than the second threshold value (NO in step S8), in step S10, the display unit 106 provides voice recognition non-recognition information indicating that voice recognition could not be performed. Notify the user.

図４は、音声認識不可情報を表示する表示画面の一例を示す図である。 FIG. 4 is a diagram showing an example of a display screen for displaying non-speech recognition information.

図４に示すように、表示部１０６は、調停部１４によって第１の尤度が第１の閾値以下であり、且つ第２の尤度が第２の閾値以下であると判断された場合、表示画面上に音声認識不可情報１０６１を表示する。図４では、音声認識不可情報１０６１は、“音声認識できませんでした”という文字情報で構成される。 As shown in FIG. 4, when the mediation unit 14 determines that the first likelihood is equal to or less than the first threshold value and the second likelihood is equal to or less than the second threshold value, the display unit 106 is determined. The voice recognition non-recognizable information 1061 is displayed on the display screen. In FIG. 4, the voice recognition impossible information 1061 is composed of the character information "Voice recognition could not be performed".

なお、本実施の形態では、端末１００は、音声認識不可情報を、表示部１０６に表示することによりユーザに通知するが、他の方法でユーザに通知してもよい。本開示は特にこれに限定されず、スピーカ１０３から音声出力することによりユーザに通知してもよい。 In the present embodiment, the terminal 100 notifies the user by displaying the voice recognition unrecognizable information on the display unit 106, but the terminal 100 may notify the user by another method. The present disclosure is not particularly limited to this, and the user may be notified by outputting voice from the speaker 103.

また、本実施の形態において、端末１００は、第１の音声情報がサーバ２００に送信されていることを示すサーバ送信情報をユーザに通知してもよい。 Further, in the present embodiment, the terminal 100 may notify the user of the server transmission information indicating that the first voice information is transmitted to the server 200.

図５は、サーバ送信情報を表示する表示画面の一例を示す図である。 FIG. 5 is a diagram showing an example of a display screen for displaying server transmission information.

図５に示すように、表示部１０６は、第１の音声情報がサーバ２００に送信された場合、表示画面上にサーバ送信情報１０６２を表示する。図５では、サーバ送信情報１０６２は、音声情報がネットワークを介して送信されていることを示すアイコンで構成される。サーバ送信情報１０６２の表示は、第１の音声情報の送信開始タイミングで開始され、送信終了タイミングで終了してもよい。また、サーバ送信情報１０６２の表示は、第１の音声情報の送信開始タイミングで開始され、第２の音声認識結果情報の受信タイミングで終了してもよい。 As shown in FIG. 5, when the first voice information is transmitted to the server 200, the display unit 106 displays the server transmission information 1062 on the display screen. In FIG. 5, the server transmission information 1062 is composed of an icon indicating that the voice information is transmitted via the network. The display of the server transmission information 1062 may be started at the transmission start timing of the first voice information and may end at the transmission end timing. Further, the display of the server transmission information 1062 may be started at the transmission start timing of the first voice information and may end at the reception timing of the second voice recognition result information.

なお、ユーザによっては、サーバ２００に音声情報を送信することを望まない可能性がある。そのため、事前に、ユーザに対して、サーバ２００に音声情報を送信するか否かを確認する送信確認情報を提示することが好ましい。 Note that some users may not want to transmit voice information to the server 200. Therefore, it is preferable to present to the user in advance transmission confirmation information for confirming whether or not to transmit voice information to the server 200.

図６は、送信確認情報を表示する表示画面の一例を示す図である。 FIG. 6 is a diagram showing an example of a display screen for displaying transmission confirmation information.

図６に示すように、表示部１０６は、初期設定時において、表示画面上に送信確認情報１０６３を表示する。図６では、送信確認情報１０６３は、“音声をクラウドにアップしてもよいですか？”という文字情報で構成される。送信確認情報１０６３は、端末１００の初期設定時に表示されてもよいし、第１の音声情報を最初に送信する際に表示されてもよい。 As shown in FIG. 6, the display unit 106 displays the transmission confirmation information 1063 on the display screen at the time of initial setting. In FIG. 6, the transmission confirmation information 1063 is composed of the character information "Can the voice be uploaded to the cloud?". The transmission confirmation information 1063 may be displayed at the time of initial setting of the terminal 100, or may be displayed at the time of first transmitting the first voice information.

本実施の形態では、第１の音声認識部１３は、第１の音声認識結果情報および第１の尤度をそれぞれ調停部１４に出力するものを例に説明をしたが、これに限定をされない。 In the present embodiment, the first voice recognition unit 13 has been described as an example of outputting the first voice recognition result information and the first likelihood to the arbitration unit 14, but the present invention is not limited thereto. ..

第１の音声認識部１３は、第１の尤度を、第２の音声情報に対する音声認識を行うときに算出する。例えば、第１の音声認識部１３は、第２の音声情報の音声認識結果および第１の尤度を含む第１の音声認識結果情報を調停部１４に出力してもよい。 The first voice recognition unit 13 calculates the first likelihood when performing voice recognition for the second voice information. For example, the first voice recognition unit 13 may output the voice recognition result of the second voice information and the first voice recognition result information including the first likelihood to the arbitration unit 14.

この場合、調停部１４は、第１の音声認識結果情報の中から必要に応じて、第２の音声情報の音声認識結果および第１の尤度を取り出して処理を行えばよい。 In this case, the arbitration unit 14 may take out the voice recognition result and the first likelihood of the second voice information from the first voice recognition result information as necessary and perform processing.

本実施の形態では、第２の音声認識部２２は、第２の音声認識結果情報および第２の尤度をそれぞれ通信部２０１に出力するものを例に説明をしたが、これに限定をされない。 In the present embodiment, the second voice recognition unit 22 has been described as an example of outputting the second voice recognition result information and the second likelihood to the communication unit 201, respectively, but the present invention is not limited thereto. ..

第２の音声認識部２２は、第２の尤度を、第３の音声情報に対する音声認識を行うときに算出する。例えば、第２の音声認識部２２は、第３の音声情報の音声認識結果および第２の尤度を含む第２の音声認識結果情報を通信部２０１に出力してもよい。 The second voice recognition unit 22 calculates the second likelihood when performing voice recognition for the third voice information. For example, the second voice recognition unit 22 may output the voice recognition result of the third voice information and the second voice recognition result information including the second likelihood to the communication unit 201.

この場合、通信部２０１、通信部１０１および調停部１４のいずれかは、第２の音声認識結果情報の中から必要に応じて、第３の音声情報の音声認識結果または第２の尤度を取り出して処理を行ってもよい。 In this case, any one of the communication unit 201, the communication unit 101, and the arbitration unit 14 obtains the voice recognition result or the second likelihood of the third voice information from the second voice recognition result information, if necessary. It may be taken out and processed.

また、本実施の形態では、第１の音声認識部１３が、第１の音声認識結果情報および第１の尤度を調停部１４へ出力し、通信部１０１がサーバ２００により送信された第２の音声認識結果情報、および第２の尤度を調停部１４へ出力するものを例に説明をしたが、これに限定されない。 Further, in the present embodiment, the first voice recognition unit 13 outputs the first voice recognition result information and the first likelihood to the arbitration unit 14, and the communication unit 101 is transmitted by the server 200. The voice recognition result information of the above and the one that outputs the second likelihood to the arbitration unit 14 have been described as an example, but the description is not limited to this.

調停部１４は、第１の尤度が入力されれば、予め保持する第１の閾値との比較結果に応じて、第１の音声認識結果情報を出力すべきかどうかを判断できる。 If the first likelihood is input, the arbitration unit 14 can determine whether or not to output the first voice recognition result information according to the comparison result with the first threshold value held in advance.

また、調停部１４は、第２の尤度が入力されれば、予め保持する第２の閾値との比較結果に応じて、第２の音声認識結果情報を出力すべきかどうかを判断できる。 Further, if the second likelihood is input, the arbitration unit 14 can determine whether or not to output the second voice recognition result information according to the comparison result with the second threshold value held in advance.

例えば、第１の音声認識部１３は、第１の音声認識結果情報を調停部１４に出力するのではなく、端末１００のメモリ１０５に記憶してもよい。この場合、第１の音声認識部１３は、第１の尤度を調停部１４に出力する。 For example, the first voice recognition unit 13 may store the first voice recognition result information in the memory 105 of the terminal 100 instead of outputting it to the arbitration unit 14. In this case, the first voice recognition unit 13 outputs the first likelihood to the arbitration unit 14.

また、例えば、通信部１０１は、第２の音声認識結果情報を調停部１４に出力するのではなく、端末１００のメモリ１０５に記憶してもよい。この場合、通信部１０１は、第２の尤度を調停部１４に出力する。 Further, for example, the communication unit 101 may store the second voice recognition result information in the memory 105 of the terminal 100 instead of outputting it to the arbitration unit 14. In this case, the communication unit 101 outputs the second likelihood to the arbitration unit 14.

また、調停部１４は、出力すべきと判断した第１の音声認識結果情報または第２の音声認識結果情報をメモリ１０５から取り出して出力してもよい。また、調停部１４は、第１の音声認識結果情報および第２の音声認識結果情報を出力しないと判断した場合、メモリ１０５から、第１の音声認識結果情報および第２の音声認識結果情報を削除してもよい。 Further, the arbitration unit 14 may take out the first voice recognition result information or the second voice recognition result information determined to be output from the memory 105 and output the information. Further, when the arbitration unit 14 determines that the first voice recognition result information and the second voice recognition result information are not output, the mediation unit 14 outputs the first voice recognition result information and the second voice recognition result information from the memory 105. You may delete it.

また、本実施の形態では、端末１００が調停部１４を備えているが、本開示は特にこれに限定されず、サーバ２００が調停部を備えてもよい。 Further, in the present embodiment, the terminal 100 includes the arbitration unit 14, but the present disclosure is not particularly limited to this, and the server 200 may include the arbitration unit.

図７は、本開示の実施の形態１の変形例における音声認識システムの機能構成を示す図である。図７に示すように、端末１００は、音声取得部１１、第１の収音処理部１２及び第１の音声認識部１３を備える。サーバ２００は、第２の収音処理部２１、第２の音声認識部２２及び調停部２３を備える。 FIG. 7 is a diagram showing a functional configuration of a voice recognition system in a modified example of the first embodiment of the present disclosure. As shown in FIG. 7, the terminal 100 includes a voice acquisition unit 11, a first sound collection processing unit 12, and a first voice recognition unit 13. The server 200 includes a second sound collection processing unit 21, a second voice recognition unit 22, and an arbitration unit 23.

端末１００の通信部１０１は、第１の音声認識部１３から出力された第１の音声認識結果情報および第１の尤度をサーバ２００へ送信する。サーバ２００の通信部２０１は、端末１００によって送信された第１の音声認識結果情報および第１の尤度を受信し、調停部２３へ出力する。 The communication unit 101 of the terminal 100 transmits the first voice recognition result information and the first likelihood output from the first voice recognition unit 13 to the server 200. The communication unit 201 of the server 200 receives the first voice recognition result information and the first likelihood transmitted by the terminal 100 and outputs them to the arbitration unit 23.

第２の音声認識部２２は、第２の収音処理部２１によって出力された第３の音声情報に対して音声認識を行い、音声認識結果を第２の音声認識結果情報として調停部２３へ出力する。また、第２の音声認識部２２は、第２の尤度を調停部２３へ出力する。 The second voice recognition unit 22 performs voice recognition on the third voice information output by the second sound collection processing unit 21, and sends the voice recognition result to the arbitration unit 23 as the second voice recognition result information. Output. Further, the second voice recognition unit 22 outputs the second likelihood to the arbitration unit 23.

調停部２３は、通信部２０１によって受信された第１の音声認識結果情報と、第２の音声認識部２２によって出力された第２の音声認識結果情報とのうちのいずれを出力するかを選択する。なお、調停部２３の処理は、調停部１４の処理と同じであるので、説明を省略する。 The arbitration unit 23 selects which of the first voice recognition result information received by the communication unit 201 and the second voice recognition result information output by the second voice recognition unit 22 is output. To do. Since the processing of the arbitration unit 23 is the same as the processing of the arbitration unit 14, the description thereof will be omitted.

サーバ２００の通信部２０１は、調停部２３から出力された選択結果を端末１００へ送信する。なお、選択結果は、第１の音声認識結果情報及び第２の音声認識結果情報のいずれか一方、又は、音声認識ができなかったことを示す情報を含む。端末１００の通信部１０１は、サーバ２００によって送信された選択結果を受信する。 The communication unit 201 of the server 200 transmits the selection result output from the arbitration unit 23 to the terminal 100. The selection result includes either one of the first voice recognition result information and the second voice recognition result information, or information indicating that the voice recognition could not be performed. The communication unit 101 of the terminal 100 receives the selection result transmitted by the server 200.

このように、調停部は、端末１００とサーバ２００とのいずれが備えてもよい。調停部がサーバ２００にある場合、端末１００の演算量を削減することができる。また、調停部が端末１００にある場合、ネットワークを介して選択結果を受信する必要がないので、処理時間を短縮することができる。 As described above, the arbitration unit may be provided by either the terminal 100 or the server 200. When the arbitration unit is located on the server 200, the calculation amount of the terminal 100 can be reduced. Further, when the arbitration unit is located in the terminal 100, it is not necessary to receive the selection result via the network, so that the processing time can be shortened.

なお、第１の収音処理部１２において第１の音声情報に含まれるノイズを除去する方式を第１の除去方式、第２の収音処理部２１において第１の音声情報に含まれるノイズを除去する方式を第２の除去方式として説明をした。しかしながら第１の除去方式、第２の除去方式というのは、第１の収音処理部１２、第２の収音処理部２１において行う第１の音声情報に含まれるノイズを除去する方式の名称である。 The first sound collecting processing unit 12 removes the noise contained in the first voice information by the first removing method, and the second sound collecting processing unit 21 removes the noise contained in the first voice information. The method of removing has been described as the second removing method. However, the first removal method and the second removal method are names of methods for removing noise contained in the first voice information performed by the first sound collection processing unit 12 and the second sound collection processing unit 21. Is.

したがって、第１の収音処理部１２において第１の音声情報に含まれるノイズを除去する方式を第２の除去方式、第２の収音処理部２１において第１の音声情報に含まれるノイズを除去する方式を第１の除去方式と呼んでもよい。 Therefore, the first sound collecting processing unit 12 removes the noise contained in the first voice information by the second removing method, and the second sound collecting processing unit 21 removes the noise contained in the first voice information. The method of removing may be called the first removing method.

（実施の形態２）
続いて、実施の形態２に係る音声認識システムについて説明する。実施の形態２における音声認識システムの全体構成は、図１と同じであるので説明を省略する。 (Embodiment 2)
Subsequently, the voice recognition system according to the second embodiment will be described. Since the overall configuration of the voice recognition system according to the second embodiment is the same as that in FIG. 1, the description thereof will be omitted.

図８は、本開示の実施の形態２における音声認識システムの機能構成を示す図である。図８に示すように、音声認識システムは、端末１００及びサーバ２００を備える。端末１００は、音声取得部１１、第１の収音処理部１２、第１の音声認識部１３及び調停部１４を備える。サーバ２００は、第２の収音処理部２１及び第２の音声認識部２２を備える。 FIG. 8 is a diagram showing a functional configuration of the voice recognition system according to the second embodiment of the present disclosure. As shown in FIG. 8, the voice recognition system includes a terminal 100 and a server 200. The terminal 100 includes a voice acquisition unit 11, a first sound collection processing unit 12, a first voice recognition unit 13, and an arbitration unit 14. The server 200 includes a second sound collection processing unit 21 and a second voice recognition unit 22.

サーバ２００の通信部２０１は、第２の収音処理部２１から出力された第３の音声情報を端末１００へ送信する。端末１００の通信部１０１は、第３の音声情報をサーバ２００から受信し、受信した第３の音声情報を第１の音声認識部１３へ出力する。第１の音声認識部１３は、通信部１０１によって受信された第３の音声情報に対して音声認識を行い、音声認識結果を第４の音声認識結果情報として調停部１４に出力する。 The communication unit 201 of the server 200 transmits the third voice information output from the second sound collection processing unit 21 to the terminal 100. The communication unit 101 of the terminal 100 receives the third voice information from the server 200, and outputs the received third voice information to the first voice recognition unit 13. The first voice recognition unit 13 performs voice recognition on the third voice information received by the communication unit 101, and outputs the voice recognition result to the arbitration unit 14 as the fourth voice recognition result information.

この場合、音声認識結果は、第３の音声情報の音声認識結果を含む。また、第１の音声認識部１３は、この認識結果の尤もらしさを示す第４の尤度を算出し、算出した第４の尤度を調停部１４に出力する。 In this case, the voice recognition result includes the voice recognition result of the third voice information. Further, the first voice recognition unit 13 calculates a fourth likelihood indicating the plausibility of the recognition result, and outputs the calculated fourth likelihood to the arbitration unit 14.

実施の形態１において、説明をした第１の音声認識部１３における音声認識、尤度の算出において、第２の音声情報の代わりに第３の音声情報を用いて処理をすればよいので、第１の音声認識部１３における第３の音声情報に対する音声認識、第４の尤度の算出に関する詳細な説明は省略する。 In the first voice recognition unit 13 described in the first embodiment, the third voice information may be used instead of the second voice information in the voice recognition and the calculation of the likelihood. Detailed description of voice recognition for the third voice information and calculation of the fourth likelihood in the voice recognition unit 13 of 1 will be omitted.

また、通信部１０１は、第１の収音処理部１２によって出力された第２の音声情報をサーバ２００へ送信する。サーバ２００の通信部２０１は、端末１００によって送信された第２の音声情報を受信し、第２の音声認識部２２へ出力する。第２の音声認識部２２は、通信部２０１によって受信された第２の音声情報に対して音声認識を行い、音声認識結果を第３の音声認識結果情報として通信部２０１に出力する。 Further, the communication unit 101 transmits the second voice information output by the first sound collection processing unit 12 to the server 200. The communication unit 201 of the server 200 receives the second voice information transmitted by the terminal 100 and outputs it to the second voice recognition unit 22. The second voice recognition unit 22 performs voice recognition on the second voice information received by the communication unit 201, and outputs the voice recognition result to the communication unit 201 as the third voice recognition result information.

この場合、音声認識結果は、第２の音声情報の音声認識結果を含む。また、第２の音声認識部２２は、この音声認識結果の尤もらしさを示す第３の尤度を算出し、算出した第３の尤度を通信部２０１に出力する。 In this case, the voice recognition result includes the voice recognition result of the second voice information. Further, the second voice recognition unit 22 calculates a third likelihood indicating the plausibility of the voice recognition result, and outputs the calculated third likelihood to the communication unit 201.

実施の形態１において、説明をした第２の音声認識部２２における音声認識、尤度の算出において、第３の音声情報の代わりに第２の音声情報を用いて処理をすればよいので、第２の音声認識部２２における第２の音声情報に対する音声認識、第３の尤度の算出に関する詳細な説明は省略する。 In the second voice recognition unit 22 described in the first embodiment, in the voice recognition and the calculation of the likelihood, the second voice information may be used instead of the third voice information. Detailed description of voice recognition for the second voice information and calculation of the third likelihood in the voice recognition unit 22 of 2 will be omitted.

通信部２０１は、第２の音声認識部２２から出力された第３の音声認識結果情報および第３の尤度を端末１００へ送信する。通信部１０１は、第２の音声情報の音声認識結果である第３の音声認識結果情報をサーバ２００から受信し、受信した第３の音声認識結果情報を調停部１４へ出力する。 The communication unit 201 transmits the third voice recognition result information and the third likelihood output from the second voice recognition unit 22 to the terminal 100. The communication unit 101 receives the third voice recognition result information, which is the voice recognition result of the second voice information, from the server 200, and outputs the received third voice recognition result information to the arbitration unit 14.

調停部１４は、第１の音声認識部１３によって出力された第１の音声認識結果情報と、通信部１０１によって受信された第２の音声認識結果情報と、通信部１０１によって受信された第３の音声認識結果情報と、第１の音声認識部１３によって出力された第４の音声認識結果情報とのうちのいずれを出力するかを選択する。 The arbitration unit 14 has the first voice recognition result information output by the first voice recognition unit 13, the second voice recognition result information received by the communication unit 101, and the third voice recognition result information received by the communication unit 101. The voice recognition result information of the above and the fourth voice recognition result information output by the first voice recognition unit 13 are selected.

第１の音声認識部１３は、第１の音声認識結果情報の尤もらしさを示す第１の尤度を算出し、算出した第１の尤度を調停部１４に出力する。 The first voice recognition unit 13 calculates the first likelihood indicating the plausibility of the first voice recognition result information, and outputs the calculated first likelihood to the arbitration unit 14.

通信部１０１は、サーバ２００から送信された第２の音声認識結果情報の尤もらしさを示す第２の尤度を受信し、受信した第２の尤度を調停部１４に出力する。また、通信部１０１は、サーバ２００から送信された第３の音声認識結果情報の尤もらしさを示す第３の尤度を受信し、受信した第３の尤度を調停部１４に出力する。 The communication unit 101 receives the second likelihood indicating the likelihood of the second voice recognition result information transmitted from the server 200, and outputs the received second likelihood to the arbitration unit 14. Further, the communication unit 101 receives the third likelihood indicating the likelihood of the third voice recognition result information transmitted from the server 200, and outputs the received third likelihood to the arbitration unit 14.

さらに、第１の音声認識部１３は、第４の音声認識結果情報の尤もらしさを示す第４の尤度を算出し、算出した第４の尤度を調停部１４に出力する。 Further, the first voice recognition unit 13 calculates a fourth likelihood indicating the plausibility of the fourth voice recognition result information, and outputs the calculated fourth likelihood to the arbitration unit 14.

調停部１４は、第１の音声認識結果情報と、第２の音声認識結果情報と、第３の音声認識結果情報と、第４の音声認識結果情報とのうちのいずれを出力するかを、第１の尤度、第２の尤度、第３の尤度及び第４の尤度のうちの少なくとも１つに基づいて選択する。 The arbitration unit 14 determines which of the first voice recognition result information, the second voice recognition result information, the third voice recognition result information, and the fourth voice recognition result information is output. Selection is based on at least one of a first likelihood, a second likelihood, a third likelihood, and a fourth likelihood.

図９は、本開示の実施の形態２における音声認識システムの動作の一例を示す第１のフローチャートであり、図１０は、本開示の実施の形態２における音声認識システムの動作の一例を示す第２のフローチャートであり、図１１は、本開示の実施の形態２における音声認識システムの動作の一例を示す第３のフローチャートである。 FIG. 9 is a first flowchart showing an example of the operation of the voice recognition system according to the second embodiment of the present disclosure, and FIG. 10 shows an example of the operation of the voice recognition system according to the second embodiment of the present disclosure. 2 is a flowchart of FIG. 11, and FIG. 11 is a third flowchart showing an example of the operation of the voice recognition system according to the second embodiment of the present disclosure.

まず、ステップＳ４１において、端末１００の音声取得部１１は、第１の音声情報を取得する。 First, in step S41, the voice acquisition unit 11 of the terminal 100 acquires the first voice information.

次に、ステップＳ４２において、通信部１０１は、音声取得部１１によって取得された第１の音声情報をサーバ２００へ送信する。 Next, in step S42, the communication unit 101 transmits the first voice information acquired by the voice acquisition unit 11 to the server 200.

次に、ステップＳ４３において、第１の収音処理部１２は、音声取得部１１によって取得された第１の音声情報に含まれるノイズを除去し、ノイズを除去した第２の音声情報を出力する。 Next, in step S43, the first sound collection processing unit 12 removes noise included in the first voice information acquired by the voice acquisition unit 11, and outputs the second voice information from which the noise has been removed. ..

次に、ステップＳ４４において、通信部１０１は、第１の収音処理部１２によってノイズが除去された第２の音声情報をサーバ２００へ送信する。 Next, in step S44, the communication unit 101 transmits the second voice information from which noise has been removed by the first sound collection processing unit 12 to the server 200.

次に、ステップＳ４５において、第１の音声認識部１３は、第１の収音処理部１２によって出力された第２の音声情報に対して音声認識を行い、音声認識結果を第１の音声認識結果情報として調停部１４に出力する。また、第１の音声認識部１３は、第１の音声認識結果情報の尤もらしさを示す第１の尤度を算出し、算出した第１の尤度を調停部１４へ出力する。 Next, in step S45, the first voice recognition unit 13 performs voice recognition on the second voice information output by the first sound collection processing unit 12, and recognizes the voice recognition result as the first voice recognition. It is output to the arbitration unit 14 as result information. Further, the first voice recognition unit 13 calculates the first likelihood indicating the plausibility of the first voice recognition result information, and outputs the calculated first likelihood to the arbitration unit 14.

次に、ステップＳ４６において、調停部１４は、第１の音声認識結果情報の尤もらしさを示す第１の尤度が第１の閾値より大きいか否かを判断する。なお、第１の閾値は、第１の音声認識結果情報が正しい認識結果であると判断可能な閾値である。ここで、第１の尤度が第１の閾値より大きいと判断された場合（ステップＳ４６でＹＥＳ）、ステップＳ４７において、調停部１４は、第１の音声認識結果情報を出力する。 Next, in step S46, the arbitration unit 14 determines whether or not the first likelihood indicating the likelihood of the first voice recognition result information is larger than the first threshold value. The first threshold value is a threshold value at which it can be determined that the first voice recognition result information is a correct recognition result. Here, when it is determined that the first likelihood is larger than the first threshold value (YES in step S46), in step S47, the arbitration unit 14 outputs the first voice recognition result information.

一方、第１の尤度が第１の閾値以下であると判断された場合（ステップＳ４６でＮＯ）、ステップＳ４８の処理へ移行する。 On the other hand, when it is determined that the first likelihood is equal to or less than the first threshold value (NO in step S46), the process proceeds to step S48.

ここで、端末１００のステップＳ４４〜ステップＳ４６の処理に並行して、サーバ２００のステップＳ６１〜ステップＳ６３の処理が行われる。 Here, the processes of steps S61 to S63 of the server 200 are performed in parallel with the processes of steps S44 to S46 of the terminal 100.

ステップＳ６１において、サーバ２００の通信部２０１は、端末１００によって送信された第２の音声情報を受信する。 In step S61, the communication unit 201 of the server 200 receives the second voice information transmitted by the terminal 100.

次に、ステップＳ６２において、第２の音声認識部２２は、通信部２０１によって受信された第２の音声情報に対して音声認識を行い、音声認識結果を第３の音声認識結果情報として通信部２０１に出力する。また、第２の音声認識部２２は、第３の音声認識結果情報の尤もらしさを示す第３の尤度を算出し、算出した第３の尤度を通信部２０１に出力する。 Next, in step S62, the second voice recognition unit 22 performs voice recognition on the second voice information received by the communication unit 201, and uses the voice recognition result as the third voice recognition result information in the communication unit. Output to 201. Further, the second voice recognition unit 22 calculates a third likelihood indicating the likelihood of the third voice recognition result information, and outputs the calculated third likelihood to the communication unit 201.

次に、ステップＳ６３において、通信部２０１は、第２の音声認識部２２から出力された第３の音声認識結果情報および第３の尤度を端末１００へ送信する。 Next, in step S63, the communication unit 201 transmits the third voice recognition result information and the third likelihood output from the second voice recognition unit 22 to the terminal 100.

次に、ステップＳ４８において、端末１００の通信部１０１は、サーバ２００によって送信された第３の音声認識結果情報および第３の尤度を受信する。また、通信部１０１は、第３の音声認識結果情報および第３の尤度を調停部１４に出力する。 Next, in step S48, the communication unit 101 of the terminal 100 receives the third voice recognition result information and the third likelihood transmitted by the server 200. Further, the communication unit 101 outputs the third voice recognition result information and the third likelihood to the arbitration unit 14.

次に、ステップＳ４９において、調停部１４は、第３の音声認識結果情報の尤もらしさを示す第３の尤度が第３の閾値より大きいか否かを判断する。なお、第３の閾値は、第３の音声認識結果情報が正しい認識結果であると判断可能な閾値である。ここで、第３の尤度が第３の閾値より大きいと判断された場合（ステップＳ４９でＹＥＳ）、ステップＳ５０において、調停部１４は、第３の音声認識結果情報を出力する。 Next, in step S49, the arbitration unit 14 determines whether or not the third likelihood indicating the likelihood of the third voice recognition result information is greater than the third threshold value. The third threshold value is a threshold value at which it can be determined that the third voice recognition result information is the correct recognition result. Here, when it is determined that the third likelihood is larger than the third threshold value (YES in step S49), in step S50, the arbitration unit 14 outputs the third voice recognition result information.

一方、第３の尤度が第３の閾値以下であると判断された場合（ステップＳ４９でＮＯ）、ステップＳ５１の処理へ移行する。 On the other hand, when it is determined that the third likelihood is equal to or less than the third threshold value (NO in step S49), the process proceeds to step S51.

ここで、端末１００のステップＳ４２〜ステップＳ４９の処理に並行して、サーバ２００のステップＳ７１〜ステップＳ７３の処理が行われる。 Here, the processes of steps S71 to S73 of the server 200 are performed in parallel with the processes of steps S42 to S49 of the terminal 100.

ステップＳ７１において、サーバ２００の通信部２０１は、端末１００によって送信された第１の音声情報を受信する。 In step S71, the communication unit 201 of the server 200 receives the first voice information transmitted by the terminal 100.

次に、ステップＳ７２において、第２の収音処理部２１は、通信部２０１によって受信された第１の音声情報に含まれるノイズを除去し、ノイズを除去した第３の音声情報を出力する。 Next, in step S72, the second sound collecting processing unit 21 removes the noise included in the first voice information received by the communication unit 201, and outputs the third voice information from which the noise has been removed.

次に、ステップＳ７３において、通信部２０１は、第２の収音処理部２１から出力された第３の音声情報を端末１００へ送信する。 Next, in step S73, the communication unit 201 transmits the third voice information output from the second sound collection processing unit 21 to the terminal 100.

次に、ステップＳ５１において、端末１００の通信部１０１は、サーバ２００によって送信された第３の音声情報を受信する。 Next, in step S51, the communication unit 101 of the terminal 100 receives the third voice information transmitted by the server 200.

次に、ステップＳ５２において、第１の音声認識部１３は、通信部１０１によって受信された第３の音声情報に対して音声認識を行い、音声認識結果を第４の音声認識結果情報として調停部１４に出力する。また、第１の音声認識部１３は、第４の音声認識結果情報の尤もらしさを示す第４の尤度を算出し、算出した第４の尤度を調停部１４に出力する。 Next, in step S52, the first voice recognition unit 13 performs voice recognition on the third voice information received by the communication unit 101, and the voice recognition result is used as the fourth voice recognition result information in the mediation unit. Output to 14. Further, the first voice recognition unit 13 calculates a fourth likelihood indicating the plausibility of the fourth voice recognition result information, and outputs the calculated fourth likelihood to the arbitration unit 14.

次に、ステップＳ５３において、調停部１４は、第４の音声認識結果情報の尤もらしさを示す第４の尤度が第４の閾値より大きいか否かを判断する。なお、第４の閾値は、第４の音声認識結果情報が正しい認識結果であると判断可能な閾値である。ここで、第４の尤度が第４の閾値より大きいと判断された場合（ステップＳ５３でＹＥＳ）、ステップＳ５４において、調停部１４は、第４の音声認識結果情報を出力する。 Next, in step S53, the arbitration unit 14 determines whether or not the fourth likelihood indicating the likelihood of the fourth voice recognition result information is greater than the fourth threshold value. The fourth threshold value is a threshold value at which it can be determined that the fourth voice recognition result information is a correct recognition result. Here, when it is determined that the fourth likelihood is larger than the fourth threshold value (YES in step S53), in step S54, the arbitration unit 14 outputs the fourth voice recognition result information.

一方、第４の尤度が第４の閾値以下であると判断された場合（ステップＳ５３でＮＯ）、ステップＳ５５の処理へ移行する。 On the other hand, when it is determined that the fourth likelihood is equal to or less than the fourth threshold value (NO in step S53), the process proceeds to step S55.

ここで、端末１００のステップＳ５２〜ステップＳ５３の処理に並行して、サーバ２００のステップＳ７４〜ステップＳ７５の処理が行われる。 Here, the processes of steps S74 to S75 of the server 200 are performed in parallel with the processes of steps S52 to S53 of the terminal 100.

ステップＳ７４において、第２の音声認識部２２は、第２の収音処理部２１から出力された第３の音声情報に対して音声認識を行い、音声認識結果を第２の音声認識結果情報として通信部２０１に出力する。また、第２の音声認識部２２は、第２の音声認識結果情報の尤もらしさを示す第２の尤度を算出し、算出した第２の尤度を通信部２０１に出力する。 In step S74, the second voice recognition unit 22 performs voice recognition on the third voice information output from the second sound collection processing unit 21, and uses the voice recognition result as the second voice recognition result information. Output to communication unit 201. Further, the second voice recognition unit 22 calculates the second likelihood indicating the likelihood of the second voice recognition result information, and outputs the calculated second likelihood to the communication unit 201.

次に、ステップＳ７５において、通信部２０１は、第２の音声認識部２２から出力された第２の音声認識結果情報および第２の尤度を端末１００へ送信する。 Next, in step S75, the communication unit 201 transmits the second voice recognition result information and the second likelihood output from the second voice recognition unit 22 to the terminal 100.

次に、ステップＳ５５において、端末１００の通信部１０１は、サーバ２００によって送信された第２の音声認識結果情報および第２の尤度を受信する。また、通信部１０１は、第２の音声認識結果情報および第２の尤度を調停部１４へ出力する。 Next, in step S55, the communication unit 101 of the terminal 100 receives the second voice recognition result information and the second likelihood transmitted by the server 200. Further, the communication unit 101 outputs the second voice recognition result information and the second likelihood to the arbitration unit 14.

次に、ステップＳ５６において、調停部１４は、第２の音声認識結果情報の尤もらしさを示す第２の尤度が第２の閾値より大きいか否かを判断する。なお、第２の閾値は、第２の音声認識結果情報が正しい認識結果であると判断可能な閾値である。ここで、第２の尤度が第２の閾値より大きいと判断された場合（ステップＳ５６でＹＥＳ）、ステップＳ５７において、調停部１４は、第２の音声認識結果情報を出力する。 Next, in step S56, the arbitration unit 14 determines whether or not the second likelihood indicating the likelihood of the second voice recognition result information is larger than the second threshold value. The second threshold value is a threshold value at which it can be determined that the second voice recognition result information is the correct recognition result. Here, when it is determined that the second likelihood is larger than the second threshold value (YES in step S56), in step S57, the arbitration unit 14 outputs the second voice recognition result information.

一方、第２の尤度が第２の閾値以下であると判断された場合（ステップＳ５６でＮＯ）、ステップＳ５８において、表示部１０６は、音声認識ができなかったことを示す音声認識不可情報をユーザに通知する。 On the other hand, when it is determined that the second likelihood is equal to or less than the second threshold value (NO in step S56), in step S58, the display unit 106 provides voice recognition non-recognition information indicating that voice recognition could not be performed. Notify the user.

なお、第１の閾値、第２の閾値、第３の閾値および第４の閾値は、例えば、端末１００のメモリ１０５に予め記憶されている。 The first threshold value, the second threshold value, the third threshold value, and the fourth threshold value are stored in advance in, for example, the memory 105 of the terminal 100.

本実施の形態では、第１の音声認識部１３は、第１の音声認識結果情報、第１の尤度、第４の音声認識結果情報および第４の尤度をそれぞれ調停部１４に出力するものを例に説明をしたが、これに限定をされない。 In the present embodiment, the first voice recognition unit 13 outputs the first voice recognition result information, the first likelihood, the fourth voice recognition result information, and the fourth likelihood to the arbitration unit 14, respectively. The explanation was given using a thing as an example, but it is not limited to this.

第１の音声認識部１３は、第４の尤度を、第３の音声情報に対する音声認識を行うときに算出する。例えば、第１の音声認識部１３は、第３の音声情報の音声認識結果および第４の尤度を含む第４の音声認識結果情報を調停部１４に出力してもよい。 The first voice recognition unit 13 calculates the fourth likelihood when performing voice recognition for the third voice information. For example, the first voice recognition unit 13 may output the voice recognition result of the third voice information and the fourth voice recognition result information including the fourth likelihood to the arbitration unit 14.

この場合、調停部１４は、第４の音声認識結果情報の中から必要に応じて、第３の音声情報の音声認識結果および第４の尤度を取り出して処理を行えばよい。 In this case, the arbitration unit 14 may take out the voice recognition result and the fourth likelihood of the third voice information from the fourth voice recognition result information as necessary and perform processing.

本実施の形態では、第２の音声認識部２２は、第２の音声認識結果情報、第２の尤度、第３の音声認識結果情報および第３の尤度をそれぞれ通信部２０１に出力するものを例に説明をしたが、これに限定をされない。 In the present embodiment, the second voice recognition unit 22 outputs the second voice recognition result information, the second likelihood, the third voice recognition result information, and the third likelihood to the communication unit 201, respectively. The explanation was given using a thing as an example, but it is not limited to this.

第２の音声認識部２２は、第３の尤度を、第２の音声情報に対する音声認識を行うときに算出する。例えば、第２の音声認識部２２は、第２の音声情報の音声認識結果および第３の尤度を含む第３の音声認識結果情報を通信部２０１に出力してもよい。 The second voice recognition unit 22 calculates the third likelihood when performing voice recognition for the second voice information. For example, the second voice recognition unit 22 may output the voice recognition result of the second voice information and the third voice recognition result information including the third likelihood to the communication unit 201.

この場合、通信部２０１、通信部１０１および調停部１４のいずれかは、第３の音声認識結果情報の中から必要に応じて、第２の音声情報の音声認識結果または第３の尤度を取り出して処理を行ってもよい。 In this case, any one of the communication unit 201, the communication unit 101, and the arbitration unit 14 obtains the voice recognition result or the third likelihood of the second voice information from the third voice recognition result information, if necessary. It may be taken out and processed.

また、本実施の形態では、第１の音声認識部１３が、第１の音声認識結果情報、第１の尤度、第４の音声認識結果情報、および第４の尤度を調停部１４へ出力し、通信部１０１がサーバ２００により送信された第２の音声認識結果情報、第２の尤度、第３の音声認識結果情報、第３の尤度を調停部１４へ出力するものを例に説明をした。しかしながら、これに限定をされるものではない。 Further, in the present embodiment, the first voice recognition unit 13 transfers the first voice recognition result information, the first likelihood, the fourth voice recognition result information, and the fourth likelihood to the arbitration unit 14. An example in which the communication unit 101 outputs the second voice recognition result information, the second likelihood, the third voice recognition result information, and the third likelihood transmitted by the server 200 to the arbitration unit 14. I explained to. However, it is not limited to this.

調停部１４は、第１の尤度を受け取れば、予め保持する第１の閾値との比較結果に応じて、第１の音声認識結果情報を出力すべきかどうかを判断できる。 Upon receiving the first likelihood, the arbitration unit 14 can determine whether or not to output the first voice recognition result information according to the comparison result with the first threshold value held in advance.

調停部１４は、第２の尤度を受け取れば、予め保持する第２の閾値との比較結果に応じて、第２の音声認識結果情報を出力すべきかどうかを判断できる。 Upon receiving the second likelihood, the arbitration unit 14 can determine whether or not to output the second voice recognition result information according to the comparison result with the second threshold value held in advance.

調停部１４は、第３の尤度を受け取れば、予め保持する第３の閾値との比較結果に応じて、第３の音声認識結果情報を出力すべきかどうかを判断できる。 Upon receiving the third likelihood, the arbitration unit 14 can determine whether or not to output the third voice recognition result information according to the comparison result with the third threshold value held in advance.

また、調停部１４は、第４の尤度を受け取れば、予め保持する第４の閾値との比較結果に応じて、第４の音声認識結果情報を出力すべきかどうかを判断できる。 Further, when the arbitration unit 14 receives the fourth likelihood, it can determine whether or not to output the fourth voice recognition result information according to the comparison result with the fourth threshold value held in advance.

例えば、第１の音声認識部１３は、第１の音声認識結果情報および第４の音声認識結果情報を調停部１４に出力するのではなく、端末１００のメモリ１０５に記憶してもよい。この場合、通信部１０１は、第１の尤度および第４の尤度を調停部１４に出力する。 For example, the first voice recognition unit 13 may store the first voice recognition result information and the fourth voice recognition result information in the memory 105 of the terminal 100 instead of outputting them to the arbitration unit 14. In this case, the communication unit 101 outputs the first likelihood and the fourth likelihood to the arbitration unit 14.

また、例えば、通信部１０１は、第２の音声認識結果情報および第３の音声認識結果情報を調停部１４に出力するのではなく、端末１００のメモリ１０５に記憶してもよい。この場合、通信部１０１は、第２の尤度および第３の尤度を調停部１４に出力する。 Further, for example, the communication unit 101 may store the second voice recognition result information and the third voice recognition result information in the memory 105 of the terminal 100 instead of outputting them to the arbitration unit 14. In this case, the communication unit 101 outputs the second likelihood and the third likelihood to the arbitration unit 14.

また、例えば、調停部１４は、出力すべきと判断した第１の音声認識結果情報〜第４の音声認識結果情報のいずれかをメモリ１０５から取り出して出力してもよい。 Further, for example, the arbitration unit 14 may take out any one of the first voice recognition result information to the fourth voice recognition result information determined to be output from the memory 105 and output the information.

また、調停部１４は、第１の音声認識結果情報〜第４の音声認識結果情報のいずれも出力すべきではないと判断した場合、メモリ１０５から、第１の音声認識結果情報〜第４の音声認識結果情報を削除してもよい。 Further, when the arbitration unit 14 determines that none of the first voice recognition result information to the fourth voice recognition result information should be output, the first voice recognition result information to the fourth voice recognition result information to the fourth from the memory 105 The voice recognition result information may be deleted.

なお、本実施の形態２では、ステップＳ６３の第３の音声認識結果情報を送信する処理は、ステップＳ７３の第３の音声情報を送信する処理よりも先に行われている。しかしながら、第３の音声情報を送信する処理が、第３の音声認識結果情報を送信する処理よりも先に行われる場合もある。 In the second embodiment, the process of transmitting the third voice recognition result information in step S63 is performed before the process of transmitting the third voice information in step S73. However, the process of transmitting the third voice information may be performed before the process of transmitting the third voice recognition result information.

そこで、第３の音声情報を送信する処理が、第３の音声認識結果情報を送信する処理よりも先に行われる実施の形態２の変形例について説明する。 Therefore, a modified example of the second embodiment in which the process of transmitting the third voice information is performed before the process of transmitting the third voice recognition result information will be described.

図１２は、本開示の実施の形態２の変形例における音声認識システムの動作の一例を示す第１のフローチャートであり、図１３は、本開示の実施の形態２の変形例における音声認識システムの動作の一例を示す第２のフローチャートである。なお、図１２のステップＳ４６以前の処理は、図９のステップＳ４１〜Ｓ４５の処理と同じであり、図１２のステップＳ１０１以前の処理は、図９のステップＳ７１〜Ｓ７２の処理と同じであり、図１３のステップＳ１１１以前の処理は、図９のステップＳ６１〜Ｓ６２の処理と同じである。 FIG. 12 is a first flowchart showing an example of the operation of the voice recognition system in the modified example of the second embodiment of the present disclosure, and FIG. 13 is a diagram of the voice recognition system in the modified example of the second embodiment of the present disclosure. It is a 2nd flowchart which shows an example of operation. The processing before step S46 in FIG. 12 is the same as the processing in steps S41 to S45 in FIG. 9, and the processing before step S101 in FIG. 12 is the same as the processing in steps S71 to S72 in FIG. The processing before step S111 in FIG. 13 is the same as the processing in steps S61 to S62 in FIG.

ステップＳ１０１において、通信部２０１は、第２の収音処理部２１から出力された第３の音声情報を端末１００へ送信する。 In step S101, the communication unit 201 transmits the third voice information output from the second sound collection processing unit 21 to the terminal 100.

次に、ステップＳ８１において、端末１００の通信部１０１は、サーバ２００によって送信された第３の音声情報を受信する。 Next, in step S81, the communication unit 101 of the terminal 100 receives the third voice information transmitted by the server 200.

次に、ステップＳ８２において、第１の音声認識部１３は、通信部１０１によって受信された第３の音声情報に対して音声認識を行い、音声認識結果を第４の音声認識結果情報として調停部１４に出力する。また、第１の音声認識部１３は、第４の音声認識結果情報の尤もらしさを示す第４の尤度を算出し、算出した第４の尤度を調停部１４に出力する。 Next, in step S82, the first voice recognition unit 13 performs voice recognition on the third voice information received by the communication unit 101, and the voice recognition result is used as the fourth voice recognition result information in the mediation unit. Output to 14. Further, the first voice recognition unit 13 calculates a fourth likelihood indicating the plausibility of the fourth voice recognition result information, and outputs the calculated fourth likelihood to the arbitration unit 14.

次に、ステップＳ８３において、調停部１４は、第４の音声認識結果情報の尤もらしさを示す第４の尤度が第４の閾値より大きいか否かを判断する。なお、第４の閾値は、第４の音声認識結果情報が正しい認識結果であると判断可能な閾値である。ここで、第４の尤度が第４の閾値より大きいと判断された場合（ステップＳ８３でＹＥＳ）、ステップＳ８４において、調停部１４は、第４の音声認識結果情報を出力する。 Next, in step S83, the arbitration unit 14 determines whether or not the fourth likelihood indicating the likelihood of the fourth voice recognition result information is greater than the fourth threshold value. The fourth threshold value is a threshold value at which it can be determined that the fourth voice recognition result information is a correct recognition result. Here, when it is determined that the fourth likelihood is larger than the fourth threshold value (YES in step S83), in step S84, the arbitration unit 14 outputs the fourth voice recognition result information.

一方、第４の尤度が第４の閾値以下であると判断された場合（ステップＳ８３でＮＯ）、ステップＳ８５の処理へ移行する。 On the other hand, when it is determined that the fourth likelihood is equal to or less than the fourth threshold value (NO in step S83), the process proceeds to step S85.

ここで、端末１００のステップＳ４４〜ステップＳ８３の処理に並行して、サーバ２００のステップＳ６１〜ステップＳ１１１の処理が行われる。 Here, in parallel with the processing of steps S44 to S83 of the terminal 100, the processing of steps S61 to S111 of the server 200 is performed.

次に、ステップＳ１１１において、通信部２０１は、第２の音声認識部２２から出力された第３の音声認識結果情報および第３の尤度を端末１００へ送信する。 Next, in step S111, the communication unit 201 transmits the third voice recognition result information and the third likelihood output from the second voice recognition unit 22 to the terminal 100.

次に、ステップＳ８５において、端末１００の通信部１０１は、サーバ２００によって送信された第３の音声認識結果情報および第３の尤度を受信する。通信部１０１は、第３の音声認識結果情報および第３の尤度を調停部１４へ出力する。 Next, in step S85, the communication unit 101 of the terminal 100 receives the third voice recognition result information and the third likelihood transmitted by the server 200. The communication unit 101 outputs the third voice recognition result information and the third likelihood to the arbitration unit 14.

次に、ステップＳ８６において、調停部１４は、第３の音声認識結果情報の尤もらしさを示す第３の尤度が第３の閾値より大きいか否かを判断する。なお、第３の閾値は、第３の音声認識結果情報が正しい認識結果であると判断可能な閾値である。ここで、第３の尤度が第３の閾値より大きいと判断された場合（ステップＳ８６でＹＥＳ）、ステップＳ８７において、調停部１４は、第３の音声認識結果情報を出力する。 Next, in step S86, the arbitration unit 14 determines whether or not the third likelihood indicating the likelihood of the third voice recognition result information is larger than the third threshold value. The third threshold value is a threshold value at which it can be determined that the third voice recognition result information is the correct recognition result. Here, when it is determined that the third likelihood is larger than the third threshold value (YES in step S86), in step S87, the arbitration unit 14 outputs the third voice recognition result information.

一方、第３の尤度が第３の閾値以下であると判断された場合（ステップＳ８６でＮＯ）、ステップＳ８８の処理へ移行する。 On the other hand, when it is determined that the third likelihood is equal to or less than the third threshold value (NO in step S86), the process proceeds to step S88.

ここで、端末１００のステップＳ８２〜ステップＳ８６の処理に並行して、サーバ２００のステップＳ１０２〜ステップＳ１０３の処理が行われる。 Here, in parallel with the processing of steps S82 to S86 of the terminal 100, the processing of steps S102 to S103 of the server 200 is performed.

ステップＳ１０２において、第２の音声認識部２２は、第２の収音処理部２１から出力された第３の音声情報に対して音声認識を行い、音声認識結果を第２の音声認識結果情報として通信部２０１に出力する。また、第２の音声認識部２２は、第２の音声認識結果情報の尤もらしさを示す第２の尤度を算出し、算出した第２の尤度を通信部２０１に出力する。 In step S102, the second voice recognition unit 22 performs voice recognition on the third voice information output from the second sound collection processing unit 21, and uses the voice recognition result as the second voice recognition result information. Output to communication unit 201. Further, the second voice recognition unit 22 calculates the second likelihood indicating the likelihood of the second voice recognition result information, and outputs the calculated second likelihood to the communication unit 201.

次に、ステップＳ１０３において、通信部２０１は、第２の音声認識部２２から出力された第２の音声認識結果情報および第２の尤度を端末１００へ送信する。 Next, in step S103, the communication unit 201 transmits the second voice recognition result information and the second likelihood output from the second voice recognition unit 22 to the terminal 100.

次に、ステップＳ８８において、端末１００の通信部１０１は、サーバ２００によって送信された第２の音声認識結果情報および第２の尤度を受信する。通信部１０１は、第２の音声認識結果情報および第２の尤度を調停部１４に出力する。 Next, in step S88, the communication unit 101 of the terminal 100 receives the second voice recognition result information and the second likelihood transmitted by the server 200. The communication unit 101 outputs the second voice recognition result information and the second likelihood to the arbitration unit 14.

次に、ステップＳ８９において、調停部１４は、第２の音声認識結果情報の尤もらしさを示す第２の尤度が第２の閾値より大きいか否かを判断する。なお、第２の閾値は、第２の音声認識結果情報が正しい認識結果であると判断可能な閾値である。ここで、第２の尤度が第２の閾値より大きいと判断された場合（ステップＳ８９でＹＥＳ）、ステップＳ９０において、調停部１４は、第２の音声認識結果情報を出力する。 Next, in step S89, the arbitration unit 14 determines whether or not the second likelihood indicating the likelihood of the second voice recognition result information is larger than the second threshold value. The second threshold value is a threshold value at which it can be determined that the second voice recognition result information is the correct recognition result. Here, when it is determined that the second likelihood is larger than the second threshold value (YES in step S89), in step S90, the arbitration unit 14 outputs the second voice recognition result information.

一方、第２の尤度が第２の閾値以下であると判断された場合（ステップＳ８９でＮＯ）、ステップＳ９１において、表示部１０６は、音声認識ができなかったことを示す音声認識不可情報をユーザに通知する。 On the other hand, when it is determined that the second likelihood is equal to or less than the second threshold value (NO in step S89), in step S91, the display unit 106 provides voice recognition non-recognition information indicating that voice recognition could not be performed. Notify the user.

（実施の形態３）
続いて、実施の形態３に係る音声認識システムについて説明する。実施の形態３における音声認識システムの全体構成は、図１と同じであるので説明を省略する。 (Embodiment 3)
Subsequently, the voice recognition system according to the third embodiment will be described. Since the overall configuration of the voice recognition system according to the third embodiment is the same as that in FIG. 1, the description thereof will be omitted.

図１４は、本開示の実施の形態３における音声認識システムの機能構成を示す図である。図１４に示すように、音声認識システムは、端末１００及びサーバ２００を備える。端末１００は、音声取得部１１、第１の収音処理部１２、第１の音声認識部１３及び調停部１４を備える。サーバ２００は、第２の収音処理部２１を備える。 FIG. 14 is a diagram showing a functional configuration of the voice recognition system according to the third embodiment of the present disclosure. As shown in FIG. 14, the voice recognition system includes a terminal 100 and a server 200. The terminal 100 includes a voice acquisition unit 11, a first sound collection processing unit 12, a first voice recognition unit 13, and an arbitration unit 14. The server 200 includes a second sound collecting processing unit 21.

実施の形態１における音声認識システムと、実施の形態３における音声認識システムとの差異は、サーバ２００が第２の音声認識部２２を備えているか否かである。 The difference between the voice recognition system according to the first embodiment and the voice recognition system according to the third embodiment is whether or not the server 200 includes the second voice recognition unit 22.

第２の収音処理部２１は、通信部２０１によって受信された第１の音声情報に含まれるノイズを除去し、ノイズを除去した第３の音声情報を出力する。 The second sound collection processing unit 21 removes noise included in the first voice information received by the communication unit 201, and outputs the third voice information from which the noise has been removed.

サーバ２００の通信部２０１は、第２の収音処理部２１から出力された第３の音声情報を端末１００へ送信する。 The communication unit 201 of the server 200 transmits the third voice information output from the second sound collection processing unit 21 to the terminal 100.

第１の音声認識部１３は、第１の収音処理部１２によって出力された第２の音声情報に対して音声認識を行い、音声認識結果を第１の音声認識結果情報として調停部１４に出力する。また、第１の音声認識部１３は、第１の音声認識結果情報の尤もらしさを示す第１の尤度を算出し、算出した第１の尤度を調停部１４に出力する。 The first voice recognition unit 13 performs voice recognition on the second voice information output by the first sound collection processing unit 12, and uses the voice recognition result as the first voice recognition result information in the mediation unit 14. Output. Further, the first voice recognition unit 13 calculates the first likelihood indicating the plausibility of the first voice recognition result information, and outputs the calculated first likelihood to the arbitration unit 14.

また、第１の音声認識部１３は、通信部１０１によって受信された第３の音声情報に対して音声認識を行い、音声認識結果を第４の音声認識結果情報として調停部１４に出力する。また、第１の音声認識部１３は、第４の音声認識結果情報の尤もらしさを示す第４の尤度を算出し、算出した第４の尤度を調停部１４に出力する。 Further, the first voice recognition unit 13 performs voice recognition on the third voice information received by the communication unit 101, and outputs the voice recognition result to the arbitration unit 14 as the fourth voice recognition result information. Further, the first voice recognition unit 13 calculates a fourth likelihood indicating the plausibility of the fourth voice recognition result information, and outputs the calculated fourth likelihood to the arbitration unit 14.

調停部１４は、第１の音声認識部１３によって出力された第１の音声認識結果情報と、第１の音声認識部１３によって出力された第４の音声認識結果情報とのうちのいずれを出力するかを選択する。なお、調停部１４の処理については、他の実施の形態と同じであるので、説明を省略する。 The arbitration unit 14 outputs either the first voice recognition result information output by the first voice recognition unit 13 or the fourth voice recognition result information output by the first voice recognition unit 13. Select whether to do it. Since the processing of the arbitration unit 14 is the same as that of the other embodiments, the description thereof will be omitted.

（実施の形態４）
続いて、実施の形態４に係る音声認識システムについて説明する。実施の形態４における音声認識システムの全体構成は、図１と同じであるので説明を省略する。 (Embodiment 4)
Subsequently, the voice recognition system according to the fourth embodiment will be described. Since the overall configuration of the voice recognition system according to the fourth embodiment is the same as that in FIG. 1, the description thereof will be omitted.

図１５は、本開示の実施の形態４における音声認識システムの機能構成を示す図である。図１５に示すように、音声認識システムは、端末１００及びサーバ２００を備える。端末１００は、音声取得部１１及び第１の収音処理部１２を備える。サーバ２００は、第２の収音処理部２１、第２の音声認識部２２及び調停部２３を備える。 FIG. 15 is a diagram showing a functional configuration of the voice recognition system according to the fourth embodiment of the present disclosure. As shown in FIG. 15, the voice recognition system includes a terminal 100 and a server 200. The terminal 100 includes a voice acquisition unit 11 and a first sound collection processing unit 12. The server 200 includes a second sound collection processing unit 21, a second voice recognition unit 22, and an arbitration unit 23.

実施の形態１の変形例における音声認識システム（図７）と、実施の形態４における音声認識システムとの差異は、端末１００が第１の音声認識部１３を備えているか否かである。 The difference between the voice recognition system (FIG. 7) in the modified example of the first embodiment and the voice recognition system in the fourth embodiment is whether or not the terminal 100 includes the first voice recognition unit 13.

通信部１０１は、第１の収音処理部１２によって出力された第２の音声情報をサーバ２００へ送信する。サーバ２００の通信部２０１は、端末１００によって送信された第２の音声情報を受信し、第２の音声認識部２２へ出力する。第２の音声認識部２２は、通信部２０１によって受信された第２の音声情報に対して音声認識を行い、音声認識結果を第３の音声認識結果情報として調停部２３へ出力する。 The communication unit 101 transmits the second voice information output by the first sound collection processing unit 12 to the server 200. The communication unit 201 of the server 200 receives the second voice information transmitted by the terminal 100 and outputs it to the second voice recognition unit 22. The second voice recognition unit 22 performs voice recognition on the second voice information received by the communication unit 201, and outputs the voice recognition result to the arbitration unit 23 as the third voice recognition result information.

第２の音声認識部２２は、第２の収音処理部２１によって出力された第３の音声情報に対して音声認識を行い、音声認識結果を第２の音声認識結果情報として調停部２３へ出力する。 The second voice recognition unit 22 performs voice recognition on the third voice information output by the second sound collection processing unit 21, and sends the voice recognition result to the arbitration unit 23 as the second voice recognition result information. Output.

調停部２３は、第２の音声認識部２２から出力された第３の音声認識結果情報と、第２の音声認識部２２から出力された第２の音声認識結果情報とのうちのいずれを出力するかを選択する。なお、調停部２３の処理については、他の実施の形態と同じであるので、説明を省略する。 The arbitration unit 23 outputs either the third voice recognition result information output from the second voice recognition unit 22 or the second voice recognition result information output from the second voice recognition unit 22. Select whether to do it. Since the processing of the arbitration unit 23 is the same as that of the other embodiments, the description thereof will be omitted.

（実施の形態５）
続いて、実施の形態５に係る音声認識システムについて説明する。実施の形態５における音声認識システムの全体構成は、図１と同じであるので説明を省略する。 (Embodiment 5)
Subsequently, the voice recognition system according to the fifth embodiment will be described. Since the overall configuration of the voice recognition system according to the fifth embodiment is the same as that in FIG. 1, the description thereof will be omitted.

図１６は、本開示の実施の形態５における音声認識システムの機能構成を示す図である。図１６に示すように、音声認識システムは、端末１００及びサーバ２００を備える。端末１００は、音声取得部１１、第１の収音処理部１２、第１の音声認識部１３、調停部１４、発話区間検出部１５及び発話継続時間測定部１７を備える。サーバ２００は、第２の収音処理部２１及び第２の音声認識部２２を備える。 FIG. 16 is a diagram showing a functional configuration of the voice recognition system according to the fifth embodiment of the present disclosure. As shown in FIG. 16, the voice recognition system includes a terminal 100 and a server 200. The terminal 100 includes a voice acquisition unit 11, a first sound collection processing unit 12, a first voice recognition unit 13, an arbitration unit 14, an utterance section detection unit 15, and an utterance duration measurement unit 17. The server 200 includes a second sound collection processing unit 21 and a second voice recognition unit 22.

実施の形態１における音声認識システムと、実施の形態５における音声認識システムとの差異は、端末１００が発話区間検出部１５および発話継続時間測定部１７を備えているか否かである。 The difference between the voice recognition system according to the first embodiment and the voice recognition system according to the fifth embodiment is whether or not the terminal 100 includes the utterance section detection unit 15 and the utterance duration measurement unit 17.

発話区間検出部１５は、音声取得部１１によって取得された第１の音声情報におけるユーザが発話した発話区間を検出する。発話区間検出部１５は、一般的な発話区間検出（ＶＡＤ：ＶｏｉｃｅＡｃｔｉｖｉｔｙＤｅｔｅｃｔｉｏｎ）技術を用いて発話区間を検出する。例えば、発話区間検出部１５は、入力された音声信号の時系列で構成されたフレームにおいて、振幅と零交差数とに基づいて、そのフレームが音声区間か否かを検出する。また、例えば、発話区間検出部１５は、入力される音声情報の特徴量に基づき、ユーザが発話中である確率を音声モデルにより算出するとともに、ユーザの発話がない状態である確率を雑音モデルにより算出し、雑音モデルから得られた確率よりも音声モデルから得られた確率の方が高い区間を発話区間であると判定してもよい。 The utterance section detection unit 15 detects the utterance section spoken by the user in the first voice information acquired by the voice acquisition unit 11. The utterance section detection unit 15 detects the utterance section by using a general utterance section detection (VAD: Voice Activity Detection) technique. For example, the utterance section detection unit 15 detects whether or not the frame is a voice section based on the amplitude and the number of zero crossings in the frame composed of the time series of the input voice signal. Further, for example, the utterance section detection unit 15 calculates the probability that the user is speaking by the voice model based on the feature amount of the input voice information, and calculates the probability that the user is not speaking by the noise model. A section in which the probability obtained from the voice model is higher than the probability obtained from the noise model may be determined to be the utterance section.

発話継続時間測定部１７は、発話区間検出部１５によって発話区間が検出された場合に、音声があると判断された区間（フレーム）の開始から終了までの時間を測定する。 The utterance duration measurement unit 17 measures the time from the start to the end of the section (frame) determined to have voice when the utterance section is detected by the utterance section detection unit 15.

第１の収音処理部１２は、発話区間検出部１５によって発話区間が検出されない場合には、第１の音声情報に含まれるノイズを除去せず、第２の音声情報を出力しない。また、通信部１０１は、発話区間検出部１５によって発話区間が検出されない場合には、第１の音声情報をサーバ２００へ送信しない。 When the utterance section detection unit 15 does not detect the utterance section, the first sound collection processing unit 12 does not remove the noise included in the first voice information and does not output the second voice information. Further, the communication unit 101 does not transmit the first voice information to the server 200 when the utterance section is not detected by the utterance section detection unit 15.

第１の収音処理部１２は、発話区間検出部１５によって発話区間が検出された場合には、第１の音声情報に含まれるノイズを除去する。また、通信部１０１は、発話区間検出部１５によって発話区間が検出された場合には、発話区間内における第１の音声情報をサーバ２００へ送信する。 When the utterance section is detected by the utterance section detection unit 15, the first sound collection processing unit 12 removes noise included in the first voice information. Further, when the utterance section is detected by the utterance section detection unit 15, the communication unit 101 transmits the first voice information in the utterance section to the server 200.

調停部１４は、音声認識部によって出力された第１の音声認識結果情報と、通信部１０１によって受信された第２の音声認識結果情報とのうちのいずれを出力するかを、少なくとも発話継続時間の長さに関する情報を用いて選択する。すなわち、調停部１４は、発話継続時間測定部１７によって検出された発話継続時間が所定の長さより長い場合に、第２の音声認識結果情報の尤もらしさを示す第２の尤度に乗算する重み付けを、前記第１の音声認識結果情報の尤もらしさを示す第１の尤度に乗算する重み付けよりも上げる。発話継続時間が所定の時間の長さよりも長い場合、単語数が多い高度な音声指示を行っている可能性が高い。この場合、サーバ２００から出力される音声認識結果に乗算される重み付けを上げることにより、誤認識を防止することができる。 The arbitration unit 14 determines which of the first voice recognition result information output by the voice recognition unit and the second voice recognition result information received by the communication unit 101 is output, at least the utterance duration. Select using information about the length of. That is, the arbitration unit 14 is weighted by multiplying the second likelihood indicating the likelihood of the second speech recognition result information when the utterance duration detected by the utterance duration measurement unit 17 is longer than a predetermined length. Is higher than the weighting that multiplies the first likelihood, which indicates the likelihood of the first speech recognition result information. If the utterance duration is longer than the predetermined length of time, it is highly likely that advanced voice instructions with a large number of words are being performed. In this case, erroneous recognition can be prevented by increasing the weighting multiplied by the voice recognition result output from the server 200.

なお、図８に示す実施の形態２における音声認識システムにおいて、端末１００が発話区間検出部１５を備えてもよい。また、図８に示す実施の形態２における音声認識システムにおいて、端末１００が発話区間検出部１５及び発話継続時間測定部１７を備えてもよい。 In the voice recognition system according to the second embodiment shown in FIG. 8, the terminal 100 may include the utterance section detection unit 15. Further, in the voice recognition system according to the second embodiment shown in FIG. 8, the terminal 100 may include the utterance section detection unit 15 and the utterance duration measurement unit 17.

この場合、調停部１４は、第１の音声認識結果情報と、第２の音声認識結果情報と、第３の音声認識結果情報と、第４の音声認識結果情報とのうちのいずれを出力するかを、少なくとも発話継続時間の長さに関する情報を用いて選択する。 In this case, the arbitration unit 14 outputs any of the first voice recognition result information, the second voice recognition result information, the third voice recognition result information, and the fourth voice recognition result information. At least with information about the length of speech duration.

また、調停部１４は、発話継続時間が所定の長さより長い場合に、第２の音声認識結果情報の尤もらしさを示す第２の尤度及び第３の音声認識結果情報の尤もらしさを示す第３の尤度に乗算する重み付けを、第１の音声認識結果情報の尤もらしさを示す第１の尤度及び第４の音声認識結果情報の尤もらしさを示す第４の尤度に乗算する重み付けよりも上げる。 Further, the arbitration unit 14 shows the likelihood of the second speech recognition result information and the likelihood of the third speech recognition result information when the utterance duration is longer than a predetermined length. From the weighting that multiplies the likelihood of 3 by the first likelihood that indicates the likelihood of the first speech recognition result information and the fourth likelihood that indicates the likelihood of the fourth speech recognition result information. Also raise.

さらに、調停部１４は、発話継続時間が所定の長さより長い場合に、第２の尤度に乗算する重み付けを、第３の尤度に乗算する重み付けよりも上げる。 Further, the arbitration unit 14 raises the weight to be multiplied by the second likelihood more than the weight to be multiplied by the third likelihood when the utterance duration is longer than a predetermined length.

また、本実施の形態５では、取得された音声情報に対して発話区間検出が行われるが、ノイズが除去された音声情報に対して発話区間検出を行ってもよい。そこで、ノイズが除去された音声情報に対して発話区間検出を行う実施の形態５の変形例について説明する。 Further, in the fifth embodiment, the utterance section detection is performed on the acquired voice information, but the utterance section detection may be performed on the voice information from which noise has been removed. Therefore, a modified example of the fifth embodiment in which the utterance section is detected for the voice information from which noise has been removed will be described.

図１７は、本開示の実施の形態５の変形例における音声認識システムの機能構成を示す図である。図１７に示すように、端末１００は、音声取得部１１、第１の収音処理部１２、第１の音声認識部１３、調停部１４、発話区間検出部１５及び音声送信判断部１６を備える。 FIG. 17 is a diagram showing a functional configuration of a voice recognition system in a modified example of the fifth embodiment of the present disclosure. As shown in FIG. 17, the terminal 100 includes a voice acquisition unit 11, a first sound collection processing unit 12, a first voice recognition unit 13, an arbitration unit 14, a speech section detection unit 15, and a voice transmission determination unit 16. ..

発話区間検出部１５は、第１の収音処理部１２によってノイズが除去された第２の音声情報におけるユーザが発話した発話区間を検出する。発話区間検出部１５は、一般的な発話区間検出技術を用いて発話区間を検出する。 The utterance section detection unit 15 detects the utterance section spoken by the user in the second voice information from which noise has been removed by the first sound collection processing unit 12. The utterance section detection unit 15 detects the utterance section by using a general utterance section detection technique.

音声送信判断部１６は、発話区間検出部１５による発話区間の検出結果に基づいて、音声取得部１１によって取得された第１の音声情報を送信するか否かを判断する。音声送信判断部１６は、発話区間検出部１５によって発話区間が検出された場合、音声取得部１１によって取得された第１の音声情報を送信すると判断し、発話区間検出部１５によって発話区間が検出されない場合、音声取得部１１によって取得された第１の音声情報を送信しないと判断する。通信部１０１は、音声送信判断部１６による判断結果に基づいて、音声取得部１１によって取得された第１の音声情報を送信する。 The voice transmission determination unit 16 determines whether or not to transmit the first voice information acquired by the voice acquisition unit 11 based on the detection result of the utterance section by the utterance section detection unit 15. When the utterance section detection unit 15 detects the utterance section, the voice transmission determination unit 16 determines that the first voice information acquired by the voice acquisition unit 11 is transmitted, and the utterance section detection unit 15 detects the utterance section. If not, it is determined that the first voice information acquired by the voice acquisition unit 11 is not transmitted. The communication unit 101 transmits the first voice information acquired by the voice acquisition unit 11 based on the determination result by the voice transmission determination unit 16.

このように、ノイズが除去された音声情報である第２の音声情報に対して発話区間の検出を行うことにより、より高い精度で発話区間を検出することができる。 In this way, by detecting the utterance section for the second voice information which is the voice information from which noise has been removed, the utterance section can be detected with higher accuracy.

本開示に係る音声認識装置及び音声認識方法は、騒音環境では音声認識の正確性を向上させることができ、静音環境では音声認識の高速化を実現することができ、音声情報に含まれるノイズを除去し、ノイズを除去した音声情報に対して音声認識を行う音声認識装置及び音声認識方法として有用である。 The voice recognition device and the voice recognition method according to the present disclosure can improve the accuracy of voice recognition in a noisy environment, can realize high-speed voice recognition in a quiet environment, and can reduce noise contained in voice information. It is useful as a voice recognition device and a voice recognition method for performing voice recognition on voice information from which noise has been removed.

１１音声取得部
１２第１の収音処理部
１３第１の音声認識部
１４調停部
１５発話区間検出部
１６音声送信判断部
２１第２の収音処理部
２２第２の音声認識部
２３調停部
１００端末
１０１通信部
１０２マイク
１０３スピーカ
１０４制御部
１０５メモリ
１０６表示部
２００サーバ
２０１通信部
２０２制御部
２０３メモリ
３００ネットワーク 11 Voice acquisition unit 12 1st sound collection processing unit 13 1st voice recognition unit 14 Mediation unit 15 Speech section detection unit 16 Voice transmission judgment unit 21 2nd sound collection processing unit 22 2nd voice recognition unit 23 Mediation unit 100 Terminal 101 Communication unit 102 Microphone 103 Speaker 104 Control unit 105 Memory 106 Display unit 200 Server 201 Communication unit 202 Control unit 203 Memory 300 Network

Claims

The voice acquisition unit that acquires the first voice information,
A noise removal processing unit that removes noise contained in the first voice information acquired by the voice acquisition unit by using the first removal method, and outputs the voice information from which the noise has been removed as second voice information. When,
A voice recognition unit that performs voice recognition on the second voice information output by the noise removal processing unit and outputs the voice recognition result as the first voice recognition result information.
The first voice information acquired by the voice acquisition unit is transmitted to the server, and the amount of noise removed from the first voice information by the first removal method in the server is larger than the amount of noise. The noise contained in the first voice information is removed by using the second removal method for removing the noise from the first voice information, and voice recognition is performed for the third voice information from which the noise is removed. The communication unit that receives the voice recognition result as the second voice recognition result information from the server, which is the result of the above.
An arbitration unit that selects which of the first voice recognition result information output by the voice recognition unit and the second voice recognition result information received by the communication unit is to be output.
An utterance section detection unit that detects an utterance section spoken by a user in the first voice information acquired by the voice acquisition unit, and a speech section detection unit.
With
When the utterance section is not detected by the utterance section detection unit, the noise removal processing unit does not remove the noise included in the first voice information and does not output the second voice information.
The communication unit does not transmit the first voice information to the server,
The voice recognition unit calculates a first likelihood indicating the plausibility of the first voice recognition result information, outputs the calculated first likelihood to the arbitration unit, and outputs the calculated first likelihood.
The communication unit receives the second likelihood indicating the likelihood of the second voice recognition result information calculated by the server, outputs the received second likelihood to the arbitration unit, and outputs the received second likelihood.
The arbitration unit determines which of the first voice recognition result information and the second voice recognition result information is output, at least one of the first likelihood and the second likelihood. Select based on,
Voice recognition device.

The arbitration section
When the first likelihood is larger than a predetermined first threshold value, the first voice recognition result information is output.
When the first likelihood is equal to or less than the first threshold value and the second likelihood is larger than the predetermined second threshold value, the second voice recognition result information is output.
When the first likelihood is equal to or less than the first threshold value and the second likelihood is equal to or less than the second threshold value, the first voice recognition result information and the second voice recognition result Does not output any information,
The voice recognition device according to claim 1.

The voice acquisition unit that acquires the first voice information,
A noise removal processing unit that removes noise contained in the first voice information acquired by the voice acquisition unit by using the first removal method, and outputs the voice information from which the noise has been removed as second voice information. When,
A voice recognition unit that performs voice recognition on the second voice information output by the noise removal processing unit and outputs the voice recognition result as the first voice recognition result information.
The first voice information acquired by the voice acquisition unit is transmitted to the server, and the amount of noise removed from the first voice information by the first removal method in the server is larger than the amount of noise. The noise contained in the first voice information is removed by using the second removal method for removing the noise from the first voice information, and voice recognition is performed for the third voice information from which the noise is removed. The communication unit that receives the voice recognition result as the second voice recognition result information from the server, which is the result of the above.
An arbitration unit that selects which of the first voice recognition result information output by the voice recognition unit and the second voice recognition result information received by the communication unit is to be output.
An utterance section detection unit that detects an utterance section spoken by a user in the first voice information acquired by the voice acquisition unit, and a speech section detection unit.
When the utterance section is detected by the utterance section detection unit, the utterance duration measurement unit that measures the utterance duration, which is the duration of the utterance section detected by the utterance section detection unit,
With
When the utterance section is not detected by the utterance section detection unit, the noise removal processing unit does not remove the noise included in the first voice information and does not output the second voice information.
The communication unit does not transmit the first voice information to the server,
When the utterance section is detected by the utterance section detection unit, the noise removal processing unit removes noise included in the first voice information.
The communication unit transmits the first voice information in the utterance section to the server.
The arbitration unit determines at least which of the first voice recognition result information output by the voice recognition unit and the second voice recognition result information received by the communication unit is output. Select using the information on the length of speech duration,
Voice recognition device.

When the utterance duration is longer than a predetermined length, the arbitration unit multiplies the second likelihood indicating the likelihood of the second voice recognition result information by the weighting of the first voice recognition result information. Higher than the weighting that multiplies the first likelihood, which indicates the likelihood of
The voice recognition device according to claim 3 .

The voice acquisition unit that acquires the first voice information,
A noise removal processing unit that removes noise contained in the first voice information acquired by the voice acquisition unit by using the first removal method, and outputs the voice information from which the noise has been removed as second voice information. When,
A voice recognition unit that performs voice recognition on the second voice information output by the noise removal processing unit and outputs the voice recognition result as the first voice recognition result information.
The first voice information acquired by the voice acquisition unit is transmitted to the server, and the amount of noise removed from the first voice information by the first removal method in the server is larger than the amount of noise. The noise contained in the first voice information is removed by using the second removal method for removing the noise from the first voice information, and voice recognition is performed for the third voice information from which the noise is removed. The communication unit that receives the voice recognition result as the second voice recognition result information from the server, which is the result of the above.
An arbitration unit that selects which of the first voice recognition result information output by the voice recognition unit and the second voice recognition result information received by the communication unit is to be output.
With
The communication unit receives the third voice information from the server, outputs the received third voice information to the voice recognition unit, and outputs the received third voice information to the voice recognition unit.
The voice recognition unit performs voice recognition on the third voice information received by the communication unit, and outputs the voice recognition result as the fourth voice recognition result information.
The communication unit transmits the second voice information output by the noise removal processing unit to the server, and the voice recognition result obtained by voice recognition for the second voice information is the third voice recognition. Received from the server as the result information, and output the received third voice recognition result information to the arbitration unit.
The arbitration unit includes the first voice recognition result information output by the voice recognition unit, the second voice recognition result information received by the communication unit, and the third voice recognition result information received by the communication unit. Which of the voice recognition result information of the above and the fourth voice recognition result information output by the voice recognition unit is to be output.
Voice recognition device.

The voice recognition unit calculates a first likelihood indicating the plausibility of the first voice recognition result information, outputs the calculated first likelihood to the arbitration unit, and outputs the calculated first likelihood.
The communication unit receives the second likelihood indicating the likelihood of the second voice recognition result information calculated by the server, outputs the received second likelihood to the arbitration unit, and outputs the received second likelihood.
The communication unit receives a third likelihood indicating the likelihood of the third voice recognition result information calculated by the server, outputs the received third likelihood to the arbitration unit, and outputs the third likelihood.
The voice recognition unit calculates a fourth likelihood indicating the plausibility of the fourth voice recognition result information, outputs the calculated fourth likelihood to the arbitration unit, and outputs the calculated fourth likelihood.
The arbitration unit outputs any of the first voice recognition result information, the second voice recognition result information, the third voice recognition result information, and the fourth voice recognition result information. The choice is made based on at least one of the first likelihood, the second likelihood, the third likelihood and the fourth likelihood.
The voice recognition device according to claim 5 .

The utterance section detection unit for detecting the utterance section spoken by the user in the first voice information acquired by the voice acquisition unit is further provided.
When the utterance section is not detected by the utterance section detection unit, the noise removal processing unit does not remove the noise included in the first voice information and does not output the second voice information.
The communication unit does not transmit the first voice information to the server.
The voice recognition device according to claim 5 or 6 .

When the utterance section is detected by the utterance section detection unit, the utterance duration measuring unit further comprises a utterance duration measuring unit that measures the utterance duration, which is the duration of the utterance section detected by the utterance section detection unit.
When the utterance section is detected by the utterance section detection unit, the noise removal processing unit removes noise included in the first voice information.
The communication unit transmits the first voice information in the utterance section to the server.
The arbitration unit outputs any of the first voice recognition result information, the second voice recognition result information, the third voice recognition result information, and the fourth voice recognition result information. The choice is made using at least the information regarding the length of the speech duration.
The voice recognition device according to claim 7 .

The arbitration unit shows the likelihood of the second voice recognition result information and the likelihood of the third voice recognition result information when the utterance duration is longer than a predetermined length. The weighting to be multiplied by the third likelihood is multiplied by the first likelihood indicating the plausibility of the first speech recognition result information and the fourth likelihood indicating the plausibility of the fourth speech recognition result information. Raise more than the weight to do
The voice recognition device according to claim 8 .

When the utterance duration is longer than a predetermined length, the arbitrator increases the weight to be multiplied by the second likelihood more than the weight to be multiplied by the third likelihood.
The voice recognition device according to claim 9 .

The voice acquisition unit that acquires the first voice information,
A noise removal processing unit that removes noise contained in the first voice information acquired by the voice acquisition unit by using the first removal method, and outputs the voice information from which the noise has been removed as second voice information. When,
The first voice information acquired by the voice acquisition unit is transmitted to the server, and the amount of noise is larger than the amount of noise removed from the first voice information by the first removal method in the server. A communication unit that receives a third voice information from the server from which noise contained in the first voice information has been removed by using a second removal method that removes noise from the first voice information.
Voice recognition is performed on the second voice information output by the noise removal processing unit, the voice recognition result is output as the first voice recognition result information, and the third voice information received by the communication unit is output. A voice recognition unit that performs voice recognition on voice information and outputs the voice recognition result as second voice recognition result information.
An arbitration unit that selects which of the first voice recognition result information and the second voice recognition result information output by the voice recognition unit is to be output.
An utterance section detection unit that detects an utterance section spoken by a user in the first voice information acquired by the voice acquisition unit, and a speech section detection unit.
With
When the utterance section is not detected by the utterance section detection unit, the noise removal processing unit does not remove the noise included in the first voice information and does not output the second voice information.
The communication unit does not transmit the first voice information to the server,
The voice recognition unit calculates a first likelihood indicating the plausibility of the first voice recognition result information, outputs the calculated first likelihood to the arbitration unit, and outputs the calculated first likelihood.
The voice recognition unit calculates a second likelihood indicating the plausibility of the second voice recognition result information, outputs the calculated second likelihood to the arbitration unit, and outputs the calculated second likelihood.
The arbitration unit determines which of the first voice recognition result information and the second voice recognition result information is output, at least one of the first likelihood and the second likelihood. Select based on,
Voice recognition device.

It is a voice recognition method in a server having a communication unit, a noise removal processing unit, a voice recognition unit , a mediation unit, and a speech section detection unit , and performing voice recognition for voice information acquired by a terminal.
The communication unit receives the first voice information acquired by the terminal, and receives the first voice information.
The noise removal processing unit removes the noise contained in the received first voice information by using the first removal method, and outputs the voice information from which the noise has been removed as the second voice information.
The voice recognition unit performs voice recognition on the second voice information and outputs the voice recognition result as the first voice recognition result information.
The first removal method is used by the communication unit to remove a smaller amount of noise than the amount of noise removed from the first voice information by the first removal method in the terminal. The noise included in the voice information is removed, and the voice recognition result, which is the result of voice recognition being performed on the third voice information from which the noise is removed, is received from the terminal as the second voice recognition result information. And
The arbitration unit selects which of the first voice recognition result information and the second voice recognition result information is output.
The utterance section detection unit detects the utterance section spoken by the user in the first voice information received by the communication unit.
When the utterance section is not detected by the utterance section detection unit, the noise removal processing unit does not remove the noise included in the first voice information and does not output the second voice information.
The voice recognition unit calculates a first likelihood indicating the plausibility of the first voice recognition result information, and outputs the calculated first likelihood to the arbitration unit.
The communication unit receives the second likelihood indicating the likelihood of the second voice recognition result information calculated by the terminal, outputs the received second likelihood to the arbitration unit, and outputs the received second likelihood.
Which of the first voice recognition result information and the second voice recognition result information is output by the arbitration unit is determined by at least one of the first likelihood and the second likelihood. Select based on,
Speech recognition method.

The voice acquisition unit that acquires the first voice information,
A noise removal processing unit that removes noise contained in the first voice information acquired by the voice acquisition unit by using the first removal method, and outputs the voice information from which the noise has been removed as second voice information. When,
The first voice information acquired by the voice acquisition unit is transmitted to the server, and the amount of noise is larger than the amount of noise removed from the first voice information by the first removal method in the server. A communication unit that receives a third voice information from the server from which noise contained in the first voice information has been removed by using a second removal method that removes noise from the first voice information.
Voice recognition is performed on the second voice information output by the noise removal processing unit, the voice recognition result is output as the first voice recognition result information, and the third voice information received by the communication unit is output. A voice recognition unit that performs voice recognition on voice information and outputs the voice recognition result as second voice recognition result information.
An arbitration unit that selects which of the first voice recognition result information and the second voice recognition result information output by the voice recognition unit is to be output.
An utterance section detection unit that detects an utterance section spoken by a user in the first voice information acquired by the voice acquisition unit, and a speech section detection unit.
When the utterance section is detected by the utterance section detection unit, the utterance duration measurement unit that measures the utterance duration, which is the duration of the utterance section detected by the utterance section detection unit,
With
When the utterance section is not detected by the utterance section detection unit, the noise removal processing unit does not remove the noise included in the first voice information and does not output the second voice information.
The communication unit does not transmit the first voice information to the server,
When the utterance section is detected by the utterance section detection unit, the noise removal processing unit removes noise included in the first voice information.
The communication unit transmits the first voice information in the utterance section to the server.
The arbitration unit determines which of the first voice recognition result information and the second voice recognition result information output by the voice recognition unit is output, at least with respect to the length of the utterance duration. Select with information,
Voice recognition device.

It is a voice recognition method in a server that includes a communication unit, a noise removal processing unit, a voice recognition unit, a mediation unit, a utterance section detection unit, and a utterance duration measurement unit, and performs voice recognition on voice information acquired by a terminal. ,
The communication unit receives the first voice information acquired by the terminal, and receives the first voice information.
The noise removal processing unit removes the noise contained in the received first voice information by using the first removal method, and outputs the voice information from which the noise has been removed as the second voice information.
The voice recognition unit performs voice recognition on the second voice information and outputs the voice recognition result as the first voice recognition result information.
The first removal method is used by the communication unit to remove a smaller amount of noise than the amount of noise removed from the first voice information by the first removal method in the terminal. The noise included in the voice information is removed, and the voice recognition result, which is the result of voice recognition being performed on the third voice information from which the noise is removed, is received from the terminal as the second voice recognition result information. And
The arbitration unit selects which of the first voice recognition result information and the second voice recognition result information is output.
The utterance section detection unit detects the utterance section spoken by the user in the first voice information received by the communication unit.
When the utterance section is detected by the utterance section detection unit, the utterance duration measuring unit measures the utterance duration, which is the duration of the utterance section detected by the utterance section detection unit.
When the utterance section is not detected by the utterance section detection unit, the noise removal processing unit does not remove the noise included in the first voice information and does not output the second voice information.
When the utterance section is detected by the utterance section detection unit, the noise removal processing unit removes the noise included in the first voice information.
Which of the first voice recognition result information and the second voice recognition result information output by the voice recognition unit is output by the arbitration unit is related to at least the length of the utterance duration. Select with information,
Speech recognition method.

The voice acquisition unit that acquires the first voice information,
A noise removal processing unit that removes noise contained in the first voice information acquired by the voice acquisition unit by using the first removal method, and outputs the voice information from which the noise has been removed as second voice information. When,
The first voice information acquired by the voice acquisition unit is transmitted to the server, and the amount of noise is larger than the amount of noise removed from the first voice information by the first removal method in the server. A communication unit that receives a third voice information from the server from which noise contained in the first voice information has been removed by using a second removal method that removes noise from the first voice information.
Voice recognition is performed on the second voice information output by the noise removal processing unit, the voice recognition result is output as the first voice recognition result information, and the third voice information received by the communication unit is output. A voice recognition unit that performs voice recognition on voice information and outputs the voice recognition result as second voice recognition result information.
An arbitration unit that selects which of the first voice recognition result information and the second voice recognition result information output by the voice recognition unit is to be output.
With
The communication unit receives the voice recognition result, which is the result of voice recognition for the third voice information in the server, as the third voice recognition result information from the server, and receives the third voice information. Output the voice recognition result information of 3 to the arbitration unit,
The communication unit transmits the second voice information output by the noise removal processing unit to the server, and the voice recognition result obtained by voice recognition for the second voice information is the fourth voice. Received from the server as recognition result information, and output the received fourth voice recognition result information to the arbitration unit.
The arbitration unit includes the first voice recognition result information output by the voice recognition unit, the second voice recognition result information output by the voice recognition unit, and the first voice recognition result information received by the communication unit. Select which of the voice recognition result information of 3 and the fourth voice recognition result information received by the communication unit is to be output.
Voice recognition device.

It is a voice recognition method in a server having a communication unit, a noise removal processing unit, a voice recognition unit, and a mediation unit, and performing voice recognition for voice information acquired by a terminal.
The communication unit receives the first voice information acquired by the terminal, and receives the first voice information.
The noise removal processing unit removes the noise contained in the received first voice information by using the first removal method, and outputs the voice information from which the noise has been removed as the second voice information.
The voice recognition unit performs voice recognition on the second voice information and outputs the voice recognition result as the first voice recognition result information.
The first removal method is used by the communication unit to remove a smaller amount of noise than the amount of noise removed from the first voice information by the first removal method in the terminal. The noise included in the voice information is removed, and the voice recognition result, which is the result of voice recognition being performed on the third voice information from which the noise is removed, is received from the terminal as the second voice recognition result information. And
The arbitration unit selects which of the first voice recognition result information and the second voice recognition result information is output.
The communication unit transmits the second voice information to the terminal,
At the terminal, the communication unit removes noise contained in the second voice information by using the second removal method, and voice recognition is performed on the fourth voice information from which the noise has been removed. The voice recognition result, which is the result of the noise recognition, is received from the terminal as the third voice recognition result information.
The communication unit receives the third voice information from the terminal and outputs the received third voice information to the voice recognition unit.
The voice recognition unit performs voice recognition on the third voice information received by the communication unit, and outputs the voice recognition result as the fourth voice recognition result information.
The arbitration unit has the first voice recognition result information output by the voice recognition unit, the second voice recognition result information received by the communication unit, and the third voice recognition result information received by the communication unit. Which of the voice recognition result information of the above and the fourth voice recognition result information output by the voice recognition unit is to be output.
Speech recognition method.