ITTO980383A1

ITTO980383A1 - PROCEDURE AND VOICE RECOGNITION DEVICE WITH DOUBLE STEP OF NEURAL AND MARKOVIAN RECOGNITION.

Info

Publication number: ITTO980383A1
Application number: IT98TO000383A
Authority: IT
Inventors: Roberto Gemello; Luciano Fissore
Original assignee: Cselt Centro Studi Lab Telecom
Priority date: 1998-05-07
Filing date: 1998-05-07
Publication date: 1999-11-07
Also published as: EP0955628A3; DE69938374T2; JP3078279B2; JP2000029495A; US6185528B1; CA2270326C; DE69938374D1; EP0955628A2; EP0955628B1; CA2270326A1

Description

Descrizionedell'invenzioneaventepertitolo: Description of the invention with the title:

"PROCEDIMENTO E DISPOSITIVO DI RICONOSCIMENTO VOCALE CON DOPPIOPASSODIRICONOSCIMENTO,NEURALEEMARKOVIANO" "PROCEDURE AND VOICE RECOGNITION DEVICE WITH DOUBLE PASS OF RECOGNITION, NEURAL EMARKOVIAN"

Lapresenteinvenzione siriferisceaisistemidiriconoscimento automatico della voce ed in particolare riguarda un procedimento e un dispositivo per il riconoscimento di parole isolate in ampi vocabolari in cui le parole sono rappresentate componendo unità acustico-fonetiche della lingua e in cui il riconoscimentovieneeffettuatomedianteduepassisequenzialiincuisiutilizzano rispettivamenteletecnichedelleretineuraliedeimodellimarkovianieirisultati delleduetecnichesonocombinatiinmodoopportunopermigliorarel'accuratezza delriconoscimento. The present invention relates to automatic voice recognition systems and in particular concerns a procedure and a device for the recognition of isolated words in large vocabularies in which the words are represented by composing acoustic-phonetic units of the language and in which the recognition is carried out by two sequential passages in which they use respectfully the techniques of the obstruction of the obscuration of the obstruction of the last words and the obstruction of the latter.

Le reti neurali sono una struttura elaborativa parallela, che riproduce in forma molto semplificata l'organizzazione della corteccia cerebrale. Una rete neurale è costituita da numerose unità elaborative, dette neuroni, fortemente interconnesse mediante collegamenti di varia intensità dette sinapsi o pesi di interconnessione.Ineuronisonodispostiingeneresecondounastrutturaalivelli, unlivellodiingresso, unoopiùlivelliintermedieunlivellodiuscita. Partendo dalleunitàdiingresso,acuivienefornitoilsegnaledatrattare,l'elaborazionesi propaga ai livelli successivi della rete fino alle unità di uscita, che forniscono il risultato. Varie realizzazioni di reti neurali sono descritte, ad esempio, nel libro di D. Rumelhart "Parallel Distributed Processing", voi. 1 Foundations, MIT Press, Cambridge, Mass., 1986. Neural networks are a parallel processing structure that reproduces the organization of the cerebral cortex in a very simplified form. A neural network is made up of numerous processing units, called neurons, strongly interconnected through links of varying intensity called synapses or interconnection weights. Neurons are arranged according to a structure at levels, an input level, an intermediate level and an output level. Starting from the input units, the signal to be treated is supplied, the processing propagates to the successive levels of the network up to the output units, which provide the result. Various realizations of neural networks are described, for example, in D. Rumelhart's book "Parallel Distributed Processing", vol. 1 Foundations, MIT Press, Cambridge, Mass., 1986.

La tecnologia delle reti neurali è applicabile in molti settori, ed in particolare nel riconoscimento del parlato, in cui la rete neurale è utilizzata per stimare la probabilità P(Q | X) di un'unità fonetica Q data la rappresentazione parametrica X di una porzione di segnale vocale d’ingresso. Le parole da riconoscere sono rappresentate come concatenazione di unità fonetiche e si utilizza un algoritmo di programmazione dinamica per trovare la parola che ha la massima probabilità di essere quella effettivamente pronunciata. Neural network technology is applicable in many sectors, and in particular in speech recognition, where the neural network is used to estimate the probability P (Q | X) of a phonetic unit Q given the parametric representation X of a portion voice input signal. The words to be recognized are represented as a concatenation of phonetic units and a dynamic programming algorithm is used to find the word that has the greatest probability of being the one actually pronounced.

I modelli markoviani nascosti (Hidden Markov Models ) sono una tecnologia classica per il riconoscimento del parlato. Un modello di questo tipo consiste di un certo numero di stati collegati dalle possibili transizioni. Alle transizioni è associata una probabilità di passare dallo stato di origine a quello di destinazione. Inoltre, ogni stato può emettere dei simboli da un alfabeto finito secondo una distribuzione di probabilità data. Nel caso di impiego per il riconoscimento del parlato, ogni modello rappresenta un'unità acustico-fonetica mediante un automa sinistra-destra in cui in ogni stato si può permanere con una transizione ciclica o passare al successivo. Ad ogni stato inoltre è associata una densità di probabilità definita su X, dove X rappresenta un vettore di parametri estratti dal segnale vocale ogni 10 ms. I simboli emessi, in base alla densità di probabilità associata allo stato, sono quindi gli infiniti possibili vettori di parametri X. Questa densità di probabilità è data da una mistura di gaussiane nello spazio multidimensionale dei vettori d’ingresso. Hidden Markov Models are a classic technology for speech recognition. Such a model consists of a number of states connected by possible transitions. Transitions are associated with a probability of going from the source to the destination state. Furthermore, each state can emit symbols from a finite alphabet according to a given probability distribution. In the case of use for speech recognition, each model represents an acoustic-phonetic unit by means of a left-right automaton in which in each state one can persist with a cyclic transition or move on to the next. Each state is also associated with a probability density defined on X, where X represents a vector of parameters extracted from the speech signal every 10 ms. The symbols emitted, based on the probability density associated with the state, are therefore the infinite possible vectors of X parameters. This probability density is given by a mixture of Gaussians in the multidimensional space of the input vectors.

Anche nel caso dei modelli markoviani nascosti le parole da riconoscere sono rappresentate come concatenazione di unità fonetiche e si utilizza un algoritmo di programmazione dinamica (algoritmo di Viterbi) per trovare la parola generata con la massima probabilità, dato il segnale vocale d’ingresso. Even in the case of hidden Markov models, the words to be recognized are represented as a concatenation of phonetic units and a dynamic programming algorithm (Viterbi's algorithm) is used to find the word generated with the highest probability, given the voice input signal.

Maggiori dettagli su questa tecnica di riconoscimento si possono trovare p. es, in L. Rabiner, B-H. Juang: "Fundamentals of speech recognition", Prentice Hall, Englewood Cliffs, New Jersey (USA) More details on this recognition technique can be found on p. eg, in L. Rabiner, B-H. Juang: "Fundamentals of speech recognition", Prentice Hall, Englewood Cliffs, New Jersey (USA)

Il metodo oggetto della presente invenzione utilizza entrambe le tecniche delle reti neurali e dei modelli markoviani, mediante un doppio passo di riconoscimento e una ricombinazione dei risultati ottenuti con le due tecniche. The method object of the present invention uses both techniques of neural networks and Markov models, by means of a double recognition step and a recombination of the results obtained with the two techniques.

Un sistema di riconoscimento in cui vengono ricombinati i punteggi di riconoscitori differenti per il miglioramento delle prestazioni in termini di accuratezza di riconoscimento è illustrato nella memoria "Speech recognition using segmentai neural nets" di S. Austin, G. Zavaliagkos, J. Makhoul e R. Schwartz presentata alla conferenza ICASSP '92, San Francisco, 23-26 Marzo 1992. A recognition system in which the scores of different recognizers are recombined to improve the performance in terms of recognition accuracy is illustrated in the "Speech recognition using segmentai neural nets" by S. Austin, G. Zavaliagkos, J. Makhoul and R. Schwartz presented at ICASSP '92 conference, San Francisco, March 23-26, 1992.

Questo sistema noto effettua un primo riconoscimento con l’utilizzo dei modelli markoviani nascosti, fornendo una lista delle N migliori ipotesi di riconoscimento (p. es: 20), cioè delle N frasi che hanno la maggior probabilità di essere quella effettivamente pronunciata, insieme a un rispettivo punteggio di verosimiglianza. Lo stadio markoviano di riconoscimento provvede anche a una segmentazione fonetica di ciascuna ipotesi e trasferisce il risultato della segmentazione a un secondo stadio di riconoscimento basato su una rete neurale. Questo opera un riconoscimento a partire dai segmenti fonetici forniti dal primo passo markoviano e fornisce a sua volta una lista di ipotesi associate ognuna un punteggio di verosimiglianza in base alla tecnica di riconoscimento neurale. I due punteggi sono poi combinati linearmente per formare una lista unica, e la migliore ipotesi risultante dalla combinazione viene scelta come frase riconosciuta. This known system carries out a first recognition with the use of hidden Markov models, providing a list of the N best recognition hypotheses (e.g. 20), that is, of the N phrases that are most likely to be the one actually pronounced, together with a respective likelihood score. The Markov stage of recognition also provides for a phonetic segmentation of each hypothesis and transfers the result of the segmentation to a second stage of recognition based on a neural network. This operates a recognition starting from the phonetic segments provided by the first Markovian step and in turn provides a list of associated hypotheses each with a likelihood score based on the neural recognition technique. The two scores are then combined linearly to form a single list, and the best guess resulting from the combination is chosen as the recognized phrase.

Un sistema di questo tipo presenta alcuni inconvenienti. Un primo inconveniente è legato al fatto di effettuare il riconoscimento nel secondo stadio a partire dai segmenti fonetici fomiti dal primo stadio: in presenza di eventuali errori temporali nella segmentazione, il secondo stadio commetterà a sua volta errori di riconoscimento che quindi si propagano alla lista finale. Inoltre, il sistema non si presta bene al riconoscimento di parole isolate all’intemo di grandi vocabolari, per il fatto di presentare come primo stadio il riconoscitore markoviano, che in tali particolari condizioni risulta un po' meno efficiente di quello neurale in termini di onere computazionale. Ancora, tenuto conto che le ipotesi fornite da un riconoscitore markoviano e da un riconoscitore a rete neurale hanno dinamica dei punteggi notevolmente diverse, una semplice combinazione lineare dei punteggi può dare risultati non significativi. Infine, il sistema noto non fornisce alcuna indicazione sull’affidabilità del riconoscimento effettuato. A system of this type has some drawbacks. A first drawback is linked to the fact of carrying out the recognition in the second stage starting from the phonetic segments provided by the first stage: in the presence of any temporal errors in the segmentation, the second stage will in turn commit recognition errors which then propagate to the final list . Furthermore, the system does not lend itself well to the recognition of isolated words within large vocabularies, due to the fact that it presents the Markov recognizer as the first stage, which in these particular conditions is a little less efficient than the neural one in terms of burden. computational. Again, taking into account that the hypotheses provided by a Markov recognizer and a neural network recognizer have remarkably different dynamics of the scores, a simple linear combination of the scores can give insignificant results. Finally, the known system does not provide any indication on the reliability of the recognition performed.

Il disporre di questa informazione nel caso di riconoscimento di parole isolate è invece una caratteristica molto importante: in effetti questi sistemi, come prassi generale, richiedono all’utilizzatore una conferma della parola pronunciata, ciò che allunga i tempi della procedura. Disponendo dell’informazione di affidabilità il sistema può richiedere la conferma solo quando l’affidabilità del riconoscimento scende sotto una certa soglia, rendendo la procedura più rapida, con vantaggi sia per l’utilizzatore che per il gestore del sistema. Having this information in the case of recognition of isolated words is instead a very important feature: in fact these systems, as a general practice, require the user to confirm the spoken word, which lengthens the time of the procedure. Having the reliability information, the system can request confirmation only when the reliability of the recognition falls below a certain threshold, making the procedure faster, with advantages for both the user and the system manager.

Lo scopo dell’invenzione è quello di fornire un procedimento e un dispositivo di riconoscimento del tipo suddetto che è particolarmente studiato per il riconoscimento di parole isolate all’interno di grandi vocabolari e che consente di migliorare l'accuratezza del riconoscimento e inoltre di ottenere una stima dell’affidabilità del riconoscimento. The object of the invention is to provide a method and a recognition device of the aforesaid type which is particularly designed for the recognition of isolated words within large vocabularies and which allows to improve the accuracy of the recognition and also to obtain a estimation of recognition reliability.

Più in particolare, il procedimento secondo l’invenzione è caratterizzato dal fatto che i due passi di riconoscimento operano in sequenza su una stessa espressione da riconoscere in modo tale che il passo neurale esamini l’intero vocabolario attivo e il passo markoviano esamini solo un vocabolario parziale rappresentato dalla lista di ipotesi fornita come risultato del passo neurale, e dal fatto che si valuta inoltre l’affidabilità del riconoscimento per la ipotesi migliore della lista riordinata, sulla base dei punteggi risultanti dalla combinazione e associati a tale ipotesi migliore e a una o più ipotesi che occupano posizioni successive nella lista riordinata, generando un indice di affidabilità che può assumere almeno due valori corrispondenti rispettivamente a riconoscimento certo e riconoscimento incerto. More specifically, the method according to the invention is characterized by the fact that the two recognition steps operate in sequence on the same expression to be recognized in such a way that the neural step examines the entire active vocabulary and the Markov step examines only one vocabulary. partial represented by the list of hypotheses provided as a result of the neural step, and by the fact that the reliability of the recognition for the best hypothesis of the reordered list is also evaluated, on the basis of the scores resulting from the combination and associated with this best hypothesis and one or more hypotheses that occupy successive positions in the reordered list, generating a reliability index that can assume at least two values corresponding respectively to certain recognition and uncertain recognition.

Un riconoscitore per realizzare il procedimento è caratterizzato dal fatto che l’unità di riconoscimento a rete neurale è disposta a monte dell’unità di riconoscimento basata sui modelli markoviani nascosti ed è atta a effettuare il rispettivo riconoscimento operando sull’intero vocabolario attivo, e l’unità di riconoscimento basata sui modelli markoviani nascosti è atta a effettuare il riconoscimento in maniera indipendente da quello effettuato dall’unità di riconoscimento a rete neurale operando su un vocabolario parziale costituito dalle ipotesi contenute nella lista fornita da quest’ultima; e dal fatto che l’unità elaborativa comprende mezzi per valutare l’affidabilità del riconoscimento per l’ipotesi che ha il miglior punteggio di verosimiglianza nella lista di ipotesi riordinata, utilizzando i punteggi combinati associati alle ipotesi contenute nella lista riordinata, detti mezzi di valutazione essendo atti a fornire un indice di affidabilità che può assumere almeno due valori corrispondenti rispettivamente a riconoscimento certo o incerto per tale ipotesi. A recognizer for carrying out the procedure is characterized by the fact that the neural network recognition unit is arranged upstream of the recognition unit based on hidden Markovian models and is able to carry out the respective recognition by operating on the entire active vocabulary, and the The recognition unit based on hidden Markov models is capable of carrying out the recognition independently from that carried out by the neural network recognition unit by operating on a partial vocabulary consisting of the hypotheses contained in the list provided by the latter; and by the fact that the processing unit includes means for evaluating the reliability of the recognition for the hypothesis that has the best likelihood score in the reordered hypothesis list, using the combined scores associated with the hypotheses contained in the reordered list, said evaluation means being able to provide a reliability index which can assume at least two values corresponding respectively to certain or uncertain recognition for this hypothesis.

A maggior chiarimento si fa riferimento ai disegni allegati, in cui: For further clarification, reference is made to the attached drawings, in which:

- la fig. 1 è uno schema a blocchi di un sistema di riconoscimento secondo l’invenzione. - fig. 1 is a block diagram of a recognition system according to the invention.

- la fig. 2 è un diagramma di flusso del procedimento di riconoscimento secondo l’invenzione, - fig. 2 is a flow chart of the recognition process according to the invention,

- la fig. 3 è un diagramma di flusso delle operazioni di combinazione dei punteggi, e - fig. 3 is a flowchart of the scoring operations, and

- la fig. 4 è un diagramma di flusso delle operazioni di calcolo dell’affidabilità del riconoscimento. - fig. 4 is a flow chart of the recognition reliability calculation operations.

La descrizione che segue è fatta a titolo di esempio non limitativo supponendo che l’invenzione sia impiegata per il riconoscimento di parole isolate. The following description is given as a non-limiting example assuming that the invention is used for the recognition of isolated words.

Nella fig. 1 si vede che il sistema di riconoscimento secondo l’invenzione comprende due riconoscitori NE, MA operanti in due passi di riconoscimento successivi e indipendenti sul segnale vocale in arrivo sulla linea 1. Come usuale nella tecnica, il segnale presente sulla linea 1 sarà un’opportuna rappresentazione parametrica (p. es. una rappresentazione cepstrale) di una parola pronunciata dal parlatore, ottenuta in dispositivi di trattamento non rappresentati e organizzata in trame della durata p. es. di 10 - 15 ms. In fig. 1 it can be seen that the recognition system according to the invention comprises two recognizers NE, MA operating in two successive and independent recognition steps on the voice signal arriving on line 1. As usual in the art, the signal present on line 1 will be a appropriate parametric representation (eg a cepstral representation) of a word pronounced by the speaker, obtained in unrepresented processing devices and organized in plots of duration p. ex. by 10 - 15 ms.

Il riconoscitore NE che opera nel primo passo è basato sulla tecnologia delle reti neurali ed effettua il riconoscimento utilizzando l’intero vocabolario attivo. NE fornisce su un’uscita 2 una lista delle M(nn) parole che costituiscono le migliori ipotesi di riconoscimento secondo lo specifico tipo di rete neurale e sono associate ognuna a un rispettivo punteggio di verosimiglianza acustica nnj. The NE recognizer that operates in the first step is based on neural network technology and carries out the recognition using the entire active vocabulary. NE provides on output 2 a list of the M (nn) words that constitute the best recognition hypotheses according to the specific type of neural network and are each associated with a respective acoustic likelihood score nnj.

L’uscita 2 di NE è collegata anche al secondo riconoscitore MA, che riceve anch’esso il segnale presente sulla connessione 1 ed effettua un riconoscimento in base alla tecnica dei modelli markoviani nascosti, limitando però il campo di scelta delle possibili ipotesi di riconoscimento al vocabolario rappresentato dalle M(nn) parole individuate dal riconoscitore NE. MA fornisce a sua volta su un’uscita 3 una lista di M(hmm) parole che costituiscono le migliori ipotesi di riconoscimento secondo il modello markoviano e sono associate ognuna a un rispettivo punteggio di verosimiglianza acustica hmmj. The output 2 of NE is also connected to the second recognizer MA, which also receives the signal present on connection 1 and performs a recognition on the basis of the hidden Markov models technique, limiting however the range of choice of the possible recognition hypotheses to vocabulary represented by the M (nn) words identified by the NE recognizer. MA in turn provides on output 3 a list of M (hmm) words that constitute the best recognition hypotheses according to the Markov model and are each associated with a respective acoustic likelihood score hmmj.

In modo del tutto convenzionale, le due liste sono emesse come liste ordinate. Si noti che nel caso più generale esse possono avere lunghezza diversa anche se, date le modalità di operazione di MA, le M(hmm) parole fornite da MA saranno un sottoinsieme delle M(nn) parole fornite da NE. In a completely conventional way, the two lists are issued as ordered lists. Note that in the more general case they can have different length even if, given the operation modalities of MA, the M (hmm) words supplied by MA will be a subset of the M (nn) words supplied by NE.

Le uscite 2, 3 dei due riconoscitori NE, MA sono collegate a un dispositivo di elaborazione dei punteggi EL che deve compiere due tipi di operazione: The outputs 2, 3 of the two NE, MA recognizers are connected to an EL score processing device which must perform two types of operation:

1) effettuare un trattamento dei punteggi delle parole presenti nelle due liste, basato su una normalizzazione dei punteggi di ciascuna parola e su una combinazione dei punteggi normalizzati, e, a conclusione del trattamento, fornire su una prima uscita 4 del dispositivo una nuova lista riordinata in base ai punteggi combinati; 1) carry out a treatment of the scores of the words present in the two lists, based on a normalization of the scores of each word and on a combination of the normalized scores, and, at the end of the treatment, provide a new reordered list on a first output 4 of the device based on combined scores;

2) se entrambi i riconoscitori NE, MA hanno individuato una stessa parola come migliore ipotesi di riconoscimento, calcolare ed emettere su una seconda uscita 5 un indice di affidabilità di questa parola (che ovviamente risulterà la migliore ipotesi nella lista combinata), verificando che siano soddisfatte certe condizioni per i punteggi di verosimiglianza all’intemo di tale lista combinata. 2) if both recognizers NE, MA have identified the same word as the best hypothesis of recognition, calculate and issue on a second output 5 a reliability index of this word (which will obviously be the best hypothesis in the combined list), verifying that they are certain conditions are met for the likelihood scores within this combined list.

Tenuto conto di questa duplice funzione, all’interno dei dispositivi EL di elaborazione dei punteggi si sono rappresentati in figura tre blocchi funzionali UE1, CM, UE2. UE1 è un’unità elaborativa che ha il compito di effettuare le operazioni relative alla normalizzazione dei punteggi delle due liste fomite da NE e MA, alla combinazione dei punteggi normalizzati e alla generazione della lista riordinata in base ai punteggi combinati, che viene emessa su una prima uscita 4 del riconoscitore. CM è un’unità di confronto che ha il compito di verificare se la migliore parola riconosciuta è la stessa nelle due liste e, in caso di esito positivo, di abilitare l’unità UE2. Questa a sua volta è un’unità elaborativa che ha il compito di verificare se le condizioni desiderate per i punteggi combinati sono soddisfatte e di emettere di conseguenza l’indice di affidabilità su una seconda uscita 5 del riconoscitore. Nell’esempio di realizzazione qui descritto si supporrà che tale indice possa assumere due valori, corrispondenti rispettivamente a "riconoscimento certo" e a "riconoscimento incerto". Taking into account this dual function, three functional blocks UE1, CM, UE2 are shown in the figure within the EL score processing devices. UE1 is a processing unit that has the task of carrying out the operations relating to the normalization of the scores of the two lists provided by NE and MA, the combination of the normalized scores and the generation of the reordered list based on the combined scores, which is issued on a first output 4 of the validator. CM is a comparison unit that has the task of verifying whether the best recognized word is the same in the two lists and, if successful, of enabling the UE2 unit. This in turn is a processing unit that has the task of verifying whether the desired conditions for the combined scores are met and consequently emitting the reliability index on a second output 5 of the recognizer. In the example of embodiment described here it will be assumed that this index can assume two values, corresponding respectively to "certain recognition" and "uncertain recognition".

Le modalità con cui le unità UE1, UE2 effettuano le operazioni indicate sopra saranno descritte con maggiori dettagli in seguito. The ways in which the units UE1, UE2 carry out the operations indicated above will be described in greater detail below.

La soluzione adottata, con l’unità di riconoscimento neurale NE posta a monte dell’unità di riconoscimento markoviana MA migliora l’efficienza complessiva. In effetti la tecnologia delle reti neurali consente maggiori velocità di riconoscimento su vocabolari grandi, mentre quella markoviana ha migliori prestazioni su vocabolari più limitati: utilizzando il riconoscitore markoviano MA nella seconda fase, dove si utilizza solo il vocabolario corrispondente alle M(nn) migliori ipotesi ottenute con il riconoscitore neurale NE, si possono ridurre i tempi globali di riconoscimento. The solution adopted, with the NE neural recognition unit placed upstream of the MA Markov recognition unit, improves overall efficiency. In fact, the neural network technology allows higher recognition speeds on large vocabularies, while the Markovian one has better performances on more limited vocabularies: using the Markov recognizer MA in the second phase, where only the vocabulary corresponding to the M (nn) best hypotheses is used obtained with the neural recognizer NE, the overall recognition times can be reduced.

I vantaggi in termini di velocità fomiti dalle reti neurali sono ottenuti in particolar modo se il riconoscitore neurale NE è del tipo in cui la propagazione dei risultati delle elaborazioni è di tipo incrementale (cioè NE comprende una rete a più livelli in cui si propagano da un livello a quello superiore solo le differenze significative tra i valori di attivazione dei neuroni in istanti successivi), come descritto p. es. nella domanda di brevetto europeo EP-A 0 733 982 a nome della stessa Richiedente. Non vi sono particolari esigenze per il riconoscitore markoviano MA, che può essere di uno qualsiasi dei tipi noti nella tecnica. The advantages in terms of speed provided by neural networks are obtained in particular if the neural recognizer NE is of the type in which the propagation of the results of the processing is incremental (i.e. NE includes a multi-level network in which they propagate from a to the higher level only the significant differences between the activation values of the neurons in subsequent instants), as described p. ex. in the European patent application EP-A 0 733 982 in the name of the same Applicant. There are no particular requirements for the Markov recognizer MA, which can be of any of the types known in the art.

Si noti che la fig. 1 è uno schema puramente funzionale, e quindi i blocchi UE1, CM, UE2 corrisponderanno in generale parti diverse di un programma memorizzato nei dispositivi di elaborazione EL. Tenuto conto che anche i singoli riconoscitori NE, MA sono a loro volta implementati su dispositivi di elaborazione opportunamente programmati, è chiaro che uno stesso dispositivo di elaborazione può svolgere i compiti di più di imo dei blocchi rappresentati. Note that fig. 1 is a purely functional diagram, and therefore the blocks UE1, CM, UE2 will generally correspond to different parts of a program stored in the processing devices EL. Taking into account that also the single recognizers NE, MA are in turn implemented on suitably programmed processing devices, it is clear that the same processing device can perform the tasks of more than one of the blocks represented.

L’intero processo di riconoscimento svolto dal dispositivo di fig. 1 è anche rappresentato sotto forma di diagramma di flusso in fig. 2. Data la descrizione che precede, non sono necessarie ulteriori spiegazioni. The entire recognition process carried out by the device in fig. 1 is also shown in the form of a flow chart in FIG. 2. Given the above description, no further explanation is needed.

Venendo ora alle operazioni relative al trattamento dei punteggi delle ipotesi comprese nelle due liste fornite da NE e MA, il primo passo compiuto da UE1 è il calcolo della media μ(ηη), μ(ΐιιηπι) e della varianza σ(ηη), d(hmm) dei punteggi per ciascuna delle due liste secondo le ben note formule: Coming now to the operations relating to the treatment of the scores of the hypotheses included in the two lists provided by NE and MA, the first step taken by UE1 is the calculation of the mean μ (ηη), μ (ΐιιηπι) and of the variance σ (ηη), d (hmm) of the scores for each of the two lists according to the well-known formulas:

dove M(hmm) , M(nn), nnj, hmmj hanno il significato già visto. where M (hmm), M (nn), nnj, hmmj have the meaning already seen.

Successivamente si procede a una normalizzazione dei punteggi rispetto alla media e alla varianza, in modo da ottenere due liste NNi, HMMj di punteggi a media nulla e varianza unitaria. A questo scopo UEl esegue le operazioni rappresentate dalle relazioni seguenti: The scores are then normalized with respect to the mean and variance, in order to obtain two lists NNi, HMMj of scores with zero mean and unit variance. For this purpose UEl performs the operations represented by the following relations:

UEl effettua il calcolo della media e della varianza dei punteggi (e la normalizzazione) per una lista solo se il numero di parole in quella lista non è inferiore a una certa soglia M. Nell’esempio preferito di realizzazione si è posto M = 3, cioè il valore minimo per il quale il calcolo della media e della varianza sono possibili. Se il numero di parole in una lista è inferiore alla soglia M, invece del punteggio fornito dal rispettivo riconoscitore, UEl utilizza valori di punteggio prefissati. Questo costituisce a sua volta una sorta di normalizzazione. In prove effettuate si sono assegnati al punteggio un valore di 3,0 nel caso di una sola ipotesi e valori di 2,0 e 1,0 nel caso di due sole ipotesi. Il riconoscitore si è comunque dimostrato poco sensibile al valore di questi parametri, e quindi qualsiasi valore che corrisponda a una buona verosimiglianza può essere utilizzato. UEl performs the calculation of the mean and variance of the scores (and normalization) for a list only if the number of words in that list is not less than a certain threshold M. In the preferred implementation example we set M = 3, ie the minimum value for which the calculation of the mean and variance are possible. If the number of words in a list is less than the threshold M, instead of the score provided by the respective recognizer, UEl uses predetermined score values. This in turn constitutes a kind of normalization. In tests carried out, the score was assigned a value of 3.0 in the case of a single hypothesis and values of 2.0 and 1.0 in the case of only two hypotheses. However, the recognizer has proved not very sensitive to the value of these parameters, and therefore any value that corresponds to a good likelihood can be used.

Infine si passa alla combinazione vera e propria dei punteggi associati nelle due liste a una stessa parola IP^HMM), I3⁄4(NN) per generare la lista finale di possibili parole, che viene poi riordinata in base al punteggio combinato. La combinazione è una combinazione lineare, cosicché nella nuova lista ognuna delle parole IPx ha un punteggio combinato Sx dato da Finally, we move on to the actual combination of the scores associated in the two lists with the same word IP ^ HMM), I3⁄4 (NN) to generate the final list of possible words, which is then reordered based on the combined score. The combination is a linear combination, so that in the new list each of the IPx words has a combined score Sx given by

dove a e β sono i pesi attribuiti a ciascuno dei due riconoscitori. where a and β are the weights attributed to each of the two recognizers.

Preferibilmente, i due pesi (memorizzati all’interno dell’unità UE1) soddisfano alla relazione β = 1 - a, dove a = 0,5 se i riconoscitori hanno prestazioni sostanzialmente analoghe Nel caso di prestazioni alquanto diverse, un intervallo adatto di valori di a e β può essere 0,4 - 0,6 Preferably, the two weights (stored inside the UE1 unit) satisfy the relation β = 1 - a, where a = 0.5 if the recognizers have substantially similar performances.In the case of somewhat different performances, a suitable range of values of a and β can be 0.4 - 0.6

Evidentemente, la combinazione dei punteggi non viene effettuata nel caso di parole presenti in una sola lista. Queste parole (generalmente appartenenti alla lista fornita dalla rete neurale, per le ragioni dette sopra) possono essere scartate oppure possono essere associate a un punteggio minimo, in modo da essere inserite nella lista finale dopo quelle per cui si è effettuata la combinazione dei punteggi. Obviously, the combination of scores is not carried out in the case of words present in a single list. These words (generally belonging to the list provided by the neural network, for the reasons mentioned above) can be discarded or can be associated with a minimum score, in order to be included in the final list after those for which the scores were combined.

Grazie alla normalizzazione, che dà liste con media nulla e varianza unitaria, si eliminano gli effetti dovuti alla diversa dinamica dei punteggi forniti dai due riconoscitori e si migliora la precisione del riconoscimento. Thanks to the normalization, which gives lists with zero mean and unit variance, the effects due to the different dynamics of the scores provided by the two recognizers are eliminated and recognition accuracy is improved.

La procedura di trattamento è riportata anche nel diagramma di flusso di Fig. 3. Data la descrizione che precede, questo diagramma non ha bisogno di ulteriori illustrazioni. The treatment procedure is also reported in the flow diagram of Fig. 3. Given the above description, this diagram does not need further illustrations.

Una volta che UE1 ha ottenuto i punteggi combinati e preparato la lista riordinata, il blocco UE2 può determinare l’affidabilità del riconoscimento della prima parola della lista stessa. Come detto, le operazioni di UE2 sono abilitate dal comparatore CM se questo riconosce che una stessa parola occupa la prima posizione nelle liste fornite da NE e MA, cioè IPl(NN) = IPl(HMM). Per la determinazione dell’affidabilità UE2 valuta il punteggio associato alla parola migliore e le differenze di punteggio tra questa e alcune delle parole successive nella lista. In particolare, affinché il riconoscimento sia considerato "certo", contemporaneamente alla condizione relativa all’identità della migliore parola nelle due liste, devono essere soddisfatte le seguenti condizioni: Once UE1 has obtained the combined scores and prepared the reordered list, the UE2 block can determine the reliability of the recognition of the first word of the list itself. As said, the operations of UE2 are enabled by the comparator CM if this recognizes that the same word occupies the first position in the lists supplied by NE and MA, that is IP1 (NN) = IP1 (HMM). For the determination of reliability, UE2 evaluates the score associated with the best word and the score differences between this and some of the following words in the list. In particular, in order for recognition to be considered "certain", at the same time as the condition relating to the identity of the best word in the two lists, the following conditions must be met:

1) il punteggio combinato SI della prima parola della lista riordinata deve essere superiore a una prima soglia Tl; 1) the combined SI score of the first word of the reordered list must be higher than a first threshold Tl;

2) le differenze tra il punteggio combinato SI associato alla prima parola della lista riordinata e quelli S2, S5 associati alla seconda e alla quinta parola sono rispettivamente superiori a una seconda e a una terza soglia T2, T3. 2) the differences between the combined score SI associated with the first word of the reordered list and those S2, S5 associated with the second and fifth words are respectively greater than a second and a third threshold T2, T3.

Le differenze SI - S2 e SI - S5 vengono calcolate e confrontate con le rispettive soglie verificate solo se è presente un sufficiente numero di ipotesi; in caso contrario la condizione 2) è considerata automaticamente soddisfatta. The differences SI - S2 and SI - S5 are calculated and compared with the respective verified thresholds only if there is a sufficient number of hypotheses; otherwise, condition 2) is considered automatically satisfied.

I valori delle soglie sono stabiliti in base all'applicazione in cui viene inserito il riconoscitore. Per esempio, negli esperimenti effettuati si sono adottati i seguenti valori): The threshold values are established according to the application in which the recognizer is inserted. For example, the following values were adopted in the experiments carried out):

E’ intuitivo vedere come le condizioni indicate sopra, (che oltre all’identità della migliore ipotesi dì riconoscimento fornita dalle due liste richiedono anche un sufficiente distacco di punteggio tra l’ipotesi migliore e quelle successive nella lista), permettano di valutare effettivamente la certezza del riconoscimento. It is intuitive to see how the conditions indicated above, (which in addition to the identity of the best hypothesis of recognition provided by the two lists also require a sufficient gap in the score between the best hypothesis and the following ones in the list), allow to effectively evaluate the certainty of recognition.

Le operazioni di valutazione dell’affidabilità del riconoscimento sono anche rappresentate sotto forma di diagramma di flusso in fig. 4. Si noti che in questo diagramma la concordanza della migliore parola nelle due liste è stata indicata come una condizione da verificare congiuntamente alle altre condizioni, invece di essere considerata una condizione preliminare per la verifica delle altre condizioni, ma è evidente che si tratta unicamente di dettagli implementativi dello stesso principio. Per il resto, anche questo diagramma non ha bisogno di ulteriori illustrazioni. The recognition reliability evaluation operations are also represented in the form of a flow chart in fig. 4. Note that in this diagram the agreement of the best word in the two lists has been indicated as a condition to be verified together with the other conditions, instead of being considered a preliminary condition for the verification of the other conditions, but it is evident that this is only of implementation details of the same principle. Otherwise, this diagram also needs no further illustrations.

E’ evidente che quanto descritto è dato unicamente a titolo di esempio non limitativo e che varianti e modifiche sono possibili senza uscire dal campo di protezione deirinvenzione. Per esempio, per la valutazione dell’affidabilità, si potrebbe solo verificare che il punteggio della parola migliore sia sufficientemente superiore a quello della seconda parola, eliminando il confronto con un’ulteriore parola (che potrebbe anche non essere la quinta, ma un’altra parola sufficientemente distante dalla seconda). Per la verifica dell’affidabilità del riconoscimento si potrebbero combinare diversamente le condizioni date - oppure aggiungere condizioni ulteriori - in modo da introdurre gradi di valutazione intermedi tra “certo” e “incerto”: per esempio, un grado di valutazione intermedio potrebbe essere rappresentato dalla verifica delle condizioni solo per le soglie TI e T2 ma non per T3. Infine, anche se si è fatto riferimento nella descrizione al riconoscimento di parole isolate, il riconoscitore potrebbe essere utilizzato anche per il parlato continuo. It is clear that what has been described is given only by way of non-limiting example and that variations and modifications are possible without going out of the scope of the invention. For example, for the evaluation of reliability, one could only verify that the score of the best word is sufficiently higher than that of the second word, eliminating the comparison with an additional word (which may not even be the fifth, but another word sufficiently distant from the second). To verify the reliability of the recognition, the given conditions could be combined differently - or additional conditions added - so as to introduce intermediate grades of assessment between "certain" and "uncertain": for example, an intermediate grade could be represented by verification of conditions only for thresholds TI and T2 but not for T3. Finally, even if reference has been made in the description to the recognition of isolated words, the recognizer could also be used for continuous speech.

Claims

Claims 1. Procedure for speech recognition, in which: two recognition steps (NE, MA) are carried out, one based on the use of neural networks and the other on the use of hidden Markov models, providing respective lists of hypotheses of recognition in which each hypothesis is associated with a respective acoustic likelihood score; the likelihood scores of each list are processed; and a single reordered list is provided on the basis of the scores processed, characterized by the fact that the two recognition steps (NE, MA) operate in sequence on the same expression to be recognized in such a way that the neural step (NE) examines the entire active vocabulary and the Markovian pass (MA) examines only a partial vocabulary represented by the list of hypotheses provided as a result of the neural pass (NE), and by the fact that the reliability of the recognition is also evaluated for the best hypothesis of the reordered list, on the based on the scores resulting from the combination and associated with this best hypothesis and with one or more hypotheses that occupy successive positions in the reordered list, generating a reliability index that can assume at least two values corresponding respectively to certain recognition and uncertain recognition.

2. Procedure according to rev. 1, characterized by the fact that the processing of the likelihood scores includes the following operations: - calculation of the mean and variance of the scores associated with the hypotheses in each of the lists, - normalization of the scores associated with the hypotheses in each of the lists with respect to the mean and variance, so as to transform these lists into lists in which the scores have zero mean and unit variance, - linear combination of the normalized scores associated with recognition hypotheses present in both lists.

3. Procedure according to rev. 2, characterized by the fact that the calculation of the mean and variance and the normalization of the scores of a list are carried out only if this includes a number of hypotheses not less than a minimum.

4. Procedure according to rev. 3, characterized by the fact that for a list comprising a number of hypotheses lower than said minimum, predetermined values are assigned to the scores of the hypotheses contained therein.

5. Process according to any one of claims 1 to 4, characterized in that for said linear combination the scores of the hypotheses present in the two lists are weighed with weights having a unitary sum.

6. Process according to any one of claims 1 to 6, characterized in that for the creation of said single list the hypotheses present in a single list are discarded.

7. Procedure according to any one of claims 1 to 5, characterized in that for the creation of said single list the hypotheses present in a single list are assigned a minimum score, lower than the lower combined score of a hypothesis present in both lists.

8. Process according to any one of the preceding claims, characterized in that said evaluation of the reliability of the recognition for the best hypothesis of recognition in the reordered list is carried out if this hypothesis was the best in both lists, and includes the operations of: - compare the combined score associated with said best hypothesis with a first threshold, e - calculate a first difference in scores, given by the difference between the combined score associated with said best hypothesis and that associated with the hypothesis with the immediately lower score, and - comparing said first difference with a second threshold; and by the fact that the reliability index is assigned the value corresponding to certain recognition if said combined score and said first difference are both higher than the respective threshold.

9. Proceedings according to rev. 8, characterized by the fact that said evaluation of the reliability of recognition also includes the operations of: - calculating a second difference in scores, given by the difference between the combined score associated with said best hypothesis and that associated with a further hypothesis that occupies a position spaced by a predetermined number of positions in the reordered list, e - compare said second difference with a third threshold, and by the fact that the corresponding value of certain recognition is assigned to the reliability index if also said further difference is higher than the threshold.

10. Procedure according to rev. 8 or 9, characterized by the fact that the calculation of said differences is carried out only in the presence of lists including a number of hypotheses not less than a minimum.

11. Procedure according to rev. 10, characterized by the fact that in the presence of lists comprising a number of hypotheses lower than said minimum, the condition for exceeding the second and third threshold is considered satisfied.

12. Speech recognition, comprising: - a pair of recognition units (NN, MA) connected in cascade, which respectively use a recognition technique based on the use of neural networks and a recognition technique based on hidden Markov models, providing respective lists of recognition hypotheses in which each hypothesis is associated with a respective acoustic likelihood score; And - a processing unit (EL), including means (UE1) to perform a combination of the likelihood scores determined by the two recognition units (NN, MA) and provide a reordered list based on the combined scores, characterized by the fact that the neural network recognition unit (NN) is arranged upstream of the recognition unit (MA) based on hidden Markovian models and is able to carry out the respective recognition by operating on the entire active vocabulary, and the recognition unit (MA) based on hidden Markov models is able to carry out the recognition independently from that carried out by the neural network recognition unit (NN) by operating on a partial vocabulary consisting of the hypotheses contained in the list provided by the latter ; and the fact that the processing unit (EL) includes means (CM, UE2) to evaluate the reliability of the recognition for the hypothesis that has the best likelihood score in the reordered hypothesis list, using the combined scores associated with the contained hypotheses in the reordered list, said evaluation means (CM, UE2) being able to provide a reliability index which can assume at least two. values corresponding respectively to certain or uncertain recognition for this hypothesis.

13. Recognizer according to rev. 12, characterized by the fact that said combination means (UE1) in said processing unit (EL) are able to linearly combine likelihood scores associated with recognition hypotheses present in both lists, after having subjected them to a pre-processing including the operations from: - calculation of the mean and variance of the scores associated with said hypotheses in the respective list, - normalization of the scores associated with said hypotheses with respect to the mean and variance of the respective list, so as to transform said lists into lists of scores with zero mean and unit variance.

14. Recognizer according to rev. 13, characterized by the fact that said combination means (UE1) are enabled to perform the calculation of the mean and variance and the normalization of the scores of the lists provided by each recognition unit (NN, MA) only if these lists include a number of hypothesis not less than a minimum.

15. Recognizer according to any one of claims 12 - 14, characterized in that said recognition reliability evaluation means (CM, UE2) comprise first comparison means (CM) for comparing the best recognition hypothesis identified by the recognition unit neural network recognition (NN) with that provided by the recognition unit (MA) based on hidden Markovian models and emit an enabling signal if these best hypotheses coincide, and second comparison means (UE2), enabled by said enabling signal and able to compare with respective thresholds the score of the best hypothesis of the reordered list and the difference between the score associated with the best hypothesis of the reordered list and that associated with the hypothesis with the immediately lower score, and to issue said reliability index with a value corresponding to certain recognition when said score and said difference exceed the respective thresholds to.

16. Recognizer according to rev. 15, characterized by the fact that said second comparison means (UE2) are suitable for comparing with a further threshold the difference between the score associated with the best hypothesis of the reordered list and that associated with a hypothesis that occupies a subsequent and spaced position of a predetermined number of positions in the reordered list, and to issue said reliability index with a value corresponding to certain recognition when this difference also exceeds said further threshold.