US12020697B2 - Systems and methods for fast filtering of audio keyword search - Google Patents
Systems and methods for fast filtering of audio keyword search Download PDFInfo
- Publication number
- US12020697B2 US12020697B2 US16/929,383 US202016929383A US12020697B2 US 12020697 B2 US12020697 B2 US 12020697B2 US 202016929383 A US202016929383 A US 202016929383A US 12020697 B2 US12020697 B2 US 12020697B2
- Authority
- US
- United States
- Prior art keywords
- voice segment
- keyword
- keywords
- decoder
- segment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
- 238000000034 method Methods 0.000 title claims description 79
- 238000001914 filtration Methods 0.000 title description 3
- 230000005236 sound signal Effects 0.000 claims abstract description 11
- 230000006870 function Effects 0.000 claims description 41
- 238000012549 training Methods 0.000 claims description 33
- 238000010801 machine learning Methods 0.000 claims description 19
- 238000004891 communication Methods 0.000 claims description 17
- 238000013528 artificial neural network Methods 0.000 claims description 11
- 239000013598 vector Substances 0.000 claims description 9
- 230000000694 effects Effects 0.000 claims description 7
- 239000000284 extract Substances 0.000 claims description 5
- 238000012545 processing Methods 0.000 description 18
- 230000015654 memory Effects 0.000 description 14
- 230000008569 process Effects 0.000 description 12
- 238000013459 approach Methods 0.000 description 11
- 238000010586 diagram Methods 0.000 description 8
- 238000013527 convolutional neural network Methods 0.000 description 5
- 238000013135 deep learning Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 239000000203 mixture Substances 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 238000004590 computer program Methods 0.000 description 3
- 239000007787 solid Substances 0.000 description 3
- 238000012546 transfer Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 2
- 230000001413 cellular effect Effects 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 230000007812 deficiency Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 238000009472 formulation Methods 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000011045 prefiltration Methods 0.000 description 2
- 238000012552 review Methods 0.000 description 2
- 238000013515 script Methods 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 239000000126 substance Substances 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/083—Recognition networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/19—Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
- G10L15/197—Probabilistic grammars, e.g. word n-grams
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
- G06F18/24155—Bayesian classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N7/00—Computing arrangements based on specific mathematical models
- G06N7/01—Probabilistic graphical models, e.g. probabilistic networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
- G10L15/144—Training of HMMs
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
Definitions
- This application relates generally to automatic speech recognition and, more particularly, to audio keyword search techniques.
- ASR Automatic speech recognition
- ASRs are often used as voice user interfaces to input data or commands into a computer system via a user's voice.
- ASR technology has evolved and the cost of data storage has decreased, the amount of audio data accessible to computer systems has substantially increased, resulting in the development of audio keyword search systems that enable users to more readily access relevant audio voice or speech information.
- the current state-of-the art in audio keyword search requires transcribing all words in the recording by an ASR and then identifying the location of key terms. Because an ASR typically implements a computationally intensive multi-stage process, certain devices such as mobile computing devices having relatively limited computer processing power are not able to practically implement a keyword search system without substantially hindering the performance of such devices. In a current keyword search process or pipeline, the ASR can dominate the consumed processing demand by consuming up to 93% of the demand.
- LSTM Long Short-Term Memory Networks
- DTW segmental dynamic time warping
- the application in various implementations, addresses deficiencies associated with the efficiency of audio keyword search techniques.
- This application describes an exemplary audio keyword search (KWS) system configured to reduce the computation cost of finding spoken words in audio recordings.
- WSS audio keyword search
- the inventive systems, methods and devices described herein learn a function (i.e., a classifier) on posteriorgrams of a voice segment which may be improved by user feedback. Users can mark short (e.g., 1 to 10 seconds) of audio that contain occurrences of a fixed set of keywords.
- the systems, methods, and devices described herein process the audio and build a classifier function using standard machine learning (ML) techniques.
- ML machine learning
- the systems, methods, and devices insert the classifier into any audio KWS pipeline and reject audio segments that do not contain at least one keyword of the set of keywords.
- the present approaches build a binary (yes/no) model to detect any of a fixed set of keywords.
- Existing approaches attempt to match single keywords by example, resulting either in many models or a simpler comparison function.
- the inventive systems, methods and devices herein learn a single function for a set of keywords.
- the present implementation can be applied to a standard output of existing ASR acoustic models.
- the present keyword search classifier and/or filter can be retrofitted into existing ASR systems without the need to modify any system internals.
- the inventive system, methods, and devices can be applied to speed up third party ASR systems.
- an ASR implements a computationally intensive multi-stage process
- the approaches described herein speed up the execution time by identifying audio recordings for which the stages of the ASR process may be omitted.
- the system, methods, and devices described herein include two top-level components, a learning module and a runtime KWS module.
- the ASR engine may be provided by a third party.
- the learning module accepts short (e.g., 1-10 second duration) audio segments, a list of keywords of interest, and labels for each segment indicating if the audio segment contains any of a target set of keywords.
- the learning module extracts posteriorgram features for each segment using an ASR engine and then applies one of two types or categories of ML techniques to learn the parameters for the keyword classifier.
- a runtime keyword search module uses the keyword classifier learned by the learning module to speed up the overall KWS module pipeline. Certain inventive approaches described herein build a binary (yes/no) model to detect a fixed set of keywords.
- an audio keyword search system includes a voice activity detector arranged to identify a voice segment of a received audio signal.
- the system also includes an automatic speech recognizer having a first automatic speech recognition engine arranged to identify one or more phonemes included in the voice segment and output the one or more phonemes to a keyword filter and a decoder arranged to, if the one or more phonemes are outputted by the keyword filter, receive the one or more phonemes included in the voice segment and generate a word lattice associated with the voice segment.
- the system further includes a keyword search module having a learning module arranged: i) receive a plurality of training speech segments, ii) receive segment labels associated with the plurality of training speech segments, iii) receive a first keyword list including one or more first keywords, and iv) execute a second automatic speech recognition engine to extract one or more posteriorgram features associated with the plurality of training speech segments.
- the keyword search module also includes a keyword classifier arranged to execute a machine learning technique to determine a filter function based on the first keyword list and the one or more posteriorgram features.
- the keyword search module further includes the keyword filter which may be arranged to execute the filter function to detect whether the voice segment includes any of the one or more first keywords of the first keyword list and, if detected, output the one or more phonemes included in the voice segment to the decoder but, if not detected, not output the one or more phonemes included in the voice segment to the decoder.
- the system also includes a word lattice search engine arranged to: i) receive the word lattice associated with the voice segment if generated by the decoder, ii) search the word lattice for one or more second keywords in a second keyword list, and iii) determine whether the voice segment includes the one or more second keywords.
- the machine learning technique may include a bag of phone N-grams technique and/or a direct classification technique.
- the bag of phone N-grams technique may include a cosine similarity classification technique.
- the direct classification technique may include a Na ⁇ ve Bayes classification technique.
- the filter function may include one or more filter parameters learned from the machine learning technique.
- a portion of the one or more first keywords of the first keyword list may include the one or more second keywords of the second keyword list.
- the first automatic search recognition engine may be the same as the second automatic search recognition engine.
- the first automatic search recognition engine may include a deep neural network acoustic model.
- the decoder may include a finite state transducer (FST) decoder.
- the decoder may implement a Hidden Markov Model (HMM) and/or recursive DNN.
- HMM Hidden Markov Model
- a method for performing an audio keyword search includes: identifying a voice segment of a received audio signal; identifying, by a first automatic speech recognition engine, one or more phonemes included in the voice segment; outputting, from the first automatic speech recognition engine, the one or more phonemes to a keyword filter; receiving a plurality of training speech segments, segment labels associated with the plurality of training speech segments, and a first keyword list including one or more first keywords; extracting, by a second automatic speech recognition engine, one or more posteriorgram features associated with the plurality of training speech segments; determining, by a machine learning technique, a filter function based on the first keyword list and the one or more posteriorgram features; and executing the filter function to detect whether the voice segment includes any of the one or more first keywords of the first keyword list and, if detected, outputting the one or more phonemes included in the voice segment to a decoder but, if not detected, not outputting the one or more phonemes included in the voice segment to the decoder.
- a mobile computing device in a further aspect, includes a receiver arranged to receive a communications signal and extract from the communications signal a received audio signal.
- the devise includes a voice activity detector arranged to identify a voice segment of the received audio signal.
- the device also includes an automatic speech recognizer having a first automatic speech recognition engine arranged to identify one or more phonemes included in the voice segment and output the one or more phonemes to a keyword filter and a decoder arranged to, if the one or more phonemes are outputted by the keyword filter, receive the one or more phonemes included in the voice segment and generate a word lattice associated with the voice segment.
- the device also includes a keyword search module having a learning module arranged: i) receive a plurality of training speech segments, ii) receive segment labels associated with the plurality of training speech segments, iii) receive a first keyword list including one or more first keywords, and iv) execute a second automatic speech recognition engine to extract one or more posteriorgram features associated with the plurality of training speech segments.
- the keyword search module also includes a keyword classifier arranged to execute a machine learning technique to determine a filter function based on the first keyword list and the one or more posteriorgram features.
- the keyword search module also includes the keyword filter which may be arranged to execute the filter function to detect whether the voice segment includes any of the one or more first keywords of the first keyword list and, if detected, output the one or more phonemes included in the voice segment to the decoder but, if not detected, not output the one or more phonemes included in the voice segment to the decoder.
- the device further includes a word lattice search engine arranged to: i) receive the word lattice associated with the voice segment if outputted by the decoder, ii) search the word lattice for one or more second keywords in a second keyword list, and iii) determine whether the voice segment includes the one or more second keywords.
- FIG. 1 is an illustration of a public safety communication network
- FIG. 2 is block diagram of a computer system arranged to perform processing associated with a keyword search system
- FIG. 3 is a block diagram of an audio keyword search system using fast filtering
- FIG. 4 is a block diagram of a keyword search module and filter
- FIG. 5 shows two categories of keyword classifiers
- FIG. 6 shows a keyword classifier technique based on estimating N-gram counts from posteriorgrams
- FIG. 7 shows various keyword classifier techniques based on deep learning
- FIG. 8 A shows graphs of miss rate versus the number of segments using bag of N-grams classifiers for PSC and VAST data
- FIG. 8 B shows graphs of miss rate versus the number of segments using deep learning classifiers for PSC and VAST data
- FIG. 9 shows an exemplary process for performing a fast keyword search.
- the application in various aspects, addresses deficiencies associated with conventional audio keyword search systems.
- the application includes exemplary systems, methods, devices for keyword searching configured to speed up the execution time by identifying audio recordings for which the computationally expensive stages of the ASR process may be omitted. In this way, substantially less processing power is used in the KWS pipeline to perform keyword searching. This, in turn, enables a KWS system and/or application to run more efficiently on platforms having relatively lower processing capability such as, for example, a mobile computing device.
- FIG. 1 is an illustration of a public safety communication (PSC) system 100 including various end user devices such as mobile computing device 102 , vehicle 104 , computer 106 , and telephone 108 .
- Mobile computing device 102 may include a wireless communications devices such as, without limitation, a cellular telephone, computing tablet, portable computer, and so on.
- Devices 102 , 104 , and 106 may include a KWS application and/or module 124 , 126 , and 128 respectively.
- PSC system 100 may also include a KWS/ASR server 112 , PSC server 120 , and PSC server 122 .
- PSC system 100 may communicate via network 110 which may include the Internet, a local area network, a Wifi or 802.11 network, a private network, a wireless network, and/or satellite network.
- Wireless communications devices 102 and 104 may communicate with other devices, servers, and/or entities connected to network 110 via public land mobile network (PLMN) 116 and base station 118 .
- PLMN 116 may include a CDMA, 802.11, TDMA, 3G, 4G, 5G, GSM, and/or a long term evolution (LTE) network, and the like.
- FIG. 1 illustrates an exemplary PSC system 100 , it should be understood by one or ordinary skill that a KWS application and/or module 124 , 126 , 128 and/or KWS/ASR server 112 may operate within any communications network and/or environment.
- KWS modules 124 , 126 , and 128 enable a user of device 102 , 104 and 106 respectively to input voice information and/or commands via a user interface of the devices 102 , 104 , and 106 that can then be processed into recognized text and/or commands by modules 124 , 126 , and 128 .
- modules 102 , 104 , and 106 provide keyword search capabilities that enable rapid and efficient identification of, for example, highly relevant and high priority information and/or commands.
- module 124 may detect when the firefighter says “chemical spill” and automatically send a request for a hazmat team to PSC server 122 and 120 , which may be a server run by the state or local public safety authorities.
- PSC server 122 and 120 may be a server run by the state or local public safety authorities.
- a remote KWS/ASR server 112 may provide keyword search features to a user of telephone 108 during a call.
- FIG. 2 is block diagram of a computer system 200 arranged to perform processing associated with a keyword search system such as, for example, modules 124 , 126 , and 128 , server 112 , systems 300 and 400 , which are discussed in detail later herein.
- the exemplary computer system 200 includes a central processing unit (CPU) 202 , a memory 204 , and an interconnect bus 206 .
- the CPU 202 may include a single microprocessor or a plurality of microprocessors or special purpose processors for configuring computer system 200 as a multi-processor system.
- the memory 204 illustratively includes a main memory and a read only memory.
- the computer 200 also includes the mass storage device 208 having, for example, various disk drives, solid state drives, tape drives, etc.
- the memory 204 also includes dynamic random access memory (DRAM) and high-speed cache memory.
- DRAM dynamic random access memory
- memory 204 stores at least portions of instructions and data for execution by the CPU 202 .
- the memory 204 may also contain compute elements, such as Deep In-Memory Architectures (DIMA), wherein data is sent to memory and a function of the data (e.g., matrix vector multiplication) is read out by the CPU 202 .
- DIMA Deep In-Memory Architectures
- the mass storage 208 may include one or more magnetic disk, optical disk drives, and/or solid state memories, for storing data and instructions for use by the CPU 202 .
- At least one component of the mass storage system 208 preferably in the form of a non-volatile disk drive, solid state, or tape drive, stores the database used for processing image data and/or running artificial intelligence (AI) engines and neural networks of an ASR and/or KWS system.
- the mass storage system 208 may also include one or more drives for various portable media, such as a floppy disk, flash drive, a compact disc read only memory (CD-ROM, DVD, CD-RW, and variants), memory stick, or an integrated circuit non-volatile memory adapter (i.e. PC-MCIA adapter) to input and output data and code to and from the computer system 200 .
- PC-MCIA adapter integrated circuit non-volatile memory adapter
- the computer system 200 may also include one or more input/output interfaces for communications, shown by way of example, as interface 210 and/or a transceiver for data communications via the network 212 .
- the data interface 210 may be a modem, an Ethernet card or any other suitable data communications device.
- the data interface 210 may provide a relatively high-speed link to a network 212 and/or network 110 , such as an intranet, internet, or the Internet, either directly or through another external interface.
- the communication link to the network 212 may be, for example, optical, wired, or wireless (e.g., via satellite or cellular network).
- the computer system 200 may also connect via the data interface 210 and network 212 to at least one other computer system to perform remote or distributed multi-sensor processing.
- computer system 200 may include a mainframe or other type of host computer system capable of Web-based communications via network 212 .
- the computer system 200 may include software for operating a network application such as a web server and/or web client.
- the computer system 200 may also include suitable input/output ports, that may interface with a portable data storage device, or use the interconnect bus 206 for interconnection with a local display 216 and keyboard 214 or the like serving as a local user interface for programming and/or data retrieval purposes.
- the display 216 may include a touch screen capability to enable users to interface with the system 200 by touching portions of the surface of the display 216 .
- Computer system 200 may include one or more microphones and/or speakers to facilitate voice and/or audio communications with a user. Server operations personnel may interact with the system 200 for controlling and/or programming the system from remote terminal devices via the network 212 .
- the computer system 200 may run a variety of application programs and store associated data in a database of mass storage system 208 .
- One or more such applications may include a KWS system and/or an ASR such as described with respect to FIGS. 3 and 4 .
- the components contained in the computer system 200 may enable the computer system to be used as a server, workstation, personal computer, network terminal, mobile computing device, mobile telephone, System on a Chip (SoC), and the like.
- the computer system 200 may include one or more applications such as machine learning (ML), deep learning, and artificial intelligence using neural networks.
- the system 200 may include software and/or hardware that implements a web server application.
- the web server application may include software such as HTML, XML, WML, SGML, PHP (Hypertext Preprocessor), CGI, and like languages.
- the foregoing features of the disclosure may be realized as a software component operating in the system 200 where the system 200 includes Unix workstation, a Windows workstation, a LINUX workstation, or other type of workstation. Other operation systems may be employed such as, without limitation, Windows, MAC OS, and LINUX.
- the software can optionally be implemented as a C language computer program, or a computer program written in any high level language including, without limitation, Javascript, Java, CSS, Python, Keras, TensorFlow, PHP, Ruby, C++, C, Shell, C#, Objective-C, Go, R, TeX, VimL, Perl, Scala, CoffeeScript, Emacs Lisp, Swift, Fortran, or Visual BASIC. Certain script-based programs may be employed such as XML, WML, PHP, and so on.
- the system 200 may use a digital signal processor (DSP).
- DSP digital signal processor
- the mass storage 208 may include a database.
- the database may be any suitable database system, including the commercially available Microsoft Access database, and can be a local or distributed database system.
- a database system may implement Sybase and/or a SQL Server.
- the database may be supported by any suitable persistent data memory, such as a hard disk drive, RAID system, tape drive system, floppy diskette, or any other suitable system.
- the system 200 may include a database that is integrated with the system 300 and/or 400 , however, it will be understood that, in other implementations, the database and mass storage 208 can be an external element.
- the system 200 may include an Internet browser program and/or be configured operate as a web server.
- the client and/or web server may be configured to recognize and interpret various network protocols that may be used by a client or server program. Commonly used protocols include Hypertext Transfer Protocol (HTTP), File Transfer Protocol (FTP), Telnet, and Secure Sockets Layer (SSL), and Transport Layer Security (TLS), for example.
- HTTP Hypertext Transfer Protocol
- FTP File Transfer Protocol
- Telnet Telnet
- SSL Secure Sockets Layer
- TLS Transport Layer Security
- new protocols and revisions of existing protocols may be frequently introduced.
- a new revision of the server and/or client application may be continuously developed and released.
- the system 300 and/or 400 includes a networked-based, e.g., Internet-based, application that may be configured and run on the system 200 and/or any combination of the other components of the system 300 and/or 400 .
- the server 112 and/or computer system 200 may include a web server running a Web 2.0 application or the like.
- Web applications running on systems 300 and/or 400 may use server-side dynamic content generation mechanisms such, without limitation, Java servlets, CGI, PHP, or ASP.
- mashed content may be generated by a web browser running, for example, client-side scripting including, without limitation, JavaScript and/or applets on a wireless device.
- system 200 , 300 , and/or 400 may include applications that employ asynchronous JavaScript+XML (Ajax) and like technologies that use asynchronous loading and content presentation techniques. These techniques may include, without limitation, XHTML and CSS for style presentation, document object model (DOM) API exposed by a web browser, asynchronous data exchange of XML data, and web browser side scripting, e.g., JavaScript.
- Certain web-based applications and services may utilize web protocols including, without limitation, the services-orientated access protocol (SOAP) and representational state transfer (REST). REST may utilize HTTP with XML.
- SOAP services-orientated access protocol
- REST representational state transfer
- the computer 200 , server 122 , devices 102 , 104 , and 106 , system 300 , system 400 , or another components of systems 300 and 400 may also provide enhanced security and data encryption.
- Enhanced security may include access control, biometric authentication, cryptographic authentication, message integrity checking, encryption, digital rights management services, and/or other like security services.
- the security may include protocols such as IPSEC and IKE.
- the encryption may include, without limitation, DES, 3DES, AES, RSA, ECC, and any like public key or private key based schemes.
- FIG. 3 is a block diagram of an audio keyword search system and/or pipeline 300 using fast filtering.
- System 300 includes a KWS module 302 including a KWS filter 304 .
- the KWS module 302 including KWS filer 304 may be integrated within ASR 306 .
- Pipeline and/or system 300 includes a receiver 308 , a speech activity detector (SAD) and/or voice activity detector (VAD) 310 , a word posterior indexing module 312 , a search module 314 , and a decision threshold function 316 that may output final scores 318 associated with the detection of keywords 320 .
- SAD speech activity detector
- VAD voice activity detector
- receiver 308 receives a modulated communication signal 322 which may include an electronic signal transmitted via a wireless and/or wireline medium. Receiver 308 de-modulates and/or extracts an audio signal from signal 322 which includes at least one audio segment 324 that is output to SAD 310 .
- SAD 310 uses a deep neural network (DNN) model 326 to detect whether audio segment 324 includes speech. Those segments 324 determined to include speech are passed on and output to ASR 306 .
- ASR 306 uses a Hidden Markov Model deep neural network (HMM-DNN) 328 to recognize the speech within a speech segment such as segment 324 .
- HMM-DNN Hidden Markov Model deep neural network
- Each segment 324 may have a duration in the range of 1 second to 10 seconds. In some configurations, each segment may be less than or equal to about 1 second, 5, seconds, 10 seconds, 15 seconds, 20 seconds, and 30 seconds.
- KWS module 302 uses a HMM-DNN model 328 or another neural network model to define a classification function that is used by KWS filter 304 to filter out and/or omit segments such as segment 324 that do not contain a keyword in keyword list 330 .
- KWS module 302 may receive training audio segments 332 , keyword list 330 , and segment labels 334 associated with each training audio segment 332 . Training segments 332 may include a mix a segments having a keyword in keyword list 330 and segments not having a keyword in keyword list 330 . Segment labels 334 may include labels indicating whether each audio segment of audio segments 332 includes a keyword in keyword list 330 or does not include a keyword of keyword list 330 .
- ASR 306 includes a phoneme and/or word decoder such as, for example, a finite state transducer (FST) decoder configured to form a word and/or phoneme lattice associated with each of the speech segments 324 .
- FST finite state transducer
- the formulation and processing associated with performing speech recognition by the decoder can account for a substantial percentage of the processing cost expended by system 300 to perform keyword searching.
- the decoder usually must process every segment such as segment 324 even though a substantial portion of the processed audio segments may not include a keyword of keywords 320 .
- ASR 306 is able to perform automatic speech recognition substantially more efficiently, more rapidly, and by utilizing substantially less processing power of a device such as devices 102 , 104 , 106 and server 112 .
- ASR 306 may then output one or more word lattices from its decoder to word posterior indexing function 312 which may formulate one or more inverted indices to enable searching by search module 314 for keywords 320 .
- decision threshold function 316 may assign a score 318 related to a probability of the presence of any keywords 320 . If the score 318 is determined to be greater than or equal to the threshold value, the pipeline and/or system 300 determines that the audio segment 324 includes a keyword of keywords 320 .
- System 300 may then store audio segment 324 and/or forward audio segment 324 to another system and/or server 120 or 122 for further review or processing.
- FIG. 4 is a block diagram of a keyword search system 400 including a KWS learning module 402 integrated with an ASR 404 .
- KWS learning module 402 includes a keyword (KW) filter 406 , keyword classifier 408 and an ASR acoustic model 410 .
- FIG. 4 shows KW filter 406 positioned within ASR 404 between DNN acoustic model 412 and decoder 414 .
- KWS learning module 402 or any portion thereof, such as KW filter 406 is integrated within ASR 404 .
- KWS learning module 402 including KW filter 406 is capable of being retrofitted into and/or integrated with an existing ASR 404 .
- System 400 also includes VAD 416 and word lattice search module 418 .
- KWS system 400 receives at least one audio segment 420 that is input into VAD 416 .
- VAD 416 may receive segment 420 from a receiver such as receiver 308 .
- VAD 416 uses a deep neural network (DNN) model such as model 326 to detect whether audio segment 420 includes speech. Those segments 420 determined to include speech are passed on and/or output to ASR 404 .
- ASR 404 uses DNN acoustic model 412 , which may include a Hidden Markov Model deep neural network (HMM-DNN) such as model 328 to recognize the speech within a speech segment such as segment 420 .
- HMM-DNN Hidden Markov Model deep neural network
- DNN acoustic model 412 may include other acoustic models such as, without limitation, a recursive DNN.
- Each segment 420 may have a duration in the range of 1 second to 10 seconds. In some configurations, each segment may be less than or equal to about 1 second, 5, seconds, 10 seconds, 15 seconds, 20 seconds, and 30 seconds.
- KWS learning module 402 uses ASR acoustic model 410 which may include HMM-DNN model 328 or another neural network model to define a keyword classifier 408 .
- a first automatic speech recognition engine and/or DNN acoustic model 412 may be arranged to identify one or more phonemes included in voice segment 420 and output the one or more phonemes to KW filter 406 .
- Decoder 414 may be arranged to, if the one or more phonemes are outputted by the KW filter 406 , receive the one or more phonemes included in voice segment 402 and generate a word lattice associated with voice segment 420 .
- KWS learning module 402 may implement a second automatic speech recognition engine such as ASR acoustic model 410 to extract one or more posteriorgram features associated with training speech segments 422 .
- Keyword classifier 408 may execute a machine learning technique to determine a filter function based on a keyword list such as keyword list 330 and one or more posteriorgram features. Keyword classifier 408 may define a single filter function that is used by KW filter 406 to filter out and/or omit segments such as segment 420 that do not contain a keyword such as in keyword list 330 which would be an input into learning module 402 .
- KWS learning module 402 may receive training audio segments 422 , a keyword list such as keyword list 330 of FIG. 3 , and segment labels 424 associated with each training audio segment 422 .
- Training segments 420 may include a mix a segments having a keyword in keyword list 330 and not having a keyword in keyword list 330 .
- Segment labels 424 may include labels indicating whether each audio segment of audio segments 422 includes a keyword in keyword list 330 or does not include a keyword of keyword list 330 .
- each segment label may identify a specific keyword or keywords included in a particular training audio segment 422 .
- ASR 406 includes a phoneme and/or word decoder 414 such as, for example, a finite state transducer (FST) decoder configured to form a word and/or phoneme lattice associated with each of the speech segments 420 .
- FST finite state transducer
- the formulation and processing associated with performing speech recognition by decoder 414 can account for a substantial percentage of the processing cost expended by system 400 to perform keyword searching.
- decoder 414 may otherwise process every segment such as segment 420 even though a substantial portion of the processed audio segments may not include a keyword of keywords 320 , which may be used by word lattice search module 418 .
- ASR 404 is able to perform automatic speech recognition substantially more efficiently, more rapidly, and by utilizing substantially less processing power of a device such as devices 102 , 104 , 106 and server 112 .
- ASR 404 may then output one or more word lattices from its decoder 418 and perform a word search using word lattice search module 418 to determine whether an audio segment such as segment 420 includes a target keyword such as in keyword list 320 and 330 .
- Keyword list 330 may be the same as keyword list 320 .
- keyword list 320 may include a subset of keywords in keyword list 330 .
- Word lattice search module 418 may perform one or more of the operations of search module 314 and decision threshold function 316 to determine a score associated with each audio segment 420 such as score 318 .
- System 400 may store audio segment 420 and/or forward audio segment 420 to another system and/or server 120 or 122 for further review or processing.
- word lattice search module 418 includes a word lattice search engine arranged to: i) receive the word lattice associated with voice segment 420 if generated by decoder 414 , ii) search the word lattice for one or more keywords in a keyword list such as keywords 320 , and iii) determine whether voice segment 420 includes one or more of the keywords 320 .
- FIG. 5 shows diagram 500 including two categories of keyword filter classifiers 502 and 504 .
- Keyword filter classifier 502 includes a bag of phoneme or phone N-grams technique.
- the bag of phone N-grams classifier 502 may apply an approach of collapsing frame outputs associated with a voice segment such as segment 324 or 420 and then estimating phone N-gram counts as a sparse feature vector representation of a voice segment such as segment 324 or 420 . This enables the bag of phone N-grams classifier 502 to make a binary (yes/no) determination of whether a voice segment such as segment 324 or 420 contains a keyword in, for example, keyword list 330 , or does not contain a keyword in keyword list 330 .
- Keyword filter classifier 504 includes a direct classification technique.
- the direct classification classifier 504 may apply feed poster probabilities of an acoustic model output such as from ASR acoustic model 410 directly to deep networks. This enables the direct classification classifier 504 to make a binary (yes/no) determination of whether a voice segment such as segment 324 or 420 contains a keyword in, for example, keyword list 330 , or does not contain a keyword in keyword list 330 .
- FIG. 6 shows a graph 600 of a posteriorgram based on the spoken term “individually” that illustrates how the bag of phone N-grams classifier 502 may make a yes/no determination based on estimating N-gram counts from a posteriorgram by detecting change points 602 in the distribution of phones using a sliding two frame block.
- Classifier 502 may record a change point when the root mean square error (RMSE) between normalized phone distributions in adjacent blocks exceeds a set threshold. The counts are then normalized to a sum of 1 in each bin defined by the change points 602 .
- Classifier 504 then computes overlapping 2-gram soft counts from the bins.
- RMSE root mean square error
- weighted soft counts are computed using inverse document frequency (IDF) from training scripts and/or training segments.
- Classifier 504 may then compute the cosine similarity between the keywords such as in keyword list 330 and N-grams of a segment such as segment 324 or 420 to make a binary determination whether a keyword of list 330 is present in an audio segment such as segment 324 or 420 .
- IDF inverse document frequency
- An ASR engine such as ASR acoustic model 410 may implement other bag of N-gram classifiers including Na ⁇ ve Bayes (NB) N-gram classifiers of various types such as Na ⁇ ve Bayes with restricted vocabulary (NB-KW) where feature vectors are limited to N-grams; Na ⁇ ve Bayes with full vocabulary using all N-grams (NB-All) in training; and Na ⁇ ve Bayes with minimum classification error (NB-MCE) having per-N-gram discriminately trained weights using misclassification measure M(S).
- NB-KW Na ⁇ ve Bayes with restricted vocabulary
- NB-All Na ⁇ ve Bayes with full vocabulary using all N-grams
- NB-MCE Na ⁇ ve Bayes with minimum classification error
- FIG. 7 shows a diagram 700 of various direct classification techniques based on deep learning that may be implemented by, for example, learning module 402 using an ASR engine such as ASR acoustic model 410 .
- the techniques of FIG. 7 may include a wide convolutional neural network (Wide CNN) classifier 702 , a frame level bidirectional long short term memory (LSTM) classifier 704 , and a CNN with bidirectional LSTM (CNN+B-LSTM) classifier 706 .
- These classifiers enable a binary (yes/no) determination of whether a voice segment such as segment 324 or 420 contains a keyword in, for example, keyword list 330 , or does not contain a keyword in keyword list 330 .
- the Conv2D block of Wide CNN classifier 702 and/or CNN+B-LSTM classifier 706 is repeated 1, 2, or 3 times, or more.
- Wide CNN classifier 702 and/or CNN+B-LSTM classifier 706 uses a Conv2D block with MaxPool and Dropout with 20 ms input windows.
- FIG. 8 A shows graphs 802 and 804 of miss rate 806 and 810 versus the number of segments 808 and 812 respectively using bag of N-grams classifiers such as Consine, NB-KW, NB-All, and NB-MCE for PSC and video analytic speech technology (VAST) data.
- Graphs 802 and 804 illustrate that at a 50% removal rate of segments, the miss rate is only increased by about 3-5% using light weight classifiers including bag of N-gram classifiers with respect to conventional KWS systems.
- FIG. 8 B shows graphs 852 and 854 of miss rate 856 and 860 versus the number of segments 856 and 862 respectively using deep learning classifiers such as Cosine, NB-KW, NB-All, NB-MCE, BLSTM, C-BLSTM, and CNNO3 for PSC and VAST data.
- Graphs 852 and 854 illustrate that direct classification methods can be as effective on PSC data as lightweight classifiers. Direct classifiers may also benefit from more PSC training data versus the VAST data.
- FIG. 9 shows an exemplary process 900 for performing a fast keyword search.
- Exemplary keyword search process 900 includes the following steps; identify a voice segment of a received audio signal (Step 902 ); identify, by a first automatic speech recognition engine such as DNN acoustic model 412 , one or more phonemes included in the voice segment such as voice segment 324 or 420 (Step 904 ); output, from the first automatic speech recognition engine, the one or more phonemes to a keyword filter such as filter 304 or 406 (Step 906 ); receive a plurality of training speech segments such as segments 332 or 422 , segment labels associated with the plurality of training speech segments such as labels 334 or 424 , and a first keyword list including one or more first keywords such as keyword list 330 (Step 908 ); extract, by a second automatic speech recognition engine such as ASR acoustic model 410 , one or more posteriorgram features associated with the plurality of training speech segments 332 or 422 (Step 910 ); determine,
- a computer program product that includes a computer usable and/or readable medium.
- a computer usable medium may consist of a read only memory device, such as a CD ROM disk or conventional ROM devices, or a random access memory, such as a hard drive device or a computer diskette, or flash memory device having a computer readable program code stored thereon.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Probability & Statistics with Applications (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Molecular Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Computational Mathematics (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Optimization (AREA)
- Mathematical Analysis (AREA)
- Algebra (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (20)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/929,383 US12020697B2 (en) | 2020-07-15 | 2020-07-15 | Systems and methods for fast filtering of audio keyword search |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/929,383 US12020697B2 (en) | 2020-07-15 | 2020-07-15 | Systems and methods for fast filtering of audio keyword search |
Publications (2)
Publication Number | Publication Date |
---|---|
US20220020361A1 US20220020361A1 (en) | 2022-01-20 |
US12020697B2 true US12020697B2 (en) | 2024-06-25 |
Family
ID=79293505
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/929,383 Active 2040-11-11 US12020697B2 (en) | 2020-07-15 | 2020-07-15 | Systems and methods for fast filtering of audio keyword search |
Country Status (1)
Country | Link |
---|---|
US (1) | US12020697B2 (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112836039B (en) * | 2021-01-27 | 2023-04-21 | 成都网安科技发展有限公司 | Voice data processing method and device based on deep learning |
CN113724718B (en) * | 2021-09-01 | 2022-07-29 | 宿迁硅基智能科技有限公司 | Target audio output method, device and system |
CN115527523A (en) * | 2022-09-23 | 2022-12-27 | 北京世纪好未来教育科技有限公司 | Keyword speech recognition method, device, storage medium and electronic equipment |
Citations (64)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20010012998A1 (en) | 1999-12-17 | 2001-08-09 | Pierrick Jouet | Voice recognition process and device, associated remote control device |
US6542869B1 (en) | 2000-05-11 | 2003-04-01 | Fuji Xerox Co., Ltd. | Method for automatic analysis of audio including music and speech |
US20040083104A1 (en) | 2002-10-17 | 2004-04-29 | Daben Liu | Systems and methods for providing interactive speaker identification training |
US20060206310A1 (en) | 2004-06-29 | 2006-09-14 | Damaka, Inc. | System and method for natural language processing in a peer-to-peer hybrid communications network |
US20070112837A1 (en) | 2005-11-09 | 2007-05-17 | Bbnt Solutions Llc | Method and apparatus for timed tagging of media content |
US7337115B2 (en) | 2002-07-03 | 2008-02-26 | Verizon Corporate Services Group Inc. | Systems and methods for providing acoustic classification |
US7437284B1 (en) | 2004-07-01 | 2008-10-14 | Basis Technology Corporation | Methods and systems for language boundary detection |
US20100125448A1 (en) | 2008-11-20 | 2010-05-20 | Stratify, Inc. | Automated identification of documents as not belonging to any language |
US20100191530A1 (en) | 2009-01-23 | 2010-07-29 | Honda Motor Co., Ltd. | Speech understanding apparatus |
US20120017146A1 (en) | 2010-07-13 | 2012-01-19 | Enrique Travieso | Dynamic language translation of web site content |
US20120323573A1 (en) | 2011-03-25 | 2012-12-20 | Su-Youn Yoon | Non-Scorable Response Filters For Speech Scoring Systems |
US20130311190A1 (en) | 2012-05-21 | 2013-11-21 | Bruce Reiner | Method and apparatus of speech analysis for real-time measurement of stress, fatigue, and uncertainty |
CN104036012A (en) * | 2014-06-24 | 2014-09-10 | 中国科学院计算技术研究所 | Dictionary learning method, visual word bag characteristic extracting method and retrieval system |
US20150095026A1 (en) * | 2013-09-27 | 2015-04-02 | Amazon Technologies, Inc. | Speech recognizer with multi-directional decoding |
US20150194147A1 (en) | 2011-03-25 | 2015-07-09 | Educational Testing Service | Non-Scorable Response Filters for Speech Scoring Systems |
US20150228279A1 (en) | 2014-02-12 | 2015-08-13 | Google Inc. | Language models using non-linguistic context |
US20150371635A1 (en) * | 2013-06-25 | 2015-12-24 | Keith Kintzley | System and Method for Processing Speech to Identify Keywords or Other Information |
US20160042739A1 (en) | 2014-08-07 | 2016-02-11 | Nuance Communications, Inc. | Fast speaker recognition scoring using i-vector posteriors and probabilistic linear discriminant analysis |
US20160240188A1 (en) | 2013-11-20 | 2016-08-18 | Mitsubishi Electric Corporation | Speech recognition device and speech recognition method |
US20160267904A1 (en) | 2015-03-13 | 2016-09-15 | Google Inc. | Addressing Missing Features in Models |
US20160284347A1 (en) * | 2015-03-27 | 2016-09-29 | Google Inc. | Processing audio waveforms |
WO2016189307A1 (en) | 2015-05-26 | 2016-12-01 | Sonalytic Limited | Audio identification method |
US20170011735A1 (en) | 2015-07-10 | 2017-01-12 | Electronics And Telecommunications Research Institute | Speech recognition system and method |
US20170061002A1 (en) | 2012-12-31 | 2017-03-02 | Google Inc. | Hold Back and Real Time Ranking of Results in a Streaming Matching System |
US20170092266A1 (en) | 2015-09-24 | 2017-03-30 | Intel Corporation | Dynamic adaptation of language models and semantic tracking for automatic speech recognition |
US20170294192A1 (en) | 2016-04-08 | 2017-10-12 | Knuedge Incorporated | Classifying Signals Using Mutual Information |
US20170308613A1 (en) * | 2016-04-26 | 2017-10-26 | Baidu Usa Llc | Method and system of determining categories associated with keywords using a trained model |
US20170365251A1 (en) | 2015-01-16 | 2017-12-21 | Samsung Electronics Co., Ltd. | Method and device for performing voice recognition using grammar model |
US20180006865A1 (en) | 2003-12-27 | 2018-01-04 | Electronics And Telecommunications Research Institute | Preamble configuring method in the wireless lan system, and a method for a frame synchronization |
US20180012594A1 (en) | 2016-07-08 | 2018-01-11 | Google Inc. | Follow-up voice query prediction |
US20180053502A1 (en) | 2016-08-19 | 2018-02-22 | Google Inc. | Language models using domain-specific model components |
US20180061412A1 (en) | 2016-08-31 | 2018-03-01 | Samsung Electronics Co., Ltd. | Speech recognition method and apparatus based on speaker recognition |
US20180174600A1 (en) | 2016-12-16 | 2018-06-21 | Google Inc. | Associating faces with voices for speaker diarization within videos |
US20180342239A1 (en) | 2017-05-26 | 2018-11-29 | International Business Machines Corporation | Closed captioning through language detection |
US20180357998A1 (en) | 2017-06-13 | 2018-12-13 | Intel IP Corporation | Wake-on-voice keyword detection with integrated language identification |
US20180374476A1 (en) | 2017-06-27 | 2018-12-27 | Samsung Electronics Co., Ltd. | System and device for selecting speech recognition model |
US20190108257A1 (en) | 2017-10-06 | 2019-04-11 | Soundhound, Inc. | Bidirectional probabilistic natural language rewriting and selection |
US20190138539A1 (en) | 2012-06-21 | 2019-05-09 | Google, Llc | Dynamic language model |
US20190237096A1 (en) * | 2018-12-28 | 2019-08-01 | Intel Corporation | Ultrasonic attack detection employing deep learning |
US10388272B1 (en) | 2018-12-04 | 2019-08-20 | Sorenson Ip Holdings, Llc | Training speech recognition systems using word sequences |
US10402500B2 (en) | 2016-04-01 | 2019-09-03 | Samsung Electronics Co., Ltd. | Device and method for voice translation |
US20190304470A1 (en) | 2016-07-11 | 2019-10-03 | FTR Labs Pty Ltd | Method and system for automatically diarising a sound recording |
US20190371318A1 (en) | 2018-02-15 | 2019-12-05 | DMAI, Inc. | System and method for adaptive detection of spoken language via multiple speech models |
US20190385589A1 (en) | 2017-03-17 | 2019-12-19 | Yamaha Corporation | Speech Processing Device, Teleconferencing Device, Speech Processing System, and Speech Processing Method |
US20200021949A1 (en) | 2018-01-21 | 2020-01-16 | Qualcomm Incorporated | Systems and methods for locating a user equipment using generic position methods for a 5g network |
US20200019492A1 (en) * | 2018-07-12 | 2020-01-16 | EMC IP Holding Company LLC | Generating executable test automation code automatically according to a test case |
US20200027444A1 (en) | 2018-07-20 | 2020-01-23 | Google Llc | Speech recognition with sequence-to-sequence models |
US20200035739A1 (en) | 2017-04-19 | 2020-01-30 | Sony Semiconductor Solutions Corporation | Semiconductor device, method of manufacturing the same, and electronic apparatus |
US20200038021A1 (en) | 2015-04-22 | 2020-02-06 | Covidien Lp | Handheld electromechanical surgical system |
US10573312B1 (en) | 2018-12-04 | 2020-02-25 | Sorenson Ip Holdings, Llc | Transcription generation from multiple speech recognition systems |
US20200074992A1 (en) | 2018-08-31 | 2020-03-05 | UBTECH Robotics Corp. | Method and apparatus for judging termination of sound reception and terminal device |
US20200111476A1 (en) | 2018-10-04 | 2020-04-09 | Fujitsu Limited | Recording medium, language identification method, and information processing device |
US20200175961A1 (en) | 2018-12-04 | 2020-06-04 | Sorenson Ip Holdings, Llc | Training of speech recognition systems |
US20200243094A1 (en) | 2018-12-04 | 2020-07-30 | Sorenson Ip Holdings, Llc | Switching between speech recognition systems |
US20200243077A1 (en) * | 2019-01-28 | 2020-07-30 | Pindrop Security, Inc. | Unsupervised keyword spotting and word discovery for fraud analytics |
US20200293875A1 (en) | 2019-03-12 | 2020-09-17 | International Business Machines Corporation | Generative Adversarial Network Based Audio Restoration |
US20200357391A1 (en) | 2019-05-06 | 2020-11-12 | Apple Inc. | Reducing digital assistant latency when a language is incorrectly determined |
US20200380074A1 (en) * | 2019-05-29 | 2020-12-03 | Apple Inc. | Methods and systems for trending issue identification in text streams |
US20200387677A1 (en) | 2019-06-05 | 2020-12-10 | Samsung Electronics Co., Ltd. | Electronic device and method for controlling the electronic device thereof |
US20210055778A1 (en) * | 2017-12-29 | 2021-02-25 | Fluent.Ai Inc. | A low-power keyword spotting system |
US20210232776A1 (en) | 2018-04-27 | 2021-07-29 | Llsollu Co., Ltd. | Method for recording and outputting conversion between multiple parties using speech recognition technology, and device therefor |
US20210248421A1 (en) * | 2020-02-06 | 2021-08-12 | Shenzhen Malong Technologies Co., Ltd. | Channel interaction networks for image categorization |
US20210342785A1 (en) * | 2020-05-01 | 2021-11-04 | Monday.com Ltd. | Digital processing systems and methods for virtual file-based electronic white board in collaborative work systems |
US11176934B1 (en) | 2019-03-22 | 2021-11-16 | Amazon Technologies, Inc. | Language switching on a speech interface device |
-
2020
- 2020-07-15 US US16/929,383 patent/US12020697B2/en active Active
Patent Citations (66)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20010012998A1 (en) | 1999-12-17 | 2001-08-09 | Pierrick Jouet | Voice recognition process and device, associated remote control device |
US6542869B1 (en) | 2000-05-11 | 2003-04-01 | Fuji Xerox Co., Ltd. | Method for automatic analysis of audio including music and speech |
US7337115B2 (en) | 2002-07-03 | 2008-02-26 | Verizon Corporate Services Group Inc. | Systems and methods for providing acoustic classification |
US20040083104A1 (en) | 2002-10-17 | 2004-04-29 | Daben Liu | Systems and methods for providing interactive speaker identification training |
US20180006865A1 (en) | 2003-12-27 | 2018-01-04 | Electronics And Telecommunications Research Institute | Preamble configuring method in the wireless lan system, and a method for a frame synchronization |
US20060206310A1 (en) | 2004-06-29 | 2006-09-14 | Damaka, Inc. | System and method for natural language processing in a peer-to-peer hybrid communications network |
US7437284B1 (en) | 2004-07-01 | 2008-10-14 | Basis Technology Corporation | Methods and systems for language boundary detection |
US20070112837A1 (en) | 2005-11-09 | 2007-05-17 | Bbnt Solutions Llc | Method and apparatus for timed tagging of media content |
US7801910B2 (en) | 2005-11-09 | 2010-09-21 | Ramp Holdings, Inc. | Method and apparatus for timed tagging of media content |
US20100125448A1 (en) | 2008-11-20 | 2010-05-20 | Stratify, Inc. | Automated identification of documents as not belonging to any language |
US20100191530A1 (en) | 2009-01-23 | 2010-07-29 | Honda Motor Co., Ltd. | Speech understanding apparatus |
US20120017146A1 (en) | 2010-07-13 | 2012-01-19 | Enrique Travieso | Dynamic language translation of web site content |
US20150194147A1 (en) | 2011-03-25 | 2015-07-09 | Educational Testing Service | Non-Scorable Response Filters for Speech Scoring Systems |
US20120323573A1 (en) | 2011-03-25 | 2012-12-20 | Su-Youn Yoon | Non-Scorable Response Filters For Speech Scoring Systems |
US20130311190A1 (en) | 2012-05-21 | 2013-11-21 | Bruce Reiner | Method and apparatus of speech analysis for real-time measurement of stress, fatigue, and uncertainty |
US20190138539A1 (en) | 2012-06-21 | 2019-05-09 | Google, Llc | Dynamic language model |
US20170061002A1 (en) | 2012-12-31 | 2017-03-02 | Google Inc. | Hold Back and Real Time Ranking of Results in a Streaming Matching System |
US20150371635A1 (en) * | 2013-06-25 | 2015-12-24 | Keith Kintzley | System and Method for Processing Speech to Identify Keywords or Other Information |
US20150095026A1 (en) * | 2013-09-27 | 2015-04-02 | Amazon Technologies, Inc. | Speech recognizer with multi-directional decoding |
US20160240188A1 (en) | 2013-11-20 | 2016-08-18 | Mitsubishi Electric Corporation | Speech recognition device and speech recognition method |
US20150228279A1 (en) | 2014-02-12 | 2015-08-13 | Google Inc. | Language models using non-linguistic context |
CN104036012A (en) * | 2014-06-24 | 2014-09-10 | 中国科学院计算技术研究所 | Dictionary learning method, visual word bag characteristic extracting method and retrieval system |
US20160042739A1 (en) | 2014-08-07 | 2016-02-11 | Nuance Communications, Inc. | Fast speaker recognition scoring using i-vector posteriors and probabilistic linear discriminant analysis |
US20170365251A1 (en) | 2015-01-16 | 2017-12-21 | Samsung Electronics Co., Ltd. | Method and device for performing voice recognition using grammar model |
US20160267904A1 (en) | 2015-03-13 | 2016-09-15 | Google Inc. | Addressing Missing Features in Models |
US20160284347A1 (en) * | 2015-03-27 | 2016-09-29 | Google Inc. | Processing audio waveforms |
US20200038021A1 (en) | 2015-04-22 | 2020-02-06 | Covidien Lp | Handheld electromechanical surgical system |
WO2016189307A1 (en) | 2015-05-26 | 2016-12-01 | Sonalytic Limited | Audio identification method |
US20170011735A1 (en) | 2015-07-10 | 2017-01-12 | Electronics And Telecommunications Research Institute | Speech recognition system and method |
US20170092266A1 (en) | 2015-09-24 | 2017-03-30 | Intel Corporation | Dynamic adaptation of language models and semantic tracking for automatic speech recognition |
US10402500B2 (en) | 2016-04-01 | 2019-09-03 | Samsung Electronics Co., Ltd. | Device and method for voice translation |
US20170294192A1 (en) | 2016-04-08 | 2017-10-12 | Knuedge Incorporated | Classifying Signals Using Mutual Information |
US20170308613A1 (en) * | 2016-04-26 | 2017-10-26 | Baidu Usa Llc | Method and system of determining categories associated with keywords using a trained model |
US20180012594A1 (en) | 2016-07-08 | 2018-01-11 | Google Inc. | Follow-up voice query prediction |
US10964329B2 (en) | 2016-07-11 | 2021-03-30 | FTR Labs Pty Ltd | Method and system for automatically diarising a sound recording |
US20190304470A1 (en) | 2016-07-11 | 2019-10-03 | FTR Labs Pty Ltd | Method and system for automatically diarising a sound recording |
US20180053502A1 (en) | 2016-08-19 | 2018-02-22 | Google Inc. | Language models using domain-specific model components |
US20180061412A1 (en) | 2016-08-31 | 2018-03-01 | Samsung Electronics Co., Ltd. | Speech recognition method and apparatus based on speaker recognition |
US20180174600A1 (en) | 2016-12-16 | 2018-06-21 | Google Inc. | Associating faces with voices for speaker diarization within videos |
US20190385589A1 (en) | 2017-03-17 | 2019-12-19 | Yamaha Corporation | Speech Processing Device, Teleconferencing Device, Speech Processing System, and Speech Processing Method |
US20200035739A1 (en) | 2017-04-19 | 2020-01-30 | Sony Semiconductor Solutions Corporation | Semiconductor device, method of manufacturing the same, and electronic apparatus |
US20180342239A1 (en) | 2017-05-26 | 2018-11-29 | International Business Machines Corporation | Closed captioning through language detection |
US20180357998A1 (en) | 2017-06-13 | 2018-12-13 | Intel IP Corporation | Wake-on-voice keyword detection with integrated language identification |
US20180374476A1 (en) | 2017-06-27 | 2018-12-27 | Samsung Electronics Co., Ltd. | System and device for selecting speech recognition model |
US20190108257A1 (en) | 2017-10-06 | 2019-04-11 | Soundhound, Inc. | Bidirectional probabilistic natural language rewriting and selection |
US20210055778A1 (en) * | 2017-12-29 | 2021-02-25 | Fluent.Ai Inc. | A low-power keyword spotting system |
US20200021949A1 (en) | 2018-01-21 | 2020-01-16 | Qualcomm Incorporated | Systems and methods for locating a user equipment using generic position methods for a 5g network |
US20190371318A1 (en) | 2018-02-15 | 2019-12-05 | DMAI, Inc. | System and method for adaptive detection of spoken language via multiple speech models |
US20210232776A1 (en) | 2018-04-27 | 2021-07-29 | Llsollu Co., Ltd. | Method for recording and outputting conversion between multiple parties using speech recognition technology, and device therefor |
US20200019492A1 (en) * | 2018-07-12 | 2020-01-16 | EMC IP Holding Company LLC | Generating executable test automation code automatically according to a test case |
US20200027444A1 (en) | 2018-07-20 | 2020-01-23 | Google Llc | Speech recognition with sequence-to-sequence models |
US20200074992A1 (en) | 2018-08-31 | 2020-03-05 | UBTECH Robotics Corp. | Method and apparatus for judging termination of sound reception and terminal device |
US20200111476A1 (en) | 2018-10-04 | 2020-04-09 | Fujitsu Limited | Recording medium, language identification method, and information processing device |
US10573312B1 (en) | 2018-12-04 | 2020-02-25 | Sorenson Ip Holdings, Llc | Transcription generation from multiple speech recognition systems |
US20200243094A1 (en) | 2018-12-04 | 2020-07-30 | Sorenson Ip Holdings, Llc | Switching between speech recognition systems |
US20200175961A1 (en) | 2018-12-04 | 2020-06-04 | Sorenson Ip Holdings, Llc | Training of speech recognition systems |
US10388272B1 (en) | 2018-12-04 | 2019-08-20 | Sorenson Ip Holdings, Llc | Training speech recognition systems using word sequences |
US20190237096A1 (en) * | 2018-12-28 | 2019-08-01 | Intel Corporation | Ultrasonic attack detection employing deep learning |
US20200243077A1 (en) * | 2019-01-28 | 2020-07-30 | Pindrop Security, Inc. | Unsupervised keyword spotting and word discovery for fraud analytics |
US20200293875A1 (en) | 2019-03-12 | 2020-09-17 | International Business Machines Corporation | Generative Adversarial Network Based Audio Restoration |
US11176934B1 (en) | 2019-03-22 | 2021-11-16 | Amazon Technologies, Inc. | Language switching on a speech interface device |
US20200357391A1 (en) | 2019-05-06 | 2020-11-12 | Apple Inc. | Reducing digital assistant latency when a language is incorrectly determined |
US20200380074A1 (en) * | 2019-05-29 | 2020-12-03 | Apple Inc. | Methods and systems for trending issue identification in text streams |
US20200387677A1 (en) | 2019-06-05 | 2020-12-10 | Samsung Electronics Co., Ltd. | Electronic device and method for controlling the electronic device thereof |
US20210248421A1 (en) * | 2020-02-06 | 2021-08-12 | Shenzhen Malong Technologies Co., Ltd. | Channel interaction networks for image categorization |
US20210342785A1 (en) * | 2020-05-01 | 2021-11-04 | Monday.com Ltd. | Digital processing systems and methods for virtual file-based electronic white board in collaborative work systems |
Non-Patent Citations (14)
Title |
---|
Bisandu et al., "Clustering news articles using efficient similarity measure and N-grams" Int. J. Knowledge Engineering and Data Mining, vol. 5, No. 4, 2018, pp. 333-348 (Year: 2018). * |
Chen et al., "Query-by-Example Keyword Spotting Using Long Short-Term Memory Networks," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 19-24, 2015. |
David Snyder, "SRE16 Xvector Model," http://kaldi-asr.org/models/m3, 2017, Accessed: Oct. 10, 2018. |
Dehak et al., "Front-end factor analysis for speaker verification," IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, No. 4, pp. 788-798, 2011. |
He et al., "Streaming End-to-End Speech Recognition for Mobile Devices," ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 12-17, 2019. |
International Search Report and Written Opinion for International Application No. PCT/US2020/066298 dated Mar. 26, 2021. |
Kinnunen et al., "A speaker pruning algorithm for real-time speaker identification," in International Conference on Audio-and Video-Based Biometric Person Authentication. Springer, 639-646, 2003. |
Kinnunen, et al., "Real-time speaker identification and verification." IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, No. 1, 277-288, 2006. |
Miller, et al., "Rapid and Accurate Spoken Term Detection," Proceedings of Interspeech, ISCA, 2007, pp. 314-317. |
Sarkar et al., "Fast Approach to Speaker Identification for Large Population using MLLR and Sufficient Statistics," in 2010 National Conference on Communications (NCC) IEEE, 1-5, 2010. |
Schmidt et al., "Large-scale Speaker Identification," in Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP) IEEE, 1669-1673, 2014. |
Snyder et al., "X-vectors: Robust DNN embeddings for speaker recognition," in Proc. IEEE ICASSP, 2018. |
Zhang, et al., "Unsupervised Spoken Keyword Spotting Via Segmental DTW on Gaussian Posteriorgrams," Proceedings of the Automatic Speech Recognition & Understanding (ASRU) Workshop, IEEE, 2009, 398-403. |
Zhu et al., "Self-attentive Speaker Embeddings for Text-Independent Speaker Verification," Interspeech, 2018. |
Also Published As
Publication number | Publication date |
---|---|
US20220020361A1 (en) | 2022-01-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11664020B2 (en) | Speech recognition method and apparatus | |
US10192545B2 (en) | Language modeling based on spoken and unspeakable corpuses | |
US11423089B2 (en) | System and method for determining application programming interface and object bindings on natural language processed inputs | |
US20210304759A1 (en) | Automatic speech recognition with filler model processing | |
CA3065765C (en) | Extracting domain-specific actions and entities in natural language commands | |
US8631498B1 (en) | Techniques for identifying potential malware domain names | |
US11545157B2 (en) | Speaker diartzation using an end-to-end model | |
CN109686383B (en) | Voice analysis method, device and storage medium | |
US12020697B2 (en) | Systems and methods for fast filtering of audio keyword search | |
AU2017424116B2 (en) | Extracting domain-specific actions and entities in natural language commands | |
WO2021103712A1 (en) | Neural network-based voice keyword detection method and device, and system | |
US20200219487A1 (en) | Information processing apparatus and information processing method | |
US20190042560A1 (en) | Extracting domain-specific actions and entities in natural language commands | |
KR20200014046A (en) | Device and Method for Machine Reading Comprehension Question and Answer | |
US11769487B2 (en) | Systems and methods for voice topic spotting | |
EP4364135A1 (en) | Canonical training for highly configurable multilingual speech recogntion | |
US20220050971A1 (en) | System and Method for Generating Responses for Conversational Agents | |
CN113611284A (en) | Voice library construction method, recognition method, construction system and recognition system | |
CN113506584B (en) | Data processing method and device | |
US11798542B1 (en) | Systems and methods for integrating voice controls into applications | |
CN119311525B (en) | Alarm information gathering method, system, equipment and storage medium | |
US20250046333A1 (en) | Acoustic sound event detection system | |
CN114219012A (en) | Method, apparatus, computer program product and storage medium for sample data processing | |
CN120012776A (en) | Content security identification method based on integration of multiple large language models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: RAYTHEON APPLIED SIGNAL TECHNOLOGY, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:WINTRODE, JONATHAN C.;REEL/FRAME:053661/0808 Effective date: 20200831 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE AFTER FINAL ACTION FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: ADVISORY ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: ALLOWED -- NOTICE OF ALLOWANCE NOT YET MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |