US6477524B1 - Method for statistical text analysis - Google Patents
Method for statistical text analysis Download PDFInfo
- Publication number
- US6477524B1 US6477524B1 US09/468,508 US46850899A US6477524B1 US 6477524 B1 US6477524 B1 US 6477524B1 US 46850899 A US46850899 A US 46850899A US 6477524 B1 US6477524 B1 US 6477524B1
- Authority
- US
- United States
- Prior art keywords
- query
- term
- matrix
- story
- terms
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Lifetime
Links
- 238000000034 method Methods 0.000 title claims abstract description 56
- 239000011159 matrix material Substances 0.000 claims abstract description 49
- 238000013459 approach Methods 0.000 description 13
- 230000001413 cellular effect Effects 0.000 description 13
- 238000010586 diagram Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 238000007781 pre-processing Methods 0.000 description 3
- 241001465754 Metazoa Species 0.000 description 2
- 241000270295 Serpentes Species 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 2
- 238000000354 decomposition reaction Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 239000003795 chemical substances by application Substances 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 239000013598 vector Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99931—Database or file accessing
- Y10S707/99932—Access augmentation or optimizing
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99931—Database or file accessing
- Y10S707/99933—Query processing, i.e. searching
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99931—Database or file accessing
- Y10S707/99933—Query processing, i.e. searching
- Y10S707/99934—Query formulation, input preparation, or translation
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99931—Database or file accessing
- Y10S707/99933—Query processing, i.e. searching
- Y10S707/99935—Query augmenting and refining, e.g. inexact access
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99931—Database or file accessing
- Y10S707/99933—Query processing, i.e. searching
- Y10S707/99936—Pattern matching access
Definitions
- This invention relates to method of analyzing text for relevant content, more particularly for statistical methods for performing text analysis.
- search engines on the Internet. These search engines take key words and search for relevant articles, web sites and discussions that include those words. The articles, web sites and discussions will be referred to hereinafter as stories.
- text analysis also occurs in other types of text searching, such as computerized reference collections, electronic libraries and document management systems.
- FIG. 1 A current technique for text searching is shown in prior art FIG. 1 .
- the query term is obtained in some manner, such as from a user entry or a user profile. If the query term is a single word, the story is searched for occurrences of that term. If the query term is a phrase, the story is searched for occurrences of each word of the phrase.
- a match results in the story being added to the list of stories that are relevant to the query.
- the query term is a phrase
- stories with each word of the phrase are added to a preliminary list associated with that word.
- An intersection of the preliminary lists is taken and those stories that are at the intersection of the preliminary lists (i.e., stories that contain all the words in the query phrase) are added to the list of stories that are relevant.
- relevant stories may not contain the exact term. This problem is exacerbated when the information retrieval is based upon very concise documents, such as user profiles, and when the information itself is in the form of brief summaries such as news summaries that can be found at an Internet portal site like yahoo.com.
- the first approach is to use an online (electronic) thesaurus or dictionary such as WordNet.
- WordNet is a large, manually built, general-purpose semantic network, which models the lexical knowledge of a native English speaker. It is organized around groupings of words called synsets. Each synset contains synonymous words and relationships among them.
- the relationships take the form of IS-A, A-KIND-OF, etc. For example, using the relationship “a snake is A-KIND-OF animal,” a query using the word snake may expand to include the word animal.
- a second category of AQE involves a pairwise association measure between words.
- the given corpus is analyzed to determine pairwise word associations and a query term is expanded to include terms having association values greater than a certain threshold.
- a pairwise mutual information value is determined from the context vectors of frequent words in the corpus. This example is discussed in “Corpus Analysis for TREC 5 Query Expansion,” by Gauch, et al., in Fifth Text Retrieval Conference (TREC-5), Gaithersburg, Md., 1996 (Gauch).
- the third category of AQE uses blind relevance feedback in accordance with Rocchio's algorithm.
- Blind feedback refers to the fact that relevance is not judged by the user but is determined by the system automatically.
- An initial search for the original query is performed and the retrieved documents are sorted according to some measure. The top few documents are assumed to be relevant for the original query. The original query is then expanded by using the terms in these relevant documents.
- a general-purpose database like WordNet has large coverage gaps when used for domains with their own specific vocabularies and sublanguages.
- Technology business news, for example, or technology advancements in highly jargon-filled technologies will have their own terms not recognized by general-purpose databases. It would be prohibitively time consuming to create a semantic network or to add terms to WordNet for each possible domain.
- Pairwise association techniques such as the one discussed above usually use a symmetric cooccurrence matrix of words.
- a symmetric matrix the occurrence of one word of the pair triggers an expansion to include the other word. This can be problematic when one word is fairly common. For example, the use of the word cellular in a query would result in the expansion to include the word phone. This is probably not inaccurate because cellular commonly refers to phones.
- the use of a symmetric matrix results in the addition of the word cellular whenever the word phone is used. Given that phone is a fairly common word, this could result in an unnecessary expansion and irrelevant results.
- An example of an approach using this type of matrix is shown in U.S. Pat. No. 5,675,819, issued Dec. 7, 1995.
- One aspect of the invention is a method for retrieving relevant stories from a collection of stories.
- the method includes the steps of identifying at least one query term, and applying a cooccurrence matrix to the query term to provide a list of query terms. Then, it is determined if a story in the collection contains any terms on the list of query terms. If the story does contain words in the list of query words, a relevance measure is increased. If the relevance measure is higher than a threshold, the story is added to a list of relevant stories.
- FIG. 1 shows a flowchart of a prior art approach for text retrieval.
- FIG. 2 shows a block diagram of a system in which text retrieval is used, in accordance with the invention.
- FIG. 3 shows an expanded block diagram of story analysis for a text retrieval sytem, in accordance with the invention.
- FIG. 4 shows a block diagram of one embodiment of a method for text retrieval using automated query expansion and searching in accordance with the invention.
- FIG. 5 shows a block diagram of one embodiment of a method for relevance scoring for stories retrieved, in accordance with the invention.
- user preferences 10 are used by the user profiler module 12 to develop, update and track a user profile.
- the user profiler may use both direct user inputs, a smart agent that forms user preferences automatically on the basis of user's usage pattern, both or neither.
- the user profiler develops a list of web addresses that are of interest to the user.
- the user profile is then provided by the user profiler to the web handling module 14 .
- the web handling module 14 fetches the web pages, parses them, and processes them to segment out the stories 16 .
- the web handler may interact with the user profiler module to provide feedback on the relevance of a particular web site.
- the web handler 14 passes the stories it locates 16 to the story analysis module 18 .
- FIG. 3 shows an expanded view of the story analysis module 18 .
- the stories and user queries 20 are used as initial inputs to the module.
- stop-word removal is performed to remove generic and commonly used terms, such as the days of the week and the months of the year. It must be noted that if the user query includes those terms, such as an event that happened on the first Tuesday in May, the user query would override the stop-word removal process for those words.
- the stop-word removal module 24 provides one way in which the system can adapt to particular domains. For example, a stop-word list for entertainment news would probably differ greatly from a stop-word list for technology news. Additionally, a generic stop-word list can be supplemented with words from relevant domains, making them more accurate and resulting in higher relevance stories.
- the second part of preprocessing in this example includes stemming 26 .
- Stemming refers to the process of reducing words to their core by removing, for example, suffixes due to tenses in a verb. For instance, the words “processed” and “processing” are stemmed to “process”. There algorithms in the literature.
- the method of determining the relevant stories at 28 uses a pairwise association approach.
- One of the weaknesses of current approaches, as discussed previously, is that they only consider a predetermined number of highest frequency terms in a corpus. For smaller corpus with diverse context, the approach is not feasible. Use of conditional probabilities will result in more accurate and reliable associations.
- the pairwise association values are formed in an asymmetrical matrix, referred to as a cooccurrence matrix.
- updating the matrix will provide the user with help in specifying queries.
- a list of dominant words for a particular domain will be determined using techniques such as Singular Value Decomposition on the cooccurrence matrix and let user choose from among these and related words to construct their queries.
- FIG. 4 One embodiment of a method for query expansion and story identification is shown in FIG. 4, as an example to promote understanding of the invention.
- the process starts with identification of a query term at 40 . Because some query terms may actually be phrases, single word query terms and phrase query terms have to be accounted for.
- the determination is made whether the term is a single word or a phrase. If the query term is a single word, the process moves to step 44 and the term to which the cooccurrence matrix is applied is set equal to the single word. If the query term is a phrase, the term to which the cooccurrence matrix will be applied is set equal to each word in the query term at step 48 . The words against which these terms are checked in the matrix will be referred to as storyterms.
- the cooccurrence matrix is applied to determine if there are other words that should be used to expand the query.
- the difference between the two steps is that at step 46 the entry to be analyzed is [term, storyterm] and at step 50 the entry is [word,storyterm]. This terminology will be explained below.
- S is a collection of documents or stories containing N terms of which T are distinct after stop-word removal and stemming.
- n i is the number of occurrences of the term i in the collection S
- n ij is the number of occurrences of the terms i and j in the same document.
- This association value defined above is an estimate of the conditional probability of observing the term j given the term i.
- n i N P ⁇ ( observing ⁇ ⁇ i ⁇ ⁇ and ⁇ ⁇ j )
- P ⁇ ( observing ⁇ ⁇ i ) P ( j ⁇ ⁇ i )
- the cooccurrence matrix is then defined as:
- the word cellular appears in a collection of stories 4 times.
- the term phone appears 22 times. They occur together 4 times, which means that every time the word cellular appears, the word phone also appears. Therefore the association ⁇ [cellular, phone] is 4/4, or 1.
- the association ⁇ [phone, cellular] Is 4/22, or 0.182. Therefore, if the word cellular appears in a query, it is a good idea to add phone, but not the reverse. In this manner the query can be appropriately expanded.
- T kN ⁇ .
- Typical values for k are between 10 and 20, and for ⁇ between 0.5 and 0.6. As the text size increases the number of new words will rapidly become smaller. The rate of increase of T will be much slower than the rate of N.
- the value of the matrix for the term is determined and compared to a threshold at steps 46 and 50 . If the value is higher than the threshold, each kind of term, whether single word or phrase is handled to expand the query.
- the term is a single word and the cooccurrence value is above the threshold at step 46 , that term is added to the query at step 56 .
- the term is a phrase, for each word in the phrase an association value is determined from the cooccurrence matrix with each storyterm. If the association value is above threshold, at step 50 , that storyterm is added to the posting list associated with that word at step 52 . This is repeated for each word in the phrase resulting in a posting list for each word.
- the resulting posting lists are intersected at step 54 . Storywords that intersect from the posting lists, i.e., the contents of the intersection set of the posting lists, are added as terms to the query at step 56 , thus expanding the query.
- step 58 the actual story searching begins, using the expanded queries. Again the process splits at step 60 based upon the nature of the query terms, which now include the additional expansion terms. Following the single term path to step 62 and then step 64 , a story is searched to see if it contains that query term. If the story does not contain the term, the process returns to step 58 to test the next term. If the story does contain the term, the score for that particular story is increased at step 66 .
- step 70 the process moves to step 70 where the term is set equal to each word in the phrase.
- the stories are searched at step 72 . If the story does not contain that word, the process returns to step 70 until all the terms in the phrase are used. If there are no more words in the phrase, the process returns to step 58 until there are no more query terms. If the story does contain that word, the story is added to the posting list of that word at step 74 . When the query is complete, the posting lists are intersected at step 76 and the stories in the intersection set have their scores increased at step 66 .
- the scores for various stories indicate the relevance of that particular story to the user's profile or preferences.
- a term may only occur once in the collection. This will render its association value to be either 1 or 0, and the estimate of probability becomes useless. To avoid this problem, it may be desirable to modify the process such that terms that occur just once for the collection are not expanded during query expansion.
- the scoring process should also have some parameters associated with it. For example, if a query contained the word Internet and Microsoft, desirable results would have stories that contain one occurrence of both of these words rather than stories that have repeated occurrences of one word. The score would then need to be higher for those stories with the occurrence of both words, rather than a story with frequent repetitions of one word.
- the new term k is the number of terms from the query that the story contains and w is a weight.
- the second term being added to the first controls the problem discussed above.
- An adaptation of the above example uses a threshold to determine if a story should be presented to the user.
- a flow chart of one example of such a process is shown in FIG. 5 .
- the process starts with each story in the story list at step 80 .
- the quantity ⁇ [term,storyterm] is determined from the cooccurrence matrix and a sum is computed at step 82 of all ⁇ [term,storyterm] .
- the story is presented to the user at step 88 if the score exceeds a threshold at step 86 .
- the cooccurrence matrix can be used as the adjacency matrix for a directed graph, with the links being the edges of the graph.
- the cooccurrence matrix can be used to assist the user in formulating effective queries.
- the implementation of this concept involves using the cooccurrence matrix to discover dominant terms in a given corpus.
- the cooccurrence matrix has been defined as having a dimension of T ⁇ T, where T is the number of distinct words in the collection. Usually T will be in the range of one to five thousand words. This makes C a large matrix, but it is sparse because most of the terms do not occur together.
- a dimensionality reduction technique may be used on C, such as Singular Value Decomposition.
- the SVD of C can be computed and a small number of singular values will be selected via thresholding.
- the components of the eigenvectors corresponding to these dominant singular values are the weights of a linear combination of all terms in the collection.
- Each singular value corresponds to a meta-term made up of a linear combination of the T terms. When most of these weights are relatively small, the meta-term can be said to be a subset of the original T terms.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Probability & Statistics with Applications (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A method for retrieving relevant stories from a collection of stories. The method comprises the steps of identifying at least one query term, applying a cooccurrence matrix to the query term to provide a list of query terms, determining if a story in the collection contains any terms on the list of query terms, and then increasing a relevance measure if the story does contain words on the list of query words. If the relevance measure is higher than a threshold, the story is added to a list of relevant stories.
Description
This invention is a continuation of, and claims priority from, U.S. Provisional Patent Application No. 60/149,778, filed Aug. 18, 1999.
1. Field of the Invention
This invention relates to method of analyzing text for relevant content, more particularly for statistical methods for performing text analysis.
2. Background of the Invention
One of the most prevalent uses of text analysis today is by search engines on the Internet. These search engines take key words and search for relevant articles, web sites and discussions that include those words. The articles, web sites and discussions will be referred to hereinafter as stories. The use of text analysis also occurs in other types of text searching, such as computerized reference collections, electronic libraries and document management systems.
One challenge unique to the Internet is that content is being added to the collection of stories almost continuously. The number of stories that might have relevance to an inquiry can overload the user and bog down the system used for searching. This is especially true in view of the current methods for text searching employed by search engines and other types of text searching tools.
A current technique for text searching is shown in prior art FIG. 1. The query term is obtained in some manner, such as from a user entry or a user profile. If the query term is a single word, the story is searched for occurrences of that term. If the query term is a phrase, the story is searched for occurrences of each word of the phrase.
For single word query terms, a match results in the story being added to the list of stories that are relevant to the query. If the query term is a phrase, stories with each word of the phrase are added to a preliminary list associated with that word. An intersection of the preliminary lists is taken and those stories that are at the intersection of the preliminary lists (i.e., stories that contain all the words in the query phrase) are added to the list of stories that are relevant. A major drawback to this approach is that relevant stories may not contain the exact term. This problem is exacerbated when the information retrieval is based upon very concise documents, such as user profiles, and when the information itself is in the form of brief summaries such as news summaries that can be found at an Internet portal site like yahoo.com.
One way to overcome this problem is to add new terms to the original query that are related to the original terms. This task can be performed manually, but requires considerable expertise, both in searching and in the area being queried, an expertise most users lack. Performance of this task automatically falls under the category of Automated Query Expansion (AQE). There are three main approaches to AQE in the current literature.
The first approach is to use an online (electronic) thesaurus or dictionary such as WordNet. WordNet is a large, manually built, general-purpose semantic network, which models the lexical knowledge of a native English speaker. It is organized around groupings of words called synsets. Each synset contains synonymous words and relationships among them. The relationships take the form of IS-A, A-KIND-OF, etc. For example, using the relationship “a snake is A-KIND-OF animal,” a query using the word snake may expand to include the word animal.
Discussion of these types of approaches can be found in “TREC-4 Experiments at Dublin City University: Thresholding Posting Lists, Query Expansion with WordNet and POS Tagging of Spanish,” by Smeaton, et al., published in Fourth Text Retrieval Conference (TREC-4), Gaithersburg, Md., Nov. 1-3, 1995, (Smeaton) and “Information Access and Retrieval with Semantic Background Knowledge,” by A. Chakravarthy, Ph.D. Thesis, MIT, Boston, Mass., 1995 (Chakravarthy).
A second category of AQE involves a pairwise association measure between words. The given corpus is analyzed to determine pairwise word associations and a query term is expanded to include terms having association values greater than a certain threshold. In one example, a pairwise mutual information value is determined from the context vectors of frequent words in the corpus. This example is discussed in “Corpus Analysis for TREC 5 Query Expansion,” by Gauch, et al., in Fifth Text Retrieval Conference (TREC-5), Gaithersburg, Md., 1996 (Gauch).
The third category of AQE uses blind relevance feedback in accordance with Rocchio's algorithm. Blind feedback refers to the fact that relevance is not judged by the user but is determined by the system automatically. An initial search for the original query is performed and the retrieved documents are sorted according to some measure. The top few documents are assumed to be relevant for the original query. The original query is then expanded by using the terms in these relevant documents.
Articles discussing this approach include Mitra, et al. “Improving Automatic Query Expansion,” ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 206-214, Melbourne, Australia, 1998 (Mitra); Buckley, et al. “Automatic Query Expansion Using SMART: TREC-3,” Fourth Text Retrieval Conference, Gaithersburg, Md., 1994 (Buckley); and Abberley, et al. “Retrieval of Broadcast News Documents with the THISL System,” ICASSP '98, Seattle, Wash., 1998 (Abberley).
However, all of these techniques still have serious drawbacks. A general-purpose database like WordNet has large coverage gaps when used for domains with their own specific vocabularies and sublanguages. Technology business news, for example, or technology advancements in highly jargon-filled technologies will have their own terms not recognized by general-purpose databases. It would be prohibitively time consuming to create a semantic network or to add terms to WordNet for each possible domain.
Another problem with the above approaches occurs with ambiguous terms. For example, the word “bank” may result in an expansion on rivers or an expansion on financial matters. Finally, methods using blind feedback are promising and are currently very popular. However, when the number of documents in the database is low, and/or is of short length, blind feedback runs into problems.
Pairwise association techniques such as the one discussed above usually use a symmetric cooccurrence matrix of words. In a symmetric matrix, the occurrence of one word of the pair triggers an expansion to include the other word. This can be problematic when one word is fairly common. For example, the use of the word cellular in a query would result in the expansion to include the word phone. This is probably not inaccurate because cellular commonly refers to phones. However, the use of a symmetric matrix results in the addition of the word cellular whenever the word phone is used. Given that phone is a fairly common word, this could result in an unnecessary expansion and irrelevant results. An example of an approach using this type of matrix is shown in U.S. Pat. No. 5,675,819, issued Dec. 7, 1995.
Other types of query expansion techniques have also been patented. U.S. Pat. No. 5,926,811, issued Jul. 20, 1999, forms a statistical thesaurus. However, the techniques used are not as sophisticated or exacting as those using pairwise associations or matrices. Finally, a method tagging speech by identifying the part of speech of a given word is shown in U.S. Pat. No. 5,721,902, issued Feb. 24, 1998.
Therefore, a method for more accurate query expansion that takes into account such things as asymmetrical pairwise word associations and domain-specific words and phrases is needed.
One aspect of the invention is a method for retrieving relevant stories from a collection of stories. The method includes the steps of identifying at least one query term, and applying a cooccurrence matrix to the query term to provide a list of query terms. Then, it is determined if a story in the collection contains any terms on the list of query terms. If the story does contain words in the list of query words, a relevance measure is increased. If the relevance measure is higher than a threshold, the story is added to a list of relevant stories.
For a more complete understanding of the present invention and for further advantages thereof, reference is now made to the following Detailed Description taken in conjunction with the accompanying Drawings in which:
FIG. 1 shows a flowchart of a prior art approach for text retrieval.
FIG. 2 shows a block diagram of a system in which text retrieval is used, in accordance with the invention.
FIG. 3 shows an expanded block diagram of story analysis for a text retrieval sytem, in accordance with the invention.
FIG. 4 shows a block diagram of one embodiment of a method for text retrieval using automated query expansion and searching in accordance with the invention.
FIG. 5 shows a block diagram of one embodiment of a method for relevance scoring for stories retrieved, in accordance with the invention.
A system employing methods of the invention is shown in block diagram form in FIG. 2. In this example, user preferences 10 are used by the user profiler module 12 to develop, update and track a user profile. The user profiler may use both direct user inputs, a smart agent that forms user preferences automatically on the basis of user's usage pattern, both or neither. The user profiler develops a list of web addresses that are of interest to the user.
The user profile is then provided by the user profiler to the web handling module 14. The web handling module 14 fetches the web pages, parses them, and processes them to segment out the stories 16. The web handler may interact with the user profiler module to provide feedback on the relevance of a particular web site. The web handler 14 passes the stories it locates 16 to the story analysis module 18.
FIG. 3 shows an expanded view of the story analysis module 18. The stories and user queries 20, either from the user profile or from direct user inputs, are used as initial inputs to the module. In the preprocessing module 22, stop-word removal is performed to remove generic and commonly used terms, such as the days of the week and the months of the year. It must be noted that if the user query includes those terms, such as an event that happened on the first Tuesday in May, the user query would override the stop-word removal process for those words.
The stop-word removal module 24 provides one way in which the system can adapt to particular domains. For example, a stop-word list for entertainment news would probably differ greatly from a stop-word list for technology news. Additionally, a generic stop-word list can be supplemented with words from relevant domains, making them more accurate and resulting in higher relevance stories.
The second part of preprocessing in this example includes stemming 26. Stemming refers to the process of reducing words to their core by removing, for example, suffixes due to tenses in a verb. For instance, the words “processed” and “processing” are stemmed to “process”. There algorithms in the literature. Once the preprocessing is completed, relevant stories are determined at module 28. The method of determining relevant stories will be discussed in more detail with reference to FIGS. 4 and 5. After a story is determined to be relevant, it is added to the relevant stories list 30. These are then identified to the user.
In general, the method of determining the relevant stories at 28 uses a pairwise association approach. One of the weaknesses of current approaches, as discussed previously, is that they only consider a predetermined number of highest frequency terms in a corpus. For smaller corpus with diverse context, the approach is not feasible. Use of conditional probabilities will result in more accurate and reliable associations. The pairwise association values are formed in an asymmetrical matrix, referred to as a cooccurrence matrix.
Further, current approaches do not exploit higher order information contained in the matrix. For example, if the term A is associated with the term B, and B is associated with C, a second order association exists between A and C. These are typically not used in the current literature, as they require intensive computations to derive them directly from the corpus. Aspects of the invention include extracting and utilizing higher order information contained in the matrix. The matrix itself then contains higher order information about multiple word cooccurrences when term relationships in the matrix are interpreted as those of a directed graph. Various graph-theoretic methods may be applied to analyze the structure of the matrix.
Most text retrieval algorithms are trained using a large collection of relevant text. Unfortunately, for information filtering applications (e.g., filtering of news summaries) domain content is not fixed and may change drastically over time. In addition, large collections may not be available. Updating the matrix periodically using the gathered stories will be used to solve this problem.
In addition, updating the matrix will provide the user with help in specifying queries. A list of dominant words for a particular domain will be determined using techniques such as Singular Value Decomposition on the cooccurrence matrix and let user choose from among these and related words to construct their queries.
Having discussed in general terms the parameters in which the matrix should function; a specific example is given to further demonstrate the use of such a matrix to a text retrieval system. One embodiment of a method for query expansion and story identification is shown in FIG. 4, as an example to promote understanding of the invention.
As discussed previously, the process starts with identification of a query term at 40. Because some query terms may actually be phrases, single word query terms and phrase query terms have to be accounted for. At step 42, the determination is made whether the term is a single word or a phrase. If the query term is a single word, the process moves to step 44 and the term to which the cooccurrence matrix is applied is set equal to the single word. If the query term is a phrase, the term to which the cooccurrence matrix will be applied is set equal to each word in the query term at step 48. The words against which these terms are checked in the matrix will be referred to as storyterms.
At steps 46 and 50 the cooccurrence matrix is applied to determine if there are other words that should be used to expand the query. The difference between the two steps is that at step 46 the entry to be analyzed is [term, storyterm] and at step 50 the entry is [word,storyterm]. This terminology will be explained below.
In order to understand the cooccurrence matrix, the following definitions will be used. S is a collection of documents or stories containing N terms of which T are distinct after stop-word removal and stemming. The quantity ni is the number of occurrences of the term i in the collection S, and nij is the number of occurrences of the terms i and j in the same document. The measure of association of j with i would then be defined as:
This association value defined above is an estimate of the conditional probability of observing the term j given the term i.
The cooccurrence matrix is then defined as:
Note that C is not symmetric, as would have been the case if cooccurrence counts were used as entries. The association relation used in not symmetric, i.e., αij≠αji. Using the example of the terms cellular and phone discussed previously, it is useful to explain this aspect of the cooccurrence matrix.
For example, the word cellular appears in a collection of stories 4 times. The term phone appears 22 times. They occur together 4 times, which means that every time the word cellular appears, the word phone also appears. Therefore the association α[cellular, phone] is 4/4, or 1. The association α[phone, cellular] Is 4/22, or 0.182. Therefore, if the word cellular appears in a query, it is a good idea to add phone, but not the reverse. In this manner the query can be appropriately expanded.
It is important for the entries in the cooccurrence matrix to be accurate, in the sense of reflecting true or asymptotic values for the associations. It may happen that α[phone,cellular] is low because there are few stories containing the term cellular phone. This problem is currently solved by creating a large collection of representative documents to train the algorithms. For some applications, however, such as news stories that vary in content so drastically, it is impossible to come up with one collection that captures most of the terms. It is believed that the preferred approach would be to update the cooccurrence matrix frequently, every day or every few days, using new stories.
The Weak Law of Large Numbers guarantees that the estimates will converge to the ‘true’ probabilities as N gets large. The size of the matrix is an important factor in the running of the processes. C will be a T×T matrix, where T is the number of distinct terms in the collection. As N gets large, the concern would be that T would increase to the point of being impractical. However, the rate of increase for T will be much slower than the rate of increase of N. An example of this relationship could be shown as:
Typical values for k are between 10 and 20, and for βbetween 0.5 and 0.6. As the text size increases the number of new words will rapidly become smaller. The rate of increase of T will be much slower than the rate of N.
All of these considerations result in a cooccurrence matrix that can be used to perform query expansion. Returning now to FIG. 4, the value of the matrix for the term is determined and compared to a threshold at steps 46 and 50. If the value is higher than the threshold, each kind of term, whether single word or phrase is handled to expand the query.
If the term is a single word and the cooccurrence value is above the threshold at step 46, that term is added to the query at step 56. If the term is a phrase, for each word in the phrase an association value is determined from the cooccurrence matrix with each storyterm. If the association value is above threshold, at step 50, that storyterm is added to the posting list associated with that word at step 52. This is repeated for each word in the phrase resulting in a posting list for each word. Once all the words in the phrase have been analyzed, the resulting posting lists are intersected at step 54. Storywords that intersect from the posting lists, i.e., the contents of the intersection set of the posting lists, are added as terms to the query at step 56, thus expanding the query.
At step 58, the actual story searching begins, using the expanded queries. Again the process splits at step 60 based upon the nature of the query terms, which now include the additional expansion terms. Following the single term path to step 62 and then step 64, a story is searched to see if it contains that query term. If the story does not contain the term, the process returns to step 58 to test the next term. If the story does contain the term, the score for that particular story is increased at step 66.
If the query term is a phrase, the process moves to step 70 where the term is set equal to each word in the phrase. The stories are searched at step 72. If the story does not contain that word, the process returns to step 70 until all the terms in the phrase are used. If there are no more words in the phrase, the process returns to step 58 until there are no more query terms. If the story does contain that word, the story is added to the posting list of that word at step 74. When the query is complete, the posting lists are intersected at step 76 and the stories in the intersection set have their scores increased at step 66.
In this manner, the scores for various stories indicate the relevance of that particular story to the user's profile or preferences. In some instances, a term may only occur once in the collection. This will render its association value to be either 1 or 0, and the estimate of probability becomes useless. To avoid this problem, it may be desirable to modify the process such that terms that occur just once for the collection are not expanded during query expansion.
The scoring process should also have some parameters associated with it. For example, if a query contained the word Internet and Microsoft, desirable results would have stories that contain one occurrence of both of these words rather than stories that have repeated occurrences of one word. The score would then need to be higher for those stories with the occurrence of both words, rather than a story with frequent repetitions of one word.
The new term k is the number of terms from the query that the story contains and w is a weight. The second term being added to the first controls the problem discussed above. After the story scores are computed, at either step 66 or step 78, the stories are sorted according to their scores. Stories above a threshold at 68 are added at step 78 and presented to the user. The process continues until no more terms are left to be searched.
An adaptation of the above example uses a threshold to determine if a story should be presented to the user. A flow chart of one example of such a process is shown in FIG. 5. The process starts with each story in the story list at step 80. For each storyterm in the story and each term in the query the quantity α[term,storyterm] is determined from the cooccurrence matrix and a sum is computed at step 82 of all α[term,storyterm]. At step 84, the score for that particular story is then set equal to
Once the score is computed, the story is presented to the user at step 88 if the score exceeds a threshold at step 86.
Having set out one example of a process utilizing the invention, it is now possible to discuss further adaptations and features of the invention. As mentioned previously, the use of higher order associative relationships in the matrix may increase the accuracy and efficiency of the retrieval system. Initially, a threshold value is established that sets a minimum association value.
If the associative value between two terms, such as i and j, exceeds the threshold value, a link is said to exist between these two terms such that i is linked to j. It must be noted that the matrix is not symmetric, so j is no necessarily linked to i. Having established this relationship, the cooccurrence matrix can be used as the adjacency matrix for a directed graph, with the links being the edges of the graph.
Of special interest are groups of terms that are connected in such a manner that makes it possible to go from any word within the group to another. Such a group of nodes is called a strongly connected component of the graph, in a directed graph. Locating these strongly connected components of the graph will assist in query expansion. Given a term, the query can expand to include all the terms within the same connected component as the given term, not just using the entries of the cooccurrence matrix to decide in adding a term.
The above technique will reduce the decrease in precision that occurs during query expansion. This decrease in precision occurs mostly because some words have high associative values because they are common words. Previously, the term cellular linked to phone had an associative value of 1, and phone linked to cellular had a value of 0.182. The second link can be avoided by Thresholding the values at 0.2.
Also discussed previously was the concept that the cooccurrence matrix can be used to assist the user in formulating effective queries. The implementation of this concept involves using the cooccurrence matrix to discover dominant terms in a given corpus. The cooccurrence matrix has been defined as having a dimension of T×T, where T is the number of distinct words in the collection. Usually T will be in the range of one to five thousand words. This makes C a large matrix, but it is sparse because most of the terms do not occur together.
Given this last property of the matrix, a dimensionality reduction technique may be used on C, such as Singular Value Decomposition. The SVD of C can be computed and a small number of singular values will be selected via thresholding. The components of the eigenvectors corresponding to these dominant singular values are the weights of a linear combination of all terms in the collection. Each singular value corresponds to a meta-term made up of a linear combination of the T terms. When most of these weights are relatively small, the meta-term can be said to be a subset of the original T terms. These groups for each dominant singular value are then combined to obtain a list of dominant words in the collection.
The techniques shown above are merely intended as examples of the invention. Thus, although there has been described to this point a particular embodiment for a method and structure for statistical text analysis, it is not intended that such specific references be considered as limitations upon the scope of this invention except in-so-far as set forth in the following claims.
Claims (9)
1. A method for retrieving relevant stories from a collection of stories, the method comprising:
identifying at least one query term;
applying an asymmetrical, cooccurrence matrix to the at least one query term to provide a list of query terms;
determining if a story in the collection contains any terms on the list of query terms;
increasing a relevance measure if the story does contain words on the list of query words; and
adding a story to a list of relevant stories if the relevance measure is higher than a threshold.
2. The method of claim 1 , wherein identifying at least one query term further comprises using a user profile for obtaining the at least one query term.
3. The method of claim 1 , wherein applying a cooccurrence matrix further comprises comparing a measure from the cooccurrence matrix to a threshold.
4. The method of claim 1 , wherein the at least one query term is a single word.
5. The method of claim 1 , wherein the at least one query term is a phrase and each word in the phrase is used as a query term.
6. The method of claim 1 , wherein increasing a relevance measure further comprises:
determining a frequency of occurrence of the query term in the story;
comparing the frequency to a threshold; and
increasing the relevance measure based upon the comparing step.
7. The method of claim 1 , wherein a directed graph is used to derive higher order associative information for each term in the matrix.
8. The method of claim 1 , wherein the matrix is used to assist the user in formulating a query.
9. The method of claim 1 , wherein the matrix is updated dynamically on the basis of new stories.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/468,508 US6477524B1 (en) | 1999-08-18 | 1999-12-21 | Method for statistical text analysis |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14977899P | 1999-08-18 | 1999-08-18 | |
US09/468,508 US6477524B1 (en) | 1999-08-18 | 1999-12-21 | Method for statistical text analysis |
Publications (1)
Publication Number | Publication Date |
---|---|
US6477524B1 true US6477524B1 (en) | 2002-11-05 |
Family
ID=26847015
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/468,508 Expired - Lifetime US6477524B1 (en) | 1999-08-18 | 1999-12-21 | Method for statistical text analysis |
Country Status (1)
Country | Link |
---|---|
US (1) | US6477524B1 (en) |
Cited By (53)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020091715A1 (en) * | 2001-01-11 | 2002-07-11 | Aric Coady | Process and system for sparse vector and matrix reperesentation of document indexing and retrieval |
US6668256B1 (en) * | 2000-01-19 | 2003-12-23 | Autonomy Corporation Ltd | Algorithm for automatic selection of discriminant term combinations for document categorization |
US6721728B2 (en) * | 2001-03-02 | 2004-04-13 | The United States Of America As Represented By The Administrator Of The National Aeronautics And Space Administration | System, method and apparatus for discovering phrases in a database |
US20040158559A1 (en) * | 2002-10-17 | 2004-08-12 | Poltorak Alexander I. | Apparatus and method for identifying potential patent infringement |
US20040194140A1 (en) * | 2003-03-27 | 2004-09-30 | Sharp Laboratories Of America, Inc. | On-screen intelligent electronic program guide |
US20050165819A1 (en) * | 2004-01-14 | 2005-07-28 | Yoshimitsu Kudoh | Document tabulation method and apparatus and medium for storing computer program therefor |
US20050177805A1 (en) * | 2004-02-11 | 2005-08-11 | Lynch Michael R. | Methods and apparatuses to generate links from content in an active window |
US20050283491A1 (en) * | 2004-06-17 | 2005-12-22 | Mike Vandamme | Method for indexing and retrieving documents, computer program applied thereby and data carrier provided with the above mentioned computer program |
US20060112128A1 (en) * | 2004-11-23 | 2006-05-25 | Palo Alto Research Center Incorporated | Methods, apparatus, and program products for performing incremental probabilitstic latent semantic analysis |
US20060184511A1 (en) * | 2005-02-16 | 2006-08-17 | Koo Sing C | Semantic network document container for use in attestation of management reports |
US7194483B1 (en) | 2001-05-07 | 2007-03-20 | Intelligenxia, Inc. | Method, system, and computer program product for concept-based multi-dimensional analysis of unstructured information |
US20070067157A1 (en) * | 2005-09-22 | 2007-03-22 | International Business Machines Corporation | System and method for automatically extracting interesting phrases in a large dynamic corpus |
US20070067281A1 (en) * | 2005-09-16 | 2007-03-22 | Irina Matveeva | Generalized latent semantic analysis |
US7272594B1 (en) | 2001-05-31 | 2007-09-18 | Autonomy Corporation Ltd. | Method and apparatus to link to a related document |
US7275033B1 (en) * | 2000-09-30 | 2007-09-25 | Intel Corporation | Method and system for using rule-based knowledge to build a class-based domain specific statistical language model |
US20080109454A1 (en) * | 2006-11-03 | 2008-05-08 | Willse Alan R | Text analysis techniques |
US7409383B1 (en) * | 2004-03-31 | 2008-08-05 | Google Inc. | Locating meaningful stopwords or stop-phrases in keyword-based retrieval systems |
US7536413B1 (en) | 2001-05-07 | 2009-05-19 | Ixreveal, Inc. | Concept-based categorization of unstructured objects |
US7627588B1 (en) | 2001-05-07 | 2009-12-01 | Ixreveal, Inc. | System and method for concept based analysis of unstructured data |
US7676485B2 (en) | 2006-01-20 | 2010-03-09 | Ixreveal, Inc. | Method and computer program product for converting ontologies into concept semantic networks |
US7788251B2 (en) | 2005-10-11 | 2010-08-31 | Ixreveal, Inc. | System, method and computer program product for concept-based searching and analysis |
WO2011045699A1 (en) * | 2009-10-14 | 2011-04-21 | Koninklijke Philips Electronics N.V. | Method and system for facilitating data entry for an information system |
US8166051B1 (en) | 2009-02-03 | 2012-04-24 | Sandia Corporation | Computation of term dominance in text documents |
US20120221558A1 (en) * | 2011-02-28 | 2012-08-30 | International Business Machines Corporation | Identifying information assets within an enterprise using a semantic graph created using feedback re-enforced search and navigation |
US8589413B1 (en) | 2002-03-01 | 2013-11-19 | Ixreveal, Inc. | Concept-based method and system for dynamically analyzing results from search engines |
US8751487B2 (en) | 2011-02-28 | 2014-06-10 | International Business Machines Corporation | Generating a semantic graph relating information assets using feedback re-enforced search and navigation |
US20150269252A1 (en) * | 2004-06-25 | 2015-09-24 | Google Inc. | Nonstandard locality-based text entry |
US9223769B2 (en) | 2011-09-21 | 2015-12-29 | Roman Tsibulevskiy | Data processing systems, devices, and methods for content analysis |
US9245243B2 (en) | 2009-04-14 | 2016-01-26 | Ureveal, Inc. | Concept-based analysis of structured and unstructured data using concept inheritance |
US9323767B2 (en) | 2012-10-01 | 2016-04-26 | Longsand Limited | Performance and scalability in an intelligent data operating layer system |
US9646110B2 (en) | 2011-02-28 | 2017-05-09 | International Business Machines Corporation | Managing information assets using feedback re-enforced search and navigation |
US9916306B2 (en) | 2012-10-19 | 2018-03-13 | Sdl Inc. | Statistical linguistic analysis of source content |
US9954794B2 (en) | 2001-01-18 | 2018-04-24 | Sdl Inc. | Globalization management system and method therefor |
US9984054B2 (en) | 2011-08-24 | 2018-05-29 | Sdl Inc. | Web interface including the review and manipulation of a web document and utilizing permission based control |
USRE46973E1 (en) | 2001-05-07 | 2018-07-31 | Ureveal, Inc. | Method, system, and computer program product for concept-based multi-dimensional analysis of unstructured information |
US20180232373A1 (en) * | 2017-02-13 | 2018-08-16 | International Business Machines Corporation | Weighting and Expanding Query Terms Based on Language Model Favoring Surprising Words |
US10061749B2 (en) | 2011-01-29 | 2018-08-28 | Sdl Netherlands B.V. | Systems and methods for contextual vocabularies and customer segmentation |
US10140320B2 (en) | 2011-02-28 | 2018-11-27 | Sdl Inc. | Systems, methods, and media for generating analytical data |
US10198438B2 (en) | 1999-09-17 | 2019-02-05 | Sdl Inc. | E-services translation utilizing machine translation and translation memory |
US10248650B2 (en) | 2004-03-05 | 2019-04-02 | Sdl Inc. | In-context exact (ICE) matching |
US10261994B2 (en) | 2012-05-25 | 2019-04-16 | Sdl Inc. | Method and system for automatic management of reputation of translators |
US10319252B2 (en) | 2005-11-09 | 2019-06-11 | Sdl Inc. | Language capability assessment and training apparatus and techniques |
US10417646B2 (en) | 2010-03-09 | 2019-09-17 | Sdl Inc. | Predicting the cost associated with translating textual content |
US10452740B2 (en) | 2012-09-14 | 2019-10-22 | Sdl Netherlands B.V. | External content libraries |
US10572928B2 (en) | 2012-05-11 | 2020-02-25 | Fredhopper B.V. | Method and system for recommending products based on a ranking cocktail |
US10580015B2 (en) | 2011-02-25 | 2020-03-03 | Sdl Netherlands B.V. | Systems, methods, and media for executing and optimizing online marketing initiatives |
US10614167B2 (en) | 2015-10-30 | 2020-04-07 | Sdl Plc | Translation review workflow systems and methods |
US10635863B2 (en) | 2017-10-30 | 2020-04-28 | Sdl Inc. | Fragment recall and adaptive automated translation |
US10657540B2 (en) | 2011-01-29 | 2020-05-19 | Sdl Netherlands B.V. | Systems, methods, and media for web content management |
US10817676B2 (en) | 2017-12-27 | 2020-10-27 | Sdl Inc. | Intelligent routing services and systems |
US11256867B2 (en) | 2018-10-09 | 2022-02-22 | Sdl Inc. | Systems and methods of machine learning for digital assets and message creation |
US11308528B2 (en) | 2012-09-14 | 2022-04-19 | Sdl Netherlands B.V. | Blueprinting of multimedia assets |
US11386186B2 (en) | 2012-09-14 | 2022-07-12 | Sdl Netherlands B.V. | External content library connector systems and methods |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5491557A (en) * | 1991-12-27 | 1996-02-13 | Minolta Camera Kabushiki Kaisha | Image forming apparatus having a memory and two operating modes, and method of using same |
US5675819A (en) * | 1994-06-16 | 1997-10-07 | Xerox Corporation | Document information retrieval using global word co-occurrence patterns |
US5675710A (en) * | 1995-06-07 | 1997-10-07 | Lucent Technologies, Inc. | Method and apparatus for training a text classifier |
US5721902A (en) * | 1995-09-15 | 1998-02-24 | Infonautics Corporation | Restricted expansion of query terms using part of speech tagging |
US5926811A (en) * | 1996-03-15 | 1999-07-20 | Lexis-Nexis | Statistical thesaurus, method of forming same, and use thereof in query expansion in automated text searching |
US6006225A (en) * | 1998-06-15 | 1999-12-21 | Amazon.Com | Refining search queries by the suggestion of correlated terms from prior searches |
US6128613A (en) * | 1997-06-26 | 2000-10-03 | The Chinese University Of Hong Kong | Method and apparatus for establishing topic word classes based on an entropy cost function to retrieve documents represented by the topic words |
US6144958A (en) * | 1998-07-15 | 2000-11-07 | Amazon.Com, Inc. | System and method for correcting spelling errors in search queries |
US6189002B1 (en) * | 1998-12-14 | 2001-02-13 | Dolphin Search | Process and system for retrieval of documents using context-relevant semantic profiles |
-
1999
- 1999-12-21 US US09/468,508 patent/US6477524B1/en not_active Expired - Lifetime
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5491557A (en) * | 1991-12-27 | 1996-02-13 | Minolta Camera Kabushiki Kaisha | Image forming apparatus having a memory and two operating modes, and method of using same |
US5675819A (en) * | 1994-06-16 | 1997-10-07 | Xerox Corporation | Document information retrieval using global word co-occurrence patterns |
US5675710A (en) * | 1995-06-07 | 1997-10-07 | Lucent Technologies, Inc. | Method and apparatus for training a text classifier |
US5721902A (en) * | 1995-09-15 | 1998-02-24 | Infonautics Corporation | Restricted expansion of query terms using part of speech tagging |
US5926811A (en) * | 1996-03-15 | 1999-07-20 | Lexis-Nexis | Statistical thesaurus, method of forming same, and use thereof in query expansion in automated text searching |
US6128613A (en) * | 1997-06-26 | 2000-10-03 | The Chinese University Of Hong Kong | Method and apparatus for establishing topic word classes based on an entropy cost function to retrieve documents represented by the topic words |
US6006225A (en) * | 1998-06-15 | 1999-12-21 | Amazon.Com | Refining search queries by the suggestion of correlated terms from prior searches |
US6169986B1 (en) * | 1998-06-15 | 2001-01-02 | Amazon.Com, Inc. | System and method for refining search queries |
US6144958A (en) * | 1998-07-15 | 2000-11-07 | Amazon.Com, Inc. | System and method for correcting spelling errors in search queries |
US6189002B1 (en) * | 1998-12-14 | 2001-02-13 | Dolphin Search | Process and system for retrieval of documents using context-relevant semantic profiles |
Non-Patent Citations (6)
Title |
---|
Abberley, et al. Retrieval of Broadcast News Documents with the THISL System, ICASSP '98, Seattle, WA. |
Buckley, et al. Automatic Query Expansion Using SMART: TREC-3, Fourth Text Retrieval Conference, Gaithersburg, MD, Nov. 1995. |
Charkravarthy, A. Information Access and Retrieval with Semantic Background Knowledge, Ph.D. Thesis, MIT, 1995. |
Gauch, et al. Corpus Analysis for TREC 5 Query Expansion, Fifth Text Retrieval Conference (TREC-5), Gaithersburg, MD 1996. |
Mitra, et al. Improving Automatic Query Expansion, ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 206-214, Melbourne, Australia 1998. |
Smeaton, et al. TREC-4 Experiments at Dublin City University: Thresholding Posting Lists, Query Expansion with WordNet and POS Tagging of Spanish, Fourth Text Retrieval Conference (TREC-4), Gaithersburg, MD, Nov. 1995. |
Cited By (101)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10198438B2 (en) | 1999-09-17 | 2019-02-05 | Sdl Inc. | E-services translation utilizing machine translation and translation memory |
US10216731B2 (en) | 1999-09-17 | 2019-02-26 | Sdl Inc. | E-services translation utilizing machine translation and translation memory |
US6668256B1 (en) * | 2000-01-19 | 2003-12-23 | Autonomy Corporation Ltd | Algorithm for automatic selection of discriminant term combinations for document categorization |
US7275033B1 (en) * | 2000-09-30 | 2007-09-25 | Intel Corporation | Method and system for using rule-based knowledge to build a class-based domain specific statistical language model |
US6751628B2 (en) * | 2001-01-11 | 2004-06-15 | Dolphin Search | Process and system for sparse vector and matrix representation of document indexing and retrieval |
US20020091715A1 (en) * | 2001-01-11 | 2002-07-11 | Aric Coady | Process and system for sparse vector and matrix reperesentation of document indexing and retrieval |
US7328204B2 (en) | 2001-01-11 | 2008-02-05 | Aric Coady | Process and system for sparse vector and matrix representation of document indexing and retrieval |
US20040205063A1 (en) * | 2001-01-11 | 2004-10-14 | Aric Coady | Process and system for sparse vector and matrix representation of document indexing and retrieval |
US9954794B2 (en) | 2001-01-18 | 2018-04-24 | Sdl Inc. | Globalization management system and method therefor |
US6721728B2 (en) * | 2001-03-02 | 2004-04-13 | The United States Of America As Represented By The Administrator Of The National Aeronautics And Space Administration | System, method and apparatus for discovering phrases in a database |
US7194483B1 (en) | 2001-05-07 | 2007-03-20 | Intelligenxia, Inc. | Method, system, and computer program product for concept-based multi-dimensional analysis of unstructured information |
US7627588B1 (en) | 2001-05-07 | 2009-12-01 | Ixreveal, Inc. | System and method for concept based analysis of unstructured data |
USRE46973E1 (en) | 2001-05-07 | 2018-07-31 | Ureveal, Inc. | Method, system, and computer program product for concept-based multi-dimensional analysis of unstructured information |
US7831559B1 (en) | 2001-05-07 | 2010-11-09 | Ixreveal, Inc. | Concept-based trends and exceptions tracking |
US7890514B1 (en) | 2001-05-07 | 2011-02-15 | Ixreveal, Inc. | Concept-based searching of unstructured objects |
US7536413B1 (en) | 2001-05-07 | 2009-05-19 | Ixreveal, Inc. | Concept-based categorization of unstructured objects |
US7272594B1 (en) | 2001-05-31 | 2007-09-18 | Autonomy Corporation Ltd. | Method and apparatus to link to a related document |
US8005858B1 (en) | 2001-05-31 | 2011-08-23 | Autonomy Corporation PLC | Method and apparatus to link to a related document |
US8589413B1 (en) | 2002-03-01 | 2013-11-19 | Ixreveal, Inc. | Concept-based method and system for dynamically analyzing results from search engines |
US7792832B2 (en) | 2002-10-17 | 2010-09-07 | Poltorak Alexander I | Apparatus and method for identifying potential patent infringement |
US20040158559A1 (en) * | 2002-10-17 | 2004-08-12 | Poltorak Alexander I. | Apparatus and method for identifying potential patent infringement |
US20040194140A1 (en) * | 2003-03-27 | 2004-09-30 | Sharp Laboratories Of America, Inc. | On-screen intelligent electronic program guide |
US20050165819A1 (en) * | 2004-01-14 | 2005-07-28 | Yoshimitsu Kudoh | Document tabulation method and apparatus and medium for storing computer program therefor |
US20050177805A1 (en) * | 2004-02-11 | 2005-08-11 | Lynch Michael R. | Methods and apparatuses to generate links from content in an active window |
US7512900B2 (en) | 2004-02-11 | 2009-03-31 | Autonomy Corporation Ltd. | Methods and apparatuses to generate links from content in an active window |
US10248650B2 (en) | 2004-03-05 | 2019-04-02 | Sdl Inc. | In-context exact (ICE) matching |
US7945579B1 (en) * | 2004-03-31 | 2011-05-17 | Google Inc. | Locating meaningful stopwords or stop-phrases in keyword-based retrieval systems |
US8473510B1 (en) * | 2004-03-31 | 2013-06-25 | Google Inc. | Locating meaningful stopwords or stop-phrases in keyword-based retrieval systems |
US8965919B1 (en) * | 2004-03-31 | 2015-02-24 | Google Inc. | Locating meaningful stopwords or stop-phrases in keyword-based retrieval systems |
US7409383B1 (en) * | 2004-03-31 | 2008-08-05 | Google Inc. | Locating meaningful stopwords or stop-phrases in keyword-based retrieval systems |
US9817920B1 (en) * | 2004-03-31 | 2017-11-14 | Google Llc | Locating meaningful stopwords or stop-phrases in keyword-based retrieval systems |
US10452718B1 (en) * | 2004-03-31 | 2019-10-22 | Google Llc | Locating meaningful stopwords or stop-phrases in keyword-based retrieval systems |
US8214385B1 (en) * | 2004-03-31 | 2012-07-03 | Google Inc. | Locating meaningful stopwords or stop-phrases in keyword-based retrieval systems |
US8626787B1 (en) * | 2004-03-31 | 2014-01-07 | Google Inc. | Locating meaningful stopwords or stop-phrases in keyword-based retrieval systems |
US20050283491A1 (en) * | 2004-06-17 | 2005-12-22 | Mike Vandamme | Method for indexing and retrieving documents, computer program applied thereby and data carrier provided with the above mentioned computer program |
US10534802B2 (en) | 2004-06-25 | 2020-01-14 | Google Llc | Nonstandard locality-based text entry |
US20150269252A1 (en) * | 2004-06-25 | 2015-09-24 | Google Inc. | Nonstandard locality-based text entry |
US20060112128A1 (en) * | 2004-11-23 | 2006-05-25 | Palo Alto Research Center Incorporated | Methods, apparatus, and program products for performing incremental probabilitstic latent semantic analysis |
US7529765B2 (en) * | 2004-11-23 | 2009-05-05 | Palo Alto Research Center Incorporated | Methods, apparatus, and program products for performing incremental probabilistic latent semantic analysis |
US20060184511A1 (en) * | 2005-02-16 | 2006-08-17 | Koo Sing C | Semantic network document container for use in attestation of management reports |
US8312021B2 (en) * | 2005-09-16 | 2012-11-13 | Palo Alto Research Center Incorporated | Generalized latent semantic analysis |
US20070067281A1 (en) * | 2005-09-16 | 2007-03-22 | Irina Matveeva | Generalized latent semantic analysis |
US20070067157A1 (en) * | 2005-09-22 | 2007-03-22 | International Business Machines Corporation | System and method for automatically extracting interesting phrases in a large dynamic corpus |
US7788251B2 (en) | 2005-10-11 | 2010-08-31 | Ixreveal, Inc. | System, method and computer program product for concept-based searching and analysis |
US10319252B2 (en) | 2005-11-09 | 2019-06-11 | Sdl Inc. | Language capability assessment and training apparatus and techniques |
US7676485B2 (en) | 2006-01-20 | 2010-03-09 | Ixreveal, Inc. | Method and computer program product for converting ontologies into concept semantic networks |
US20080109454A1 (en) * | 2006-11-03 | 2008-05-08 | Willse Alan R | Text analysis techniques |
US8166051B1 (en) | 2009-02-03 | 2012-04-24 | Sandia Corporation | Computation of term dominance in text documents |
US9245243B2 (en) | 2009-04-14 | 2016-01-26 | Ureveal, Inc. | Concept-based analysis of structured and unstructured data using concept inheritance |
WO2011045699A1 (en) * | 2009-10-14 | 2011-04-21 | Koninklijke Philips Electronics N.V. | Method and system for facilitating data entry for an information system |
CN102549589B (en) * | 2009-10-14 | 2016-03-30 | 皇家飞利浦电子股份有限公司 | Be convenient to the method and system to infosystem input data |
CN102549589A (en) * | 2009-10-14 | 2012-07-04 | 皇家飞利浦电子股份有限公司 | Method and system for facilitating data entry for an information system |
US10417646B2 (en) | 2010-03-09 | 2019-09-17 | Sdl Inc. | Predicting the cost associated with translating textual content |
US10984429B2 (en) | 2010-03-09 | 2021-04-20 | Sdl Inc. | Systems and methods for translating textual content |
US10521492B2 (en) | 2011-01-29 | 2019-12-31 | Sdl Netherlands B.V. | Systems and methods that utilize contextual vocabularies and customer segmentation to deliver web content |
US10990644B2 (en) | 2011-01-29 | 2021-04-27 | Sdl Netherlands B.V. | Systems and methods for contextual vocabularies and customer segmentation |
US11044949B2 (en) | 2011-01-29 | 2021-06-29 | Sdl Netherlands B.V. | Systems and methods for dynamic delivery of web content |
US10657540B2 (en) | 2011-01-29 | 2020-05-19 | Sdl Netherlands B.V. | Systems, methods, and media for web content management |
US11301874B2 (en) | 2011-01-29 | 2022-04-12 | Sdl Netherlands B.V. | Systems and methods for managing web content and facilitating data exchange |
US11694215B2 (en) | 2011-01-29 | 2023-07-04 | Sdl Netherlands B.V. | Systems and methods for managing web content |
US10061749B2 (en) | 2011-01-29 | 2018-08-28 | Sdl Netherlands B.V. | Systems and methods for contextual vocabularies and customer segmentation |
US10580015B2 (en) | 2011-02-25 | 2020-03-03 | Sdl Netherlands B.V. | Systems, methods, and media for executing and optimizing online marketing initiatives |
US8751487B2 (en) | 2011-02-28 | 2014-06-10 | International Business Machines Corporation | Generating a semantic graph relating information assets using feedback re-enforced search and navigation |
US10162892B2 (en) * | 2011-02-28 | 2018-12-25 | International Business Machines Corporation | Identifying information assets within an enterprise using a semantic graph created using feedback re-enforced search and navigation |
US10140320B2 (en) | 2011-02-28 | 2018-11-27 | Sdl Inc. | Systems, methods, and media for generating analytical data |
US20120221558A1 (en) * | 2011-02-28 | 2012-08-30 | International Business Machines Corporation | Identifying information assets within an enterprise using a semantic graph created using feedback re-enforced search and navigation |
US8782039B2 (en) | 2011-02-28 | 2014-07-15 | International Business Machines Corporation | Generating a semantic graph relating information assets using feedback re-enforced search and navigation |
US9652559B2 (en) | 2011-02-28 | 2017-05-16 | International Business Machines Corporation | Managing information assets using feedback re-enforced search and navigation |
US9646110B2 (en) | 2011-02-28 | 2017-05-09 | International Business Machines Corporation | Managing information assets using feedback re-enforced search and navigation |
US11366792B2 (en) | 2011-02-28 | 2022-06-21 | Sdl Inc. | Systems, methods, and media for generating analytical data |
US9984054B2 (en) | 2011-08-24 | 2018-05-29 | Sdl Inc. | Web interface including the review and manipulation of a web document and utilizing permission based control |
US11263390B2 (en) | 2011-08-24 | 2022-03-01 | Sdl Inc. | Systems and methods for informational document review, display and validation |
US9508027B2 (en) | 2011-09-21 | 2016-11-29 | Roman Tsibulevskiy | Data processing systems, devices, and methods for content analysis |
US9953013B2 (en) | 2011-09-21 | 2018-04-24 | Roman Tsibulevskiy | Data processing systems, devices, and methods for content analysis |
US10311134B2 (en) | 2011-09-21 | 2019-06-04 | Roman Tsibulevskiy | Data processing systems, devices, and methods for content analysis |
US9558402B2 (en) | 2011-09-21 | 2017-01-31 | Roman Tsibulevskiy | Data processing systems, devices, and methods for content analysis |
US9430720B1 (en) | 2011-09-21 | 2016-08-30 | Roman Tsibulevskiy | Data processing systems, devices, and methods for content analysis |
US11830266B2 (en) | 2011-09-21 | 2023-11-28 | Roman Tsibulevskiy | Data processing systems, devices, and methods for content analysis |
US11232251B2 (en) | 2011-09-21 | 2022-01-25 | Roman Tsibulevskiy | Data processing systems, devices, and methods for content analysis |
US10325011B2 (en) | 2011-09-21 | 2019-06-18 | Roman Tsibulevskiy | Data processing systems, devices, and methods for content analysis |
US12223756B2 (en) | 2011-09-21 | 2025-02-11 | Roman Tsibulevskiy | Data processing systems, devices, and methods for content analysis |
US9223769B2 (en) | 2011-09-21 | 2015-12-29 | Roman Tsibulevskiy | Data processing systems, devices, and methods for content analysis |
US10572928B2 (en) | 2012-05-11 | 2020-02-25 | Fredhopper B.V. | Method and system for recommending products based on a ranking cocktail |
US10402498B2 (en) | 2012-05-25 | 2019-09-03 | Sdl Inc. | Method and system for automatic management of reputation of translators |
US10261994B2 (en) | 2012-05-25 | 2019-04-16 | Sdl Inc. | Method and system for automatic management of reputation of translators |
US11308528B2 (en) | 2012-09-14 | 2022-04-19 | Sdl Netherlands B.V. | Blueprinting of multimedia assets |
US10452740B2 (en) | 2012-09-14 | 2019-10-22 | Sdl Netherlands B.V. | External content libraries |
US11386186B2 (en) | 2012-09-14 | 2022-07-12 | Sdl Netherlands B.V. | External content library connector systems and methods |
US9323767B2 (en) | 2012-10-01 | 2016-04-26 | Longsand Limited | Performance and scalability in an intelligent data operating layer system |
US9916306B2 (en) | 2012-10-19 | 2018-03-13 | Sdl Inc. | Statistical linguistic analysis of source content |
US11080493B2 (en) | 2015-10-30 | 2021-08-03 | Sdl Limited | Translation review workflow systems and methods |
US10614167B2 (en) | 2015-10-30 | 2020-04-07 | Sdl Plc | Translation review workflow systems and methods |
US20180232373A1 (en) * | 2017-02-13 | 2018-08-16 | International Business Machines Corporation | Weighting and Expanding Query Terms Based on Language Model Favoring Surprising Words |
US10713241B2 (en) * | 2017-02-13 | 2020-07-14 | International Business Machines Corporation | Weighting and expanding query terms based on language model favoring surprising words |
US10706048B2 (en) * | 2017-02-13 | 2020-07-07 | International Business Machines Corporation | Weighting and expanding query terms based on language model favoring surprising words |
US20180232374A1 (en) * | 2017-02-13 | 2018-08-16 | International Business Machines Corporation | Weighting and Expanding Query Terms Based on Language Model Favoring Surprising Words |
US10635863B2 (en) | 2017-10-30 | 2020-04-28 | Sdl Inc. | Fragment recall and adaptive automated translation |
US11321540B2 (en) | 2017-10-30 | 2022-05-03 | Sdl Inc. | Systems and methods of adaptive automated translation utilizing fine-grained alignment |
US11475227B2 (en) | 2017-12-27 | 2022-10-18 | Sdl Inc. | Intelligent routing services and systems |
US10817676B2 (en) | 2017-12-27 | 2020-10-27 | Sdl Inc. | Intelligent routing services and systems |
US11256867B2 (en) | 2018-10-09 | 2022-02-22 | Sdl Inc. | Systems and methods of machine learning for digital assets and message creation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6477524B1 (en) | Method for statistical text analysis | |
US7509313B2 (en) | System and method for processing a query | |
US8005858B1 (en) | Method and apparatus to link to a related document | |
US7260571B2 (en) | Disambiguation of term occurrences | |
US8719262B1 (en) | Identification of semantic units from within a search query | |
US9569505B2 (en) | Phrase-based searching in an information retrieval system | |
CA2813644C (en) | Phrase-based searching in an information retrieval system | |
EP1622055B1 (en) | Phrase-based indexing in an information retrieval system | |
EP1622052B1 (en) | Phrase-based generation of document description | |
US8204874B2 (en) | Abbreviation handling in web search | |
US7580929B2 (en) | Phrase-based personalization of searches in an information retrieval system | |
US7580921B2 (en) | Phrase identification in an information retrieval system | |
US20070136251A1 (en) | System and Method for Processing a Query | |
US20090234836A1 (en) | Multi-term search result with unsupervised query segmentation method and apparatus | |
Li et al. | Complex query recognition based on dynamic learning mechanism | |
Huang et al. | Learning to find comparable entities on the web | |
Xu et al. | Using multiple features and statistical model to calculate text units similarity | |
Reghu Raj et al. | A phrase grammar-based conceptual indexing paradigm | |
Lazarinis | Automatic extraction of knowledge from Greek Web documents | |
Lortz | Research Paper Clustering of Mailing List Discussions for Information Retrieval | |
Sathiyamurthy et al. | EVENT DETECTION AND SUMMARIZATION BASED ON SOCIAL NETWORKS AND SEMANTIC QUERY EXPANSION |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: SHARP LABORATORIES OF AMERICA, INCORPORATED, WASHI Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:TASKIRAN, CUNEYT;QIAN, RICHARD;REEL/FRAME:010475/0025;SIGNING DATES FROM 19991207 TO 19991217 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FEPP | Fee payment procedure |
Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
FPAY | Fee payment |
Year of fee payment: 4 |
|
FPAY | Fee payment |
Year of fee payment: 8 |
|
FPAY | Fee payment |
Year of fee payment: 12 |