US9177045B2 - Topical search engines and query context models - Google Patents
Topical search engines and query context models Download PDFInfo
- Publication number
- US9177045B2 US9177045B2 US12/792,288 US79228810A US9177045B2 US 9177045 B2 US9177045 B2 US 9177045B2 US 79228810 A US79228810 A US 79228810A US 9177045 B2 US9177045 B2 US 9177045B2
- Authority
- US
- United States
- Prior art keywords
- query
- queries
- keywords
- urls
- keyword
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
Images
Classifications
-
- G06F17/30672—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3338—Query expansion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3322—Query formulation using system suggestions
-
- G06F17/3064—
Definitions
- Search engines are utilized to maximize the likelihood of locating relevant information amongst an abundance of data. For instance, search engines are often employed over the World Wide Web (a.k.a. web) to facilitate locating and accessing websites of interest as a function of a search query comprising one or more keywords and operators. Upon receipt of a query, the search engine retrieves a list of websites that match the query, generates a snippet of text associated with the websites, and displays the websites and text, typically ranked based on relevance. The user can thereafter scroll through a plurality of returned websites in an attempt to identify information of interest. However, this can be an extremely time-consuming and frustrating process since search engines can return a substantial amount of content that often is irrelevant to a user's intent.
- World Wide Web a.k.a. web
- the subject disclosure generally concerns topical search as well as generation and utilization of query context models to facilitate production of topical search systems.
- Technology is provided for automatically identifying queries related to a topic and identifying suitable query context to be added to these queries as a function of lexical generality, among other things.
- a generated query context model can comprise both queries and query context related to a particular topic or domain and be employed to bias received queries toward the particular topic or domain.
- general contextual keywords from the query context model can be added to an ambiguous or underspecified query and provided to a search engine, wherein the contextual keywords influence returned results.
- FIG. 1 is a block diagram of topical search system.
- FIG. 3 is a block diagram of a representative context-generation component.
- FIG. 4 is a block diagram of a representative subweb component.
- FIG. 5 is a block diagram of a representative query-alteration component.
- FIG. 6 is a flow chart diagram of a method of topical search.
- FIG. 7 is a flow chart diagram of a topical search method.
- FIG. 8 is a flow chart diagram of a method of subweb construction.
- FIG. 9 is a flow chart diagram of a method of generating a query context model.
- FIG. 10 is a flow chart diagram of a method of lexical generality scoring.
- FIG. 11 is a flow chart diagram of a method of topical query contextualization.
- FIG. 12 is a schematic block diagram illustrating a suitable operating environment for aspects of the subject disclosure.
- Topical search engines focus search results on specific topics or domains (e.g., photography, automobiles, home improvement, golf, fishing . . . ).
- contextual keywords can be added to queries to bias a search towards a particular topic.
- constraints can be imposed that control the type (e.g., lexical generality, co-occurrence . . . ) and number of contextual keywords to be added as well as whether or not any contextual keywords are added at all.
- search engines or systems can employ query context models to provide contextual keywords for various topic related queries.
- a query context model can be generated automatically from one or more topic-relevant documents or URLs, amongst other data according to an aspect of the disclosure. After a number of queries are extracted from provided and discovered URLs, topic-related or contextual keywords pertaining to the queries can be identified as a function of lexical generality, among other things.
- the topical search system 100 includes a search engine 110 for receiving queries and returning results.
- the search engine 110 can correspond to a general or generic web search engine or a more specific or specialized search engine all of which are known in the art. Furthermore, although described herein with respect to the web, the search engine 110 is not limited thereto and can operate over any collection of data.
- the topical search system 100 also includes a topic component 120 configured to intercept queries destined for the search engine in order to alter the queries and subsequently provide the altered queries to the search engine 110 . More specifically, the topic component 120 can add suitable context information to a query to focus results on a particular topic or domain.
- a topical query typically performs poorly with respect to search engines and particularly to general search engines due in part to ambiguity.
- general search engines are optimized for result diversity, for any ambiguous query, the top results will likely span multiple topics. This reduces the number of results that are useful for a user, or in other words, that are directed to the user's search intent. For example, suppose a user submits a query “EOS” hoping to obtain results for “Canon EOS cameras.” General search engines will return one or two useful results in the top ten while the rest are directed to other uses of the word.
- queries can be identified and disambiguated. It is, however, not trivial to decide whether a query is ambiguous. For example, query performance prediction has been employed to address this problem, but there are several issues that make such methods impractical. However, the problem of identifying ambiguity can be obviated all together. Specifically, a check can be made as to whether a query is topic related or not, rather than if it is ambiguous. For any topic related query, the query can be focused on documents of a specific topic by adding appropriate contextual keywords so the results are constrained to the topic of interest as will be described further below. Further, care should be taken when adding context so as not to change the original query intent.
- top search results become photography related without altering the original query intent.
- some other word can be added.
- simply adding generic keywords like “camera” or “photography” to every query does not work as well. For instance, not every site related to photography includes these words, and adding them might actually decrease the relevance of the results returned.
- a query “Q” can be issued by an end user, for instance, desiring to acquire results related to a particular topic.
- Altered query “Q′” is subsequently provided to search engine 110 , which returns results. In this manner, an alteration can be achieved “Q ⁇ Q′” such that the results are highly relevant to “Q” in the context of the particular topic.
- Query “Q” can be either ambiguous and topic related or unambiguous and topic related. If “Q” is ambiguous and topic related, “Q” is disambiguated by context addition. If “Q” is unambiguous and topic related, then the addition of context does not change the intent of “Q.” Further, if there is not enough or suitable contextual information, the query “Q” need not be altered. This prevents degradation of query performance (e.g., “Do no harm” principle). Thus, queries can be augmented without having to classify them as ambiguous or unambiguous. In addition, owing to the “Do no Harm” principle, queries, which are not topic-related, can remain unaltered.
- the topic component 120 is separate from the search engine 110 .
- the topic component 120 can act as a low-overhead topical wrapper with respect to the search engine 110 in one embodiment.
- the search engine does not have to be changed, or more specifically search indexes, ranking strategies, and/or result processing can remain unchanged.
- the disclosed subject matter is not limited thereto, and the functionality of the topic component 120 can be incorporated within the search engine 110 in an alternate embodiment.
- the topic component 120 and/or search engine 110 can employ a query context model to facilitate topical search, as described below, the subject application and appended claims are not limited thereto.
- FIG. 2 depicts a representative topic component 120 in further detail.
- the topic component 120 includes a context generation component 210 , query context model 220 , and query alteration component 230 .
- the context generation component 210 receives, retrieves, or otherwise acquires one or more topic relevant documents or seed sites (e.g., URLs) from a topical search system designer such as a website owner or blogger, for example. From these seed sites, among other things, the context generation component 210 produces the query context model 220 specific to a particular topic or domain (e.g., photography, cars, home improvement, golf . . . ).
- topic relevant documents or seed sites e.g., URLs
- the context generation component 210 produces the query context model 220 specific to a particular topic or domain (e.g., photography, cars, home improvement, golf . . . ).
- the query context model 220 can be embodied as a context list that stores potential contextual keywords for a number of topic-relevant queries (also referred to generically herein as n-grams or n-gram queries meaning queries comprising “n” query keywords).
- generation of the query context model 220 can be performed as an offline operation once per topic, when the topical search engine is created, for instance. Further, generation can include building a subweb of topic related queries and URLs, identifying co-occurring keywords for each query in the subweb, and for each query selecting contextual, or in other words topic-related, keywords that satisfy certain constraints including those related to lexical generality, among other things.
- the query alteration component 230 generally performs online functionality described with respect to the topic component 120 in FIG. 1 . More specifically, upon receipt of a query, the query alteration component 230 can utilize the query context model 220 to identify and add context to the query to produce an altered query. In a simple example, if the query includes a single query keyword “Focus,” the query alteration component 230 can look up the query and locate “camera” as the context to be added to the query. Of course, more than one context keyword can be added to a query. For example, for the query “Evolt,” “Olympus,” “camera,” and “digital” can be injected as context.
- FIG. 3 illustrates a representative context-generation component 210 in further detail include a subweb component 310 , co-occurrence component 320 , lexical generality component 330 , and construction component 340 .
- the subweb component 310 produces a subweb or in other words a collection of topic or domain specific URLs and queries, wherein each unique URL and query can include a corresponding domain relevance weighting.
- the subweb component 310 includes an extraction component 410 that automatically extracts URLs and queries from topic relevant seed sites and a click graph.
- a topical search engine creator can provide a small initial set of preferably authoritative and highly topic relevant websites “S.” The more relevant the websites are the less chance there is of identifying off-topic queries and URLs.
- the extraction component 410 can subsequently extract query-URL pairs from the click graph (e.g., record of users' queries, the URLs they clicked on, and the number of times a user clicked on a specific URL when they issued a particular query) where the URL is one of the seed sites “S,” or where the URL is a sub-site of one of the sites in “S.” Additionally, URLs linked to queries identified and queries linked to URLs identified can be found.
- the click graph e.g., record of users' queries, the URLs they clicked on, and the number of times a user clicked on a specific URL when they issued a particular query
- Weight component 420 of the subweb component 310 assigns weights to queries and URLs to aid in controlling the size of a generated subweb and otherwise limiting the subweb to the best URLs and queries.
- each query can be assigned an initial weight equal to the number of sites in “S” where the query “q” occurs in the click graph.
- a weight threshold can be utilized to limit the URLs selected.
- weights for each newly identified URL can be updated, for example with the sum (over all queries) of all average weights of the queries, where average weight of a query is its weight divided by the number of selected URLs linked to it.
- queries for all identified URLs are located that have a weight greater than a threshold and newly identified queries weights are updated, for instance with the sum (over all URLs) of average weights of the URLs, where average weight of a URL is its weight divided by the number of selected queries linked to it.
- the subweb component 310 also includes a selection component 430 that selects or otherwise identifies URLs from subweb URLs with a weight greater than a threshold and their corresponding queries. This set of queries and URLs along with their weights can form the subweb.
- the URLs identified in the subweb are a large but not comprehensive list of web pages for a given topic.
- the subweb need not be complete, rather the subweb should include a sizeable proportion of topic related URLs and queries.
- Various parameters or thresholds can be utilized to control the precision and size of the subweb being constructed. For example, low values of these parameters are likely to bring in non-domain query/URL pairs while high values are likely to miss some of the domain query/URL pairs.
- the co-occurrence component 320 derives co-occurrence scores between query keywords in identified in the subweb. To facilitate clarity and understanding, a sample hypothetical subweb for photography with only two URLs is shown in Table 2 below:
- n-gram query e.g., query with “n” query keywords
- the frequency of an n-gram query can be considered equal to the sum of all the clicks it received.
- the frequency of the unigram “Olympus” in URL1 is 25 (10+5+10).
- URL and pseudo-document are used interchangeably.
- n-gram query “n1” and a related keyword “c1” are said to co-occur if they appear together in the same pseudo-document “d.”
- the total frequency over the subweb is calculated by summing the individual frequencies over all documents:
- pseudo-document co-occurrence can be used instead of query co-occurrence since it helps in finding relationships between keywords that do not appear with high enough frequency in the same query.
- “E300” and “Olympus” do not appear in the same query, they still have a non-zero co-occurrence score.
- n-gram frequencies vary a lot within the subweb
- the strength of a relationship between “n1” and “c1,” “CScore(c1,n1),” can be measured by normalizing the co-occurrence frequency with the n-gram frequency, that is:
- CScore ⁇ ( c ⁇ ⁇ 1 , n ⁇ ⁇ 1 ) Freq subweb ⁇ ( c ⁇ ⁇ 1 ⁇ n ⁇ ⁇ 1 ) Freq subweb ⁇ ( n ⁇ ⁇ 1 ) .
- CScore(c1,n1) can be viewed as the maximum likelihood estimate of the conditional probability “P(c1
- the lexical generality component 330 determines the lexical generality score of keywords. Given a set of topical queries, the lexical generality (LG) of a keyword is the number of other unique keywords that the keyword appears with in queries. The lexical generality of an n-gram is then considered equal to the lexical generality score of its most general keyword.
- LG lexical generality
- the lexical generality score for “olympus” is three
- the score for “model” is three
- the score for “camera” is two
- the score for “olympus model review” is equal to the lexical generality score of “olympus,” which is three.
- the lexical generality scores enforce a partial order on all the keywords based on their “importance” in the domain. This property plays a role in selecting contextual or topic-related keywords.
- subweb frequency “Freq subweb ”) of a keyword instead of the approach described above.
- Req subweb subweb frequency
- empirical evidence suggests that subweb frequency is more representative of popularity than generality of a keyword. For example, users submit a lot more queries for brand keywords like “Sony” or “Canon” as compared to the keyword “camera.” Thus, ordering based on subweb frequency would make brand names more general than “camera,” which is undesirable.
- the construction component 340 can utilize the subweb created by the subweb component 310 , co-occurrence scores produced by the co-occurrence component 320 , and lexical generality scores produced by the lexical generality component 330 to construct a query context model, which in one embodiment can be a context list including queries and contextual keywords.
- the construction component 340 can implement various constraints on contextual or topic-related keywords “C” to ensure the query context model aids but does not harm topical and other searches. For example such constraints can be but are not limited to the following:
- the first constrain prevents a query from being over-specified. For example, “Canon” and “powershot” are both highly correlated, and LG(“Canon”)>LG(“powershot”). Thus, “Canon” can be added as context to “powershot” without changing its intent, but “powershot” should not be added as context to “Canon.”
- the second constraint prevents adding too much context. Adding context beyond a maximum number of keywords can eventually negatively affect result relevance.
- the third constraint ensures the query is shifted towards a large number of topic-related documents.
- a high score between a query keyword “q” and a context keyword “c” means that “c” occurs in many topic-related documents in which “q” occurs.
- Constraint 3 a asserts that a single high scoring keyword is enough to shift the query sufficiently—there is no need to add any moderately scoring keywords.
- constraint 3 b says that if no high scoring keywords are available and “q” does not have a high lexical generality score, then either a maximum number of moderately related contextual keywords will be added or no contextual keywords will be added at all.
- the maximum number of context keywords can be four. Keywords with a high generality score are likely related to a large number of documents in the subweb; hence, if a highly correlated context is absent then additional context is not added to prevent harm.
- bounds or thresholds for high and low co-occurrence scores “HighC” and “LowC” and the bounds or thresholds for high generality score “HighG” can be set based on application requirements. By way of example and not limitation, these scores can be set to 0.75, 0.10, and 50 respectively regardless of topic. This means that for a keyword “c” to be added as context to any keyword “q,” “c” should co-occur in at least 75% of documents in which “q” occurs. Alternatively, if “q” has a generality score lower than 50 and there is no “c” with co-occurrence over 75%, then multiple context keywords with at least 10% co-occurrence will be added. These parameters essentially allow specification of context when there is sufficient information to confidently add contextual keywords without departing from search intent.
- a representative query-alteration component 230 is depicted including a context acquisition component 510 and a query writer component 520 .
- the context acquisition component 510 selects a context for a query using query context model or in one embodiment a context list, as previously described.
- the context acquisition component 510 can start by looking for an entire query in the context list and, if present, return the context specified. Otherwise, a highly related context keyword can be added for each of a query's keywords in turn (e.g., round robin fashion). The addition of each keyword in turn ensures that the query shifts towards topical documents related to all query keywords. Since the maximum content size is limited, only one or two context keywords can selected per query keyword. Therefore, moderately related contexts need not be used here.
- various portions of the disclosed systems above and methods below can include or consist of artificial intelligence, machine learning, or knowledge or rule-based components, sub-components, processes, means, methodologies, or mechanisms (e.g., support vector machines, neural networks, expert systems, Bayesian belief networks, fuzzy logic, data fusion engines, classifiers . . . ).
- Such components can automate certain mechanisms or processes performed thereby to make portions of the systems and methods more adaptive as well as efficient and intelligent.
- the context generation component 210 can employ such mechanisms to facilitate generation of a query context model, for instance based on imperfect or unavailable information.
- Such mechanisms can be utilized additionally or alternatively to determine or infer whether or not to add context to a query so as not to degrade performance with respect to user's search intent, for instance.
- a topical search method 600 is illustrated.
- a query can be received, retrieved, or otherwise obtained or acquired from a user, for example.
- one or more contextual or, in other words, topic-related keywords are identified as a function of lexical generality and optionally co-occurrence, wherein lexical generality refers to the generality of contextual keyword and co-occurrence concerns occurrence of query keywords together in a pseudo-document.
- identification of one or more contextual keywords can involve determining contextual keywords that are greater in lexical generality than keywords that form the acquired query.
- identified contextual keywords are added to the acquired query.
- FIG. 7 is a flow chart diagram of a method of topical search 700 is illustrated.
- a subweb can be built based on a one or more (preferably a few) topic relative and authoritative URLs provided by a creator of a topical search engine, for example, as well as other data such as a click graph.
- a built subweb can include a collection of topic or domain specific URLs and queries with each unique URL and query including an optional domain relevance weight.
- a query context model can be constructed utilizing the subweb as well as computed co-occurrence and lexical generality scores, for example.
- the context query model which in one embodiment can be a context list, can include a plurality of topic specific queries and related contextual keywords.
- contextual keywords can be added to an input query to produce a topic specific query.
- FIG. 8 depicts a method of subweb building 800 for use with respect to a topical search system.
- a number of topic relevant URLs are acquired from a creator of a topical search engine, for example, as seed sites. The more relevant as well as authoritative the URLs are the better the overall quality of the subweb.
- all query-URL pairs are extracted from a click graph (e.g., accessible via a service) as a function of the acquired seed URLs.
- an initial weight can be assigned to queries as a function of the number of sites where the query appears in a click graph.
- a cycle can begin where URLs linked to identified queries and queries linked to identified URLs are located.
- URL and query weights can be updated as new URLs and queries are located.
- URLs with weights greater than a predetermined threshold and corresponding queries are selected for inclusion in the subweb.
- FIG. 9 is a flow chart diagram of a method of query context-model generation 900 .
- co-occurrence scores between queries and keywords are determined.
- a generated subweb can include a number of URLs and a set of queries corresponding to each URL. URLs can be treated as pseudo-documents and corresponding query keywords as words in the pseudo-document.
- the co-occurrence score measures the probability of a contextual keyword “c1” co-occurring with a query n-gram “n1” in a pseudo-document, given that “n1” is observed in the pseudo-document.
- lexical generality scores are determined for query keywords. More specifically, given a set of topical queries, lexical generality of a keyword is the number of other unique keywords it appears with in queries.
- the lexical generality of a query e.g., query n-gram
- a query context model or the like is generated as a function of a co-occurrence and lexical generality scores, among other things.
- the query context model can identify topical queries and corresponding contextual keywords that can be added to the queries.
- the contextual model can be generated utilizing a number of constraints on contextual keywords to ensure that context can be added without harming query intent. For example, the lexical generality of a contextual keyword can be required to be greater than the lexical generality of a query.
- a maximum number of contextual keywords that can be added can also be defined as well as acceptable co-occurrence values and co-occurrence ranges.
- FIG. 10 is a flow chart diagram of a method of iterative lexical-generality scoring 1000 .
- a query is acquired from a previously constructed subweb.
- a lexical generality score is determined at 1020 .
- a determination is made at reference numeral 1030 , as to whether all queries in the subweb have been processed. If not all queries have been processed (“NO”), the method loops back to reference 1010 where the next query is acquired for processing. If, however, all queries have been processed (“YES”), all keywords and lexical generality scores are returned or otherwise persisted for later access.
- FIG. 11 illustrates a method of topical query contextualization 1100 .
- a check is made as to whether an input query is located in a query context model or in one embodiment a context list. If the query is located in the context list (“YES”), context is acquired from the context list and returned at 1120 . If the query is not located in the context model or list (“NO”), a query keyword is selected at numeral 1130 , and context for that keyword is looked up and returned at 1140 .
- a check is made as to whether the end of the query has been reached, or in other words, whether all the keywords comprising the query have been processed. If the end of the query has been reached (“YES”), the method terminates.
- the method continues at reference 1160 where a check is made pertaining to whether or not a maximum context or number of contextual keywords has been reached with respect to the query. If the maximum has been reached (“YES”), the method terminates. Otherwise (“NO”), the method returns to numeral 1130 where the next query keyword is select.
- a component may be, but is not limited to being, a process running on a processor, a processor, an object, an instance, an executable, a thread of execution, a program, and/or a computer.
- an application running on a computer and the computer can be a component.
- One or more components may reside within a process and/or thread of execution and a component may be localized on one computer and/or distributed between two or more computers.
- Persistent data or the like is intended to refer to data stored on a non-volatile medium that exists across application sessions. In other words, the persistent data survives application startup and termination.
- transient data often saved on a volatile medium such as memory, is created within or during an application session and is discarded at the end of the session.
- the term “persist,” or various forms thereof e.g., persists, persisting, persisted . . . ), is intended to refer to storing data in a persistent form or as persistent data.
- a “subweb” as used herein is a collection of topic or domain specific documents or URLs and query. Each unique URL and query can have a corresponding domain relevance weight.
- a “keyword,” and various forms thereof, generally refers to a word specified with respect to a search to enable location of particular information or data, among other things.
- a “query keyword” refers to a word specified with respect to a query such as an input search query.
- a “contextual keyword” or “topic-related keyword” refers to a word that can be added to a query to bias the results toward a particular topic or domain.
- a “contextual keyword” or “topic-related keyword” can be derived from prior queries and query keywords, as described herein.
- n-gram query is intended to refer to “n” sub-strings of a query or more particularly a query including “n” query keywords. Accordingly, a “unigram” is a query including one query keyword.
- the term “inference” or “infer” refers generally to the process of reasoning about or inferring states of the system, environment, and/or user from a set of observations as captured via events and/or data. Inference can be employed to identify a specific context or action, or can generate a probability distribution over states, for example. The inference can be probabilistic—that is, the computation of a probability distribution over states of interest based on a consideration of data and events. Inference can also refer to techniques employed for composing higher-level events from a set of events and/or data.
- Such inference results in the construction of new events or actions from a set of observed events and/or stored event data, whether or not the events are correlated in close temporal proximity, and whether the events and data come from one or several event and data sources.
- Various classification schemes and/or systems e.g., support vector machines, neural networks, expert systems, Bayesian belief networks, fuzzy logic, data fusion engines . . . ) can be employed in connection with performing automatic and/or inferred action in connection with the claimed subject matter.
- FIG. 12 As well as the following discussion are intended to provide a brief, general description of a suitable environment in which various aspects of the subject matter can be implemented.
- the suitable environment is only an example and is not intended to suggest any limitation as to scope of use or functionality.
- the computer 1210 includes one or more processing units or processors 1220 , system memory 1230 , system bus 1240 , mass storage 1250 , and one or more interface components 1270 .
- the system bus 1240 communicatively couples at least the above system components.
- the computer 1210 can include one or more processors 1220 coupled to system memory 1230 that execute various computer executable actions, instructions, and or components.
- the processing unit 1220 can be implemented with a general purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein.
- a general-purpose processor may be a microprocessor, but in the alternative, the processor may be any processor, controller, microcontroller, or state machine.
- the processing unit 1220 may also be implemented as a combination of computing devices, for example a combination of a DSP and a microprocessor, a plurality of microprocessors, multi-core processors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
- the computer 1210 can include or otherwise interact with a variety of computer-readable media to facilitate control of the computer 1210 to implement one or more aspects of the claimed subject matter.
- the computer-readable media can be any available media that can be accessed by the computer 1210 and includes volatile and nonvolatile media and removable and non-removable media.
- computer-readable media may comprise computer storage media and communication media.
- Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules, or other data.
- Computer storage media includes, but is not limited to memory devices (e.g., random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM) . . . ), magnetic storage devices (e.g., hard disk, floppy disk, cassettes, tape . . . ), optical disks (e.g., compact disk (CD), digital versatile disk (DVD) . . .
- RAM random access memory
- ROM read-only memory
- EEPROM electrically erasable programmable read-only memory
- magnetic storage devices e.g., hard disk, floppy disk, cassettes, tape . . .
- optical disks e.g., compact disk (CD), digital versatile disk (DVD) . . .
- Communication media typically embodies computer-readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.
- modulated data signal means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
- communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer-readable media.
- System memory 1230 and mass storage 1250 are examples of computer-readable storage media.
- system memory 1230 may be volatile (e.g., RAM), non-volatile (e.g., ROM, flash memory . . . ) or some combination of the two.
- the basic input/output system (BIOS) including basic routines to transfer information between elements within the computer 1210 , such as during start-up, can be stored in nonvolatile memory, while volatile memory can act as external cache memory to facilitate processing by the processing unit 1220 , among other things.
- Mass storage 1250 includes removable/non-removable, volatile/non-volatile computer storage media for storage of large amounts of data relative to the system memory 1230 .
- mass storage 1250 includes, but is not limited to, one or more devices such as a magnetic or optical disk drive, floppy disk drive, flash memory, solid-state drive, or memory stick.
- topic component 120 including context generation component 210 and query alteration component 230 can be an application 1262 or part of an application 1262 and include one or more modules 1264 and data 1266 stored in memory and/or mass storage 1250 whose functionality can be realized when executed by one or more processors or processing units 1220 , as shown.
- the computer 1210 also includes one or more interface components 1270 that are communicatively coupled to the system bus 1240 and facilitate interaction with the computer 1210 .
- the interface component 1270 can be a port (e.g., serial, parallel, PCMCIA, USB, FireWire . . . ) or an interface card (e.g., sound, video . . . ) or the like.
- the interface component 1270 can be embodied as a user input/output interface to enable a user to enter commands and information into the computer 1210 through one or more input devices (e.g., pointing device such as a mouse, trackball, stylus, touch pad, keyboard, microphone, joystick, game pad, satellite dish, scanner, camera, other computer . . . ).
- the interface component 1270 can be embodied as an output peripheral interface to supply output to displays (e.g., CRT, LCD, plasma . . . ), speakers, printers, and/or other computers, among other things.
- the interface component 1270 can be embodied as a network interface to enable communication with other computing devices (not shown), such as over a wired or wireless communications link.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
TABLE 1 | |||
QUERY | CONTEXT | ||
Evolt | Olympus: 0.9, camera: 0.7, digital 0.5 | ||
Alpha | Sony: 0.95 | ||
Focus | camera: 0.9 | ||
Powershot | Canon: 0.9 | ||
Nikkor | Nikon: 0.7 | ||
Sandisk | memory: 0.8, card: 0.8 | ||
Here, for example, “Olympus” is a potential context for “evolt” with confidence 0.9 and “camera” is a potential context with confidence 0.7. “Alpha” is disambiguated with “Sony,” and so forth. Notice that some terms that are reasonably unambiguous, like “Nikkor,” still have context defined, which does not change the intent of the query keyword.
TABLE 2 | |||
####### | |||
URL 1: xxxxxxxxxxxxxxxxxxxxxxxxxx | |||
Olympus 300 | 10 | ||
E300 | 10 | ||
Olympus Camera | 5 | ||
Olympus models review | 10 | ||
####### | |||
URL 2: yyyyyyyyyyyyyyyyyyyyyyyyyyy | |||
Olympus Reviews | 10 | ||
Olympus model reviews | 5 | ||
####### | |||
For each URL, a set of queries are specified and for each query-URL pair the number of clicks observed is denoted. To calculate the co-occurrence scores between keywords a modified bag of words approach for text documents can be used. More specifically, each URL can be treated as a pseudo-document and all corresponding query keywords as words in the pseudo-document. The frequency of an n-gram query (e.g., query with “n” query keywords) can be considered equal to the sum of all the clicks it received. For example, the frequency of the unigram “Olympus” in URL1 is 25 (10+5+10). For ease of use, the terms URL and pseudo-document are used interchangeably.
Freqd(n1∩c1)=min(Freqd(n1),Freqd(c1))
The total frequency over the subweb is calculated by summing the individual frequencies over all documents:
For instance in the sample subweb in Table 2, the co-occurrence frequency of two unigrams ‘E300’ & ‘Olympus’ is given by:
Freqsubweb(E300∩Olympus)=FreqURL1(E300∩Olympus)+FreqURL2(E300∩Olympus)=min(FreqURL1(E300),FreqURL1(Olympus))+min(FreqURL2(E300),FreqURL2(Olympus))=min(10,25)+min(0,15)=10
FreqURL1(E300∩Olympus Camera)=min(10,5)=5
-
- a. Every cεC has CScore(c,q)≧HighC
- b. LG(q)<HighG and every cεC has HighC≧CScore(c,q)≧LowC and |C|=MAXC
Claims (20)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/792,288 US9177045B2 (en) | 2010-06-02 | 2010-06-02 | Topical search engines and query context models |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/792,288 US9177045B2 (en) | 2010-06-02 | 2010-06-02 | Topical search engines and query context models |
Publications (2)
Publication Number | Publication Date |
---|---|
US20110302172A1 US20110302172A1 (en) | 2011-12-08 |
US9177045B2 true US9177045B2 (en) | 2015-11-03 |
Family
ID=45065292
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/792,288 Active 2031-03-06 US9177045B2 (en) | 2010-06-02 | 2010-06-02 | Topical search engines and query context models |
Country Status (1)
Country | Link |
---|---|
US (1) | US9177045B2 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140180692A1 (en) * | 2011-02-28 | 2014-06-26 | Nuance Communications, Inc. | Intent mining via analysis of utterances |
US20180088752A1 (en) * | 2016-09-28 | 2018-03-29 | Button Inc. | Mobile web browser providing contextual actions based on web page content |
Families Citing this family (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100131513A1 (en) | 2008-10-23 | 2010-05-27 | Lundberg Steven W | Patent mapping |
CN101464897A (en) * | 2009-01-12 | 2009-06-24 | 阿里巴巴集团控股有限公司 | Word matching and information query method and device |
US8386495B1 (en) | 2010-04-23 | 2013-02-26 | Google Inc. | Augmented resource graph for scoring resources |
US9589053B1 (en) * | 2010-12-17 | 2017-03-07 | The Boeing Company | Method and apparatus for constructing a query based upon concepts associated with one or more search terms |
US9582609B2 (en) * | 2010-12-27 | 2017-02-28 | Infosys Limited | System and a method for generating challenges dynamically for assurance of human interaction |
CN102567408B (en) * | 2010-12-31 | 2014-06-04 | 阿里巴巴集团控股有限公司 | Method and device for recommending search keyword |
US20120188249A1 (en) * | 2011-01-26 | 2012-07-26 | Raytheon Company | Distributed graph system and method |
MX2013010281A (en) * | 2011-03-09 | 2014-07-11 | Sirius Xm Radio Inc | System and method for increasing transmission bandwidth efficiency. |
US9904726B2 (en) | 2011-05-04 | 2018-02-27 | Black Hills IP Holdings, LLC. | Apparatus and method for automated and assisted patent claim mapping and expense planning |
US20140222788A1 (en) * | 2011-08-24 | 2014-08-07 | The Regents Of The University Of California | Research recommendation system |
US20130085946A1 (en) | 2011-10-03 | 2013-04-04 | Steven W. Lundberg | Systems, methods and user interfaces in a patent management system |
US8700599B2 (en) * | 2011-11-21 | 2014-04-15 | Microsoft Corporation | Context dependent keyword suggestion for advertising |
US9304738B1 (en) * | 2012-06-14 | 2016-04-05 | Goolge Inc. | Systems and methods for selecting content using weighted terms |
US9697821B2 (en) * | 2013-01-29 | 2017-07-04 | Tencent Technology (Shenzhen) Company Limited | Method and system for building a topic specific language model for use in automatic speech recognition |
US9213730B2 (en) * | 2013-08-13 | 2015-12-15 | Xerox Corporation | Method and apparatus for extracting portions of text from long social media documents |
GB2545548A (en) | 2014-03-29 | 2017-06-21 | Camelot Uk Bidco Ltd | Improved method, system and software for searching, identifying, retrieving and presenting electronic documents |
US9645739B2 (en) | 2014-09-26 | 2017-05-09 | Intel Corporation | Host-managed non-volatile memory |
US11636120B2 (en) | 2014-11-21 | 2023-04-25 | Microsoft Technology Licensing, Llc | Offline evaluation of ranking functions |
JP2016126481A (en) * | 2014-12-26 | 2016-07-11 | ブラザー工業株式会社 | Device control program, device control method, and device control apparatus |
CN105930527B (en) * | 2016-06-01 | 2019-09-20 | 北京百度网讯科技有限公司 | Searching method and device |
US11816436B2 (en) | 2018-07-24 | 2023-11-14 | MachEye, Inc. | Automated summarization of extracted insight data |
US11341126B2 (en) | 2018-07-24 | 2022-05-24 | MachEye, Inc. | Modifying a scope of a canonical query |
US11651043B2 (en) | 2018-07-24 | 2023-05-16 | MachEye, Inc. | Leveraging analytics across disparate computing devices |
US11853107B2 (en) * | 2018-07-24 | 2023-12-26 | MachEye, Inc. | Dynamic phase generation and resource load reduction for a query |
US11841854B2 (en) | 2018-07-24 | 2023-12-12 | MachEye, Inc. | Differentiation of search results for accurate query output |
US12026185B2 (en) * | 2021-03-01 | 2024-07-02 | Chevron U.S.A. Inc. | Document search and analysis tool |
US12106192B2 (en) | 2021-03-01 | 2024-10-01 | Chevron U.S.A. Inc. | White space analysis |
US20240070188A1 (en) * | 2022-08-29 | 2024-02-29 | Unnanu, Inc. | System and method for searching media or data based on contextual weighted keywords |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5875446A (en) | 1997-02-24 | 1999-02-23 | International Business Machines Corporation | System and method for hierarchically grouping and ranking a set of objects in a query context based on one or more relationships |
US20020194161A1 (en) | 2001-04-12 | 2002-12-19 | Mcnamee J. Paul | Directed web crawler with machine learning |
US6691108B2 (en) | 1999-12-14 | 2004-02-10 | Nec Corporation | Focused search engine and method |
US6701310B1 (en) | 1999-11-22 | 2004-03-02 | Nec Corporation | Information search device and information search method using topic-centric query routing |
US7392278B2 (en) | 2004-01-23 | 2008-06-24 | Microsoft Corporation | Building and using subwebs for focused search |
US20090164895A1 (en) * | 2007-12-19 | 2009-06-25 | Yahoo! Inc. | Extracting semantic relations from query logs |
US20090265331A1 (en) | 2008-04-18 | 2009-10-22 | Microsoft Corporation | Creating business value by embedding domain tuned search on web-sites |
US20100057716A1 (en) | 2008-08-28 | 2010-03-04 | Stefik Mark J | System And Method For Providing A Topic-Directed Search |
US7680860B1 (en) | 2001-07-31 | 2010-03-16 | Logika Corporation | Method and system for creating vertical search engines |
US20100082752A1 (en) * | 2008-09-30 | 2010-04-01 | Yahoo! Inc. | Query log mining for detecting spam hosts |
US8626768B2 (en) * | 2010-01-06 | 2014-01-07 | Microsoft Corporation | Automated discovery aggregation and organization of subject area discussions |
-
2010
- 2010-06-02 US US12/792,288 patent/US9177045B2/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5875446A (en) | 1997-02-24 | 1999-02-23 | International Business Machines Corporation | System and method for hierarchically grouping and ranking a set of objects in a query context based on one or more relationships |
US6701310B1 (en) | 1999-11-22 | 2004-03-02 | Nec Corporation | Information search device and information search method using topic-centric query routing |
US6691108B2 (en) | 1999-12-14 | 2004-02-10 | Nec Corporation | Focused search engine and method |
US20020194161A1 (en) | 2001-04-12 | 2002-12-19 | Mcnamee J. Paul | Directed web crawler with machine learning |
US7680860B1 (en) | 2001-07-31 | 2010-03-16 | Logika Corporation | Method and system for creating vertical search engines |
US7392278B2 (en) | 2004-01-23 | 2008-06-24 | Microsoft Corporation | Building and using subwebs for focused search |
US20090164895A1 (en) * | 2007-12-19 | 2009-06-25 | Yahoo! Inc. | Extracting semantic relations from query logs |
US20090265331A1 (en) | 2008-04-18 | 2009-10-22 | Microsoft Corporation | Creating business value by embedding domain tuned search on web-sites |
US20100057716A1 (en) | 2008-08-28 | 2010-03-04 | Stefik Mark J | System And Method For Providing A Topic-Directed Search |
US20100082752A1 (en) * | 2008-09-30 | 2010-04-01 | Yahoo! Inc. | Query log mining for detecting spam hosts |
US8626768B2 (en) * | 2010-01-06 | 2014-01-07 | Microsoft Corporation | Automated discovery aggregation and organization of subject area discussions |
Non-Patent Citations (29)
Title |
---|
[1] K. Bharat, and M. R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of SIGIR-98 (Melbourne, AU, 1998), ACM Press, 104-111. |
A. Kruger, C. Lee Giles, F. Coerzee, E.J. Glover, G.W. Flake, S. Lawrence, C.W. Omlin, DEADLINER: Building an New Niche Search Engine, In Proc. CIKM 2000, pp. 272-281. |
A. Turpin, and W. Hersh. 2004. Do clarity scores for queries correlate with user performance?. In Proceedings of the 15th Australasian Database Conference-vol. 27 (Dunedin, New Zealand). K. Schewe and H. Williams, Eds. ACM International Conference Proceeding Series, vol. 52, pp. 85-91. |
B. Smyth, E. Balfe, J. Freyne, P. Briggs, M. Coyle, and O. Boydell. 2005. Exploiting Query Repetition and Regularity in an Adaptive Community-Based Web Search Engine. User Modeling and User-Adapted Interaction 14, 5 (Jan. 2005), 383-423. |
C. Hauff, V. Murdock, and R. Baeza-Yates. 2008. Improved query difficulty prediction for the web. In Proceeding of CIKM '08 (Napa Valley, California, USA, Oct. 26-30, 2008). CIKM '08. ACM, New York, NY, 439-448. |
Carstens, "Effects of Using a Research Context Ontology for Query Expansion", 2009, ESWC 2009 Heraklion Proceedings of the 6th European Semantic Web Conference on the Semantic Web: Research and Applications. * |
D. Gibson, J.M. Kleinberg, and P. Raghavan. Inferring web communities from link topology. In UK Conference on Hypertext, pp. 225-234, 1998. |
E. Glover, G. Flake, S. Lawrence, W. P. Birmingham, A. Kruger, C. L. Giles, and D. Pennock. Improving category specific web search by learning query modifications. In Symposium on Applications and the Internet, SAINT, (San Diego, CA, Jan. 2001) IEEE, 23-31. |
Google Custom Search Engine. http://www.google.com/cse/. |
H. Chang, D. Cohn, and A. K. McCallum. Learning to create customized authority lists. In Proc. 17th ICML (Stanford, CA, 2000), Morgan Kaufmann, San Francisco, CA, 127-134. |
He, and I. Ounis. 2006. Query performance prediction. Inf. Syst. 31, 7 (Nov. 2006), 585-594. |
Hoeber et al, "Conceptual Query Expansion", 2005, In Proceedings of the Third International Atlantic web Intelligence Conference. * |
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604-632, 1999. |
J. Teevan, S. T. Dumais and E. Horvitz (2005). Personalizing search via automated analysis of interests and activities. In Proceedings of SIGIR 2005. |
Klemke, Roland. Modelling context in information brokering processes. Diss. Bibliothek der RWTH Aachen, 2002. * |
P. Thomas, and D. Hawking. 2006. Evaluation by comparing result sets in context. In Proceedings of CIKM '06. ACM, New York, NY, 94-101. |
Poblete, Dr. Searcher and Mr. Broser, A unified hyperlink-click graph, 2008. * |
R. Chandrasekar, H. Chen, S. Corston-Oliver, and E. Brill. Subwebs for specialized search. In Proceedings of SIGIR-2004 (Sheffield, UK, Jul. 2004), ACM Press, 480-481. |
R. Kraft, C. C. Chang, F. Maghoul, and R. Kumar. 2006. Searching with context. In Proceedings of WWW '06 . (Edinburgh, Scotland, May 23-26, 2006). WWW '06. ACM, New York, NY, 477-486. |
R. Song, Z. Luo, J. Nie, Y. Yu, and H. Hon. 2009. Identification of ambiguous queries in web search. Inf. Process. Manage. 45, 2 (Mar. 2009), 216-229. |
Rollyo site. http://rollyo.com. |
S. Chakrabarti, M. van den Berg and B. Dom. Focused Crawling: a New Approach to Topic-Specific Web Resource Discovery, 1999. Proceedings of WWW 1999. |
S. Cronen-Townsend, Y. Zhou, and W. B. Croft. 2002. Predicting query performance. In Proceedings of SIGIR-2002 (Tampere, Finland, Aug. 11-15, 2002). ACM, New York, NY, 299-306. |
S. Lawrence, K. Bollacker and C. Lee Giles, Indexing and Retrieval of Scientific Literature, In Proc. CIKM 99, pp. 139-146. |
Srinivasan et al., "Target Seeking Crawlers and their Topical Performance", Aug. 2002, ACM. * |
Stopford, Benjamin. "The Topic Specific Search Engine," (2006). * |
T. H. Haveliwala. 2002. Topic-sensitive PageRank. In Proceedings of the WWW 2002. (Honolulu, Hawaii, USA, May 7-11, 2002). ACM, New York, NY, 517-526. |
The Power of Topic Specific Search Engines-Published Date: Apr. 12, 2010 http://www.workz.com/content/view-content.html?section-id=475&content-id=6801. |
Weinberger et al., "Resolving Tag Ambiguity", Oct. 2008, ACM. * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140180692A1 (en) * | 2011-02-28 | 2014-06-26 | Nuance Communications, Inc. | Intent mining via analysis of utterances |
US20180088752A1 (en) * | 2016-09-28 | 2018-03-29 | Button Inc. | Mobile web browser providing contextual actions based on web page content |
Also Published As
Publication number | Publication date |
---|---|
US20110302172A1 (en) | 2011-12-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9177045B2 (en) | Topical search engines and query context models | |
US8244750B2 (en) | Related search queries for a webpage and their applications | |
Sontag et al. | Probabilistic models for personalizing web search | |
US11762908B1 (en) | Node graph pruning and fresh content | |
JP5247475B2 (en) | Mining web search user behavior to improve web search relevance | |
US8244737B2 (en) | Ranking documents based on a series of document graphs | |
US8417692B2 (en) | Generalized edit distance for queries | |
US8099423B2 (en) | Hierarchical metadata generator for retrieval systems | |
US9396188B2 (en) | Assigning tags to digital content | |
US7668887B2 (en) | Method, system and software product for locating documents of interest | |
Amitay et al. | Social search and discovery using a unified approach | |
US20070214131A1 (en) | Re-ranking search results based on query log | |
Wang et al. | A web service discovery approach based on common topic groups extraction | |
US9110901B2 (en) | Identifying web pages of the world wide web having relevance to a first file by comparing responses from its multiple authors | |
EP2875465A1 (en) | Defense against search engine tracking | |
Makvana et al. | A novel approach to personalize web search through user profiling and query reformulation | |
US20170075899A1 (en) | Utilizing keystroke logging to determine items for presentation | |
Wang et al. | Kvasir: Scalable provision of semantically relevant web content on big data framework | |
Chen et al. | Personalized query suggestion diversification in information retrieval | |
US20200401638A1 (en) | Method of and system for generating search query completion suggestion on search engine | |
Bassani et al. | A multi-representation re-ranking model for Personalized Product Search | |
Hurtado Martín et al. | An exploratory study on content-based filtering of call for papers | |
Mahale et al. | Advanced web crawler for deep web interface using binary vector & page rank | |
Dinesh | Real world evaluation of approaches to research paper recommendation | |
Sondhi et al. | Using query context models to construct topical search engines |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: MICROSOFT CORPORATION, WASHINGTON Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:CHANDRASEKAR, RAMAN;SONDHI, PARIKSHIT;ROUNTHWAITE, ROBERT;REEL/FRAME:024656/0017 Effective date: 20100601 |
|
AS | Assignment |
Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MICROSOFT CORPORATION;REEL/FRAME:034544/0001 Effective date: 20141014 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |