US10055461B1 - Ranking documents based on large data sets - Google Patents
Ranking documents based on large data sets Download PDFInfo
- Publication number
- US10055461B1 US10055461B1 US14/815,736 US201514815736A US10055461B1 US 10055461 B1 US10055461 B1 US 10055461B1 US 201514815736 A US201514815736 A US 201514815736A US 10055461 B1 US10055461 B1 US 10055461B1
- Authority
- US
- United States
- Prior art keywords
- computing device
- user
- document
- features
- search
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related, expires
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2457—Query processing with adaptation to user needs
- G06F16/24575—Query processing with adaptation to user needs using context
-
- G06F17/30528—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2457—Query processing with adaptation to user needs
- G06F16/24578—Query processing with adaptation to user needs using ranking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Creation or modification of classes or clusters
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G06F17/3053—
-
- G06F17/30867—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G06N7/005—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N7/00—Computing arrangements based on specific mathematical models
- G06N7/01—Probabilistic graphical models, e.g. probabilistic networks
-
- G06N99/005—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99931—Database or file accessing
- Y10S707/99933—Query processing, i.e. searching
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99931—Database or file accessing
- Y10S707/99933—Query processing, i.e. searching
- Y10S707/99935—Query augmenting and refining, e.g. inexact access
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99941—Database schema or data structure
- Y10S707/99943—Generating database or data structure, e.g. via user interface
Definitions
- the present invention relates generally to information retrieval and, more particularly, to systems and methods for creating a ranking function from large data sets and using the ranking function to rank documents.
- the World Wide Web (“web”) contains a vast amount of information. Locating a desired portion of the information, however, can be challenging. This problem is compounded because the amount of information on the web and the number of new users inexperienced at web searching are growing rapidly.
- Search engines attempt to return hyperlinks to web documents in which a user is interested.
- search engines base their determination of the user's interest on search terms (called a search query) entered by the user.
- the goal of the search engine is to provide links to high quality, relevant results to the user based on the search query.
- the search engine accomplishes this by matching the terms in the search query to a corpus of pre-stored web documents. Web documents that contain the user's search terms are “hits” and are returned to the user.
- the search engine oftentimes ranks the documents using a ranking function based on the documents' perceived relevance to the user's search terms. Determining a document's relevance can be a tricky problem.
- Systems and methods, consistent with the principles of the invention create a ranking function based, at least in part, on prior information retrieval data, such as query data, user information, and document information, and use the ranking function to rank (or score) documents.
- a method for ranking documents may include creating a ranking model that predicts a likelihood that a document will be selected and training the ranking model using a data set that includes tens of millions of instances.
- the method may also include identifying documents relating to a search query, scoring the documents based, at least in part, on the ranking model, and forming search results for the search query from the scored documents.
- a system for ranking documents may include a repository and a server.
- the repository may store information corresponding to multiple prior searches.
- the server may receive a search query from a user, identify documents corresponding to the search query, and rank the identified documents based, at least in part, on a ranking model that includes rules that maximize a likelihood of the repository.
- a method for generating a model may include selecting candidate conditions from training data, estimating weights for the candidate conditions, and forming new rules from the candidate conditions and corresponding ones of the weights.
- the method may also include comparing a likelihood of the training data between a model with the new rules and the model without the new rules and selectively adding the new rules to the model based, at least in part, on results of the comparing.
- a method for ranking documents may include receiving a search query, identifying documents relating to the search query, and determining prior probabilities of selecting each of the documents.
- the method may also include determining a score for each of the documents based, at least in part, on the prior probability of selecting the document and generating search results for the search query from the scored documents.
- a system for generating a model may include a repository and multiple devices.
- the repository may store training data that includes instances and features, where each of the features corresponds to one or more of the instances.
- At least one of the devices may create a feature-to-instance index that maps features to instances to which the features correspond, select a candidate condition, request information associated with the candidate condition from other ones of the devices, receive the requested information from the other devices, estimate a weight for the candidate condition based, at least in part, on the requested information, form a new rule from the candidate condition and the weight, and selectively add the new rule to the model.
- a system for generating a model includes a repository and multiple devices.
- the repository may store multiple instances.
- At least one of the devices may analyze a subset of the instances to identify matching candidate conditions, analyze the candidate conditions to collect statistics regarding predicted probabilities from the matching instances, gather statistics regarding one of the candidate conditions from other ones of the devices, determine a weight associated with the one candidate condition based on at least one of the collected statistics and the gathered statistics, form a rule from the one candidate condition and the weight, and selectively add the rule to the model.
- FIG. 1 is a diagram of an exemplary information retrieval network in which systems and methods consistent with the principles of the invention may be implemented;
- FIG. 2 is a diagram of an exemplary model generation system according to an implementation consistent with the principles of the invention
- FIG. 3 is an exemplary diagram of a device according to an implementation consistent with the principles of the invention.
- FIG. 4 is a flowchart of exemplary processing for generating a ranking model according to an implementation consistent with the principles of the invention.
- FIG. 5 is a flowchart of exemplary processing for ranking documents according to an implementation consistent with the principles of the invention.
- Systems and methods consistent with the principles of the invention may generate a ranking model based, at least in part, on prior information retrieval data, such as data relating to users, queries previously provided by these users, documents retrieved based on these queries, and documents that were selected and not selected in relation to these queries.
- the systems and methods may use this ranking model as part of a ranking function to rank documents.
- FIG. 1 is an exemplary diagram of a network 100 in which systems and methods consistent with the principles of the invention may be implemented.
- Network 100 may include multiple clients 110 connected to multiple servers 120 - 140 via a network 150 .
- Network 150 may include a local area network (LAN), a wide area network (WAN), a telephone network, such as the Public Switched Telephone Network (PSTN), an intranet, the Internet, a memory device, another type of network, or a combination of networks.
- PSTN Public Switched Telephone Network
- Clients 110 may include client entities.
- An entity may be defined as a device, such as a wireless telephone, a personal computer, a personal digital assistant (PDA), a lap top, or another type of computation or communication device, a thread or process running on one of these devices, and/or an object executable by one of these device.
- Servers 120 - 140 may include server entities that gather, process, search, and/or maintain documents in a manner consistent with the principles of the invention.
- Clients 110 and servers 120 - 140 may connect to network 150 via wired, wireless, and/or optical connections.
- server 120 may optionally include a search engine 125 usable by clients 110 .
- Server 120 may crawl documents (e.g., web pages) and store information associated with these documents in a repository of crawled documents.
- Servers 130 and 140 may store or maintain documents that may be crawled by server 120 .
- servers 120 - 140 are shown as separate entities, it may be possible for one or more of servers 120 - 140 to perform one or more of the functions of another one or more of servers 120 - 140 .
- two or more of servers 120 - 140 are implemented as a single server. It may also be possible that a single one of servers 120 - 140 is implemented as multiple, possibly distributed, devices.
- FIG. 2 is an exemplary diagram of a model generation system 200 consistent with the principles of the invention.
- System 200 may include one or more devices 210 and a repository 220 .
- Repository 220 may include one or more logical or physical memory devices that may store a large data set (e.g., tens of millions of instances and millions of features) that may be used, as described in more detail below, to create and train a ranking model.
- the data may include information retrieval data, such as query data, user information, and document information, that may be used to create a model that may be used to rank a particular document.
- the query data may include search terms previously provided by users to retrieve documents.
- the user information may include Internet Protocol (IP) addresses, cookie information, query languages, and/or geographical information associated with the users.
- the document information may include information relating to the documents presented to the users and the documents that were selected and not selected by the users. In other exemplary implementations, other types of data may alternatively or additionally be stored by repository 220 .
- IP
- Device(s) 210 may include any type of computing device capable of accessing repository 220 via any type of connection mechanism. According to one implementation consistent with the principles of the invention, system 200 may include multiple devices 210 . According to another implementation, system 200 may include a single device 210 . Device(s) 210 may correspond to one or more of servers 120 - 140 .
- FIG. 3 is an exemplary diagram of a device 300 according to an implementation consistent with the principles of the invention.
- Device 300 may correspond to one or more of clients 110 , servers 120 - 140 , and device(s) 210 .
- Device 300 may include a bus 310 , a processor 320 , a main memory 330 , a read only memory (ROM) 340 , a storage device 350 , one or more input devices 360 , one or more output devices 370 , and a communication interface 380 .
- Bus 310 may include one or more conductors that permit communication among the components of device 300 .
- Processor 320 may include any type of conventional processor or microprocessor that interprets and executes instructions.
- Main memory 330 may include a random access memory (RAM) or another type of dynamic storage device that stores information and instructions for execution by processor 320 .
- ROM 340 may include a conventional ROM device or another type of static storage device that stores static information and instructions for use by processor 320 .
- Storage device 350 may include a magnetic and/or optical recording medium and its corresponding drive.
- Input device(s) 360 may include one or more conventional mechanisms that permit an operator to input information to device 300 , such as a keyboard, a mouse, a pen, voice recognition and/or biometric mechanisms, etc.
- Output device(s) 370 may include one or more conventional mechanisms that output information to the operator, including a display, a printer, a speaker, etc.
- Communication interface 380 may include any transceiver-like mechanism that enables device 300 to communicate with other devices and/or systems.
- device 300 may perform certain data-related operations. Device 300 may perform these operations in response to processor 320 executing software instructions contained in a computer-readable medium, such as memory 330 .
- a computer-readable medium may be defined as one or more physical or logical memory devices and/or carrier waves.
- the software instructions may be read into memory 330 from another computer-readable medium, such as data storage device 350 , or from another device via communication interface 380 .
- the software instructions contained in memory 330 causes processor 320 to perform processes that will be described later.
- hardwired circuitry may be used in place of or in combination with software instructions to implement processes consistent with the principles of the invention.
- implementations consistent with the principles of the invention are not limited to any specific combination of hardware circuitry and software.
- the set of data in repository 220 may include multiple elements, called instances. It may be possible for repository 220 to store more than 50 million instances. Each instance may include a triple of data: (u, q, d), where u refers to user information, q refers to query data provided by the user, and d refers to document information relating to documents retrieved as a result of the query data and which documents the user selected and did not select.
- features may be extracted for any given (u, q, d). These features might include one or more of the following: the country in which user u is located, the time of day that user u provided query q, the language of the country in which user u is located, each of the previous three queries that user u provided, the language of query q, the exact string of query q, the word(s) in query q, the number of words in query q, each of the words in document d, each of the words in the Uniform Resource Locator (URL) of document d, the top level domain in the URL of document d, each of the prefixes of the URL of document d, each of the words in the title of document d, each of the words in the links pointing to document d, each of the words in the title of the documents shown above and below document d for query q, the number of times a word in query q matches a word in document d, the number of times user u has previously accessed document d, and other information.
- a feature-to-instance index may be generated that links features to the instances in which they are included. For example, for a given feature f, the set of instances that contain that feature may be listed. The list of instances for a feature f is called the “hitlist for feature f.” Thereafter, given a set of features f 0 , . . . , f n , the set of instances that contain those features can be determined by intersecting the hitlist for each of the features f 0 , . . . , f n .
- Other information may also be determined for a given (u, q, d). This information may include, for example, the position that document d was provided within search results presented to user u for query q, the number of documents above document d that were selected by user u for query q, and a score (“old score”) that was assigned to document d for query q. The old score may have been assigned by search engine 125 or by another search engine.
- a ranking model may be created from this data.
- the model uses the data in repository 220 as a way of evaluating how good the model is.
- the model may include rules that maximize the log likelihood of the data in repository 220 .
- the general idea of the model is that given a new (u, q, d), the model may predict whether user u will select a particular document d for query q. As will be described in more detail below, this information may be used to rank document d for query q and user u.
- FIG. 4 is a flowchart of exemplary processing for generating a ranking model according to an implementation consistent with the principles of the invention. This processing may be performed by a single device 210 or a combination of multiple devices 210 .
- a prior probability of selection may be determined. As described above, the following information may be determined for any given instance (u, q, d): the position that document d occupied within the search results presented to user u for query q; the old score that was assigned to document d for query q; and the number of documents above document d that were selected by user u for query q. Based on this information, a function may be created that maps from these three values to a probability of selection: P (select
- a set of instances based on the same or a different set of instances may be used as “training data” D.
- its features f 0 , f 1 , . . . , f n
- f 0 may be the feature corresponding to “the word ‘tree’ appears in the query.”
- the feature f 0 may include a boolean value, such that if “tree” appears in query q then the value of f 0 is one, otherwise the value of f 0 is zero.
- the features may include discrete values. It may be assumed that many of the features will have values of zero. Accordingly, a sparse representation for the features of each instance may be used. In this case, each instance may store only features that have non-zero values.
- a “condition” C is a conjunction of features and possibly their complements. For example, a condition that includes two features is: “tree” is in query q and the domain of document d is “trees.com,” and a condition that includes a feature and a complement of a feature is: “football” is in query q and the user did not provide the query from “www.google.co.uk.” For any instance (u, q, d), the value of its features may determine the set of conditions C that apply.
- a “rule” is a condition C and a weight w, represented as (C, w).
- the ranking model M may include a set of rules and a prior probability of selection. To generate the model M, the set of conditions C 1 , . . . , C n and the values of the weights w 1 , . . . , w n need to be determined.
- a function may be created that maps conditions to a probability of selection: P (select
- C 1 , . . . ,C n ,position,old score,number of selections above)/ P (select true
- processing may start with an empty model M that includes the prior probability of selection.
- a candidate condition C may be selected (act 410 ).
- candidate conditions may be selected from the training data D. For example, for each instance in the training data D, combinations of features that are present in that instance (or complements of these features) may be chosen as candidate conditions. In another implementation, random sets of conditions may be selected as candidate conditions. In yet another implementation, single feature conditions may be considered for candidate conditions. In a further implementation, existing conditions in the model M may be augmented by adding extra features and these augmented conditions may be considered as candidate conditions. In yet other implementations, candidate conditions may be selected in other ways.
- a weight w for condition C may then be estimated (act 420 ).
- the weight w may be estimated by attempting to maximize the log likelihood of the training data D given the model M augmented with rule (C, w)—that is, find the weight that maximizes Log P(D
- the likelihood of the training data D may be compared between the current model with the new rule (C, w) and the current model without the new rule (i.e., P(D
- condition C includes many features, or if the features of condition C are quite rare (e.g., “does the word ‘mahogany’ appear in the query”), then the cost of condition C could be high.
- the rule (C,w) may then be added to the model M if: Log ⁇ P ( D
- FIG. 5 is a flowchart of exemplary processing for ranking documents according to an implementation consistent with the principles of the invention. Processing may begin with a user providing one or more search terms as a search query for searching a document corpus.
- the document corpus is the Internet and the vehicle for searching this corpus is a search engine, such as search engine 125 ( FIG. 1 ).
- the user may provide the search query to search engine 125 via web browser software on a client, such as client 110 ( FIG. 1 ).
- Search engine 125 may receive the search query and act upon it to identify documents (e.g., web pages) related to the search query (acts 510 and 520 ).
- documents e.g., web pages
- One such technique might include identifying documents that contain the one or more search terms as a phrase.
- Another technique might include identifying documents that contain the one or more search terms, but not necessarily together.
- Other techniques might include identifying documents that contain less than all of the one or more search terms, or synonyms of the one or more search terms. Yet other techniques are known to those skilled in the art.
- Search engine 125 may then score the documents based on the ranking model described above (act 530 ). With regard to each document, search engine 125 may identify a new instance (u, q, d) that corresponds to this user search, where u refers to the user, q refers to the search query provided by the user, and d refers to the document under consideration. Search engine 125 may extract the features from the new instance and determine which rules of the ranking model apply. Search engine 125 may then combine the weight of each rule with the prior probability of selection for (u, q, d) to determine the final posterior probability of the user u selecting this document d for query q. Search engine 125 may use the final posterior probability as the score for the document. Alternatively, search engine 125 might use the final posterior probability as one of multiple factors in determining the score of the document.
- Search engine 125 may sort the documents based on their scores (act 540 ). Search engine 125 may then formulate search results based on the sorted documents (act 550 ).
- the search results may include references to the documents, such as links to the documents and possibly a textual description of the links.
- the search results may include the documents themselves. In yet other implementations, the search results may take other forms.
- Search engine 125 may provide the search results as a HyperText Markup Language (HTML) document, similar to search results provided by conventional search engines. Alternatively, search engine 125 may provide the search results according to a protocol agreed upon by search engine 125 and client 110 (e.g., Extensible Markup Language (XML)).
- HTTP HyperText Markup Language
- client 110 e.g., Extensible Markup Language (XML)
- Search engine 125 may further provide information concerning the user, the query provided by the user, and the documents provided to the user to help improve the ranking model.
- server 120 may store this information in repository 220 or provide it to one of devices 210 to be used as training data for training the model.
- multiple devices 210 may be configured as a distributed system.
- devices 210 may be capable of communicating with each other and with repository 220 , as illustrated in FIG. 2 .
- each device 210 may be responsible for a subset of the instances within repository 220 .
- Each device 210 may possibly store its subset of instances in local memory.
- Each device 210 may also build its own feature-to-instance index for its subset of instances. As described previously, the feature-to-instance index may facilitate fast identification of correspondence between features and instances by linking features to the instances in which they are included.
- One or more devices 210 may identify candidate conditions to be tested.
- a device 210 “DV” has a candidate condition C that it wants to test (e.g., for which it wants to determine a weight).
- Device DV may send out a “stats” request to other devices 210 asking for statistics for condition C.
- device DV may broadcast the stats request that includes (or identifies) the condition C for which device DV desires statistics.
- the other devices 210 may receive condition C and use their own feature-to-instance index to find their instances that satisfy condition C. For example, the receiving devices 210 may identify the features making up condition C. Assume, for example, that condition C includes a combination of features fi and fj. The receiving devices 210 may then determine the set of instances that contain those features by intersecting the hitlist for features fi and fj.
- the receiving device 210 may then compute statistics about predicted probabilities for those instances. Each receiving device 210 may then send the statistics for the set of matching instances back to device DV. Device DV may use the statistics to estimate a weight for condition C. Device DV may, as described above, determine whether to add the rule containing condition C to the model. If device DV decides to add the rule containing condition C to the model, it may send an “update” request to all of the other devices 210 informing them of the new rule (i.e., the new condition and its weight).
- multiple devices 210 may be used to further increase throughput and capacity. That is, this implementation can handle more total instances and test more conditions per second. It does this by first generating a collection of conditions to test, and then optimizing them in a batch. It is desirable, however, to be able to update the model everywhere immediately when a new rule is added since correlated conditions cannot be considered in isolation without introducing unacceptable weight oscillations that prevent convergence. Therefore, implementations consistent with the principles of the invention may rapidly feed back the effects of accepted rules into the statistics used to decide the weight of future rules.
- the model generation process may be divided into iterations. Rules may be tested or have their weights optimized once per iteration. Each iteration may be broken into two phases: a candidate rule generation phase and a rule testing and optimization phase.
- the rule testing and optimization phase may determine the weights for conditions generated in the candidate rule generation phase, and accept rules into the model if their benefit (difference in log likelihood) exceeds their cost.
- candidate conditions might include all conditions with one feature, all conditions with two features that co-occur in some instance, and all extensions of existing rules by one feature (where the combination is in some instance). It may be beneficial to consider only those conditions that match some minimum number of instances. There are a couple of strategies for accomplishing this: identify conditions that appear multiple times in some fraction of the instances (divided among all of devices 210 and then summed), or identify conditions that appear multiple times on one device 210 and then gather the results together to remove duplicates. As a further optimization, extensions of only those rules added in the last iteration may be determined and added to the candidate rules generated in previous iterations.
- the conditions to be tested may be distributed to every device 210 .
- Each device 210 may analyze its share of the instances to identify candidate conditions that match each instance. For example, devices 210 may determine matching conditions for each instance by looking up the features in the instance in a tree data structure. Devices 210 may record information concerning the matching conditions and instances as (condition, instance number) pairs. Each device 210 may sort the (condition, instance number) pairs and use them to iterate through the conditions to collect statistics about predicted probabilities from the matching instances.
- devices 210 may send the statistics associated with the condition to a device 210 designated to handle that condition (e.g., a device 210 selected, for instance, based on a hash of the condition).
- a device 210 designated to handle that condition e.g., a device 210 selected, for instance, based on a hash of the condition.
- that device 210 may determine the optimal weight of the rule and determine whether the rule should be added to the model M, as described above.
- the result may be broadcast to all devices 210 (or to those devices that sent statistics).
- the output of this phase is new weights for all existing rules (possibly zero if the rule is discarded) and a list of new rules.
- Systems and methods consistent with the principles of the invention may rank search results based on a ranking model that may be generated based, at least in part, on prior information retrieval data, such as data relating to users, queries previously provided by these users, documents retrieved based on these queries, and which of these documents were selected and not selected in relation to these queries.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Computational Mathematics (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Optimization (AREA)
- Mathematical Analysis (AREA)
- Algebra (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A system ranks documents based, at least in part, on a ranking model. The ranking model may be generated to predict the likelihood that a document will be selected. The system may receive a search query and identify documents relating to the search query. The system may then rank the documents based, at least in part, on the ranking model and form search results for the search query from the ranked documents.
Description
This application is a continuation of, and claims priority to, pending U.S. patent application Ser. No. 12/777,939, filed on May 11, 2010, entitled “Ranking Documents Based on Large Data Sets,” which is a continuation of, and claims priority to, U.S. patent application Ser. No. 11/736,872, filed on Apr. 18, 2007, entitled “Ranking Documents Based on Large Data Sets,” which is a continuation of, and claims priority to, U.S. patent application Ser. No. 10/706,991, filed on Nov. 14, 2003, entitled “Ranking Documents Based on Large Data Sets.” The disclosure of the foregoing applications is incorporated herein by reference in its entirety.
The present invention relates generally to information retrieval and, more particularly, to systems and methods for creating a ranking function from large data sets and using the ranking function to rank documents.
The World Wide Web (“web”) contains a vast amount of information. Locating a desired portion of the information, however, can be challenging. This problem is compounded because the amount of information on the web and the number of new users inexperienced at web searching are growing rapidly.
Search engines attempt to return hyperlinks to web documents in which a user is interested. Generally, search engines base their determination of the user's interest on search terms (called a search query) entered by the user. The goal of the search engine is to provide links to high quality, relevant results to the user based on the search query. Typically, the search engine accomplishes this by matching the terms in the search query to a corpus of pre-stored web documents. Web documents that contain the user's search terms are “hits” and are returned to the user. The search engine oftentimes ranks the documents using a ranking function based on the documents' perceived relevance to the user's search terms. Determining a document's relevance can be a tricky problem.
Accordingly, there is a need for systems and methods that improve the determination of a document's relevance.
Systems and methods, consistent with the principles of the invention, create a ranking function based, at least in part, on prior information retrieval data, such as query data, user information, and document information, and use the ranking function to rank (or score) documents.
In accordance with one aspect consistent with the principles of the invention, a method for ranking documents is provided. The method may include creating a ranking model that predicts a likelihood that a document will be selected and training the ranking model using a data set that includes tens of millions of instances. The method may also include identifying documents relating to a search query, scoring the documents based, at least in part, on the ranking model, and forming search results for the search query from the scored documents.
According to another aspect, a system for ranking documents is provided. The system may include a repository and a server. The repository may store information corresponding to multiple prior searches. The server may receive a search query from a user, identify documents corresponding to the search query, and rank the identified documents based, at least in part, on a ranking model that includes rules that maximize a likelihood of the repository.
According to yet another aspect, a method for generating a model is provided. The method may include selecting candidate conditions from training data, estimating weights for the candidate conditions, and forming new rules from the candidate conditions and corresponding ones of the weights. The method may also include comparing a likelihood of the training data between a model with the new rules and the model without the new rules and selectively adding the new rules to the model based, at least in part, on results of the comparing.
According to a further aspect, a method for ranking documents is provided. The method may include receiving a search query, identifying documents relating to the search query, and determining prior probabilities of selecting each of the documents. The method may also include determining a score for each of the documents based, at least in part, on the prior probability of selecting the document and generating search results for the search query from the scored documents.
According to another aspect, a system for generating a model is provided. The system may include a repository and multiple devices. The repository may store training data that includes instances and features, where each of the features corresponds to one or more of the instances. At least one of the devices may create a feature-to-instance index that maps features to instances to which the features correspond, select a candidate condition, request information associated with the candidate condition from other ones of the devices, receive the requested information from the other devices, estimate a weight for the candidate condition based, at least in part, on the requested information, form a new rule from the candidate condition and the weight, and selectively add the new rule to the model.
According to yet another aspect, a system for generating a model is provided. The system includes a repository and multiple devices. The repository may store multiple instances. At least one of the devices may analyze a subset of the instances to identify matching candidate conditions, analyze the candidate conditions to collect statistics regarding predicted probabilities from the matching instances, gather statistics regarding one of the candidate conditions from other ones of the devices, determine a weight associated with the one candidate condition based on at least one of the collected statistics and the gathered statistics, form a rule from the one candidate condition and the weight, and selectively add the rule to the model.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, explain the invention. In the drawings,
The following detailed description of the invention refers to the accompanying drawings. The same reference numbers in different drawings may identify the same or similar elements. Also, the following detailed description does not limit the invention.
Systems and methods consistent with the principles of the invention may generate a ranking model based, at least in part, on prior information retrieval data, such as data relating to users, queries previously provided by these users, documents retrieved based on these queries, and documents that were selected and not selected in relation to these queries. The systems and methods may use this ranking model as part of a ranking function to rank documents.
In an implementation consistent with the principles of the invention, server 120 may optionally include a search engine 125 usable by clients 110. Server 120 may crawl documents (e.g., web pages) and store information associated with these documents in a repository of crawled documents. Servers 130 and 140 may store or maintain documents that may be crawled by server 120. While servers 120-140 are shown as separate entities, it may be possible for one or more of servers 120-140 to perform one or more of the functions of another one or more of servers 120-140. For example, it may be possible that two or more of servers 120-140 are implemented as a single server. It may also be possible that a single one of servers 120-140 is implemented as multiple, possibly distributed, devices.
Device(s) 210 may include any type of computing device capable of accessing repository 220 via any type of connection mechanism. According to one implementation consistent with the principles of the invention, system 200 may include multiple devices 210. According to another implementation, system 200 may include a single device 210. Device(s) 210 may correspond to one or more of servers 120-140.
Input device(s) 360 may include one or more conventional mechanisms that permit an operator to input information to device 300, such as a keyboard, a mouse, a pen, voice recognition and/or biometric mechanisms, etc. Output device(s) 370 may include one or more conventional mechanisms that output information to the operator, including a display, a printer, a speaker, etc. Communication interface 380 may include any transceiver-like mechanism that enables device 300 to communicate with other devices and/or systems.
As will be described in detail below, device 300, consistent with the principles of the invention, may perform certain data-related operations. Device 300 may perform these operations in response to processor 320 executing software instructions contained in a computer-readable medium, such as memory 330. A computer-readable medium may be defined as one or more physical or logical memory devices and/or carrier waves.
The software instructions may be read into memory 330 from another computer-readable medium, such as data storage device 350, or from another device via communication interface 380. The software instructions contained in memory 330 causes processor 320 to perform processes that will be described later. Alternatively, hardwired circuitry may be used in place of or in combination with software instructions to implement processes consistent with the principles of the invention. Thus, implementations consistent with the principles of the invention are not limited to any specific combination of hardware circuitry and software.
For purposes of the discussion to follow, the set of data in repository 220 (FIG. 2 ) may include multiple elements, called instances. It may be possible for repository 220 to store more than 50 million instances. Each instance may include a triple of data: (u, q, d), where u refers to user information, q refers to query data provided by the user, and d refers to document information relating to documents retrieved as a result of the query data and which documents the user selected and did not select.
Several features may be extracted for any given (u, q, d). These features might include one or more of the following: the country in which user u is located, the time of day that user u provided query q, the language of the country in which user u is located, each of the previous three queries that user u provided, the language of query q, the exact string of query q, the word(s) in query q, the number of words in query q, each of the words in document d, each of the words in the Uniform Resource Locator (URL) of document d, the top level domain in the URL of document d, each of the prefixes of the URL of document d, each of the words in the title of document d, each of the words in the links pointing to document d, each of the words in the title of the documents shown above and below document d for query q, the number of times a word in query q matches a word in document d, the number of times user u has previously accessed document d, and other information. In one implementation, repository 220 may store more than 5 million distinct features.
To facilitate fast identification of correspondence between features and instances, a feature-to-instance index may be generated that links features to the instances in which they are included. For example, for a given feature f, the set of instances that contain that feature may be listed. The list of instances for a feature f is called the “hitlist for feature f.” Thereafter, given a set of features f0, . . . , fn, the set of instances that contain those features can be determined by intersecting the hitlist for each of the features f0, . . . , fn.
Other information may also be determined for a given (u, q, d). This information may include, for example, the position that document d was provided within search results presented to user u for query q, the number of documents above document d that were selected by user u for query q, and a score (“old score”) that was assigned to document d for query q. The old score may have been assigned by search engine 125 or by another search engine.
A ranking model may be created from this data. The model uses the data in repository 220 as a way of evaluating how good the model is. The model may include rules that maximize the log likelihood of the data in repository 220. The general idea of the model is that given a new (u, q, d), the model may predict whether user u will select a particular document d for query q. As will be described in more detail below, this information may be used to rank document d for query q and user u.
To facilitate generation of the ranking model, a prior probability of selection may be determined. As described above, the following information may be determined for any given instance (u, q, d): the position that document d occupied within the search results presented to user u for query q; the old score that was assigned to document d for query q; and the number of documents above document d that were selected by user u for query q. Based on this information, a function may be created that maps from these three values to a probability of selection:
P(select|position,old score,number of selections above).
This “prior” probability of selection may provide the initial probability of document selection without considering any of the features. It uses the position, the old score, and the number of selections of documents above this document.
P(select|position,old score,number of selections above).
This “prior” probability of selection may provide the initial probability of document selection without considering any of the features. It uses the position, the old score, and the number of selections of documents above this document.
A set of instances based on the same or a different set of instances may be used as “training data” D. For each instance (u, q, d) in the training data D, its features (f0, f1, . . . , fn) may be extracted. For example, f0 may be the feature corresponding to “the word ‘tree’ appears in the query.” In this implementation, the feature f0 may include a boolean value, such that if “tree” appears in query q then the value of f0 is one, otherwise the value of f0 is zero. In other implementations, the features may include discrete values. It may be assumed that many of the features will have values of zero. Accordingly, a sparse representation for the features of each instance may be used. In this case, each instance may store only features that have non-zero values.
Therefore, for each instance (u, q, d), the following information is available: its set of features, whether document d was selected by user u for query q, and its prior probability of selection. A “condition” C is a conjunction of features and possibly their complements. For example, a condition that includes two features is: “tree” is in query q and the domain of document d is “trees.com,” and a condition that includes a feature and a complement of a feature is: “football” is in query q and the user did not provide the query from “www.google.co.uk.” For any instance (u, q, d), the value of its features may determine the set of conditions C that apply. A “rule” is a condition C and a weight w, represented as (C, w). The ranking model M may include a set of rules and a prior probability of selection. To generate the model M, the set of conditions C1, . . . , Cn and the values of the weights w1, . . . , wn need to be determined.
Based on this information, a function may be created that maps conditions to a probability of selection:
P(select|C 1 , . . . C n,position,old score,number of selections above).
The posterior probability of a selection given a set of conditions, P(select|C1, . . . , Cn, position, old score, number of selections above), may be determined using the function:
Log {P(select=false|C 1 , . . . ,C n,position,old score,number of selections above)/P(select=true|C 1 , . . . ,C n,position,old score,number of selections above)}=Sumi {−w i I(C i)}+Log {P(select=false|position,old score,number of selections above)/P(select=true|position,old score,number of selections above)},
where I(Ci)=0 if Ci=false, and I(Ci)=1 if Ci=true.
P(select|C 1 , . . . C n,position,old score,number of selections above).
The posterior probability of a selection given a set of conditions, P(select|C1, . . . , Cn, position, old score, number of selections above), may be determined using the function:
Log {P(select=false|C 1 , . . . ,C n,position,old score,number of selections above)/P(select=true|C 1 , . . . ,C n,position,old score,number of selections above)}=Sumi {−w i I(C i)}+Log {P(select=false|position,old score,number of selections above)/P(select=true|position,old score,number of selections above)},
where I(Ci)=0 if Ci=false, and I(Ci)=1 if Ci=true.
To generate the model M, processing may start with an empty model M that includes the prior probability of selection. A candidate condition C may be selected (act 410). In one implementation, candidate conditions may be selected from the training data D. For example, for each instance in the training data D, combinations of features that are present in that instance (or complements of these features) may be chosen as candidate conditions. In another implementation, random sets of conditions may be selected as candidate conditions. In yet another implementation, single feature conditions may be considered for candidate conditions. In a further implementation, existing conditions in the model M may be augmented by adding extra features and these augmented conditions may be considered as candidate conditions. In yet other implementations, candidate conditions may be selected in other ways.
A weight w for condition C may then be estimated (act 420). The weight w may be estimated by attempting to maximize the log likelihood of the training data D given the model M augmented with rule (C, w)—that is, find the weight that maximizes Log P(D|M, (C, w)), where “M, (C, w)” denotes the model M with rule (C, w) added if condition C is not already part of the model M, and w is the weight for condition C.
The likelihood of the training data D may be compared between the current model with the new rule (C, w) and the current model without the new rule (i.e., P(D|M, (C,w)) vs. P(D|M)) (acts 430 and 440). If P(D|M, (C,w)) is sufficiently greater than P(D|M), then the new rule (C, w) is added to the model M (act 450). A penalty or “Cost” for each condition C may be used to aid in the determination of whether P(D|M, (C,w)) is sufficiently greater than P(D|M). For example, if condition C includes many features, or if the features of condition C are quite rare (e.g., “does the word ‘mahogany’ appear in the query”), then the cost of condition C could be high. The rule (C,w) may then be added to the model M if:
Log {P(D|M,(C,w))}−Log {P(D|M)}>Cost(C).
If P(D|M, (C,w)) is not sufficiently greater than P(D|M), then the new rule (C, w) is discarded (i.e., not added to the model M), possibly by changing its weight to zero (act 460). In either event, processing may then return to act 410, where the next candidate condition is selected. Processing may continue for a predetermined number of iterations or until all candidate conditions have been considered.
Log {P(D|M,(C,w))}−Log {P(D|M)}>Cost(C).
If P(D|M, (C,w)) is not sufficiently greater than P(D|M), then the new rule (C, w) is discarded (i.e., not added to the model M), possibly by changing its weight to zero (act 460). In either event, processing may then return to act 410, where the next candidate condition is selected. Processing may continue for a predetermined number of iterations or until all candidate conditions have been considered.
When the data set within repository 220 becomes very large (e.g., substantially more than a few million instances), multiple devices 210 may be configured as a distributed system. For example, devices 210 may be capable of communicating with each other and with repository 220, as illustrated in FIG. 2 .
According to one exemplary implementation of the distributed system, each device 210 may be responsible for a subset of the instances within repository 220. Each device 210 may possibly store its subset of instances in local memory. Each device 210 may also build its own feature-to-instance index for its subset of instances. As described previously, the feature-to-instance index may facilitate fast identification of correspondence between features and instances by linking features to the instances in which they are included.
One or more devices 210 (or possibly all) may identify candidate conditions to be tested. Suppose that a device 210 “DV” has a candidate condition C that it wants to test (e.g., for which it wants to determine a weight). Device DV may send out a “stats” request to other devices 210 asking for statistics for condition C. For example, device DV may broadcast the stats request that includes (or identifies) the condition C for which device DV desires statistics.
The other devices 210 may receive condition C and use their own feature-to-instance index to find their instances that satisfy condition C. For example, the receiving devices 210 may identify the features making up condition C. Assume, for example, that condition C includes a combination of features fi and fj. The receiving devices 210 may then determine the set of instances that contain those features by intersecting the hitlist for features fi and fj.
The receiving device 210 may then compute statistics about predicted probabilities for those instances. Each receiving device 210 may then send the statistics for the set of matching instances back to device DV. Device DV may use the statistics to estimate a weight for condition C. Device DV may, as described above, determine whether to add the rule containing condition C to the model. If device DV decides to add the rule containing condition C to the model, it may send an “update” request to all of the other devices 210 informing them of the new rule (i.e., the new condition and its weight).
According to another exemplary implementation of the distributed system, multiple devices 210 may be used to further increase throughput and capacity. That is, this implementation can handle more total instances and test more conditions per second. It does this by first generating a collection of conditions to test, and then optimizing them in a batch. It is desirable, however, to be able to update the model everywhere immediately when a new rule is added since correlated conditions cannot be considered in isolation without introducing unacceptable weight oscillations that prevent convergence. Therefore, implementations consistent with the principles of the invention may rapidly feed back the effects of accepted rules into the statistics used to decide the weight of future rules.
As before, the model generation process may be divided into iterations. Rules may be tested or have their weights optimized once per iteration. Each iteration may be broken into two phases: a candidate rule generation phase and a rule testing and optimization phase. The rule testing and optimization phase may determine the weights for conditions generated in the candidate rule generation phase, and accept rules into the model if their benefit (difference in log likelihood) exceeds their cost.
As described previously, there are several possible ways of generating candidate conditions in the candidate rule generation phase. For example, candidate conditions might include all conditions with one feature, all conditions with two features that co-occur in some instance, and all extensions of existing rules by one feature (where the combination is in some instance). It may be beneficial to consider only those conditions that match some minimum number of instances. There are a couple of strategies for accomplishing this: identify conditions that appear multiple times in some fraction of the instances (divided among all of devices 210 and then summed), or identify conditions that appear multiple times on one device 210 and then gather the results together to remove duplicates. As a further optimization, extensions of only those rules added in the last iteration may be determined and added to the candidate rules generated in previous iterations.
The conditions to be tested may be distributed to every device 210. Each device 210 may analyze its share of the instances to identify candidate conditions that match each instance. For example, devices 210 may determine matching conditions for each instance by looking up the features in the instance in a tree data structure. Devices 210 may record information concerning the matching conditions and instances as (condition, instance number) pairs. Each device 210 may sort the (condition, instance number) pairs and use them to iterate through the conditions to collect statistics about predicted probabilities from the matching instances.
For each condition, devices 210 may send the statistics associated with the condition to a device 210 designated to handle that condition (e.g., a device 210 selected, for instance, based on a hash of the condition). When all of the statistics for a given condition have been aggregated on a single device 210, that device 210 may determine the optimal weight of the rule and determine whether the rule should be added to the model M, as described above. The result may be broadcast to all devices 210 (or to those devices that sent statistics). The output of this phase is new weights for all existing rules (possibly zero if the rule is discarded) and a list of new rules.
Systems and methods consistent with the principles of the invention may rank search results based on a ranking model that may be generated based, at least in part, on prior information retrieval data, such as data relating to users, queries previously provided by these users, documents retrieved based on these queries, and which of these documents were selected and not selected in relation to these queries.
The foregoing description of preferred embodiments of the present invention provides illustration and description, but is not intended to be exhaustive or to limit the invention to the precise form disclosed. Modifications and variations are possible in light of the above teachings or may be acquired from practice of the invention. For example, while series of acts have been described with regard to FIGS. 4 and 5 , the order of the acts may be modified in other implementations consistent with the principles of the invention. Also, non-dependent acts may be performed in parallel. Further, the acts may be modified in other ways. For example, in another exemplary implementation, acts 420-430 of FIG. 4 may be performed in a loop for a number of iterations to settle on a good weight. In the context of multiple devices 210, this may mean that device DV sends out multiple stats requests for a particular condition.
It will also be apparent to one of ordinary skill in the art that aspects of the invention, as described above, may be implemented in many different forms of software, firmware, and hardware in the implementations illustrated in the figures. The actual software code or specialized control hardware used to implement aspects consistent with the present invention is not limiting of the present invention. Thus, the operation and behavior of the aspects were described without reference to the specific software code—it being understood that one of ordinary skill in the art would be able to design software and control hardware to implement the aspects based on the description herein.
Claims (20)
1. A computer-implemented method comprising:
receiving, by a distributed search system, a collection of training data comprising a plurality of training instances that each identify a respective first document selected by a particular user when the first document was identified in search results provided by the search system to the particular user in response to particular search query issued by the particular user;
partitioning the collection of training data over a plurality of computing devices of the distributed search system;
generating, by the distributed search system, a ranking model that produces a likelihood that a particular user will select a particular document when identified by one or more search results provided in response to a particular search query submitted by the particular user, including processing, by each computing device of the plurality of computing devices, training instances assigned to the computing device, including:
selecting, by the computing device, a candidate condition, wherein the candidate condition specifies values for one or more user features, one or more query features, and one or more document features,
sending, by the computing device, to each other computing device of the plurality of computing devices, a request to compute local statistics for the candidate condition,
receiving, by the computing device from each other computing device of one or more other computing devices, respective computed statistics for the candidate condition computed by the other computing device using values of local training instances assigned to the other computing device,
computing, by the computing device, a weight for the candidate condition according to the computed statistics received from the one or more other computing devices for the candidate condition;
determining, by the computing device, that a new rule comprising the candidate condition and the computed weight should be added to the ranking model, and
in response, adding the new rule to the ranking model and providing, by the computing device, to each other computing device of the plurality of computing devices, an indication that the new rule comprising the candidate condition and the computed weight should be added to the ranking model;
receiving a search query submitted by a first user;
obtaining a plurality of search results that satisfy the search query, wherein each search result identifies a respective document of a plurality of documents;
determining one or more features of the first user and one or more features of the search query submitted by the first user;
using the one or more features of the first user and the one or more features of the search query as input to the ranking model to compute, for each document identified by the search results, a respective likelihood that the first user will select the document when provided in response to the search query; and
ranking the plurality of search results based on a respective computed likelihood for each document, the computed likelihood for each document being a likelihood that the first user will select the document when provided in response to the search query.
2. The method of claim 1 , wherein the one or more features of the first user include a location of the first user, a language of the first user, one or more previous queries issued by the first user, or a number of times the first user has accessed a particular document.
3. The method of claim 1 , wherein the one or more features of the search query include a language of the query and one or more terms of the query.
4. The method of claim 1 , further comprising:
generating, by each computing device of the plurality of computing device using local training instances assigned to the computing device, a feature-to-instance index that maps each value of a feature to one or more training instances having the value for the feature;
receiving, by a first computing device of the plurality of computing devices, a request to compute local statistics for the candidate condition;
obtaining, by the first computing device, training instances matching the candidate condition by using one or more features of the candidate condition as input to the feature-to-instance index;
computing local statistics for the candidate condition using matching training instances obtained using the feature-to-instance index; and
providing, by the first computing device, the computed local statistics in response to the request to compute local statistics for the candidate condition.
5. The method of claim 4 , wherein each training instance identifies one or more second documents that the particular user did not select when the one or more second documents were identified by the search results provided to the particular user in response to the particular search query.
6. The method of claim 4 , wherein each training instance includes data representing a position of the selected first document in an order of the search results provided to the particular user in response to the particular query.
7. The method of claim 4 , wherein each training instance includes data representing a previously computed score for the selected first document.
8. The method of claim 4 , wherein each training instance comprises data representing a number of documents ranked above the selected first document in the search results provided to the particular user in response to the particular search query.
9. A system comprising:
one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising:
receiving, by a distributed search system, a collection of training data comprising a plurality of training instances that each identify a respective first document selected by a particular user when the first document was identified in search results provided by the search system to the particular user in response to particular search query issued by the particular user;
partitioning the collection of training data over a plurality of computing devices of the distributed search system;
generating, by the distributed search system, a ranking model that produces a likelihood that a particular user will select a particular document when identified by one or more search results provided in response to a particular search query submitted by the particular user, including processing, by each computing device of the plurality of computing devices, training instances assigned to the computing device, including:
selecting, by the computing device, a candidate condition, wherein the candidate condition specifies values for one or more user features, one or more query features, and one or more document features,
sending, by the computing device, to each other computing device of the plurality of computing devices, a request to compute local statistics for the candidate condition,
receiving, by the computing device from each other computing device of one or more other computing devices, respective computed statistics for the candidate condition computed by the other computing device using values of local training instances assigned to the other computing device,
computing, by the computing device, a weight for the candidate condition according to the computed statistics received from the one or more other computing devices for the candidate condition;
determining, by the computing device, that a new rule comprising the candidate condition and the computed weight should be added to the ranking model, and
in response, adding the new rule to the ranking model and providing, by the computing device, to each other computing device of the plurality of computing devices, an indication that the new rule comprising the candidate condition and the computed weight should be added to the ranking model;
receiving a search query submitted by a first user;
obtaining a plurality of search results that satisfy the search query, wherein each search result identifies a respective document of a plurality of documents;
determining one or more features of the first user and one or more features of the search query submitted by the first user;
using the one or more features of the first user and the one or more features of the search query as input to the ranking model to compute, for each document identified by the search results, a respective likelihood that the first user will select the document when provided in response to the search query; and
ranking the plurality of search results based on a respective computed likelihood for each document, the computed likelihood for each document being a likelihood that the first user will select the document when provided in response to the search query.
10. The system of claim 9 , wherein the one or more features of the first user include a location of the first user, a language of the first user, one or more previous queries issued by the first user, or a number of times the first user has accessed a particular document.
11. The system of claim 9 , wherein the one or more features of the search query include a language of the query and one or more terms of the query.
12. The system of claim 9 , wherein the operations further comprise:
generating, by each computing device of the plurality of computing device using local training instances assigned to the computing device, a feature-to-instance index that maps each value of a feature to one or more training instances having the value for the feature;
receiving, by a first computing device of the plurality of computing devices, a request to compute local statistics for the candidate condition;
obtaining, by the first computing device, training instances matching the candidate condition by using one or more features of the candidate condition as input to the feature-to-instance index;
computing local statistics for the candidate condition using matching training instances obtained using the feature-to-instance index; and
providing, by the first computing device, the computed local statistics in response to the request to compute local statistics for the candidate condition.
13. The system of claim 12 , wherein each training instance identifies one or more second documents that the particular user did not select when the one or more second documents were identified by the search results provided to the particular user in response to the particular search query.
14. The system of claim 12 , wherein each training instance includes data representing a position of the selected first document in an order of the search results provided to the particular user in response to the particular query.
15. The system of claim 12 , wherein each training instance includes data representing a previously computed score for the selected first document.
16. The system of claim 12 , wherein each training instance comprises data representing a number of documents ranked above the selected first document in the search results provided to the particular user in response to the particular search query.
17. A computer program product, encoded on one or more non-transitory computer storage media, comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising:
receiving, by a distributed search system, a collection of training data comprising a plurality of training instances that each identify a respective first document selected by a particular user when the first document was identified in search results provided by the search system to the particular user in response to particular search query issued by the particular user;
partitioning the collection of training data over a plurality of computing devices of the distributed search system;
generating, by the distributed search system, a ranking model that produces a likelihood that a particular user will select a particular document when identified by one or more search results provided in response to a particular search query submitted by the particular user, including processing, by each computing device of the plurality of computing devices, training instances assigned to the computing device, including:
selecting, by the computing device, a candidate condition, wherein the candidate condition specifies values for one or more user features, one or more query features, and one or more document features,
sending, by the computing device, to each other computing device of the plurality of computing devices, a request to compute local statistics for the candidate condition,
receiving, by the computing device from each other computing device of one or more other computing devices, respective computed statistics for the candidate condition computed by the other computing device using values of local training instances assigned to the other computing device,
computing, by the computing device, a weight for the candidate condition according to the computed statistics received from the one or more other computing devices for the candidate condition;
determining, by the computing device, that a new rule comprising the candidate condition and the computed weight should be added to the ranking model, and
in response, adding the new rule to the ranking model and providing, by the computing device, to each other computing device of the plurality of computing devices, an indication that the new rule comprising the candidate condition and the computed weight should be added to the ranking model;
receiving a search query submitted by a first user;
obtaining a plurality of search results that satisfy the search query, wherein each search result identifies a respective document of a plurality of documents;
determining one or more features of the first user and one or more features of the search query submitted by the first user;
using the one or more features of the first user and the one or more features of the search query as input to the ranking model to compute, for each document identified by the search results, a respective likelihood that the first user will select the document when provided in response to the search query; and
ranking the plurality of search results based on a respective computed likelihood for each document, the computed likelihood for each document being a likelihood that the first user will select the document when provided in response to the search query.
18. The computer program product of claim 17 , wherein the one or more features of the first user include a location of the first user, a language of the first user, one or more previous queries issued by the first user, or a number of times the first user has accessed a particular document.
19. The computer program product of claim 17 , wherein the one or more features of the search query include a language of the query and one or more terms of the query.
20. The computer program product of claim 17 , wherein the operations further comprise:
generating, by each computing device of the plurality of computing device using local training instances assigned to the computing device, a feature-to-instance index that maps each value of a feature to one or more training instances having the value for the feature;
receiving, by a first computing device of the plurality of computing devices, a request to compute local statistics for the candidate condition;
obtaining, by the first computing device, training instances matching the candidate condition by using one or more features of the candidate condition as input to the feature-to-instance index;
computing local statistics for the candidate condition using matching training instances obtained using the feature-to-instance index; and
providing, by the first computing device, the computed local statistics in response to the request to compute local statistics for the candidate condition.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/815,736 US10055461B1 (en) | 2003-11-14 | 2015-07-31 | Ranking documents based on large data sets |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/706,991 US7231399B1 (en) | 2003-11-14 | 2003-11-14 | Ranking documents based on large data sets |
US11/736,872 US7743050B1 (en) | 2003-11-14 | 2007-04-18 | Model generation for ranking documents based on large data sets |
US12/777,939 US9116976B1 (en) | 2003-11-14 | 2010-05-11 | Ranking documents based on large data sets |
US14/815,736 US10055461B1 (en) | 2003-11-14 | 2015-07-31 | Ranking documents based on large data sets |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/777,939 Continuation US9116976B1 (en) | 2003-11-14 | 2010-05-11 | Ranking documents based on large data sets |
Publications (1)
Publication Number | Publication Date |
---|---|
US10055461B1 true US10055461B1 (en) | 2018-08-21 |
Family
ID=38049642
Family Applications (9)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/706,991 Expired - Lifetime US7231399B1 (en) | 2003-11-14 | 2003-11-14 | Ranking documents based on large data sets |
US10/734,584 Expired - Lifetime US7222127B1 (en) | 2003-09-30 | 2003-12-15 | Large scale machine learning systems and methods |
US11/736,193 Expired - Lifetime US7769763B1 (en) | 2003-11-14 | 2007-04-17 | Large scale machine learning systems and methods |
US11/736,872 Expired - Fee Related US7743050B1 (en) | 2003-11-14 | 2007-04-18 | Model generation for ranking documents based on large data sets |
US12/777,939 Expired - Fee Related US9116976B1 (en) | 2003-11-14 | 2010-05-11 | Ranking documents based on large data sets |
US12/822,902 Expired - Fee Related US8195674B1 (en) | 2003-11-14 | 2010-06-24 | Large scale machine learning systems and methods |
US13/487,873 Expired - Fee Related US8364618B1 (en) | 2003-11-14 | 2012-06-04 | Large scale machine learning systems and methods |
US13/751,746 Expired - Lifetime US8688705B1 (en) | 2003-11-14 | 2013-01-28 | Large scale machine learning systems and methods |
US14/815,736 Expired - Fee Related US10055461B1 (en) | 2003-11-14 | 2015-07-31 | Ranking documents based on large data sets |
Family Applications Before (8)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/706,991 Expired - Lifetime US7231399B1 (en) | 2003-11-14 | 2003-11-14 | Ranking documents based on large data sets |
US10/734,584 Expired - Lifetime US7222127B1 (en) | 2003-09-30 | 2003-12-15 | Large scale machine learning systems and methods |
US11/736,193 Expired - Lifetime US7769763B1 (en) | 2003-11-14 | 2007-04-17 | Large scale machine learning systems and methods |
US11/736,872 Expired - Fee Related US7743050B1 (en) | 2003-11-14 | 2007-04-18 | Model generation for ranking documents based on large data sets |
US12/777,939 Expired - Fee Related US9116976B1 (en) | 2003-11-14 | 2010-05-11 | Ranking documents based on large data sets |
US12/822,902 Expired - Fee Related US8195674B1 (en) | 2003-11-14 | 2010-06-24 | Large scale machine learning systems and methods |
US13/487,873 Expired - Fee Related US8364618B1 (en) | 2003-11-14 | 2012-06-04 | Large scale machine learning systems and methods |
US13/751,746 Expired - Lifetime US8688705B1 (en) | 2003-11-14 | 2013-01-28 | Large scale machine learning systems and methods |
Country Status (1)
Country | Link |
---|---|
US (9) | US7231399B1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11366814B2 (en) * | 2019-06-12 | 2022-06-21 | Elsevier, Inc. | Systems and methods for federated search with dynamic selection and distributed relevance |
Families Citing this family (140)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2003012576A2 (en) * | 2001-07-27 | 2003-02-13 | Quigo Technologies Inc. | System and method for automated tracking and analysis of document usage |
US7809710B2 (en) * | 2001-08-14 | 2010-10-05 | Quigo Technologies Llc | System and method for extracting content for submission to a search engine |
US6980820B2 (en) * | 2001-08-20 | 2005-12-27 | Qualcomm Inc. | Method and system for signaling in broadcast communication system |
US6731936B2 (en) * | 2001-08-20 | 2004-05-04 | Qualcomm Incorporated | Method and system for a handoff in a broadcast communication system |
US7730063B2 (en) * | 2002-12-10 | 2010-06-01 | Asset Trust, Inc. | Personalized medicine service |
EP1540514B1 (en) * | 2002-07-23 | 2010-12-22 | Quigo Technologies Inc. | System and method for automated mapping of keywords and key phrases to documents |
US7912485B2 (en) * | 2003-09-11 | 2011-03-22 | Qualcomm Incorporated | Method and system for signaling in broadcast communication system |
US7505964B2 (en) | 2003-09-12 | 2009-03-17 | Google Inc. | Methods and systems for improving a search ranking using related queries |
US7617205B2 (en) | 2005-03-30 | 2009-11-10 | Google Inc. | Estimating confidence for query revision models |
US7231399B1 (en) | 2003-11-14 | 2007-06-12 | Google Inc. | Ranking documents based on large data sets |
US20060184473A1 (en) * | 2003-11-19 | 2006-08-17 | Eder Jeff S | Entity centric computer system |
US20050267872A1 (en) * | 2004-06-01 | 2005-12-01 | Yaron Galai | System and method for automated mapping of items to documents |
US7716225B1 (en) | 2004-06-17 | 2010-05-11 | Google Inc. | Ranking documents based on user behavior and/or feature data |
US9223868B2 (en) | 2004-06-28 | 2015-12-29 | Google Inc. | Deriving and using interaction profiles |
US8570880B2 (en) * | 2004-08-05 | 2013-10-29 | Qualcomm Incorporated | Method and apparatus for receiving broadcast in a wireless multiple-access communications system |
US20060036598A1 (en) * | 2004-08-09 | 2006-02-16 | Jie Wu | Computerized method for ranking linked information items in distributed sources |
US7493320B2 (en) * | 2004-08-16 | 2009-02-17 | Telenor Asa | Method, system, and computer program product for ranking of documents using link analysis, with remedies for sinks |
US7606793B2 (en) | 2004-09-27 | 2009-10-20 | Microsoft Corporation | System and method for scoping searches using index keys |
US7761448B2 (en) | 2004-09-30 | 2010-07-20 | Microsoft Corporation | System and method for ranking search results using click distance |
US7739277B2 (en) * | 2004-09-30 | 2010-06-15 | Microsoft Corporation | System and method for incorporating anchor text into ranking search results |
US7827181B2 (en) | 2004-09-30 | 2010-11-02 | Microsoft Corporation | Click distance determination |
US7792833B2 (en) * | 2005-03-03 | 2010-09-07 | Microsoft Corporation | Ranking search results using language types |
US20060200460A1 (en) * | 2005-03-03 | 2006-09-07 | Microsoft Corporation | System and method for ranking search results using file types |
US7870147B2 (en) * | 2005-03-29 | 2011-01-11 | Google Inc. | Query revision using known highly-ranked queries |
US7962462B1 (en) * | 2005-05-31 | 2011-06-14 | Google Inc. | Deriving and using document and site quality signals from search query streams |
US7596556B2 (en) * | 2005-09-15 | 2009-09-29 | Microsoft Corporation | Determination of useful convergence of static rank |
US7730074B1 (en) | 2005-11-04 | 2010-06-01 | Google Inc. | Accelerated large scale optimization |
US7630964B2 (en) * | 2005-11-14 | 2009-12-08 | Microsoft Corporation | Determining relevance of documents to a query based on identifier distance |
US7769751B1 (en) * | 2006-01-17 | 2010-08-03 | Google Inc. | Method and apparatus for classifying documents based on user inputs |
US7647314B2 (en) * | 2006-04-28 | 2010-01-12 | Yahoo! Inc. | System and method for indexing web content using click-through features |
US7516131B2 (en) * | 2006-05-19 | 2009-04-07 | International Business Machines Corporation | Method and apparatus for ranking-based information processing |
US8032545B2 (en) * | 2006-06-14 | 2011-10-04 | General Electric Company | Systems and methods for refining identification of clinical study candidates |
WO2008027503A2 (en) * | 2006-08-31 | 2008-03-06 | The Regents Of The University Of California | Semantic search engine |
US9110975B1 (en) | 2006-11-02 | 2015-08-18 | Google Inc. | Search result inputs using variant generalized queries |
US8661029B1 (en) | 2006-11-02 | 2014-02-25 | Google Inc. | Modifying search result ranking based on implicit user feedback |
US20080177588A1 (en) * | 2007-01-23 | 2008-07-24 | Quigo Technologies, Inc. | Systems and methods for selecting aesthetic settings for use in displaying advertisements over a network |
US8938463B1 (en) * | 2007-03-12 | 2015-01-20 | Google Inc. | Modifying search result ranking based on implicit user feedback and a model of presentation bias |
US8694374B1 (en) | 2007-03-14 | 2014-04-08 | Google Inc. | Detecting click spam |
US20080243830A1 (en) * | 2007-03-30 | 2008-10-02 | Fatdoor, Inc. | User suggested ordering to influence search result ranking |
US9092510B1 (en) | 2007-04-30 | 2015-07-28 | Google Inc. | Modifying search result ranking based on a temporal element of user feedback |
US8359309B1 (en) | 2007-05-23 | 2013-01-22 | Google Inc. | Modifying search result ranking based on corpus search statistics |
US9349134B1 (en) | 2007-05-31 | 2016-05-24 | Google Inc. | Detecting illegitimate network traffic |
US20090037399A1 (en) * | 2007-07-31 | 2009-02-05 | Yahoo! Inc. | System and Method for Determining Semantically Related Terms |
US8694511B1 (en) | 2007-08-20 | 2014-04-08 | Google Inc. | Modifying search result ranking based on populations |
US8019700B2 (en) | 2007-10-05 | 2011-09-13 | Google Inc. | Detecting an intrusive landing page |
US8909655B1 (en) | 2007-10-11 | 2014-12-09 | Google Inc. | Time based ranking |
US20090100018A1 (en) * | 2007-10-12 | 2009-04-16 | Jonathan Roberts | System and method for capturing, integrating, discovering, and using geo-temporal data |
US7840569B2 (en) * | 2007-10-18 | 2010-11-23 | Microsoft Corporation | Enterprise relevancy ranking using a neural network |
US9348912B2 (en) | 2007-10-18 | 2016-05-24 | Microsoft Technology Licensing, Llc | Document length as a static relevance feature for ranking search results |
US20090171960A1 (en) * | 2008-01-02 | 2009-07-02 | Ziv Katzir | Method and system for context-aware data prioritization |
US8412571B2 (en) * | 2008-02-11 | 2013-04-02 | Advertising.Com Llc | Systems and methods for selling and displaying advertisements over a network |
US20090240539A1 (en) * | 2008-03-21 | 2009-09-24 | Microsoft Corporation | Machine learning system for a task brokerage system |
US20090240549A1 (en) * | 2008-03-21 | 2009-09-24 | Microsoft Corporation | Recommendation system for a task brokerage system |
US8726146B2 (en) | 2008-04-11 | 2014-05-13 | Advertising.Com Llc | Systems and methods for video content association |
US8812493B2 (en) | 2008-04-11 | 2014-08-19 | Microsoft Corporation | Search results ranking using editing distance and document information |
US20090276414A1 (en) * | 2008-04-30 | 2009-11-05 | Microsoft Corporation | Ranking model adaptation for searching |
US8364693B2 (en) * | 2008-06-13 | 2013-01-29 | News Distribution Network, Inc. | Searching, sorting, and displaying video clips and sound files by relevance |
US20100042610A1 (en) * | 2008-08-15 | 2010-02-18 | Microsoft Corporation | Rank documents based on popularity of key metadata |
US8195669B2 (en) * | 2008-09-22 | 2012-06-05 | Microsoft Corporation | Optimizing ranking of documents using continuous conditional random fields |
US20100082697A1 (en) * | 2008-10-01 | 2010-04-01 | Narain Gupta | Data model enrichment and classification using multi-model approach |
US8396865B1 (en) | 2008-12-10 | 2013-03-12 | Google Inc. | Sharing search engine relevance data between corpora |
US8255412B2 (en) * | 2008-12-17 | 2012-08-28 | Microsoft Corporation | Boosting algorithm for ranking model adaptation |
US20100262600A1 (en) * | 2009-04-08 | 2010-10-14 | Dumon Olivier G | Methods and systems for deriving demand metrics used in ordering item listings presented in a search results page |
US9009146B1 (en) | 2009-04-08 | 2015-04-14 | Google Inc. | Ranking search results based on similar queries |
US8447760B1 (en) | 2009-07-20 | 2013-05-21 | Google Inc. | Generating a related set of documents for an initial set of documents |
US8498974B1 (en) | 2009-08-31 | 2013-07-30 | Google Inc. | Refining search results |
US8972391B1 (en) | 2009-10-02 | 2015-03-03 | Google Inc. | Recent interest based relevance scoring |
US8874555B1 (en) | 2009-11-20 | 2014-10-28 | Google Inc. | Modifying scoring data based on historical changes |
US8296292B2 (en) * | 2009-11-25 | 2012-10-23 | Microsoft Corporation | Internal ranking model representation schema |
US8515975B1 (en) | 2009-12-07 | 2013-08-20 | Google Inc. | Search entity transition matrix and applications of the transition matrix |
US8311792B1 (en) * | 2009-12-23 | 2012-11-13 | Intuit Inc. | System and method for ranking a posting |
US8615514B1 (en) | 2010-02-03 | 2013-12-24 | Google Inc. | Evaluating website properties by partitioning user feedback |
US8924379B1 (en) | 2010-03-05 | 2014-12-30 | Google Inc. | Temporal-based score adjustments |
US8959093B1 (en) | 2010-03-15 | 2015-02-17 | Google Inc. | Ranking search results based on anchors |
US8838587B1 (en) | 2010-04-19 | 2014-09-16 | Google Inc. | Propagating query classifications |
US8738635B2 (en) | 2010-06-01 | 2014-05-27 | Microsoft Corporation | Detection of junk in search result ranking |
US8949249B2 (en) * | 2010-06-15 | 2015-02-03 | Sas Institute, Inc. | Techniques to find percentiles in a distributed computing environment |
US9623119B1 (en) | 2010-06-29 | 2017-04-18 | Google Inc. | Accentuating search results |
JP5584914B2 (en) * | 2010-07-15 | 2014-09-10 | 株式会社日立製作所 | Distributed computing system |
US8832083B1 (en) | 2010-07-23 | 2014-09-09 | Google Inc. | Combining user feedback |
US8392290B2 (en) * | 2010-08-13 | 2013-03-05 | Ebay Inc. | Seller conversion factor to ranking score for presented item listings |
US9002867B1 (en) | 2010-12-30 | 2015-04-07 | Google Inc. | Modifying ranking data based on document changes |
US8612368B2 (en) * | 2011-03-01 | 2013-12-17 | International Business Machines Corporation | Systems and methods for processing machine learning algorithms in a MapReduce environment |
US9483484B1 (en) * | 2011-05-05 | 2016-11-01 | Veritas Technologies Llc | Techniques for deduplicated data access statistics management |
US8706550B1 (en) | 2011-07-27 | 2014-04-22 | Google Inc. | External-signal influence on content item performance |
US9015174B2 (en) * | 2011-12-16 | 2015-04-21 | Microsoft Technology Licensing, Llc | Likefarm determination |
US9495462B2 (en) | 2012-01-27 | 2016-11-15 | Microsoft Technology Licensing, Llc | Re-ranking search results |
RU2557473C2 (en) * | 2013-03-28 | 2015-07-20 | Федеральное государственное бюджетное образовательное учреждение высшего профессионального образования "Российский государственный гуманитарный университет" (РГГУ) | Data preparation system for information analysis system |
US9183499B1 (en) | 2013-04-19 | 2015-11-10 | Google Inc. | Evaluating quality based on neighbor features |
US9275331B2 (en) * | 2013-05-22 | 2016-03-01 | International Business Machines Corporation | Document classification system with user-defined rules |
US10255319B2 (en) * | 2014-05-02 | 2019-04-09 | Google Llc | Searchable index |
US10606883B2 (en) | 2014-05-15 | 2020-03-31 | Evolv Technology Solutions, Inc. | Selection of initial document collection for visual interactive search |
US20150331908A1 (en) | 2014-05-15 | 2015-11-19 | Genetic Finance (Barbados) Limited | Visual interactive search |
US10102277B2 (en) * | 2014-05-15 | 2018-10-16 | Sentient Technologies (Barbados) Limited | Bayesian visual interactive search |
US11094015B2 (en) | 2014-07-11 | 2021-08-17 | BMLL Technologies, Ltd. | Data access and processing system |
CA2965122C (en) * | 2014-10-20 | 2021-11-09 | Ab Initio Technology Llc | Specifying and applying rules to data |
US10325212B1 (en) | 2015-03-24 | 2019-06-18 | InsideView Technologies, Inc. | Predictive intelligent softbots on the cloud |
US9443192B1 (en) | 2015-08-30 | 2016-09-13 | Jasmin Cosic | Universal artificial intelligence engine for autonomous computing devices and software applications |
JP6558188B2 (en) * | 2015-09-30 | 2019-08-14 | 富士通株式会社 | Distributed processing system, learning model creation method, data processing method, learning model creation program, and data processing program |
EP3188039A1 (en) * | 2015-12-31 | 2017-07-05 | Dassault Systèmes | Recommendations based on predictive model |
CN105912500B (en) | 2016-03-30 | 2017-11-14 | 百度在线网络技术(北京)有限公司 | Machine learning model generation method and device |
US20170308609A1 (en) * | 2016-04-21 | 2017-10-26 | Microsoft Technology Licensing, Llc | Multi-result ranking exploration |
WO2017212459A1 (en) | 2016-06-09 | 2017-12-14 | Sentient Technologies (Barbados) Limited | Content embedding using deep metric learning algorithms |
AU2016411801A1 (en) | 2016-06-14 | 2019-02-07 | 360 Knee Systems Pty Ltd | Graphical representation of a dynamic knee score for a knee surgery |
US11093834B2 (en) | 2016-07-06 | 2021-08-17 | Palo Alto Research Center Incorporated | Computer-implemented system and method for predicting activity outcome based on user attention |
US11477302B2 (en) * | 2016-07-06 | 2022-10-18 | Palo Alto Research Center Incorporated | Computer-implemented system and method for distributed activity detection |
US10885478B2 (en) | 2016-07-06 | 2021-01-05 | Palo Alto Research Center Incorporated | Computer-implemented system and method for providing contextually relevant task recommendations to qualified users |
US9864933B1 (en) | 2016-08-23 | 2018-01-09 | Jasmin Cosic | Artificially intelligent systems, devices, and methods for learning and/or using visual surrounding for autonomous object operation |
US10452974B1 (en) | 2016-11-02 | 2019-10-22 | Jasmin Cosic | Artificially intelligent systems, devices, and methods for learning and/or using a device's circumstances for autonomous device operation |
CN108090040B (en) * | 2016-11-23 | 2021-08-17 | 北京国双科技有限公司 | A text information classification method and system |
US10607134B1 (en) | 2016-12-19 | 2020-03-31 | Jasmin Cosic | Artificially intelligent systems, devices, and methods for learning and/or using an avatar's circumstances for autonomous avatar operation |
CN108241892B (en) * | 2016-12-23 | 2021-02-19 | 北京国双科技有限公司 | Data modeling method and device |
US10755142B2 (en) | 2017-09-05 | 2020-08-25 | Cognizant Technology Solutions U.S. Corporation | Automated and unsupervised generation of real-world training data |
US10755144B2 (en) | 2017-09-05 | 2020-08-25 | Cognizant Technology Solutions U.S. Corporation | Automated and unsupervised generation of real-world training data |
US10496153B2 (en) | 2017-10-27 | 2019-12-03 | EMC IP Holding Company LLC | Method and system for binding chassis and components |
US10102449B1 (en) | 2017-11-21 | 2018-10-16 | Jasmin Cosic | Devices, systems, and methods for use in automation |
US10474934B1 (en) | 2017-11-26 | 2019-11-12 | Jasmin Cosic | Machine learning for computing enabled systems and/or devices |
US10402731B1 (en) | 2017-12-15 | 2019-09-03 | Jasmin Cosic | Machine learning for computer generated objects and/or applications |
US11075925B2 (en) | 2018-01-31 | 2021-07-27 | EMC IP Holding Company LLC | System and method to enable component inventory and compliance in the platform |
US11574201B2 (en) | 2018-02-06 | 2023-02-07 | Cognizant Technology Solutions U.S. Corporation | Enhancing evolutionary optimization in uncertain environments by allocating evaluations via multi-armed bandit algorithms |
US10693722B2 (en) | 2018-03-28 | 2020-06-23 | Dell Products L.P. | Agentless method to bring solution and cluster awareness into infrastructure and support management portals |
US10754708B2 (en) | 2018-03-28 | 2020-08-25 | EMC IP Holding Company LLC | Orchestrator and console agnostic method to deploy infrastructure through self-describing deployment templates |
US10514907B2 (en) | 2018-03-28 | 2019-12-24 | EMC IP Holding Company LLC | System and method for out-of-the-box solution-level management via logical architecture awareness |
US10249170B1 (en) | 2018-04-24 | 2019-04-02 | Dell Products L.P. | Auto alert for server in rack due to abusive usage |
US11086738B2 (en) | 2018-04-24 | 2021-08-10 | EMC IP Holding Company LLC | System and method to automate solution level contextual support |
US10795756B2 (en) | 2018-04-24 | 2020-10-06 | EMC IP Holding Company LLC | System and method to predictively service and support the solution |
CN109242025B (en) * | 2018-09-14 | 2021-05-04 | 北京旷视科技有限公司 | Model iterative correction method, device and system |
US11029875B2 (en) | 2018-09-28 | 2021-06-08 | Dell Products L.P. | System and method for data storage in distributed system across multiple fault domains |
US10628170B1 (en) | 2018-10-03 | 2020-04-21 | Dell Products L.P. | System and method for device deployment |
US10623265B1 (en) | 2018-10-03 | 2020-04-14 | EMC IP Holding Company LLC | System and method for logical configuration of distributed systems |
US11599422B2 (en) | 2018-10-16 | 2023-03-07 | EMC IP Holding Company LLC | System and method for device independent backup in distributed system |
US10909009B2 (en) | 2018-11-01 | 2021-02-02 | Dell Products L.P. | System and method to create a highly available quorum for clustered solutions |
US10977113B2 (en) | 2019-01-29 | 2021-04-13 | Dell Products L.P. | System and method for fault identification, logging, and remediation |
US10862761B2 (en) | 2019-04-29 | 2020-12-08 | EMC IP Holding Company LLC | System and method for management of distributed systems |
US11797876B1 (en) * | 2019-06-26 | 2023-10-24 | Amazon Technologies, Inc | Unified optimization for convolutional neural network model inference on integrated graphics processing units |
US11301557B2 (en) | 2019-07-19 | 2022-04-12 | Dell Products L.P. | System and method for data processing device management |
US11481633B2 (en) | 2019-08-05 | 2022-10-25 | Bank Of America Corporation | Electronic system for management of image processing models |
US11429866B2 (en) | 2019-08-05 | 2022-08-30 | Bank Of America Corporation | Electronic query engine for an image processing model database |
US11151415B2 (en) | 2019-08-05 | 2021-10-19 | Bank Of America Corporation | Parameter archival electronic storage system for image processing models |
US20210357955A1 (en) * | 2020-05-12 | 2021-11-18 | Mercari, Inc. | User search category predictor |
Citations (51)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5488725A (en) | 1991-10-08 | 1996-01-30 | West Publishing Company | System of document representation retrieval by successive iterated probability sampling |
US5897627A (en) | 1997-05-20 | 1999-04-27 | Motorola, Inc. | Method of determining statistically meaningful rules |
US5950186A (en) | 1997-08-15 | 1999-09-07 | Microsoft Corporation | Database system index selection using cost evaluation of a workload for multiple candidate index configurations |
US5974412A (en) | 1997-09-24 | 1999-10-26 | Sapient Health Network | Intelligent query system for automatically indexing information in a database and automatically categorizing users |
US6006222A (en) | 1997-04-25 | 1999-12-21 | Culliss; Gary | Method for organizing information |
US6014665A (en) | 1997-08-01 | 2000-01-11 | Culliss; Gary | Method for organizing information |
US6078916A (en) | 1997-08-01 | 2000-06-20 | Culliss; Gary | Method for organizing information |
US6088692A (en) | 1994-12-06 | 2000-07-11 | University Of Central Florida | Natural language method and system for searching for and ranking relevant documents from a computer database |
US6144944A (en) | 1997-04-24 | 2000-11-07 | Imgis, Inc. | Computer system for efficiently selecting and providing information |
US6182068B1 (en) | 1997-08-01 | 2001-01-30 | Ask Jeeves, Inc. | Personalized search methods |
US6266668B1 (en) | 1998-08-04 | 2001-07-24 | Dryken Technologies, Inc. | System and method for dynamic data-mining and on-line communication of customized information |
US6285999B1 (en) | 1997-01-10 | 2001-09-04 | The Board Of Trustees Of The Leland Stanford Junior University | Method for node ranking in a linked database |
US6311175B1 (en) | 1998-03-06 | 2001-10-30 | Perot Systems Corp. | System and method for generating performance models of complex information technology systems |
US6397211B1 (en) | 2000-01-03 | 2002-05-28 | International Business Machines Corporation | System and method for identifying useless documents |
US6405188B1 (en) | 1998-07-31 | 2002-06-11 | Genuity Inc. | Information retrieval system |
US20020083067A1 (en) | 2000-09-28 | 2002-06-27 | Pablo Tamayo | Enterprise web mining system and method |
US6463430B1 (en) | 2000-07-10 | 2002-10-08 | Mohomine, Inc. | Devices and methods for generating and managing a database |
US20020161763A1 (en) | 2000-10-27 | 2002-10-31 | Nong Ye | Method for classifying data using clustering and classification algorithm supervised |
US20020184181A1 (en) | 2001-03-30 | 2002-12-05 | Ramesh Agarwal | Method for building classifier models for event classes via phased rule induction |
US20030033292A1 (en) | 1999-05-28 | 2003-02-13 | Ted Meisel | System and method for enabling multi-element bidding for influencinga position on a search result list generated by a computer network search engine |
US6523020B1 (en) | 2000-03-22 | 2003-02-18 | International Business Machines Corporation | Lightweight rule induction |
US6546388B1 (en) | 2000-01-14 | 2003-04-08 | International Business Machines Corporation | Metadata search results ranking system |
US6546389B1 (en) | 2000-01-19 | 2003-04-08 | International Business Machines Corporation | Method and system for building a decision-tree classifier from privacy-preserving data |
US20030135490A1 (en) | 2002-01-15 | 2003-07-17 | Barrett Michael E. | Enhanced popularity ranking |
US20030195877A1 (en) | 1999-12-08 | 2003-10-16 | Ford James L. | Search query processing to provide category-ranked presentation of search results |
US20030197837A1 (en) | 2002-04-23 | 2003-10-23 | Lg Electronics Inc. | Optical system and display device using the same |
US6651054B1 (en) | 1999-10-30 | 2003-11-18 | International Business Machines Corporation | Method, system, and program for merging query search results |
US20030217052A1 (en) | 2000-08-24 | 2003-11-20 | Celebros Ltd. | Search engine method and apparatus |
US6671683B2 (en) | 2000-06-28 | 2003-12-30 | Matsushita Electric Industrial Co., Ltd. | Apparatus for retrieving similar documents and apparatus for extracting relevant keywords |
US6714929B1 (en) | 2001-04-13 | 2004-03-30 | Auguri Corporation | Weighted preference data search system and method |
US20040088308A1 (en) | 2002-08-16 | 2004-05-06 | Canon Kabushiki Kaisha | Information analysing apparatus |
US6738764B2 (en) | 2001-05-08 | 2004-05-18 | Verity, Inc. | Apparatus and method for adaptively ranking search results |
US6751611B2 (en) | 2002-03-01 | 2004-06-15 | Paul Jeffrey Krupin | Method and system for creating improved search queries |
US6782390B2 (en) | 1998-12-09 | 2004-08-24 | Unica Technologies, Inc. | Execution of multiple models using data segmentation |
US6801909B2 (en) | 2000-07-21 | 2004-10-05 | Triplehop Technologies, Inc. | System and method for obtaining user preferences and providing user recommendations for unseen physical and information goods and services |
US6804659B1 (en) | 2000-01-14 | 2004-10-12 | Ricoh Company Ltd. | Content based web advertising |
US20050060281A1 (en) | 2003-07-31 | 2005-03-17 | Tim Bucher | Rule-based content management system |
US20050071741A1 (en) | 2003-09-30 | 2005-03-31 | Anurag Acharya | Information retrieval based on historical data |
US20050100209A1 (en) | 2003-07-02 | 2005-05-12 | Lockheed Martin Corporation | Self-optimizing classifier |
US20050198268A1 (en) | 2004-02-10 | 2005-09-08 | Rohit Chandra | Network traffic monitoring for search popularity analysis |
US6947930B2 (en) | 2003-03-21 | 2005-09-20 | Overture Services, Inc. | Systems and methods for interactive search query refinement |
US7007074B2 (en) | 2001-09-10 | 2006-02-28 | Yahoo! Inc. | Targeted advertisements using time-dependent key search terms |
US7065524B1 (en) | 2001-03-30 | 2006-06-20 | Pharsight Corporation | Identification and correction of confounders in a statistical analysis |
US7080063B2 (en) | 2002-05-10 | 2006-07-18 | Oracle International Corporation | Probabilistic model generation |
US7089194B1 (en) | 1999-06-17 | 2006-08-08 | International Business Machines Corporation | Method and apparatus for providing reduced cost online service and adaptive targeting of advertisements |
US7100111B2 (en) | 1999-04-02 | 2006-08-29 | Overture Services, Inc. | Method and system for optimum placement of advertisements on a webpage |
US7222127B1 (en) | 2003-11-14 | 2007-05-22 | Google Inc. | Large scale machine learning systems and methods |
US7269546B2 (en) | 2001-05-09 | 2007-09-11 | International Business Machines Corporation | System and method of finding documents related to other documents and of finding related words in response to a query to refine a search |
US7380205B2 (en) | 2003-10-28 | 2008-05-27 | Sap Ag | Maintenance of XML documents |
US7716225B1 (en) | 2004-06-17 | 2010-05-11 | Google Inc. | Ranking documents based on user behavior and/or feature data |
US7783632B2 (en) | 2005-11-03 | 2010-08-24 | Microsoft Corporation | Using popularity data for ranking |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7808753B2 (en) | 2007-02-27 | 2010-10-05 | Taiwan Semiconductor Manufacturing Co., Ltd. | System and method for monitoring negative bias in integrated circuits |
-
2003
- 2003-11-14 US US10/706,991 patent/US7231399B1/en not_active Expired - Lifetime
- 2003-12-15 US US10/734,584 patent/US7222127B1/en not_active Expired - Lifetime
-
2007
- 2007-04-17 US US11/736,193 patent/US7769763B1/en not_active Expired - Lifetime
- 2007-04-18 US US11/736,872 patent/US7743050B1/en not_active Expired - Fee Related
-
2010
- 2010-05-11 US US12/777,939 patent/US9116976B1/en not_active Expired - Fee Related
- 2010-06-24 US US12/822,902 patent/US8195674B1/en not_active Expired - Fee Related
-
2012
- 2012-06-04 US US13/487,873 patent/US8364618B1/en not_active Expired - Fee Related
-
2013
- 2013-01-28 US US13/751,746 patent/US8688705B1/en not_active Expired - Lifetime
-
2015
- 2015-07-31 US US14/815,736 patent/US10055461B1/en not_active Expired - Fee Related
Patent Citations (58)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5488725A (en) | 1991-10-08 | 1996-01-30 | West Publishing Company | System of document representation retrieval by successive iterated probability sampling |
US6088692A (en) | 1994-12-06 | 2000-07-11 | University Of Central Florida | Natural language method and system for searching for and ranking relevant documents from a computer database |
US6799176B1 (en) | 1997-01-10 | 2004-09-28 | The Board Of Trustees Of The Leland Stanford Junior University | Method for scoring documents in a linked database |
US7058628B1 (en) | 1997-01-10 | 2006-06-06 | The Board Of Trustees Of The Leland Stanford Junior University | Method for node ranking in a linked database |
US6285999B1 (en) | 1997-01-10 | 2001-09-04 | The Board Of Trustees Of The Leland Stanford Junior University | Method for node ranking in a linked database |
US6144944A (en) | 1997-04-24 | 2000-11-07 | Imgis, Inc. | Computer system for efficiently selecting and providing information |
US6006222A (en) | 1997-04-25 | 1999-12-21 | Culliss; Gary | Method for organizing information |
US5897627A (en) | 1997-05-20 | 1999-04-27 | Motorola, Inc. | Method of determining statistically meaningful rules |
US6539377B1 (en) | 1997-08-01 | 2003-03-25 | Ask Jeeves, Inc. | Personalized search methods |
US6182068B1 (en) | 1997-08-01 | 2001-01-30 | Ask Jeeves, Inc. | Personalized search methods |
US6078916A (en) | 1997-08-01 | 2000-06-20 | Culliss; Gary | Method for organizing information |
US6014665A (en) | 1997-08-01 | 2000-01-11 | Culliss; Gary | Method for organizing information |
US5950186A (en) | 1997-08-15 | 1999-09-07 | Microsoft Corporation | Database system index selection using cost evaluation of a workload for multiple candidate index configurations |
US5974412A (en) | 1997-09-24 | 1999-10-26 | Sapient Health Network | Intelligent query system for automatically indexing information in a database and automatically categorizing users |
US6311175B1 (en) | 1998-03-06 | 2001-10-30 | Perot Systems Corp. | System and method for generating performance models of complex information technology systems |
US6405188B1 (en) | 1998-07-31 | 2002-06-11 | Genuity Inc. | Information retrieval system |
US6266668B1 (en) | 1998-08-04 | 2001-07-24 | Dryken Technologies, Inc. | System and method for dynamic data-mining and on-line communication of customized information |
US6782390B2 (en) | 1998-12-09 | 2004-08-24 | Unica Technologies, Inc. | Execution of multiple models using data segmentation |
US7100111B2 (en) | 1999-04-02 | 2006-08-29 | Overture Services, Inc. | Method and system for optimum placement of advertisements on a webpage |
US20030033292A1 (en) | 1999-05-28 | 2003-02-13 | Ted Meisel | System and method for enabling multi-element bidding for influencinga position on a search result list generated by a computer network search engine |
US7089194B1 (en) | 1999-06-17 | 2006-08-08 | International Business Machines Corporation | Method and apparatus for providing reduced cost online service and adaptive targeting of advertisements |
US6651054B1 (en) | 1999-10-30 | 2003-11-18 | International Business Machines Corporation | Method, system, and program for merging query search results |
US6963867B2 (en) | 1999-12-08 | 2005-11-08 | A9.Com, Inc. | Search query processing to provide category-ranked presentation of search results |
US20030195877A1 (en) | 1999-12-08 | 2003-10-16 | Ford James L. | Search query processing to provide category-ranked presentation of search results |
US6397211B1 (en) | 2000-01-03 | 2002-05-28 | International Business Machines Corporation | System and method for identifying useless documents |
US6546388B1 (en) | 2000-01-14 | 2003-04-08 | International Business Machines Corporation | Metadata search results ranking system |
US6804659B1 (en) | 2000-01-14 | 2004-10-12 | Ricoh Company Ltd. | Content based web advertising |
US6546389B1 (en) | 2000-01-19 | 2003-04-08 | International Business Machines Corporation | Method and system for building a decision-tree classifier from privacy-preserving data |
US6523020B1 (en) | 2000-03-22 | 2003-02-18 | International Business Machines Corporation | Lightweight rule induction |
US6671683B2 (en) | 2000-06-28 | 2003-12-30 | Matsushita Electric Industrial Co., Ltd. | Apparatus for retrieving similar documents and apparatus for extracting relevant keywords |
US6463430B1 (en) | 2000-07-10 | 2002-10-08 | Mohomine, Inc. | Devices and methods for generating and managing a database |
US6801909B2 (en) | 2000-07-21 | 2004-10-05 | Triplehop Technologies, Inc. | System and method for obtaining user preferences and providing user recommendations for unseen physical and information goods and services |
US20030217052A1 (en) | 2000-08-24 | 2003-11-20 | Celebros Ltd. | Search engine method and apparatus |
US20020083067A1 (en) | 2000-09-28 | 2002-06-27 | Pablo Tamayo | Enterprise web mining system and method |
US6836773B2 (en) | 2000-09-28 | 2004-12-28 | Oracle International Corporation | Enterprise web mining system and method |
US20020161763A1 (en) | 2000-10-27 | 2002-10-31 | Nong Ye | Method for classifying data using clustering and classification algorithm supervised |
US20020184181A1 (en) | 2001-03-30 | 2002-12-05 | Ramesh Agarwal | Method for building classifier models for event classes via phased rule induction |
US7065524B1 (en) | 2001-03-30 | 2006-06-20 | Pharsight Corporation | Identification and correction of confounders in a statistical analysis |
US6714929B1 (en) | 2001-04-13 | 2004-03-30 | Auguri Corporation | Weighted preference data search system and method |
US6738764B2 (en) | 2001-05-08 | 2004-05-18 | Verity, Inc. | Apparatus and method for adaptively ranking search results |
US7269546B2 (en) | 2001-05-09 | 2007-09-11 | International Business Machines Corporation | System and method of finding documents related to other documents and of finding related words in response to a query to refine a search |
US7007074B2 (en) | 2001-09-10 | 2006-02-28 | Yahoo! Inc. | Targeted advertisements using time-dependent key search terms |
US20030135490A1 (en) | 2002-01-15 | 2003-07-17 | Barrett Michael E. | Enhanced popularity ranking |
US6751611B2 (en) | 2002-03-01 | 2004-06-15 | Paul Jeffrey Krupin | Method and system for creating improved search queries |
US20030197837A1 (en) | 2002-04-23 | 2003-10-23 | Lg Electronics Inc. | Optical system and display device using the same |
US7080063B2 (en) | 2002-05-10 | 2006-07-18 | Oracle International Corporation | Probabilistic model generation |
US20040088308A1 (en) | 2002-08-16 | 2004-05-06 | Canon Kabushiki Kaisha | Information analysing apparatus |
US6947930B2 (en) | 2003-03-21 | 2005-09-20 | Overture Services, Inc. | Systems and methods for interactive search query refinement |
US20050100209A1 (en) | 2003-07-02 | 2005-05-12 | Lockheed Martin Corporation | Self-optimizing classifier |
US20050060281A1 (en) | 2003-07-31 | 2005-03-17 | Tim Bucher | Rule-based content management system |
US20050071741A1 (en) | 2003-09-30 | 2005-03-31 | Anurag Acharya | Information retrieval based on historical data |
US20070094255A1 (en) | 2003-09-30 | 2007-04-26 | Google Inc. | Document scoring based on link-based criteria |
US7380205B2 (en) | 2003-10-28 | 2008-05-27 | Sap Ag | Maintenance of XML documents |
US7222127B1 (en) | 2003-11-14 | 2007-05-22 | Google Inc. | Large scale machine learning systems and methods |
US7231399B1 (en) | 2003-11-14 | 2007-06-12 | Google Inc. | Ranking documents based on large data sets |
US20050198268A1 (en) | 2004-02-10 | 2005-09-08 | Rohit Chandra | Network traffic monitoring for search popularity analysis |
US7716225B1 (en) | 2004-06-17 | 2010-05-11 | Google Inc. | Ranking documents based on user behavior and/or feature data |
US7783632B2 (en) | 2005-11-03 | 2010-08-24 | Microsoft Corporation | Using popularity data for ranking |
Non-Patent Citations (16)
Title |
---|
"Click Popularity-DirectHit Technology Overview"; http://www.searchengines.comldirecthit.html; Nov. 10, 2003 (print date); 2 pages. |
"Click Popularity—DirectHit Technology Overview"; http://www.searchengines.comldirecthit.html; Nov. 10, 2003 (print date); 2 pages. |
A.Y. Ng and M.I. Jordan; "On Discriminative vs. Generative classifiers: A comparison of logistic regression and na'ive Bayes," in T. Dietterich, S. Becker and Z. Ghahramani (eds.), Advances in Neural Information Processing Systems 14, Cambridge, MA: MIT Press, 2002. |
Co-pending U.S. Appl. No. 10/712,263; Jeremy Bern et al.; "Targeting Advertisements Based on Predicted Relevance of the Advertisements"; filed Nov. 14, 2003, 40 pages. |
Co-pending U.S. Appl. No. 10/734,584; Jeremy Bern et al.; "Large Scale Machine Learning Systems and Methods"; filed Dec. 15, 2003, 35 pages. |
F. Crestani, M. Lalmas, C. Van Rijsbergen and I. Campbell; ""Is This Document Relevant?. . . Probably": A Survey of Probabilistic Models in Information Retrieval"; ACM Computing Surveys, vol. 30, No. 4, Dec. 1998. |
http://www.httprevealer.com; "Creative Use of Http Revealer-How does Google Toolbar Work?"; Apr. 19, 2004 (print date); pp. 1-6. |
http://www.httprevealer.com; "Creative Use of Http Revealer—How does Google Toolbar Work?"; Apr. 19, 2004 (print date); pp. 1-6. |
J. Friedman et al.; Additive Logistic Regression: A Statistical View of Boosting; Technical Report; Stanford University Statistics Department; Jul. 1998; pp. 1-45. |
J.H. Friedman, T. Hastie, and R. Tibshirani; "Additive Logistic Regression: a Statistical View of Boosting"; Dept. of Statistics, Stanford University Technical Report; Aug. 20, 1998. |
Jeffrey A. Dean et al., "Ranking Documents Based on User Behavior and/or Feature Data"; U.S. Appl. No. 10/869,057, filed Jun. 17, 2004; 36 pages. |
Justin Boyan et al.; "A Machine Learning Architecture for Optimizing Web Search Engines"; Carnegie Mellon University; May 10, 1996; pp. 1-8. |
U.S. Appl. No. 10/706,991; Jeremy Bern et al.; "Ranking Documents Based on Large Data Sets"; filed Nov. 14, 2003; 38 pages. |
U.S. Appl. No. 11/736, 193; Jeremy Bern et al.; "Large Scale Machine Learning Systems and Methods"; filed Apr. 17, 2007; 35 pages. |
U.S. Appl. No. 11/736,872; Jeremy Bern et al.; "Ranking documents based on large data sets"; filed Apr. 18, 2007; 38 pages. |
Weis et al.; Rule-based Machine Learning Methods for Functional Prediction; Journal of Al Research; vol. 3; Dec. 1995; pp. 383-403. |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11366814B2 (en) * | 2019-06-12 | 2022-06-21 | Elsevier, Inc. | Systems and methods for federated search with dynamic selection and distributed relevance |
Also Published As
Publication number | Publication date |
---|---|
US7743050B1 (en) | 2010-06-22 |
US8364618B1 (en) | 2013-01-29 |
US7222127B1 (en) | 2007-05-22 |
US8195674B1 (en) | 2012-06-05 |
US9116976B1 (en) | 2015-08-25 |
US7231399B1 (en) | 2007-06-12 |
US7769763B1 (en) | 2010-08-03 |
US8688705B1 (en) | 2014-04-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10055461B1 (en) | Ranking documents based on large data sets | |
US6795820B2 (en) | Metasearch technique that ranks documents obtained from multiple collections | |
US6560600B1 (en) | Method and apparatus for ranking Web page search results | |
US8645345B2 (en) | Search engine and method with improved relevancy, scope, and timeliness | |
US6701309B1 (en) | Method and system for collecting related queries | |
US9218397B1 (en) | Systems and methods for improved searching | |
US8862565B1 (en) | Techniques for web site integration | |
US7206778B2 (en) | Text search ordered along one or more dimensions | |
US8346757B1 (en) | Determining query terms of little significance | |
US6112203A (en) | Method for ranking documents in a hyperlinked environment using connectivity and selective content analysis | |
US7653623B2 (en) | Information searching apparatus and method with mechanism of refining search results | |
US7792833B2 (en) | Ranking search results using language types | |
US6615209B1 (en) | Detecting query-specific duplicate documents | |
US7171409B2 (en) | Computerized information search and indexing method, software and device | |
US20020165860A1 (en) | Selective retrieval metasearch engine | |
US20100228715A1 (en) | Personalization of Web Search Results Using Term, Category, and Link-Based User Profiles | |
US20080027918A1 (en) | Method of generating a distributed text index for parallel query processing | |
US20070214128A1 (en) | Discovering alternative spellings through co-occurrence | |
US20070016574A1 (en) | Merging of results in distributed information retrieval | |
CA2505294A1 (en) | Query to task mapping | |
US7296016B1 (en) | Systems and methods for performing point-of-view searching | |
CA2713932A1 (en) | Automated boolean expression generation for computerized search and indexing | |
US7962468B2 (en) | System and method for providing image labeling game using CBIR | |
KR20020089677A (en) | Method for classifying a document automatically and system for the performing the same | |
US7730074B1 (en) | Accelerated large scale optimization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
CC | Certificate of correction | ||
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
LAPS | Lapse for failure to pay maintenance fees |
Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20220821 |