US10747759B2 - System and method for conducting a textual data search - Google Patents
System and method for conducting a textual data search Download PDFInfo
- Publication number
- US10747759B2 US10747759B2 US15/631,077 US201715631077A US10747759B2 US 10747759 B2 US10747759 B2 US 10747759B2 US 201715631077 A US201715631077 A US 201715631077A US 10747759 B2 US10747759 B2 US 10747759B2
- Authority
- US
- United States
- Prior art keywords
- articles
- database
- search
- conducting
- topic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
- 238000000034 method Methods 0.000 title claims abstract description 85
- 238000012545 processing Methods 0.000 claims abstract description 24
- 230000000699 topical effect Effects 0.000 claims description 45
- 238000009826 distribution Methods 0.000 claims description 19
- 239000011159 matrix material Substances 0.000 claims description 9
- 238000007477 logistic regression Methods 0.000 claims description 7
- 238000012706 support-vector machine Methods 0.000 claims description 5
- 238000004891 communication Methods 0.000 description 7
- 238000012549 training Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000002474 experimental method Methods 0.000 description 5
- 238000011160 research Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 4
- 238000013332 literature search Methods 0.000 description 4
- 230000003542 behavioural effect Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000011551 log transformation method Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
Definitions
- the present invention relates to a system and method for conducting a textual data search, and particularly, although not exclusively, to a system and method for conducting a literature search and identifying citation data.
- Textual contents may be digitally contained in a document stored in an electronic database.
- the user when a user needs to retrieve the textual data in a document stored in a large database, the user will need to locate the specific document from multiple documents within the one or more databases.
- Locating or searching specific documents or articles may involve matching a search query with the information stored within the documents. However, it may be difficult to locate some of the stored documents if the search query is not accurately formulated in some occasions, which may cause the searching process to become time consuming and inefficient.
- Embodiments of the present invention improve the accuracy of citation recommendation systems and potentially other document retrieval systems by adding a “wisdom of crowds” feature based on citation network and content similarity. To evaluate whether the query document should cite a specific document or not, it is reasonable to gather the crowds' opinions on this matter, i.e., whether each of the remaining documents in the corpus cites it or not.
- Various embodiments of the present invention concern information-retrieval systems, which may be used to provide recommendations based on topic modelling of textual data.
- the present invention further relates generally to information science and more particularly to the fields of Bibliometrics and scientometrics.
- a method of conducting a textual data search comprising the steps of: receiving a search query associated with a search topic; analyzing the search query to determine at least one attribute of the search topic; processing the at least one attribute and a plurality of articles in a database; and identifying one or more results being relevant to the search topic in the plurality of articles in the database.
- the at least one attribute includes a topical similarity between the search query and each of the plurality of articles in the database.
- the method further comprises the step of constructing the topical similarity based on text information of both the search query and each of the plurality of articles in the database.
- the step of processing the at least one attribute and the plurality of articles in the database further includes inferring at least one relevant topic and a plurality of topic distribution associated with the search query and the plurality of articles in the database over the at least one relevant topic.
- the processing of the at least one attribute and the plurality of articles in the database is based on Latent Dirichlet Allocation (LDA).
- LDA Latent Dirichlet Allocation
- the topical similarity between the search query and each of the plurality of articles in the database is represented as a cosine similarity of the plurality of topic distribution.
- the cosine similarity is represented as:
- the at least one attribute includes an aggregate likelihood that assesses whether each of the plurality of articles is to be cited by other articles with a similar topic in the database.
- the at least one attribute further includes crowd-based information associated with a list of references in each of the articles in the database.
- the aggregate likelihood is associated with the topical similarity between each of the plurality of articles and the other articles in the database.
- the method further comprises the step of representing c id with a citation matrix containing a plurality binary variables each represent a citation relationship of the article i to the article d.
- the method further comprises the step of determining a score for each of the plurality of articles in the database, wherein the score is related to a linear representation of the at least one attribute of the search topic.
- the linear representation includes a weighted sum of the at least one attribute.
- the feature weight is determined based on a linear classifier.
- the linear classifier includes at least one of a logistic regression method and a Support Vector machine for optimizing Mean Average Precision method.
- the method further comprises the step of obtaining the one or more result representing the one or more of the plurality of articles in the database in an order according to the determined score.
- a system for use in conducting a textual data search comprising: a search input module arranged to receive a search query associated with a search topic and to analyze the search query to determine at least one attribute of the search topic; and a database processing module arranged to process the at least one attribute and a plurality of articles in a database, and to identify one or more results being relevant to the search topic in the plurality of articles in the database.
- the at least one attribute includes a topical similarity between the search query and each of the plurality of articles in the database.
- the search input module is further arranged to construct the topical similarity based on text information of both the search query and each of the plurality of articles in the database.
- the at least one attribute includes an aggregate likelihood in which each of the plurality of articles is to be cited by other articles with a similar topic in the database.
- the at least one attribute further includes crowd-based information associated with a list of references in each of the articles in the database.
- the aggregate likelihood is associated with the topical similarity between each of the plurality of articles and the other articles in the database.
- the database processing module is further arranged to determine a score for each of the plurality of articles in the database, wherein the score is related to a linear representation of the at least one attribute of the search topic.
- the one or more result represents one or more of the plurality of articles in the database in an order according to the determined score.
- FIG. 1 is a schematic diagram of a computing server for operation as a system for use in conducting a textual data search in accordance with one embodiment of the present invention
- FIG. 2 is a schematic diagram of an embodiment of the system for use in conducting a textual data search in accordance with one embodiment of the present invention
- FIG. 3 is a flow diagram showing an example process of the method of conducting a textual data search in accordance with one embodiment of the present invention.
- FIG. 4 is a diagram showing an example citation network representing a citation relationship between a search query and a plurality of articles in a database.
- search engine takes a research project description such as an abstract as the query input and recommends a list of possible citations as the output.
- the abstract-based query not only contains richer information, but also relieves users from the burden of identifying the most appropriate query words.
- the longer query does not necessarily have to be a well-written abstract. Any related keywords can be simply added to the query input, since the citation recommendation method is based on the bag-of-words assumption, that is, the sequence of words in a sentence is neglected.
- authors' citation choices may be influenced by various factors.
- the search method may involve a number of features such as content similarity, author characteristics, article venue characteristics, and authors' citation behavior. To implement these features, much information needs to be collected from different sources. The cost of such data collection efforts can be substantial in practice and thus may prevent the wide adoption of some features.
- a lightweight citation recommendation method based on readily available information including article abstracts and citation networks (which may be constructed from articles' reference lists).
- article abstracts and citation networks which may be constructed from articles' reference lists.
- the citation count of an article is its in-degree in the citation network.
- Self-citation essentially involves a paper citing another paper, which has one or more authors in common. Articles on related topics may cite the same list of seminal works, which may imply that a new article on these topics should also cite the seminal works as inferred from the citation network.
- the inventors devised that one simple “wisdom of crowds” measure may potentially achieve the same purpose but at a much lower cost.
- the method in accordance with the embodiments of the present invention only involves two features, however the method may achieve a similar level of accuracy according to experiments on a standard dataset (i.e., ACL Anthology Reference Corpus) compared with other example methods which may use many additional features.
- This embodiment is arranged to provide a system for use in conducting a textual data search, comprising: a search input module arranged to receive a search query associated with a search topic and to analyze the search query to determine at least one attribute of the search topic; and a database processing module arranged to process the at least one attribute and a plurality of articles in a database, and to identify one or more results being relevant to the search topic in the plurality of articles in the database.
- the system may be used as an information retrieval system which may output one or more results of a textual content relevant to a search topic.
- the results may be provided as a list of articles in an order based on the attributes/features of the search query and the articles in the database and the scores of each of the identified articles in the list.
- the search input module and the database processing module are implemented by or for operation on a computer having an appropriate user interface.
- the computer may be implemented by any computing architecture, including stand-alone PC, client/server architecture, “dumb” terminal/mainframe architecture, or any other appropriate architecture.
- the computing device is appropriately programmed to implement the invention.
- FIG. 1 there is shown a schematic diagram of a computer or a computing server 100 which in this embodiment comprises a server 100 arranged to operate, at least in part if not entirely, the system for use in conducting a textual data search in accordance with one embodiment of the invention.
- the server 100 comprises suitable components necessary to receive, store and execute appropriate computer instructions.
- the components may include a processing unit 102 , read-only memory (ROM) 104 , random access memory (RAM) 106 , and input/output devices such as disk drives 108 , input devices 110 such as an Ethernet port, a USB port, etc.
- Display 112 such as a liquid crystal display, a light emitting display or any other suitable display and communications links 114 .
- the server 100 includes instructions that may be included in ROM 104 , RAM 106 or disk drives 108 and may be executed by the processing unit 102 .
- the server may include storage devices such as a disk drive 108 which may encompass solid state drives, hard disk drives, optical drives or magnetic tape drives.
- the server 100 may use a single disk drive or multiple disk drives.
- the server 100 may also have a suitable operating system 116 which resides on the disk drive or in the ROM of the server 100 .
- the system has a database 120 residing on a disk or other storage device, which is arranged to store at least one record 122 .
- the database 120 is in communication with the server 100 with an interface, which is implemented by computer software residing on the server 100 .
- the database 120 may also be implemented as a stand-alone database system in communication with the server 100 via an external computing network, or other types of communication links.
- the server 100 may be used as a citation recommendation system for identifying relevant citations for a research study.
- a topic models-based method may be used to predict the presence of a citation link between every pair of documents based on the text content of the documents. It may utilize the citation context (i.e., the words around citation placeholders, or citing snippets) in addition to the text content to recommend a list of reference papers given the input of a manuscript.
- citation context i.e., the words around citation placeholders, or citing snippets
- a citation recommendation framework may use both content and citation graph-based information.
- Various measures are constructed from the citation graph including co-citation coupling, same author, Katz, and citation count. Katz, which refers to the number of unique paths between two articles exponentially damped by length, represents the connection closeness between two documents. The recency of publication is also included in the feature set. Then, the features are combined in a linear model to produce a document score, which may be used to rank the documents.
- a feature set may be enriched with topical similarity and author behavioral patterns including scientific article popularity, recency, citing snippets, topic-related citing pattern, and social habits.
- two different linear classifiers namely logistic regression model and SVM-MAP, may be utilized to learn the weights of all the features in the retrieval model.
- an iterative paradigm may be applied for learning model weights. The method is found to be outperforming the single-iteration training.
- the method may incorporate one or more attributes or features when constructing the relevance score.
- attributes relate to information such as article metadata, author metadata, publication venue (e.g., journal, conference, workshop) metadata, and authors' citing behavioural patterns or habits. For instance, citation count, year of publication, co-citation, and self-citation can be incorporated in the calculation of the relevance score.
- a linear or nonlinear model may be adopted.
- a simple linear model may score each document against the query as a weighted sum of various feature scores.
- the feature weights can be obtained by fitting a logistic regression or the Support Vector machine for optimizing Mean Average Precision (SVM-MAP) model.
- a relevance score system may be applied to score or to sort the documents in a large corpus or database of documents with respect to a query (e.g., short description, abstract, full article, search terms) and then rank documents to identify candidates to be recommended as potential citations.
- a query e.g., short description, abstract, full article, search terms
- the relevance score system may involve the ranking of relevance scores based on content similarity.
- Content similarity may be calculated based on a TF-IDF (Term Frequency-Inverse Document Frequency) score.
- TF-IDF Term Frequency-Inverse Document Frequency
- the scoring system may be alternatively based on topic modeling, which may determine the content similarity score, specifically, topical similarity, etc. between the query and a document in the corpus using a vector space model along with cosine similarity. Further discussions will be included in the later parts of the disclosure.
- the inventors devise that aggregate likelihood of being cited by similar articles which captures the wisdom of crowds in the reference lists of academic articles may be used.
- the topical similarity feature which measures content similarity based on topic models may also be used in some example embodiments.
- the improved system and method provided in accordance with the following embodiments outperform existing citation recommendation systems and methods which may require many features in order to guarantee the accuracy of the recommendation systems.
- many of these features may be related to the citation network formed by documents in the large corpus.
- the citation count of a document may be its in-degree in the citation network.
- the server 100 is used as part of a search engine 200 arranged to conduct a search of the articles stored in the database 210 .
- the search engine 200 may communicate with a database 210 which may be external to the search engine 200 including the search input module 202 and the database processing module 204 .
- the system 200 is arranged to receive a search query 206 and return one or more articles in a database which is relevant to the topic(s) as a list of results 208 .
- the search input module 202 may receive and process the search query 206 to derive the necessary features or attributes relevant to a search topic. These attributes may be passed to the database processing module 204 for further processing.
- the database processing module 204 may access one or more databases 210 according to the search requirement and process the attributes to identify the relevance of each of the articles in the database 210 to the search topic or search query 206 .
- the database 210 may be locally included in the same system 200 for use in conducting a textual data search, or search engine 200 including the search input module 202 and the database processing module 204 may be selectively implemented in a database 210 for facilitating a search of the articles stored in the database 210 .
- the search query 206 may be in a form of a brief description or an abstract of the topic and may not necessarily be in form a keyword limited search with Boolean operators as appreciated by a skilled person in the art.
- a user may input a search query 206 including a search topic to the search system 200 , the search input module 202 may then analyse the search query 206 and identify one or more features/attributes of the input search topic.
- these attributes may include at least a topical similarity between the search query and each of the plurality of articles in the database, as well as an aggregate likelihood in which the search articles may be cited by other articles with a similar topic in the database.
- the search input module 202 first identifies two features of great importance to authors' citation choices. Then the database processing module 204 may process the identified attributes and the articles in the database 210 . The database processing module 204 may further score each document d i against the query (i.e., an abstract) q as a weighted sum of features scores. Preferably, scores may be assigned to each of the articles in the database 210 for a single search query q as follows:
- score ⁇ ( q , d ) ⁇ i ⁇ w i ⁇ f i ⁇ ( q , d ) ( 1 )
- the score is related to a linear representation of the attributes of the search topic.
- topical similarity may be constructed based on text information of both the search query and each of the plurality of articles in the database, which is helpful in finding similar or topically-related articles.
- aggregate likelihood of being cited by similar articles may be constructed from authors' perspective to capture the wisdom of crowds in citation choices.
- this attribute may include crowd-based or crowd-sourced information associated with a list of references in each of the articles in the database. This feature helps to find not only related works, but also their citation choices.
- topical similarity may represent the topical relationships between two documents in a more explicit way than the content/text similarity, which may be considered as an important feature to identify the document relevance.
- the process includes inferring at least one relevant topic and a plurality of topic distribution associated with the search query and the plurality of articles in the database over the at least one relevant topic, which may be preferably based on Latent Dirichlet Allocation (LDA).
- LDA Latent Dirichlet Allocation
- Topic modelling algorithms such as Latent Dirichlet Allocation (LDA) is that documents may be generated by choosing a distribution over a set of latent topics and that each topic is in turn characterized by a distribution over words. Topic models may also assume that each document may contain multiple topics. For example, each document may be characterized by a distribution over a set of topics (i.e., topical distribution), which may be represented by a vector.
- LDA Latent Dirichlet Allocation
- the topical similarity between the search query and each of the plurality of articles in the database is represented as a cosine similarity of the plurality of topic distribution.
- the cosine similarity may be represented as follows:
- the attribute of aggregate likelihood of being cited by similar articles captures the wisdom of crowds in the reference lists of academic articles.
- the method aggregates the citation decisions of all other articles regarding one candidate article, and may put more weights to the citation choices of topically similar papers.
- it may also capture the general consensus in citation choices (e.g., scientific article popularity).
- the underlying rationale is that prior citations choices by similar articles provide value for authors when deciding which papers to cite in their studies.
- the aggregate likelihood is associated with the topical similarity between each of the plurality of articles and the other articles in the database.
- an article citation network which represents a citation relationship between the search query and the plurality of articles in the database may be used in the construction of the feature of aggregate likelihood.
- FIG. 4 there is shown an example citation network associated with a search query and four articles d 1 to d 4 in a literature database.
- c id may be represented as a citation matrix containing a plurality binary variables each represent the citation relationship of the article i to the article d.
- the database processing module may further determine a score for each of the plurality of articles in the database, and may obtain the list of results 208 representing the one or more of the plurality of articles in the database in an order according to the determined score.
- the score may be related to a linear representation of the at least one attribute of the search topic.
- the ranking score or the linear representation may be calculated as a weighted sum as follows.
- score( q,d ) w 1 ⁇ topical_similarity qd +w 2 ⁇ aggregate_likelihood_being_cited qd (4) where w 1 and w 2 are the feature weights.
- the two features may be put into log-space to better fit into the model, the improved performance achieved was illustrated in experimental results on the development set after log-transformation.
- the feature weight may be determined based on a linear classifier.
- the top N articles for each search query i.e., an abstract
- the retrieved articles may be labelled with +1 if they appear in the reference lists of the query article, or ⁇ 1 otherwise.
- a linear classifier may be used to learn the weights for the two features on this training data.
- the linear classifier may include a logistic regression method or a Support Vector machine for optimizing Mean Average Precision (SVM-MAP) method.
- Logistic regression may measure the relationship between document relevance (i.e., a binary indicator) and the two features by estimating probabilities using a logistic function
- SVM-MAP is a learning technique which may train a support vector machine to directly optimize mean average precision. The inventors confirmed that consistent results may be obtained using any one of these two linear classifiers.
- topical similarity q [0.5 0.1 0.1 0.3] (5), where each element in the vector a i represents the topical similarity between the query text and d i .
- the citation relationships can be represented by the citation matrix
- citation_matrix [ 0 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 ] , ( 6 ) where each element in the matrix b ij (i.e., the value in row i and column j) denotes the existing citation relationship between d i and d j .
- a topical citation network is first constructed based on the topical similarity and citation matrix as shown in equations (5) and (6).
- the solid lines with arrows represent the existing citation relationships among the documents in the corpus.
- each directed line denotes a document citing another document.
- d 1 cites both d 2 and d 3 .
- the dotted lines represent the topical similarity between the query text and documents in the corpus.
- the calculation of the aggregate likelihood of being cited may also be represented by a matrix multiplication as follows:
- the ranking score between the query text and d i equals to 0.5 ⁇ i +0.5 ⁇ c i .
- the ranking scores of each of the articles relevant to the search query may be represented as follows:
- the four documents will be recommended in the order of d 3 , d 2 , d 1 , d 4 .
- FIG. 3 there is shown a flow chart summarizing the entire process of the method for conducting a search as described above.
- search method may be used in analyzing search queries in a form of a brief description such as a research abstract instead of the form of keywords in traditional search engines. Therefore, the results will be less likely to omit any results due to typographic errors or imprecise keyword matching.
- the method does not require users to formulate their search queries based on keywords, and therefore it is not necessary for the users to try multiple queries using different keywords.
- search queries consisting of a brief project description such as an abstract may be used as the query input accordance to the embodiments of the present invention.
- the abstract-based query not only contains richer information but also relieves users from the burden of identifying the most appropriate keywords.
- the query does not necessarily need to be a well-written abstract.
- a combination of keywords would be sufficient as the query input, as the method is based on the bag-of-words assumption and which may also ignore the sequence of words in a sentence.
- the recommended set of citations is scored and ranked by topical similarity and aggregate likelihood of being cited by articles similar to the abstract-based query.
- the searching process can only involve two attributes/features and does not require other attributes such as authors, co-authors, article venues, and citing snippets, etc. It may be difficult and time-consuming to collect all these kinds of information in practice.
- the method may also be used in applications for automatically locating and recommending citations based on the text contents (such as one or more paragraphs in an abstract field).
- a word processor may be implemented to automatically locate and insert a list of references which may be relevant to an article.
- the inventors have evaluated the performances of the embodiments according to the present invention.
- the citation recommendation method according to one example embodiment is implemented, and has been evaluated on a standard dataset (i.e., ACL Anthology Reference Corpus) compared with existing methods. It is observed that the present invention with only two features involved achieves a similar level of accuracy as existing methods which may involve more features as abovementioned.
- the method is tested on a standard dataset, the ACL Anthology Reference Corpus (ACL-ARC), and is compared with other example methods.
- ACL-ARC ACL Anthology Reference Corpus
- the baseline search methods incorporate many features such as topical similarity and author behavioral patterns. It is shown that the present method significantly outperforms both a text similarity baseline and other related models using similar features.
- articles that do not satisfy the following conditions were excluded.
- the included articles have full text with a document length exceeding 5 words; and have at least five references remaining after discarding the references to articles outside the processed corpus.
- articles published from 2000 to 2003 were used as a training set, articles published in 2004 as development set, and articles published from 2005 to 2006 as test set.
- Mean average precision may be applied to evaluate a ranked list across different queries in information retrieval (IR) systems. This measure is sensitive to the rank of all relevant documents, and gives the highest score when all correct predictions precede all incorrect predictions.
- MAP is defined as the arithmetic mean of average precision over a set of queries as shown in equation (10). Average precision is calculated for each query as the average of precision at every cutoff where a new relevant document is retrieved as:
- MAP ⁇ q ⁇ Q ⁇ AveP q ⁇ Q ⁇ , ( 10 ) where q is a query in a set of queries Q; and
- the following Table shows the comparisons of mean average precision of the method in accordance with the embodiments of the present invention with the baseline method on the development set.
- N is the number of articles collected for each abstract to train the model.
- the mean average precisions of the present invention are consistent across the four groups, that is, groups with different values of N or using different classifiers.
- SVM-MAP as the linear classifier
- the MAP of the present method is only 3 points smaller than that of the baseline method. Considering that the baseline method incorporates 19 various features, the comparable mean average precision of the present method using only 2 features is more advantageous.
- inventions are advantageous in providing a lightweight citation recommendation method based on the wisdom of crowds in citation choices.
- the method involves only two features (i.e., topical similarity and aggregate likelihood of being cited by similar articles) may deliver a similar/better performance as existing methods using many features.
- the method is highly efficient and relies on readily available information including article abstracts and reference lists, and the method is suitable for large-scale implementation.
- the embodiments described with reference to the Figures can be implemented as an application programming interface (API) or as a series of libraries for use by a developer or can be included within another software application, such as a terminal or personal computer operating system or a portable computing device operating system.
- API application programming interface
- program modules include routines, programs, objects, components and data files assisting in the performance of particular functions, the skilled person will understand that the functionality of the software application may be distributed across a number of routines, objects or components to achieve the same functionality desired herein.
- any appropriate computing system architecture may be utilised. This will include standalone computers, network computers and dedicated hardware devices.
- computing system and “computing device” are used, these terms are intended to cover any appropriate arrangement of computer hardware capable of implementing the function described.
- database may include any form of organized or unorganized data storage devices implemented in either software, hardware or a combination of both which are able to implement the function described.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
wherein Q denotes a multinomial distribution of the search query q over the at least one relevant topic and D denotes a multinomial distribution of an article d in the database over the at least one relevant topic.
aggregate_likelihood_being_citedqd=Σi n-1topical_similarityqi ·c id;
wherein i denotes an article in the database except for the article d in the database, cid denotes a binary variable which represents a citation relationship of the article i to the article d, and topical_similarityqi denotes the topical similarity between the search query and the article i.
aggregate_likelihood_normalizedqd=Σi n-1topical_similarityqi ·c id/Σi n-1 c id.
In this example, the score is related to a linear representation of the attributes of the search topic.
wherein Q denotes a multinomial distribution of the search query (i.e., an abstract) q over the relevant topics and D denotes a multinomial distribution of an article d in the database/corpus over the relevant topics.
aggregate_likelihood_being_citedqd=Σi n-1topical_similarityqi ·c id (3)
wherein i denotes an article in the database except for the article d in the database, cid denotes a binary variable which represents a citation relationship of the article i to the article d, and topical_similarityqi denotes the topical similarity between the search query and the article i. Alternatively or optionally, cid may be represented as a citation matrix containing a plurality binary variables each represent the citation relationship of the article i to the article d.
score(q,d)=w 1×topical_similarityqd +w 2×aggregate_likelihood_being_citedqd (4)
where w1 and w2 are the feature weights. The two features may be put into log-space to better fit into the model, the improved performance achieved was illustrated in experimental results on the development set after log-transformation.
score(q,d)=Σi w i ×f i(q,d) (1)
wherein q denotes the search query, d denotes an article in the database, and wi denotes a feature weight assigned for each of the at least one attribute fi(q,d).
topical_similarityq=[0.5 0.1 0.1 0.3] (5),
where each element in the vector ai represents the topical similarity between the query text and di. The citation relationships can be represented by the citation matrix
where each element in the matrix bij (i.e., the value in row i and column j) denotes the existing citation relationship between di and dj.
where each element ci represents the feature of the aggregate likelihood being cited for di.
where each element in the vector represents the ranking score of di given the query text.
Train | Dev | Test | ||
Years | 2000-2003 | 2004 | 2005-2006 | ||
Articles | 619 | 318 | 864 | ||
References | 4,734 | 2,545 | 7,637 | ||
Refs/Article | 7.6 | 8 | 8.8 | ||
(9), where k denotes a point where a new relevant document is retrieved; m is the total number of relevant documents for a query.
where q is a query in a set of queries Q; and |Q| is the number of queries in the set.
Dev | N = 100 | N = 2000 | ||
Logistic (Wisdom of Crowds) | 23.3 | 22.0 | ||
Logistic (Baseline) | 7.9 | 10.7 | ||
SVM-MAP (Wisdom of Crowds) | 22.6 | 22.5 | ||
SVM-MAP (Baseline) | 25.3 | 25.5 | ||
Logistic (Wisdom of Crowds) | Dev MAP | ||
Topical similarity | 13.2 | ||
Aggregate likelihood of being | 13.6 | ||
cited by similar articles | |||
Both features | 22.0 | ||
Claims (20)
aggregate_likelihood_being_citedqd=Σi n-1topical_similarityqi ·c id;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/631,077 US10747759B2 (en) | 2017-06-23 | 2017-06-23 | System and method for conducting a textual data search |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/631,077 US10747759B2 (en) | 2017-06-23 | 2017-06-23 | System and method for conducting a textual data search |
Publications (2)
Publication Number | Publication Date |
---|---|
US20180373754A1 US20180373754A1 (en) | 2018-12-27 |
US10747759B2 true US10747759B2 (en) | 2020-08-18 |
Family
ID=64693304
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/631,077 Active 2037-07-16 US10747759B2 (en) | 2017-06-23 | 2017-06-23 | System and method for conducting a textual data search |
Country Status (1)
Country | Link |
---|---|
US (1) | US10747759B2 (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109684272A (en) * | 2018-12-29 | 2019-04-26 | 国家电网有限公司 | Document storage method, system and terminal device |
CN110688474B (en) * | 2019-09-03 | 2023-03-14 | 西北工业大学 | Embedded representation obtaining and citation recommending method based on deep learning and link prediction |
US11205047B2 (en) * | 2019-09-05 | 2021-12-21 | Servicenow, Inc. | Hierarchical search for improved search relevance |
US11386164B2 (en) * | 2020-05-13 | 2022-07-12 | City University Of Hong Kong | Searching electronic documents based on example-based search query |
US20210391075A1 (en) * | 2020-06-12 | 2021-12-16 | American Medical Association | Medical Literature Recommender Based on Patient Health Information and User Feedback |
JP7622508B2 (en) * | 2021-03-26 | 2025-01-28 | 富士通株式会社 | Training data generation program, training data generation method, and training data generation device |
CN113392072B (en) * | 2021-06-25 | 2022-08-02 | 中国标准化研究院 | Standard knowledge service method, device, electronic equipment and storage medium |
CN113947290A (en) * | 2021-09-26 | 2022-01-18 | 海南电网有限责任公司 | Distribution network evaluation and review auxiliary method and system based on artificial intelligence |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5832476A (en) | 1994-06-29 | 1998-11-03 | Hitachi, Ltd. | Document searching method using forward and backward citation tables |
US20130297590A1 (en) * | 2012-04-09 | 2013-11-07 | Eli Zukovsky | Detecting and presenting information to a user based on relevancy to the user's personal interest |
US20140289675A1 (en) * | 2009-08-20 | 2014-09-25 | Tyron Jerrod Stading | System and Method of Mapping Products to Patents |
US9218344B2 (en) | 2012-06-29 | 2015-12-22 | Thomson Reuters Global Resources | Systems, methods, and software for processing, presenting, and recommending citations |
CN105589948A (en) | 2015-12-18 | 2016-05-18 | 重庆邮电大学 | Document citation network visualization and document recommendation method and system |
CN105787068A (en) | 2016-03-01 | 2016-07-20 | 上海交通大学 | Academic recommendation method and system based on citation network and user proficiency analysis |
-
2017
- 2017-06-23 US US15/631,077 patent/US10747759B2/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5832476A (en) | 1994-06-29 | 1998-11-03 | Hitachi, Ltd. | Document searching method using forward and backward citation tables |
US20140289675A1 (en) * | 2009-08-20 | 2014-09-25 | Tyron Jerrod Stading | System and Method of Mapping Products to Patents |
US20130297590A1 (en) * | 2012-04-09 | 2013-11-07 | Eli Zukovsky | Detecting and presenting information to a user based on relevancy to the user's personal interest |
US9218344B2 (en) | 2012-06-29 | 2015-12-22 | Thomson Reuters Global Resources | Systems, methods, and software for processing, presenting, and recommending citations |
CN105589948A (en) | 2015-12-18 | 2016-05-18 | 重庆邮电大学 | Document citation network visualization and document recommendation method and system |
CN105787068A (en) | 2016-03-01 | 2016-07-20 | 上海交通大学 | Academic recommendation method and system based on citation network and user proficiency analysis |
Non-Patent Citations (4)
Title |
---|
D. M. Blei, A. Y. Ng, and M. I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research (JMLR), 993-1022. |
S. Bethard, and D. Jurafsky. 2010. Who Should I Cite: Learning Literature Search Models from Citation Behavior. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM). |
T. Strohman, W. Bruce Croft, and D. Jensen. 2007. Recommending Citations for Academic Papers. In Proceedings of the 30th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR). |
Y. Yue, T. Finley, and F. Radlinski. 2007. A Support Vector Method for Optimizing Average Precision. In Proceedings of the 30th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR). |
Also Published As
Publication number | Publication date |
---|---|
US20180373754A1 (en) | 2018-12-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10747759B2 (en) | System and method for conducting a textual data search | |
US10146862B2 (en) | Context-based metadata generation and automatic annotation of electronic media in a computer network | |
EP2823410B1 (en) | Entity augmentation service from latent relational data | |
Bhagavatula et al. | Methods for exploring and mining tables on wikipedia | |
US7849104B2 (en) | Searching heterogeneous interrelated entities | |
US20160034514A1 (en) | Providing search results based on an identified user interest and relevance matching | |
Kalashnikov et al. | Web people search via connection analysis | |
US20080114750A1 (en) | Retrieval and ranking of items utilizing similarity | |
Im et al. | Linked tag: image annotation using semantic relationships between image tags | |
Hasibi et al. | On the reproducibility of the TAGME entity linking system | |
KR20180097120A (en) | Method for searching electronic document and apparatus thereof | |
US20240037375A1 (en) | Systems and Methods for Knowledge Distillation Using Artificial Intelligence | |
Muangprathub et al. | Document plagiarism detection using a new concept similarity in formal concept analysis | |
Devezas et al. | A review of graph-based models for entity-oriented search | |
KR20160120583A (en) | Knowledge Management System and method for data management based on knowledge structure | |
US20200394197A1 (en) | Systems and methods for federated search with dynamic selection and distributed relevance | |
Spahiu et al. | Topic profiling benchmarks in the linked open data cloud: Issues and lessons learned | |
Vo | New re-ranking approach in merging search results | |
CN115328945A (en) | Data asset retrieval method, electronic device and computer-readable storage medium | |
Zhang et al. | A semantics-based method for clustering of Chinese web search results | |
Wang et al. | Sequential text-term selection in vector space models | |
Abuoda et al. | Automatic Tag Recommendation for the UN Humanitarian Data Exchange. | |
Singh et al. | A study of similarity functions used in textual information retrieval in Wide Area Networks | |
Cheng et al. | Learning to rank relevant documents for information retrieval in bioengineering text corpora | |
Navarro Bullock et al. | Tagging data as implicit feedback for learning-to-rank |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: CITY UNIVERSITY OF HONG KONG, HONG KONG Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:XU, RUIYUN;CHEN, HAILIANG;ZHAO, J. LEON;REEL/FRAME:042793/0786 Effective date: 20170606 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YR, SMALL ENTITY (ORIGINAL EVENT CODE: M2551); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY Year of fee payment: 4 |