CN104424279B - A kind of correlation calculations method and apparatus of text - Google Patents

A kind of correlation calculations method and apparatus of text Download PDF

Info

Publication number
CN104424279B
CN104424279B CN201310388496.XA CN201310388496A CN104424279B CN 104424279 B CN104424279 B CN 104424279B CN 201310388496 A CN201310388496 A CN 201310388496A CN 104424279 B CN104424279 B CN 104424279B
Authority
CN
China
Prior art keywords
character string
word
character
characteristic value
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310388496.XA
Other languages
Chinese (zh)
Other versions
CN104424279A (en
Inventor
赫南
张文斌
姚伶伶
王莉峰
何琪
张博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201310388496.XA priority Critical patent/CN104424279B/en
Publication of CN104424279A publication Critical patent/CN104424279A/en
Application granted granted Critical
Publication of CN104424279B publication Critical patent/CN104424279B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Embodiment of the present invention proposes a kind of correlation calculations method and apparatus of text.Method includes:Receive the first character string and the second character string;Calculate the text relevant characteristic value of the first character string and the second character string and the semantic dependency characteristic value of the first character string and the second character string;The text relevant characteristic value and semantic dependency characteristic value are fitted to the correlative character value of the first character string Yu the second character string by logic-based regression model.Embodiment of the present invention improves the accuracy rate of correlation prediction, has saved memory space and has reduced costs.

Description

A kind of correlation calculations method and apparatus of text
Technical field
Embodiment of the present invention is related to technical field of internet application, more particularly, to a kind of correlation meter of text Calculate method and apparatus.
Background technique
With the rapid development of computer technology and network technology, internet(Internet)Daily life, The effect played in study and work is also increasing.Various applications on internet emerge one after another.
Search advertisements are a very important business in the Internet advertising ecosystem, it depends on search engine, this It is that matching is sold based on keyword in matter.Advertiser is in the database of business promotion, in addition to providing advertisement for display Except title, description, some keywords with the advertisement with certain correlation are also added(Buy word), and specified matching Type and bid and orientation matching target flow(Meet the user that retrieval is intended to).In classical matching process, purchase Word forms the direct index to advertisement.When query word and the purchase word " matching " of advertiser of user, correlation reaches certain Degree thinks the primary election condition for meeting advertisement triggering(It is assumed that first ignoring other orientations and filtering link), Ke Yila Take corresponding advertisement(Title, description)Do out further subsequent selected, such as clicking rate is estimated, order ads, shows plan Slightly selection etc..
It is retrieving(Retrieve)Stage, ad system can utilize the query string of user, use a variety of online, offline plans Slightly do purchase word matching.Here the purchase word found is all advertiser's specified and advertisement title and description when filling in material Relevant short text.Query word is measured in system on line(query)Word is bought with candidate(bidterm)Correlation essence It is the correlation between short text.
Traditionally have much based on the literal matched method of character string, offline online appraisal procedure also has difference, all deposits In certain limitation.Sahami of Google et al. proposes using the Webpage searching result of short text as semantic extension, The semantic dependency between short text is calculated on the basis of this, it is more preferable than simple word-based effect.University of Massachusetts Dumais of Metzler and Microsoft et al. has also attempted the method that a variety of short texts indicate and has been used to calculate semantic dependency.
However, traditional calculation method based on word vector space model in document, it is sparse to face feature on short text The problem of.Simultaneously as the word segmentation result of short text depends on language model, the consistent of different word segmentations is not ensured that, It can aggravate the sparse of vector to a certain extent.Therefore, traditional calculation method based on word vector space model in document, tool Have the shortcomings that correlation prediction accuracy rate is not high.
Moreover, needing a large amount of memory spaces in traditional calculation method based on word vector space model in document Term vector is stored, therefore also wastes memory space and improves cost.
Summary of the invention
Embodiment of the present invention proposes a kind of correlation calculations methods of text, to improve the accuracy rate of correlation prediction.
Embodiment of the present invention proposes a kind of correlation calculations devices of text, to improve the accuracy rate of correlation prediction.
The technical solution of embodiment of the present invention is as follows:
A kind of correlation calculations method of text, this method include:
Receive the first character string and the second character string;
Calculate the text relevant characteristic value and the first character string and the second character of the first character string and the second character string The semantic dependency characteristic value of string;
The text relevant characteristic value and semantic dependency characteristic value are fitted to the first word by logic-based regression model The correlative character value of symbol string and the second character string.
A kind of correlation calculations device of text, the device include character string receiving unit, correlative character value calculating list Member and correlative character value fitting unit, wherein:
Character string receiving unit, for receiving the first character string and the second character string;
Correlative character value computing unit, for calculating the text relevant characteristic value of the first character string Yu the second character string And first character string and the second character string semantic dependency characteristic value;
Correlative character value fitting unit is used for logic-based regression model for the text relevant characteristic value and semanteme Correlative character value is fitted to the correlative character value of the first character string Yu the second character string.
It can be seen from the above technical proposal that in embodiments of the present invention, receiving the first character string and the second character string; Calculate the text relevant characteristic value of the first character string and the second character string and the semanteme of the first character string and the second character string Correlative character value;The text relevant characteristic value and semantic dependency characteristic value are fitted to the by logic-based regression model The correlative character value of one character string and the second character string.It can be seen that embodiment of the present invention is avoided based on word in document The calculation method of vector space model, therefore the sparse problem of feature is avoided, so that the accuracy rate of correlation prediction is improved, And it saves memory space and has reduced costs.
Moreover, embodiment of the present invention proposes the texts based on the character strings level such as editing distance, longest common subsequence For correlation as basic feature, text similarity between they can express short string from multiple dimensions can preferably handle many short essays This is lack of standardization, participle is inaccurate or inconsistent situation.
In addition, embodiment of the present invention proposes the correlative character based on text classification, the analysis of probability implicit semantic, it can be with The implication relation between short text and the word for constituting short text is sufficiently excavated, to calculate the classification connection between two short texts System and theme contact are formed and are supplemented the feature of text relevant.
In addition, embodiment of the present invention proposes the correlative character of word-based Webpage searching result, the dictionary of dependence Number of resources is controllable, and single machine memory space, calculating speed have significantly to be improved very much, so that the light weight between the short string of canbe used on line Grade semantic dependency is calculated as possibility.
Detailed description of the invention
Fig. 1 is the correlation calculations method flow diagram according to embodiment of the present invention text;
Fig. 2 is the correlation calculations structure drawing of device according to embodiment of the present invention.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, the present invention is made with reference to the accompanying drawing further Detailed description.
In various applications, often it is related to the correlation calculations of two short texts.The correlation of two short texts refers to The two is in semantically existing correlation degree, but not necessarily literal similar.Correlation is one and compares similarity (Similarity)Wider concept is all of great significance in many products and system.Short text refers to that length is shorter Character string, for example be no more than 38 Chinese characters etc. in certain network applications.
Buy word(Bidterm)It is the purchase word for bidding that advertiser submits in bid advertisement system;Query word (Query) be in search engine user submit search key.Query word and purchase word are typically all the shorter text of length All query words and purchase word can be referred to as short text by character string.
Fig. 1 is the correlation calculations method flow diagram according to embodiment of the present invention text.
As shown in Figure 1, this method includes:
Step 101:Receive the first character string and the second character string.
Herein, the first character string and the second character string are preferably all short text.For example, the first character string and the second character String can be query word, purchase word etc. respectively.
Step 102:Calculate the first character string and the second character string text relevant characteristic value and the first character string with The semantic dependency characteristic value of second character string.
Text similarity between the short string of correlative character primary metric of text level.The correlative character of text level The text information of short string has only been used, can have been obtained by efficient optimization algorithm instant computing.
For example, the first character string and correlative character value of second character string based on editing distance can be calculated, and/or meter Calculate the first character string and correlative character value of second character string based on longest common subsequence.
Concept, the similarity of meaning between the short string of correlative character primary metric of semantic level.
In one embodiment, the semantic dependency characteristic value of calculating the first character string and the second character string includes:
Construct category of employment Feature Words dictionary(Such as level-one category of employment Feature Words dictionary);
For the first character string, category distribution belonging to each word is obtained according to category of employment Feature Words dictionary, then will The category distribution of each word adds up again multiplied by the global inverse document frequency weight of the word, to obtain the first character string classification point Cloth;For the second character string, category distribution belonging to each word is obtained according to category of employment Feature Words dictionary, then by each word Category distribution add up again multiplied by the global inverse document frequency weight of the word, to obtain the second character string category distribution;
The cosine angle similarity of the category distribution of the first character string and the second character string is calculated, to obtain the first character string With the semantic dependency characteristic value of the second character string.
Preferably, the building category of employment Feature Words dictionary includes:
Based on the category of employment feature set of words manually marked, each webpage is divided using full text matching mode classification Class;
Webpage for possessing categorical attribute carries out full text word cutting, extracts Based on Class Feature Word Quadric, and the classification extracted is special Sign word is merged into the category of employment feature set of words, to construct category of employment Feature Words dictionary.
In one embodiment, the semantic dependency characteristic value of calculating the first character string and the second character string includes:
For the first character string, theme distribution belonging to each word is obtained, then by all words in first character string Theme distribution adds up again multiplied by the global inverse document frequency weight of the word, to obtain the theme distribution of first character string; For the second character string, theme distribution belonging to each word is obtained, then by the theme distribution of all words in second character string It adds up again multiplied by the global inverse document frequency weight of the word, to obtain the theme distribution of second character string;
The cosine angle similarity of the theme distribution of the first character string and the second character string is calculated, to obtain the first character string With the semantic dependency characteristic value of the second character string.
In one embodiment, the semantic dependency characteristic value of calculating the first character string and the second character string includes:Meter Calculate the first character string and correlative character value of second character string based on statistical machine translation.
In one embodiment, the semantic dependency characteristic value of calculating the first character string and the second character string includes:Meter Calculate the semantic dependency characteristic value of the first character string Yu word granularity of second character string based on Webpage searching result.
Indeed, it is possible to which the text relevant of the first character string and the second character string is calculated using a variety of calculations simultaneously Characteristic value.For example the first character string and correlative character value of second character string based on editing distance can be calculated, and calculate the One character string and correlative character value of second character string based on longest common subsequence, then by the correlation based on editing distance Characteristic value and correlative character value based on longest common subsequence simultaneously as calculated text relevant characteristic value with Participate in the Fitting Calculation of step 103.
Similarly, the semantic dependency of the first character string and the second character string can be calculated using a variety of calculations simultaneously Characteristic value.
Such as:Calculate the first character string and the second character string semantic dependency characteristic value include in following at least one It is a:
Calculate the correlative character value based on editing distance of the first character string Yu the second character string;Calculate the first character string With the correlative character value based on longest common subsequence of the second character string;Calculate the base of the first character string Yu the second character string In the correlative character value of text classification;The first character string is calculated with the second character string based on probability latent semantic analysis (PLSA)Topic relativity characteristic value;Calculate correlation based on statistical machine translation of first character string with the second character string Characteristic value;Calculate the correlative character value of the first character string Yu word granularity of second character string based on Webpage searching result.
Then all calculated semantic dependency characteristic values are participated in the Fitting Calculation of step 103.
Step 103:The text relevant characteristic value and semantic dependency characteristic value are fitted by logic-based regression model At the correlative character value of the first character string and the second character string.
Herein, for the text relevant characteristic value and semantic phase of calculated first character string and the second character string Closing property characteristic value, construction feature vector;
Training examples are constructed using described eigenvector, and use two sorted logic regression models for the training examples Training is done, the weight of text relevant characteristic value, the weight and biasing of semantic dependency characteristic value are respectively obtained;
Utilize the weight of text relevant characteristic value, text relevant characteristic value, the weight of semantic dependency characteristic value, language Adopted correlative character value and biasing calculate the correlative character value.
It is described in more detail below the correlation calculations method of the text of embodiment of the present invention.
Problems solved by the invention formal definition is as follows:
Give two short text T1、T2, calculate the semantic dependency R (T for reflecting its semantic association degree1,T2), wherein R (T1,T2)∈[0,1]。
For a short text T, string length is used | T | it indicates, word segmentation result is expressed as T=t1t2...tn;Then T1、T2Word segmentation result be respectively T1=t11t12...t1n, T2=t21t22...t2n
First two short texts are calculated separately with the correlative character of various dimensions, it then will be multiple using Logic Regression Models The correlative character score value of dimension is fitted to a final semantic dependency score.
It is specific as follows:
For calculating the text relevant characteristic value between two short texts, the i.e. correlative character of calculating text level, Due to the text similarity between the short string of correlative character primary metric of text level, the text envelope of short string has only been used Breath, therefore can be obtained by efficient optimization algorithm instant computing.
Such as:
(1), correlation calculations text relevant characteristic value based on editing distance
Editing distance(Edit Distance), also known as Levenshtein distance refers between two character strings, by one Change into the minimum edit operation times needed for another.The edit operation of license includes that a character is substituted for another word Symbol, is inserted into a character, deletes a character.
Two short text T1、T2Editing distance EditDist (T1,T2), can by time complexity O (| T1|*|T2|) Dynamic programming algorithm be calculated.
Correlative character calculation formula of two short texts based on editing distance is as follows:
(2), correlation calculations text relevant characteristic value based on longest common subsequence
The subsequence of one character string refers to can be deleted the substring obtained after some characters by the character string(sub- string).
The longest common subsequence of two character strings is longest one in its all identical subsequence.Two short texts T1、T2Longest common subsequence LCS (T1,T2), can by time complexity O (| T1|*|T2|) dynamic programming algorithm meter It obtains.
Correlative character calculation formula of two short texts based on longest common subsequence is as follows:
For calculating the semantic dependency characteristic value between two short texts, that is, calculate the correlative character master of semantic level Measure concept, the similarity of meaning between short string.
The semantic dependency characteristic value between two short texts can be calculated in the following way:
(1), correlative character based on text classification calculate semantic dependency characteristic value
Exemplarily, embodiment of the present invention mainly uses the method based on Feature Words to short text classification, basic Process is:
It is primarily based on the initial level-one category of employment feature set of words manually marked(It include on a small quantity artificial in the set The level-one category of employment Feature Words of mark), to hundreds of millions of webpage using the matched mode classification of full text, to each webpage into Row classification;
Webpage for possessing categorical attribute carries out full text word cutting, extracts Based on Class Feature Word Quadric, calculates the category feature of extraction Weight contribution of the word for generic(That is weight vectors), the Based on Class Feature Word Quadric for then extracting these from webpage is merged into In level-one category of employment feature set of words;
It is finished to the extraction of whole web page characteristics words, has just automatically derived a comprehensive level-one category of employment feature word set It closes, so that building obtains level-one category of employment Feature Words dictionary.The dictionary is described as with formula:P (c | w), wherein c indicates class Not, w indicates word, that is to say, that each word has a category distribution.
Give two short text T1、T2, for each short text, classification belonging to each word can be obtained according to p (c | w) Distribution, then the category distribution of each word of the short text is added up, finally obtain the short essay again multiplied by Global ID's F weight of the word Category distribution p originally (c | T).
Using cosine formula, two short text T are obtained1、T2Text classification similarity be:
(2), topic relativity feature calculation semantic dependency characteristic value based on PLSA
PLSA model is a kind of non-supervisory machine learning model, for identification potential theme in document(Topic)Letter Breath excavates the potential semantic relation of document.PLSA model thinks that in user's authored documents, what is selected first is the theme of document Information distribution, then selects suitable word according to the theme distribution of document, to form a complete document.Use mathematical linguistics It is described as follows:
The probability of a selected document is p (d), and every document belongs to a theme with Probability p (z | d), and one given Theme, each word are generated with Probability p (w | z).This process, which is formed united probabilistic model expression formula, is:
p(d,w)=p(d)p(w|d)
p(w|d)=∑z∈Zp(w|z)p(z|d);
By EM algorithm, carries out PLSA model parameter and train, acquisition p (z | d) and p (w | z).Pass through Bayesian formula, p (z | w)=p (w | z) p (z)/p (w) obtains p (z | w).
Give two short text T1、T2, for each short text, theme belonging to each word can be obtained according to p (z | w) Then the theme distribution of all words of the short text is multiplied Global ID's F weight in the word and added up again, then obtains the short text by distribution Theme distribution p (z | T).
Using cosine formula, the PLSA similarity for obtaining two short texts is:
(3), correlative character based on statistical machine translation calculate semantic dependency characteristic value
The translation probability thought of bilingual sentence pair in statistical machine translation field, can naturally enough expect for short essay This progress correlation modeling.
Give two short text T1、T2If given T2, T1The probability of appearance is P (T1|T2), i.e. likelihood score (likelihood).
Obviously, T1、T2More related, likelihood score is bigger.Since text is multifarious, directly its likelihood score is modeled It is more difficult, it is rewritten using Bayesian formula as follows:
Wherein, P (T2|T1) be machine translation in translation model;Indicate T1It is translated as T2Probability;P(T1) and P (T2) point It Wei not T1And T2Language model;That portray respectively is T1And T2Whether be a legal short text probability.
Based on BOW model hypothesis, then
Wherein P (t2j|t1i) it is word t1iTo t2jTranslation probability, i.e. word alignment dictionary.Translation probability between word pair can make With EM algorithm, training is obtained on parallel corpora.
In a particular application, translation model and language model may be by large-scale Webpage search log and advertisement Main purchase word is obtained using the machine translation software moses training of open source.
Two short text T1、T2Correlative character calculation formula based on Machine Translation Model designs as follows:
In statistical machine translation field, this method is fine to the translation mapping effect between different language.But single Language(It such as is both the short string of Chinese)Between, experiment shows that dictionary for translation coverage rate is limited, promotes coverage rate and needs increased put down The number of row corpus is larger.Embodiment of the present invention uses for reference the thought of machine translation, constructs the correlation between a short text Property feature.
(4), word granularity based on Webpage searching result correlative character calculate semantic dependency characteristic value
The core calculated above based on the correlative character of machine translation is word alignment dictionary, is mapped and is closed by this word granularity The inspiration of system, embodiment of the present invention it is further proposed that word-based Webpage searching result correlative character, portray short text Between correlation.
A word is given, the maximum N number of Feature Words of TF-IDF value are extracted from its Webpage searching result(In real system N takes 64), feature vector V (t)=(w of the TF-IDF value composition of these Feature Words1,w2...wn) as the table to word justice Sign.Then two word t1、t2The correlation calculations formula of word-based Webpage searching result is defined as follows:
Two short text T1、T2The correlative character calculation formula of word-based Webpage searching result designs as follows:
The feature of word-based granularity, it is only necessary to store the TF-IDF feature vector of common word, so that it may greatly reduce The expense of disk space, the long retrieval for not needing storage magnanimity are gone here and there.Each retrieval string can use the spy of more fine-grained word The correlation to express, between short text is levied, can be measured with above formula.
According to above-mentioned algorithm, multiple correlative character values can be calculated(It is related and/or semantic related including text), Then these correlative character values can be merged to get up to constitute a total correlative character value.
It specifically includes:
According to aforementioned, the correlative character value of multiple and different dimensions, the spy of specific choice can be calculated between short string Sign includes but is not limited to:Editing distance, longest common subsequence, classification, PLSA topic model, word-based granularity correlation Deng all correlative character values are finally fitted to a total semantic dependency score value using Logic Regression Models.
The sample of the training corpus of semantic dependency model is usually the relevance score that two short texts are provided with editor, Wish model output is the relevance score between one 0 to 1.However, logistic regression is a disaggregated model, it is desirable that training The sample of corpus is feature vector and a class label, and model output is also a class label.
Embodiment of the present invention includes::
Multiple correlative character score values above-mentioned are calculated to the short text of each pair of editor's mark, a feature of composition to Amount;
M training examples are constituted with each feature vector, it, then will wherein if editor's marking is S (S ∈ [0,1]) The category label of a sample is 1, remaining sample is labeled as 0;
The weight w of each correlative character is obtained using the training of two sorted logic regression models1,w2...wnWith biasing b;
For giving two short text T1、T2, first calculate its multiple correlative character score value R above-mentioned1,R2...Rn, then Final relevance score, which is calculated, using Sigmoid function is
The input domain of Sigmoid function is (- ∞ ,+∞), and domain output is [0,1], is highly suitable for calculating correlation point Value.
Embodiment of the present invention can be applied in multiple fields, for example can be applied to the actual retrieval of search advertisements In system, primary election is done to purchase word using Logic Regression Models, and according to the relevance score between short string, certain threshold value is set It is filtered, retains with the semantic maximally related purchase word of query string as candidate.
In conclusion being faced on short text in traditional calculation method based on word vector space model in document The sparse problem of feature.Simultaneously as the word segmentation result of short text depends on language model, different word segmentations are not ensured that Unanimously, the sparse of vector can also be aggravated to a certain extent.
For this problem, embodiment of the present invention proposes based on character strings levels such as editing distance, longest common subsequences Text relevant as basic feature, text similarity between they can express short string from multiple dimensions can preferably be handled very More short texts are lack of standardization, participle is inaccurate or inconsistent situation.
Moreover, tradition is based on literal similar correlation calculations method, traditional BOW (bag-of-words) is mainly utilized Model is typically found on the basis of feature independently assumes, the correlation of short text is measured according to the match condition of feature vector Property, but in practice, many times there is many incidence relations between feature, are especially encountering polysemy and one It when the more words of justice, can semantically offset, association is caused to calculate inaccuracy.
For this problem, embodiment of the present invention proposes the correlation based on text classification, the analysis of probability implicit semantic is special Sign.It can sufficiently excavate the implication relation between short text and the word for constituting short text, thus calculate two short texts it Between classification connection and theme contact, formed and the feature of text relevant supplemented.
Moreover, traditional calculation method based on short text Webpage searching result, is formed using external resource Literal extension to short string.From effect, spreading result depends critically upon the correlation of the products such as selected search engine Quality.From performance, the search result huge amount relied on, each short string requires to store corresponding as a result, to downloading It is required with calculating speed very high;Two it is synonymous but it is literal have slight difference or even the different short text of word order, search result can also It can differ widely, and need to store respectively.In addition, indexed results are also that can regularly update, the spreading result of respective stored Need to change therewith, how to guarantee extend quality do not decline, how equilibrium data update update expense, cannot all avoid Problem.
Embodiment of the present invention proposes the correlative character of word-based Webpage searching result, the dictionary resources numbers of dependence Mesh is controllable, and single machine memory space, calculating speed have significantly to be improved very much, so that the lightweight between the short string of canbe used on line is semantic Correlation calculations are possibly realized.
Based on above-mentioned detailed analysis, embodiment of the present invention also proposed a kind of correlation calculations device of text.
Fig. 2 is the correlation calculations structure drawing of device according to embodiment of the present invention text.
As shown in Fig. 2, the device includes character string receiving unit 201, correlative character value computing unit 202 and correlation Characteristic value fitting unit 203, wherein:
Character string receiving unit 201, for receiving the first character string and the second character string;
Correlative character value computing unit 202, the text relevant for calculating the first character string and the second character string are special The semantic dependency characteristic value of value indicative and the first character string and the second character string;
Correlative character value fitting unit 203, for logic-based regression model by the text relevant characteristic value with Semantic dependency characteristic value is fitted to the correlative character value of the first character string Yu the second character string.
In one embodiment:
Correlative character value computing unit 202, for calculating the first character string and the second character string based on editing distance Correlative character value, and/or calculate the first character string and correlative character value of second character string based on longest common subsequence.
In one embodiment:
Correlative character value computing unit, for constructing level-one category of employment Feature Words dictionary;For the first character string, root Category distribution belonging to each word is obtained according to level-one category of employment Feature Words dictionary, then by the category distribution of each word multiplied by this The global inverse document frequency weight of word adds up again, to obtain the first character string category distribution;For the second character string, according to Level-one category of employment Feature Words dictionary obtains category distribution belonging to each word, then by the category distribution of each word multiplied by the word Global inverse document frequency weight add up again, to obtain the second character string category distribution;Calculate the first character string and second The cosine angle similarity of the category distribution of character string, to obtain the semantic dependency feature of the first character string and the second character string Value.
In one embodiment:
Correlative character value computing unit 202, for using based on the level-one category of employment feature set of words manually marked Full text matching mode classification classifies to each webpage;Webpage for possessing categorical attribute carries out full text word cutting, extracts class Other Feature Words, and the Based on Class Feature Word Quadric extracted is merged into the level-one category of employment feature set of words, to construct level-one row Industry Based on Class Feature Word Quadric dictionary.
In one embodiment:
Correlative character value computing unit 202, for obtaining theme distribution belonging to each word for the first character string, Then the theme distribution of all words in first character string is added up again multiplied by the global inverse document frequency weight of the word, with Obtain the theme distribution of first character string;For the second character string, obtain theme distribution belonging to each word, then by this The theme distribution of all words adds up again multiplied by the global inverse document frequency weight of the word in two character strings, with obtain this second The theme distribution of character string;The cosine angle similarity of the theme distribution of the first character string and the second character string is calculated, to obtain The semantic dependency characteristic value of first character string and the second character string.
In one embodiment:
Correlative character value computing unit 202 is turned over the second character string based on statistical machine for calculating the first character string The correlative character value translated, and/or calculate the language of the first character string and word granularity of second character string based on Webpage searching result Adopted correlative character value.
In one embodiment:
Correlative character value fitting unit 202, for being directed to the text of calculated first character string and the second character string Correlative character value and semantic dependency characteristic value, construction feature vector;Training examples are constructed using described eigenvector, and Do training using two sorted logic regression models for the training examples, respectively obtain text relevant characteristic value weight, The weight and biasing of semantic dependency characteristic value;Utilize the weight of text relevant characteristic value, text relevant characteristic value, language Weight, semantic dependency characteristic value and the biasing of adopted correlative character value, calculate the correlative character value.
In one embodiment:
Correlative character value computing unit 202 executes at least one of following for calculating:
Calculate the correlative character value based on editing distance of the first character string Yu the second character string;
Calculate the correlative character value based on longest common subsequence of the first character string Yu the second character string;
Calculate the correlative character value based on text classification of the first character string Yu the second character string;
Calculate the topic relativity feature based on probability latent semantic analysis PLSA of the first character string Yu the second character string Value;
Calculate the correlative character value based on statistical machine translation of the first character string Yu the second character string;
Calculate the correlative character value of the first character string Yu word granularity of second character string based on Webpage searching result.
Indeed, it is possible to which the correlation meter for the text that embodiment of the present invention is proposed is embodied by diversified forms Calculation method.For example, the application programming interfaces centainly standardized can be followed, the correlation calculations method of text is written as being installed to Plug-in card program in server can also be encapsulated as application program so that user voluntarily downloads use.When being written as plug-in unit When program, a variety of card formats such as ocx, dll, cab can be implemented as.Flash plug-in unit, RealPlayer can also be passed through The particular techniques such as plug-in unit, MMS plug-in unit, MI staff plug-in unit, ActiveX plug-in unit implement the text that embodiment of the present invention is proposed This correlation calculations method.
The correlation for the text that can be proposed embodiment of the present invention by the storing mode of instruction or instruction set storage Property calculation method is stored on various storage mediums.These storage mediums include but is not limited to:It is floppy disk, CD, DVD, hard Disk, flash memory, USB flash disk, CF card, SD card, mmc card, SM card, memory stick(Memory Stick), xD card etc..
Furthermore it is also possible to be applied to the correlation calculations method for the text that embodiment of the present invention is proposed based on flash memory (Nand flash)Storage medium in, such as USB flash disk, CF card, SD card, SDHC card, mmc card, SM card, memory stick, xD card etc..
In conclusion in embodiments of the present invention, in embodiments of the present invention, receiving the first character string and the second word Symbol string;Calculate the text relevant characteristic value and the first character string and the second character string of the first character string and the second character string Semantic dependency characteristic value;The text relevant characteristic value and semantic dependency characteristic value are fitted by logic-based regression model At the correlative character value of the first character string and the second character string.It can be seen that embodiment of the present invention is avoided based on document The calculation method of middle word vector space model, therefore the sparse problem of feature is avoided, to improve the standard of correlation prediction True rate, and saved memory space and reduced costs.
Moreover, embodiment of the present invention proposes the texts based on the character strings level such as editing distance, longest common subsequence For correlation as basic feature, text similarity between they can express short string from multiple dimensions can preferably handle many short essays This is lack of standardization, participle is inaccurate or inconsistent situation.
In addition, embodiment of the present invention proposes the correlative character based on text classification, the analysis of probability implicit semantic, it can be with The implication relation between short text and the word for constituting short text is sufficiently excavated, to calculate the classification connection between two short texts System and theme contact are formed and are supplemented the feature of text relevant.
In addition, embodiment of the present invention proposes the correlative character of word-based Webpage searching result, the dictionary of dependence Number of resources is controllable, and single machine memory space, calculating speed have significantly to be improved very much, so that the light weight between the short string of canbe used on line Grade semantic dependency is calculated as possibility.
The foregoing is only a preferred embodiment of the present invention, is not intended to limit the scope of the present invention.It is all Within the spirit and principles in the present invention, any modification, equivalent replacement, improvement and so on should be included in protection of the invention Within the scope of.

Claims (14)

1. a kind of correlation calculations method of text, which is characterized in that this method includes:
Receive the first character string and the second character string;
Calculate the text relevant characteristic value and the first character string and the second character string of the first character string and the second character string Semantic dependency characteristic value;
The text relevant characteristic value and semantic dependency characteristic value are fitted to the first character string by logic-based regression model With the correlative character value of the second character string;
It is described calculate the first character string and the second character string semantic dependency characteristic value include:
Calculate the correlative character value of the first character string Yu word granularity of second character string based on Webpage searching result;
The correlative character value for wherein calculating word granularity of first character string with the second character string based on Webpage searching result includes:
The maximum multiple Feature Words of TF-IDF value are extracted from the Webpage searching result of word, the TF-IDF value of these Feature Words The feature vector of composition is as the characterization to word justice.
2. the correlation calculations method of text according to claim 1, which is characterized in that the first character string of the calculating with The text relevant characteristic value of second character string includes:
The first character string and correlative character value of second character string based on editing distance are calculated, and/or calculates the first character string Correlative character value with the second character string based on longest common subsequence.
3. the correlation calculations method of text according to claim 1, which is characterized in that the first character string of the calculating with The semantic dependency characteristic value of second character string includes:
Construct category of employment Feature Words dictionary;
For the first character string, category distribution belonging to each word is obtained according to category of employment Feature Words dictionary, it then will be each The category distribution of word adds up again multiplied by the global inverse document frequency weight of the word, to obtain the first character string category distribution; For the second character string, category distribution belonging to each word is obtained according to category of employment Feature Words dictionary, then by each word Category distribution adds up again multiplied by the global inverse document frequency weight of the word, to obtain the second character string category distribution;
The cosine angle similarity of the category distribution of the first character string and the second character string is calculated, to obtain the first character string and The semantic dependency characteristic value of two character strings.
4. the correlation calculations method of text according to claim 3, which is characterized in that
The building category of employment Feature Words dictionary includes:
Based on the category of employment feature set of words manually marked, classified using full text matching mode classification to each webpage;
Webpage for possessing categorical attribute carries out full text word cutting, extracts Based on Class Feature Word Quadric, and the Based on Class Feature Word Quadric that will be extracted It is merged into the category of employment feature set of words, to construct category of employment Feature Words dictionary.
5. the correlation calculations method of text according to claim 1, which is characterized in that
It is described calculate the first character string and the second character string semantic dependency characteristic value include:
For the first character string, theme distribution belonging to each word is obtained, then by the theme of all words in first character string Distribution adds up again multiplied by the global inverse document frequency weight of the word, to obtain the theme distribution of first character string;For Second character string obtains theme distribution belonging to each word, then by the theme distribution of all words in second character string multiplied by The global inverse document frequency weight of the word adds up again, to obtain the theme distribution of second character string;
The cosine angle similarity of the theme distribution of the first character string and the second character string is calculated, to obtain the first character string and The semantic dependency characteristic value of two character strings.
6. the correlation calculations method of text according to any one of claims 1-5, which is characterized in that described to be based on patrolling The text relevant characteristic value is fitted to correlative character value with semantic dependency characteristic value and includes by volume regression model:
For the text relevant characteristic value and semantic dependency characteristic value of calculated first character string and the second character string, Construction feature vector;
Training examples are constructed using described eigenvector, and are instructed for the training examples using two sorted logic regression models Practice, respectively obtains the weight of text relevant characteristic value, the weight and biasing of semantic dependency characteristic value;
Utilize the weight of text relevant characteristic value, text relevant characteristic value, the weight of semantic dependency characteristic value, semantic phase Closing property characteristic value and biasing, calculate the correlative character value.
7. the correlation calculations method of text according to any one of claims 1-5, which is characterized in that
It is described calculate the first character string and the second character string semantic dependency characteristic value include:Calculate the first character string and second The correlative character value based on text classification of character string.
8. a kind of correlation calculations device of text, which is characterized in that the device includes character string receiving unit, correlative character It is worth computing unit and correlative character value fitting unit, wherein:
Character string receiving unit, for receiving the first character string and the second character string;
Correlative character value computing unit, for calculate the first character string and the second character string text relevant characteristic value and The semantic dependency characteristic value of first character string and the second character string;
Correlative character value fitting unit is related to semanteme by the text relevant characteristic value for logic-based regression model Property characteristic value is fitted to the correlative character value of the first character string Yu the second character string;
It is described calculate the first character string and the second character string semantic dependency characteristic value include:
Calculate the correlative character value of the first character string Yu word granularity of second character string based on Webpage searching result;
The correlative character value for wherein calculating word granularity of first character string with the second character string based on Webpage searching result includes:
The maximum multiple Feature Words of TF-IDF value are extracted from the Webpage searching result of word, the TF-IDF value of these Feature Words The feature vector of composition is as the characterization to word justice.
9. the correlation calculations device of text according to claim 8, which is characterized in that
Correlative character value computing unit, it is special for calculating correlation of first character string with the second character string based on editing distance Value indicative, and/or calculate the first character string and correlative character value of second character string based on longest common subsequence.
10. the correlation calculations device of text according to claim 8, which is characterized in that
Correlative character value computing unit, for constructing category of employment Feature Words dictionary;For the first character string, according to industry class Other Feature Words dictionary obtains category distribution belonging to each word, then by the category distribution of each word multiplied by the inverse text of the overall situation of the word This frequency index weight adds up again, to obtain the first character string category distribution;For the second character string, according to category of employment feature Word dictionary obtains category distribution belonging to each word, then by the category distribution of each word multiplied by the inverse text frequency of the overall situation of the word Index weight adds up again, to obtain the second character string category distribution;Calculate the category distribution of the first character string and the second character string Cosine angle similarity, to obtain the semantic dependency characteristic value of the first character string and the second character string.
11. the correlation calculations device of text according to claim 10, which is characterized in that
Correlative character value computing unit, for being divided using full text matching based on the category of employment feature set of words manually marked Class mode classifies to each webpage;Webpage for possessing categorical attribute carries out full text word cutting, extracts Based on Class Feature Word Quadric, and The Based on Class Feature Word Quadric extracted is merged into the category of employment feature set of words, to construct category of employment Feature Words dictionary.
12. the correlation calculations device of text according to claim 8, which is characterized in that
Correlative character value computing unit obtains theme distribution belonging to each word, then should for being directed to the first character string The theme distribution of all words adds up again multiplied by the global inverse document frequency weight of the word in first character string, with obtain this The theme distribution of one character string;For the second character string, theme distribution belonging to each word is obtained, then by second character string In the theme distributions of all words add up again multiplied by the global inverse document frequency weight of the word, to obtain second character string Theme distribution;The cosine angle similarity of the theme distribution of the first character string and the second character string is calculated, to obtain the first character The semantic dependency characteristic value of string and the second character string.
13. the correlation calculations device of the text according to any one of claim 8-12, which is characterized in that
Correlative character value fitting unit, it is special for the text relevant for calculated first character string and the second character string Value indicative and semantic dependency characteristic value, construction feature vector;Training examples are constructed using described eigenvector, and for described Training examples do training using two sorted logic regression models, respectively obtain weight, the semantic correlation of text relevant characteristic value The weight and biasing of property characteristic value;Utilize the weight of text relevant characteristic value, text relevant characteristic value, semantic dependency Weight, semantic dependency characteristic value and the biasing of characteristic value, calculate the correlative character value.
14. the correlation calculations device of the text according to any one of claim 8-12, which is characterized in that
Correlative character value computing unit, for calculating correlation based on text classification of first character string with the second character string Characteristic value.
CN201310388496.XA 2013-08-30 2013-08-30 A kind of correlation calculations method and apparatus of text Active CN104424279B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310388496.XA CN104424279B (en) 2013-08-30 2013-08-30 A kind of correlation calculations method and apparatus of text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310388496.XA CN104424279B (en) 2013-08-30 2013-08-30 A kind of correlation calculations method and apparatus of text

Publications (2)

Publication Number Publication Date
CN104424279A CN104424279A (en) 2015-03-18
CN104424279B true CN104424279B (en) 2018-11-20

Family

ID=52973259

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310388496.XA Active CN104424279B (en) 2013-08-30 2013-08-30 A kind of correlation calculations method and apparatus of text

Country Status (1)

Country Link
CN (1) CN104424279B (en)

Families Citing this family (47)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106445963B (en) * 2015-08-10 2021-11-23 北京奇虎科技有限公司 Advertisement index keyword automatic generation method and device of APP platform
US9928236B2 (en) * 2015-09-18 2018-03-27 Mcafee, Llc Systems and methods for multi-path language translation
CN106776493B (en) * 2015-11-19 2020-03-03 腾讯科技(深圳)有限公司 Information filtering method and information filtering device
US10217025B2 (en) 2015-12-22 2019-02-26 Beijing Qihoo Technology Company Limited Method and apparatus for determining relevance between news and for calculating relevance among multiple pieces of news
CN105528335B (en) * 2015-12-22 2018-10-09 北京奇虎科技有限公司 The method and apparatus for determining correlation between news
CN105630766B (en) * 2015-12-22 2018-11-06 北京奇虎科技有限公司 Correlation calculations method and apparatus between more news
CN105630767B (en) * 2015-12-22 2018-06-15 北京奇虎科技有限公司 The comparative approach and device of a kind of text similarity
CN105528336B (en) * 2015-12-23 2018-09-21 北京奇虎科技有限公司 The method and apparatus that more mark posts determine article correlation
CN105654346A (en) * 2015-12-30 2016-06-08 芜湖乐锐思信息咨询有限公司 Analysis system based on product refinement operation
CN105550905A (en) * 2015-12-30 2016-05-04 芜湖乐锐思信息咨询有限公司 Product selling analysis system based on network
CN105678571A (en) * 2015-12-30 2016-06-15 芜湖乐锐思信息咨询有限公司 Networked product planning analysis system based on Internet
CN105550904A (en) * 2015-12-30 2016-05-04 芜湖乐锐思信息咨询有限公司 Product layout analysis system based on network operation
CN105427138A (en) * 2015-12-30 2016-03-23 芜湖乐锐思信息咨询有限公司 Neural network model-based product market share analysis method and system
CN106951422B (en) * 2016-01-07 2021-05-28 腾讯科技(深圳)有限公司 Webpage training method and device, and search intention identification method and device
CN105930468B (en) * 2016-04-22 2019-05-17 江苏金鸽网络科技有限公司 A kind of rule-based information correlativity determination method
CN106095845B (en) * 2016-06-02 2021-04-06 腾讯科技(深圳)有限公司 Text classification method and device
CN107590146A (en) * 2016-07-06 2018-01-16 北京搜狗科技发展有限公司 A kind of prescription matching process and device, a kind of device for prescription matching
CN106339371B (en) * 2016-08-30 2019-04-30 齐鲁工业大学 A method and device for English-Chinese word meaning mapping based on word vector
CN106484678A (en) * 2016-10-13 2017-03-08 北京智能管家科技有限公司 A kind of short text similarity calculating method and device
CN106657016A (en) * 2016-11-10 2017-05-10 北京奇艺世纪科技有限公司 Illegal user name recognition method and system
CN108205757B (en) * 2016-12-19 2022-05-27 创新先进技术有限公司 Method and device for verifying legality of electronic payment service
CN108241867B (en) * 2016-12-26 2022-10-25 阿里巴巴集团控股有限公司 Classification method and device
CN108268465A (en) * 2016-12-30 2018-07-10 广东精点数据科技股份有限公司 A kind of text search technology towards mixed data model
CN108388480B (en) * 2017-02-03 2021-06-11 百度在线网络技术(北京)有限公司 Short string correlation verification method and device
CN107066443A (en) * 2017-03-27 2017-08-18 成都优译信息技术股份有限公司 Multilingual sentence similarity acquisition methods and system are applied to based on linear regression
CN107301248B (en) * 2017-07-19 2020-07-21 百度在线网络技术(北京)有限公司 Word vector construction method and device of text, computer equipment and storage medium
CN109325509B (en) * 2017-07-31 2023-01-17 北京国双科技有限公司 Similarity determination method and device
CN107844559A (en) * 2017-10-31 2018-03-27 国信优易数据有限公司 A kind of file classifying method, device and electronic equipment
CN110019801B (en) * 2017-12-01 2021-03-23 北京搜狗科技发展有限公司 Text relevance determining method and device
CN108182182B (en) * 2017-12-27 2021-09-10 传神语联网网络科技股份有限公司 Method and device for matching documents in translation database and computer readable storage medium
CN108536800B (en) * 2018-04-03 2022-04-19 有米科技股份有限公司 Text classification method, system, computer device and storage medium
CN110738220B (en) * 2018-07-02 2022-09-30 百度在线网络技术(北京)有限公司 Method and device for analyzing emotion polarity of sentence and storage medium
CN110895656B (en) * 2018-09-13 2023-12-29 北京橙果转话科技有限公司 Text similarity calculation method and device, electronic equipment and storage medium
CN110929498B (en) * 2018-09-20 2023-05-09 中国移动通信有限公司研究院 Calculation method and device for short text similarity, and readable storage medium
CN109522551B (en) * 2018-11-09 2024-02-20 天津新开心生活科技有限公司 Entity linking method and device, storage medium and electronic equipment
CN109271641B (en) * 2018-11-20 2023-09-08 广西三方大供应链技术服务有限公司 Text similarity calculation method and device and electronic equipment
CN111460110B (en) * 2019-01-22 2023-04-25 阿里巴巴集团控股有限公司 Abnormal text detection method, abnormal text sequence detection method and device
CN109947919B (en) * 2019-03-12 2020-05-15 北京字节跳动网络技术有限公司 Method and apparatus for generating text matching model
CN111191087B (en) * 2019-12-31 2023-11-07 歌尔股份有限公司 Character matching method, terminal device and computer readable storage medium
CN111382255B (en) * 2020-03-17 2023-08-01 北京百度网讯科技有限公司 Method, apparatus, device and medium for question-answering processing
CN111522918A (en) * 2020-04-24 2020-08-11 天津易维数科信息科技有限公司 Data aggregation method and device, electronic equipment and computer readable storage medium
CN112749252B (en) * 2020-07-14 2023-11-03 腾讯科技(深圳)有限公司 Text matching method and related device based on artificial intelligence
CN112185573B (en) * 2020-09-25 2023-11-03 志诺维思(北京)基因科技有限公司 Similar character string determining method and device based on LCS and TF-IDF
CN113392212A (en) * 2021-01-14 2021-09-14 腾讯科技(深圳)有限公司 Service knowledge graph construction method and device, electronic equipment and storage medium
CN113239666B (en) * 2021-05-13 2023-09-29 深圳市智灵时代科技有限公司 Text similarity calculation method and system
CN113254596B (en) * 2021-06-22 2021-10-08 湖南大学 User quality inspection requirement classification method and system based on rule matching and deep learning
CN114139559B (en) * 2021-12-01 2025-01-28 中科合肥技术创新工程院 A cross-domain method for measuring the comparability of bilingual texts

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101777042A (en) * 2010-01-21 2010-07-14 西南科技大学 Neural network and tag library-based statement similarity algorithm
CN102184169A (en) * 2011-04-20 2011-09-14 北京百度网讯科技有限公司 Method, device and equipment used for determining similarity information among character string information
CN102332012A (en) * 2011-09-13 2012-01-25 南方报业传媒集团 Chinese text sorting method based on correlation study between sorts
CN102622338A (en) * 2012-02-24 2012-08-01 北京工业大学 Computer-assisted computing method of semantic distance between short texts
CN103049569A (en) * 2012-12-31 2013-04-17 武汉传神信息技术有限公司 Text similarity matching method on basis of vector space model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8812493B2 (en) * 2008-04-11 2014-08-19 Microsoft Corporation Search results ranking using editing distance and document information

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101777042A (en) * 2010-01-21 2010-07-14 西南科技大学 Neural network and tag library-based statement similarity algorithm
CN102184169A (en) * 2011-04-20 2011-09-14 北京百度网讯科技有限公司 Method, device and equipment used for determining similarity information among character string information
CN102332012A (en) * 2011-09-13 2012-01-25 南方报业传媒集团 Chinese text sorting method based on correlation study between sorts
CN102622338A (en) * 2012-02-24 2012-08-01 北京工业大学 Computer-assisted computing method of semantic distance between short texts
CN103049569A (en) * 2012-12-31 2013-04-17 武汉传神信息技术有限公司 Text similarity matching method on basis of vector space model

Also Published As

Publication number Publication date
CN104424279A (en) 2015-03-18

Similar Documents

Publication Publication Date Title
CN104424279B (en) A kind of correlation calculations method and apparatus of text
Khan et al. A novel natural language processing (NLP)–based machine translation model for English to Pakistan sign language translation
CN106649818B (en) Application search intent identification method, device, application search method and server
US20200073882A1 (en) Artificial intelligence based corpus enrichment for knowledge population and query response
Schubotz et al. Semantification of identifiers in mathematics for better math information retrieval
CN104794169B (en) A kind of subject terminology extraction method and system based on sequence labelling model
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
CN106997382A (en) Innovation intention label automatic marking method and system based on big data
CN110378409A (en) It is a kind of based on element association attention mechanism the Chinese get over news documents abstraction generating method
CN104573028A (en) Intelligent question-answer implementing method and system
CN108319734A (en) A kind of product feature structure tree method for auto constructing based on linear combiner
Chen et al. Doctag2vec: An embedding based multi-label learning approach for document tagging
US20230111911A1 (en) Generation and use of content briefs for network content authoring
Hu et al. Self-supervised synonym extraction from the web.
CN110569355A (en) A combined method and system for opinion target extraction and target sentiment classification based on lexical chunks
CN104317837B (en) A kind of cross-module state search method based on topic model
Lin et al. NL2API: A framework for bootstrapping service recommendation using natural language queries
Laddha et al. Extracting aspect specific opinion expressions
Zhang et al. PKU paraphrase bank: A sentence-level paraphrase corpus for Chinese
US20190095525A1 (en) Extraction of expression for natural language processing
Hourrane et al. Using deep learning word embeddings for citations similarity in academic papers
CN103646017A (en) Acronym generating system for naming and working method thereof
CN111382333B (en) Case element extraction method in news text sentences based on case correlation joint learning and graph convolution
Rubtsova et al. Aspect extraction from reviews using conditional random fields
CN118395987A (en) BERT-based landslide hazard assessment named entity identification method of multi-neural network

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
OSZAR »