CN104424279B - A kind of correlation calculations method and apparatus of text - Google Patents
A kind of correlation calculations method and apparatus of text Download PDFInfo
- Publication number
- CN104424279B CN104424279B CN201310388496.XA CN201310388496A CN104424279B CN 104424279 B CN104424279 B CN 104424279B CN 201310388496 A CN201310388496 A CN 201310388496A CN 104424279 B CN104424279 B CN 104424279B
- Authority
- CN
- China
- Prior art keywords
- character string
- word
- character
- characteristic value
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3347—Query execution using vector based model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/374—Thesaurus
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Embodiment of the present invention proposes a kind of correlation calculations method and apparatus of text.Method includes:Receive the first character string and the second character string;Calculate the text relevant characteristic value of the first character string and the second character string and the semantic dependency characteristic value of the first character string and the second character string;The text relevant characteristic value and semantic dependency characteristic value are fitted to the correlative character value of the first character string Yu the second character string by logic-based regression model.Embodiment of the present invention improves the accuracy rate of correlation prediction, has saved memory space and has reduced costs.
Description
Technical field
Embodiment of the present invention is related to technical field of internet application, more particularly, to a kind of correlation meter of text
Calculate method and apparatus.
Background technique
With the rapid development of computer technology and network technology, internet(Internet)Daily life,
The effect played in study and work is also increasing.Various applications on internet emerge one after another.
Search advertisements are a very important business in the Internet advertising ecosystem, it depends on search engine, this
It is that matching is sold based on keyword in matter.Advertiser is in the database of business promotion, in addition to providing advertisement for display
Except title, description, some keywords with the advertisement with certain correlation are also added(Buy word), and specified matching
Type and bid and orientation matching target flow(Meet the user that retrieval is intended to).In classical matching process, purchase
Word forms the direct index to advertisement.When query word and the purchase word " matching " of advertiser of user, correlation reaches certain
Degree thinks the primary election condition for meeting advertisement triggering(It is assumed that first ignoring other orientations and filtering link), Ke Yila
Take corresponding advertisement(Title, description)Do out further subsequent selected, such as clicking rate is estimated, order ads, shows plan
Slightly selection etc..
It is retrieving(Retrieve)Stage, ad system can utilize the query string of user, use a variety of online, offline plans
Slightly do purchase word matching.Here the purchase word found is all advertiser's specified and advertisement title and description when filling in material
Relevant short text.Query word is measured in system on line(query)Word is bought with candidate(bidterm)Correlation essence
It is the correlation between short text.
Traditionally have much based on the literal matched method of character string, offline online appraisal procedure also has difference, all deposits
In certain limitation.Sahami of Google et al. proposes using the Webpage searching result of short text as semantic extension,
The semantic dependency between short text is calculated on the basis of this, it is more preferable than simple word-based effect.University of Massachusetts
Dumais of Metzler and Microsoft et al. has also attempted the method that a variety of short texts indicate and has been used to calculate semantic dependency.
However, traditional calculation method based on word vector space model in document, it is sparse to face feature on short text
The problem of.Simultaneously as the word segmentation result of short text depends on language model, the consistent of different word segmentations is not ensured that,
It can aggravate the sparse of vector to a certain extent.Therefore, traditional calculation method based on word vector space model in document, tool
Have the shortcomings that correlation prediction accuracy rate is not high.
Moreover, needing a large amount of memory spaces in traditional calculation method based on word vector space model in document
Term vector is stored, therefore also wastes memory space and improves cost.
Summary of the invention
Embodiment of the present invention proposes a kind of correlation calculations methods of text, to improve the accuracy rate of correlation prediction.
Embodiment of the present invention proposes a kind of correlation calculations devices of text, to improve the accuracy rate of correlation prediction.
The technical solution of embodiment of the present invention is as follows:
A kind of correlation calculations method of text, this method include:
Receive the first character string and the second character string;
Calculate the text relevant characteristic value and the first character string and the second character of the first character string and the second character string
The semantic dependency characteristic value of string;
The text relevant characteristic value and semantic dependency characteristic value are fitted to the first word by logic-based regression model
The correlative character value of symbol string and the second character string.
A kind of correlation calculations device of text, the device include character string receiving unit, correlative character value calculating list
Member and correlative character value fitting unit, wherein:
Character string receiving unit, for receiving the first character string and the second character string;
Correlative character value computing unit, for calculating the text relevant characteristic value of the first character string Yu the second character string
And first character string and the second character string semantic dependency characteristic value;
Correlative character value fitting unit is used for logic-based regression model for the text relevant characteristic value and semanteme
Correlative character value is fitted to the correlative character value of the first character string Yu the second character string.
It can be seen from the above technical proposal that in embodiments of the present invention, receiving the first character string and the second character string;
Calculate the text relevant characteristic value of the first character string and the second character string and the semanteme of the first character string and the second character string
Correlative character value;The text relevant characteristic value and semantic dependency characteristic value are fitted to the by logic-based regression model
The correlative character value of one character string and the second character string.It can be seen that embodiment of the present invention is avoided based on word in document
The calculation method of vector space model, therefore the sparse problem of feature is avoided, so that the accuracy rate of correlation prediction is improved,
And it saves memory space and has reduced costs.
Moreover, embodiment of the present invention proposes the texts based on the character strings level such as editing distance, longest common subsequence
For correlation as basic feature, text similarity between they can express short string from multiple dimensions can preferably handle many short essays
This is lack of standardization, participle is inaccurate or inconsistent situation.
In addition, embodiment of the present invention proposes the correlative character based on text classification, the analysis of probability implicit semantic, it can be with
The implication relation between short text and the word for constituting short text is sufficiently excavated, to calculate the classification connection between two short texts
System and theme contact are formed and are supplemented the feature of text relevant.
In addition, embodiment of the present invention proposes the correlative character of word-based Webpage searching result, the dictionary of dependence
Number of resources is controllable, and single machine memory space, calculating speed have significantly to be improved very much, so that the light weight between the short string of canbe used on line
Grade semantic dependency is calculated as possibility.
Detailed description of the invention
Fig. 1 is the correlation calculations method flow diagram according to embodiment of the present invention text;
Fig. 2 is the correlation calculations structure drawing of device according to embodiment of the present invention.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, the present invention is made with reference to the accompanying drawing further
Detailed description.
In various applications, often it is related to the correlation calculations of two short texts.The correlation of two short texts refers to
The two is in semantically existing correlation degree, but not necessarily literal similar.Correlation is one and compares similarity
(Similarity)Wider concept is all of great significance in many products and system.Short text refers to that length is shorter
Character string, for example be no more than 38 Chinese characters etc. in certain network applications.
Buy word(Bidterm)It is the purchase word for bidding that advertiser submits in bid advertisement system;Query word
(Query) be in search engine user submit search key.Query word and purchase word are typically all the shorter text of length
All query words and purchase word can be referred to as short text by character string.
Fig. 1 is the correlation calculations method flow diagram according to embodiment of the present invention text.
As shown in Figure 1, this method includes:
Step 101:Receive the first character string and the second character string.
Herein, the first character string and the second character string are preferably all short text.For example, the first character string and the second character
String can be query word, purchase word etc. respectively.
Step 102:Calculate the first character string and the second character string text relevant characteristic value and the first character string with
The semantic dependency characteristic value of second character string.
Text similarity between the short string of correlative character primary metric of text level.The correlative character of text level
The text information of short string has only been used, can have been obtained by efficient optimization algorithm instant computing.
For example, the first character string and correlative character value of second character string based on editing distance can be calculated, and/or meter
Calculate the first character string and correlative character value of second character string based on longest common subsequence.
Concept, the similarity of meaning between the short string of correlative character primary metric of semantic level.
In one embodiment, the semantic dependency characteristic value of calculating the first character string and the second character string includes:
Construct category of employment Feature Words dictionary(Such as level-one category of employment Feature Words dictionary);
For the first character string, category distribution belonging to each word is obtained according to category of employment Feature Words dictionary, then will
The category distribution of each word adds up again multiplied by the global inverse document frequency weight of the word, to obtain the first character string classification point
Cloth;For the second character string, category distribution belonging to each word is obtained according to category of employment Feature Words dictionary, then by each word
Category distribution add up again multiplied by the global inverse document frequency weight of the word, to obtain the second character string category distribution;
The cosine angle similarity of the category distribution of the first character string and the second character string is calculated, to obtain the first character string
With the semantic dependency characteristic value of the second character string.
Preferably, the building category of employment Feature Words dictionary includes:
Based on the category of employment feature set of words manually marked, each webpage is divided using full text matching mode classification
Class;
Webpage for possessing categorical attribute carries out full text word cutting, extracts Based on Class Feature Word Quadric, and the classification extracted is special
Sign word is merged into the category of employment feature set of words, to construct category of employment Feature Words dictionary.
In one embodiment, the semantic dependency characteristic value of calculating the first character string and the second character string includes:
For the first character string, theme distribution belonging to each word is obtained, then by all words in first character string
Theme distribution adds up again multiplied by the global inverse document frequency weight of the word, to obtain the theme distribution of first character string;
For the second character string, theme distribution belonging to each word is obtained, then by the theme distribution of all words in second character string
It adds up again multiplied by the global inverse document frequency weight of the word, to obtain the theme distribution of second character string;
The cosine angle similarity of the theme distribution of the first character string and the second character string is calculated, to obtain the first character string
With the semantic dependency characteristic value of the second character string.
In one embodiment, the semantic dependency characteristic value of calculating the first character string and the second character string includes:Meter
Calculate the first character string and correlative character value of second character string based on statistical machine translation.
In one embodiment, the semantic dependency characteristic value of calculating the first character string and the second character string includes:Meter
Calculate the semantic dependency characteristic value of the first character string Yu word granularity of second character string based on Webpage searching result.
Indeed, it is possible to which the text relevant of the first character string and the second character string is calculated using a variety of calculations simultaneously
Characteristic value.For example the first character string and correlative character value of second character string based on editing distance can be calculated, and calculate the
One character string and correlative character value of second character string based on longest common subsequence, then by the correlation based on editing distance
Characteristic value and correlative character value based on longest common subsequence simultaneously as calculated text relevant characteristic value with
Participate in the Fitting Calculation of step 103.
Similarly, the semantic dependency of the first character string and the second character string can be calculated using a variety of calculations simultaneously
Characteristic value.
Such as:Calculate the first character string and the second character string semantic dependency characteristic value include in following at least one
It is a:
Calculate the correlative character value based on editing distance of the first character string Yu the second character string;Calculate the first character string
With the correlative character value based on longest common subsequence of the second character string;Calculate the base of the first character string Yu the second character string
In the correlative character value of text classification;The first character string is calculated with the second character string based on probability latent semantic analysis
(PLSA)Topic relativity characteristic value;Calculate correlation based on statistical machine translation of first character string with the second character string
Characteristic value;Calculate the correlative character value of the first character string Yu word granularity of second character string based on Webpage searching result.
Then all calculated semantic dependency characteristic values are participated in the Fitting Calculation of step 103.
Step 103:The text relevant characteristic value and semantic dependency characteristic value are fitted by logic-based regression model
At the correlative character value of the first character string and the second character string.
Herein, for the text relevant characteristic value and semantic phase of calculated first character string and the second character string
Closing property characteristic value, construction feature vector;
Training examples are constructed using described eigenvector, and use two sorted logic regression models for the training examples
Training is done, the weight of text relevant characteristic value, the weight and biasing of semantic dependency characteristic value are respectively obtained;
Utilize the weight of text relevant characteristic value, text relevant characteristic value, the weight of semantic dependency characteristic value, language
Adopted correlative character value and biasing calculate the correlative character value.
It is described in more detail below the correlation calculations method of the text of embodiment of the present invention.
Problems solved by the invention formal definition is as follows:
Give two short text T1、T2, calculate the semantic dependency R (T for reflecting its semantic association degree1,T2), wherein R
(T1,T2)∈[0,1]。
For a short text T, string length is used | T | it indicates, word segmentation result is expressed as T=t1t2...tn;Then
T1、T2Word segmentation result be respectively T1=t11t12...t1n, T2=t21t22...t2n。
First two short texts are calculated separately with the correlative character of various dimensions, it then will be multiple using Logic Regression Models
The correlative character score value of dimension is fitted to a final semantic dependency score.
It is specific as follows:
For calculating the text relevant characteristic value between two short texts, the i.e. correlative character of calculating text level,
Due to the text similarity between the short string of correlative character primary metric of text level, the text envelope of short string has only been used
Breath, therefore can be obtained by efficient optimization algorithm instant computing.
Such as:
(1), correlation calculations text relevant characteristic value based on editing distance
Editing distance(Edit Distance), also known as Levenshtein distance refers between two character strings, by one
Change into the minimum edit operation times needed for another.The edit operation of license includes that a character is substituted for another word
Symbol, is inserted into a character, deletes a character.
Two short text T1、T2Editing distance EditDist (T1,T2), can by time complexity O (| T1|*|T2|)
Dynamic programming algorithm be calculated.
Correlative character calculation formula of two short texts based on editing distance is as follows:
(2), correlation calculations text relevant characteristic value based on longest common subsequence
The subsequence of one character string refers to can be deleted the substring obtained after some characters by the character string(sub-
string).
The longest common subsequence of two character strings is longest one in its all identical subsequence.Two short texts
T1、T2Longest common subsequence LCS (T1,T2), can by time complexity O (| T1|*|T2|) dynamic programming algorithm meter
It obtains.
Correlative character calculation formula of two short texts based on longest common subsequence is as follows:
For calculating the semantic dependency characteristic value between two short texts, that is, calculate the correlative character master of semantic level
Measure concept, the similarity of meaning between short string.
The semantic dependency characteristic value between two short texts can be calculated in the following way:
(1), correlative character based on text classification calculate semantic dependency characteristic value
Exemplarily, embodiment of the present invention mainly uses the method based on Feature Words to short text classification, basic
Process is:
It is primarily based on the initial level-one category of employment feature set of words manually marked(It include on a small quantity artificial in the set
The level-one category of employment Feature Words of mark), to hundreds of millions of webpage using the matched mode classification of full text, to each webpage into
Row classification;
Webpage for possessing categorical attribute carries out full text word cutting, extracts Based on Class Feature Word Quadric, calculates the category feature of extraction
Weight contribution of the word for generic(That is weight vectors), the Based on Class Feature Word Quadric for then extracting these from webpage is merged into
In level-one category of employment feature set of words;
It is finished to the extraction of whole web page characteristics words, has just automatically derived a comprehensive level-one category of employment feature word set
It closes, so that building obtains level-one category of employment Feature Words dictionary.The dictionary is described as with formula:P (c | w), wherein c indicates class
Not, w indicates word, that is to say, that each word has a category distribution.
Give two short text T1、T2, for each short text, classification belonging to each word can be obtained according to p (c | w)
Distribution, then the category distribution of each word of the short text is added up, finally obtain the short essay again multiplied by Global ID's F weight of the word
Category distribution p originally (c | T).
Using cosine formula, two short text T are obtained1、T2Text classification similarity be:
(2), topic relativity feature calculation semantic dependency characteristic value based on PLSA
PLSA model is a kind of non-supervisory machine learning model, for identification potential theme in document(Topic)Letter
Breath excavates the potential semantic relation of document.PLSA model thinks that in user's authored documents, what is selected first is the theme of document
Information distribution, then selects suitable word according to the theme distribution of document, to form a complete document.Use mathematical linguistics
It is described as follows:
The probability of a selected document is p (d), and every document belongs to a theme with Probability p (z | d), and one given
Theme, each word are generated with Probability p (w | z).This process, which is formed united probabilistic model expression formula, is:
p(d,w)=p(d)p(w|d)
p(w|d)=∑z∈Zp(w|z)p(z|d);
By EM algorithm, carries out PLSA model parameter and train, acquisition p (z | d) and p (w | z).Pass through Bayesian formula, p (z
| w)=p (w | z) p (z)/p (w) obtains p (z | w).
Give two short text T1、T2, for each short text, theme belonging to each word can be obtained according to p (z | w)
Then the theme distribution of all words of the short text is multiplied Global ID's F weight in the word and added up again, then obtains the short text by distribution
Theme distribution p (z | T).
Using cosine formula, the PLSA similarity for obtaining two short texts is:
(3), correlative character based on statistical machine translation calculate semantic dependency characteristic value
The translation probability thought of bilingual sentence pair in statistical machine translation field, can naturally enough expect for short essay
This progress correlation modeling.
Give two short text T1、T2If given T2, T1The probability of appearance is P (T1|T2), i.e. likelihood score
(likelihood).
Obviously, T1、T2More related, likelihood score is bigger.Since text is multifarious, directly its likelihood score is modeled
It is more difficult, it is rewritten using Bayesian formula as follows:
Wherein, P (T2|T1) be machine translation in translation model;Indicate T1It is translated as T2Probability;P(T1) and P (T2) point
It Wei not T1And T2Language model;That portray respectively is T1And T2Whether be a legal short text probability.
Based on BOW model hypothesis, then
Wherein P (t2j|t1i) it is word t1iTo t2jTranslation probability, i.e. word alignment dictionary.Translation probability between word pair can make
With EM algorithm, training is obtained on parallel corpora.
In a particular application, translation model and language model may be by large-scale Webpage search log and advertisement
Main purchase word is obtained using the machine translation software moses training of open source.
Two short text T1、T2Correlative character calculation formula based on Machine Translation Model designs as follows:
In statistical machine translation field, this method is fine to the translation mapping effect between different language.But single
Language(It such as is both the short string of Chinese)Between, experiment shows that dictionary for translation coverage rate is limited, promotes coverage rate and needs increased put down
The number of row corpus is larger.Embodiment of the present invention uses for reference the thought of machine translation, constructs the correlation between a short text
Property feature.
(4), word granularity based on Webpage searching result correlative character calculate semantic dependency characteristic value
The core calculated above based on the correlative character of machine translation is word alignment dictionary, is mapped and is closed by this word granularity
The inspiration of system, embodiment of the present invention it is further proposed that word-based Webpage searching result correlative character, portray short text
Between correlation.
A word is given, the maximum N number of Feature Words of TF-IDF value are extracted from its Webpage searching result(In real system
N takes 64), feature vector V (t)=(w of the TF-IDF value composition of these Feature Words1,w2...wn) as the table to word justice
Sign.Then two word t1、t2The correlation calculations formula of word-based Webpage searching result is defined as follows:
Two short text T1、T2The correlative character calculation formula of word-based Webpage searching result designs as follows:
The feature of word-based granularity, it is only necessary to store the TF-IDF feature vector of common word, so that it may greatly reduce
The expense of disk space, the long retrieval for not needing storage magnanimity are gone here and there.Each retrieval string can use the spy of more fine-grained word
The correlation to express, between short text is levied, can be measured with above formula.
According to above-mentioned algorithm, multiple correlative character values can be calculated(It is related and/or semantic related including text),
Then these correlative character values can be merged to get up to constitute a total correlative character value.
It specifically includes:
According to aforementioned, the correlative character value of multiple and different dimensions, the spy of specific choice can be calculated between short string
Sign includes but is not limited to:Editing distance, longest common subsequence, classification, PLSA topic model, word-based granularity correlation
Deng all correlative character values are finally fitted to a total semantic dependency score value using Logic Regression Models.
The sample of the training corpus of semantic dependency model is usually the relevance score that two short texts are provided with editor,
Wish model output is the relevance score between one 0 to 1.However, logistic regression is a disaggregated model, it is desirable that training
The sample of corpus is feature vector and a class label, and model output is also a class label.
Embodiment of the present invention includes::
Multiple correlative character score values above-mentioned are calculated to the short text of each pair of editor's mark, a feature of composition to
Amount;
M training examples are constituted with each feature vector, it, then will wherein if editor's marking is S (S ∈ [0,1])
The category label of a sample is 1, remaining sample is labeled as 0;
The weight w of each correlative character is obtained using the training of two sorted logic regression models1,w2...wnWith biasing b;
For giving two short text T1、T2, first calculate its multiple correlative character score value R above-mentioned1,R2...Rn, then
Final relevance score, which is calculated, using Sigmoid function is
The input domain of Sigmoid function is (- ∞ ,+∞), and domain output is [0,1], is highly suitable for calculating correlation point
Value.
Embodiment of the present invention can be applied in multiple fields, for example can be applied to the actual retrieval of search advertisements
In system, primary election is done to purchase word using Logic Regression Models, and according to the relevance score between short string, certain threshold value is set
It is filtered, retains with the semantic maximally related purchase word of query string as candidate.
In conclusion being faced on short text in traditional calculation method based on word vector space model in document
The sparse problem of feature.Simultaneously as the word segmentation result of short text depends on language model, different word segmentations are not ensured that
Unanimously, the sparse of vector can also be aggravated to a certain extent.
For this problem, embodiment of the present invention proposes based on character strings levels such as editing distance, longest common subsequences
Text relevant as basic feature, text similarity between they can express short string from multiple dimensions can preferably be handled very
More short texts are lack of standardization, participle is inaccurate or inconsistent situation.
Moreover, tradition is based on literal similar correlation calculations method, traditional BOW (bag-of-words) is mainly utilized
Model is typically found on the basis of feature independently assumes, the correlation of short text is measured according to the match condition of feature vector
Property, but in practice, many times there is many incidence relations between feature, are especially encountering polysemy and one
It when the more words of justice, can semantically offset, association is caused to calculate inaccuracy.
For this problem, embodiment of the present invention proposes the correlation based on text classification, the analysis of probability implicit semantic is special
Sign.It can sufficiently excavate the implication relation between short text and the word for constituting short text, thus calculate two short texts it
Between classification connection and theme contact, formed and the feature of text relevant supplemented.
Moreover, traditional calculation method based on short text Webpage searching result, is formed using external resource
Literal extension to short string.From effect, spreading result depends critically upon the correlation of the products such as selected search engine
Quality.From performance, the search result huge amount relied on, each short string requires to store corresponding as a result, to downloading
It is required with calculating speed very high;Two it is synonymous but it is literal have slight difference or even the different short text of word order, search result can also
It can differ widely, and need to store respectively.In addition, indexed results are also that can regularly update, the spreading result of respective stored
Need to change therewith, how to guarantee extend quality do not decline, how equilibrium data update update expense, cannot all avoid
Problem.
Embodiment of the present invention proposes the correlative character of word-based Webpage searching result, the dictionary resources numbers of dependence
Mesh is controllable, and single machine memory space, calculating speed have significantly to be improved very much, so that the lightweight between the short string of canbe used on line is semantic
Correlation calculations are possibly realized.
Based on above-mentioned detailed analysis, embodiment of the present invention also proposed a kind of correlation calculations device of text.
Fig. 2 is the correlation calculations structure drawing of device according to embodiment of the present invention text.
As shown in Fig. 2, the device includes character string receiving unit 201, correlative character value computing unit 202 and correlation
Characteristic value fitting unit 203, wherein:
Character string receiving unit 201, for receiving the first character string and the second character string;
Correlative character value computing unit 202, the text relevant for calculating the first character string and the second character string are special
The semantic dependency characteristic value of value indicative and the first character string and the second character string;
Correlative character value fitting unit 203, for logic-based regression model by the text relevant characteristic value with
Semantic dependency characteristic value is fitted to the correlative character value of the first character string Yu the second character string.
In one embodiment:
Correlative character value computing unit 202, for calculating the first character string and the second character string based on editing distance
Correlative character value, and/or calculate the first character string and correlative character value of second character string based on longest common subsequence.
In one embodiment:
Correlative character value computing unit, for constructing level-one category of employment Feature Words dictionary;For the first character string, root
Category distribution belonging to each word is obtained according to level-one category of employment Feature Words dictionary, then by the category distribution of each word multiplied by this
The global inverse document frequency weight of word adds up again, to obtain the first character string category distribution;For the second character string, according to
Level-one category of employment Feature Words dictionary obtains category distribution belonging to each word, then by the category distribution of each word multiplied by the word
Global inverse document frequency weight add up again, to obtain the second character string category distribution;Calculate the first character string and second
The cosine angle similarity of the category distribution of character string, to obtain the semantic dependency feature of the first character string and the second character string
Value.
In one embodiment:
Correlative character value computing unit 202, for using based on the level-one category of employment feature set of words manually marked
Full text matching mode classification classifies to each webpage;Webpage for possessing categorical attribute carries out full text word cutting, extracts class
Other Feature Words, and the Based on Class Feature Word Quadric extracted is merged into the level-one category of employment feature set of words, to construct level-one row
Industry Based on Class Feature Word Quadric dictionary.
In one embodiment:
Correlative character value computing unit 202, for obtaining theme distribution belonging to each word for the first character string,
Then the theme distribution of all words in first character string is added up again multiplied by the global inverse document frequency weight of the word, with
Obtain the theme distribution of first character string;For the second character string, obtain theme distribution belonging to each word, then by this
The theme distribution of all words adds up again multiplied by the global inverse document frequency weight of the word in two character strings, with obtain this second
The theme distribution of character string;The cosine angle similarity of the theme distribution of the first character string and the second character string is calculated, to obtain
The semantic dependency characteristic value of first character string and the second character string.
In one embodiment:
Correlative character value computing unit 202 is turned over the second character string based on statistical machine for calculating the first character string
The correlative character value translated, and/or calculate the language of the first character string and word granularity of second character string based on Webpage searching result
Adopted correlative character value.
In one embodiment:
Correlative character value fitting unit 202, for being directed to the text of calculated first character string and the second character string
Correlative character value and semantic dependency characteristic value, construction feature vector;Training examples are constructed using described eigenvector, and
Do training using two sorted logic regression models for the training examples, respectively obtain text relevant characteristic value weight,
The weight and biasing of semantic dependency characteristic value;Utilize the weight of text relevant characteristic value, text relevant characteristic value, language
Weight, semantic dependency characteristic value and the biasing of adopted correlative character value, calculate the correlative character value.
In one embodiment:
Correlative character value computing unit 202 executes at least one of following for calculating:
Calculate the correlative character value based on editing distance of the first character string Yu the second character string;
Calculate the correlative character value based on longest common subsequence of the first character string Yu the second character string;
Calculate the correlative character value based on text classification of the first character string Yu the second character string;
Calculate the topic relativity feature based on probability latent semantic analysis PLSA of the first character string Yu the second character string
Value;
Calculate the correlative character value based on statistical machine translation of the first character string Yu the second character string;
Calculate the correlative character value of the first character string Yu word granularity of second character string based on Webpage searching result.
Indeed, it is possible to which the correlation meter for the text that embodiment of the present invention is proposed is embodied by diversified forms
Calculation method.For example, the application programming interfaces centainly standardized can be followed, the correlation calculations method of text is written as being installed to
Plug-in card program in server can also be encapsulated as application program so that user voluntarily downloads use.When being written as plug-in unit
When program, a variety of card formats such as ocx, dll, cab can be implemented as.Flash plug-in unit, RealPlayer can also be passed through
The particular techniques such as plug-in unit, MMS plug-in unit, MI staff plug-in unit, ActiveX plug-in unit implement the text that embodiment of the present invention is proposed
This correlation calculations method.
The correlation for the text that can be proposed embodiment of the present invention by the storing mode of instruction or instruction set storage
Property calculation method is stored on various storage mediums.These storage mediums include but is not limited to:It is floppy disk, CD, DVD, hard
Disk, flash memory, USB flash disk, CF card, SD card, mmc card, SM card, memory stick(Memory Stick), xD card etc..
Furthermore it is also possible to be applied to the correlation calculations method for the text that embodiment of the present invention is proposed based on flash memory
(Nand flash)Storage medium in, such as USB flash disk, CF card, SD card, SDHC card, mmc card, SM card, memory stick, xD card etc..
In conclusion in embodiments of the present invention, in embodiments of the present invention, receiving the first character string and the second word
Symbol string;Calculate the text relevant characteristic value and the first character string and the second character string of the first character string and the second character string
Semantic dependency characteristic value;The text relevant characteristic value and semantic dependency characteristic value are fitted by logic-based regression model
At the correlative character value of the first character string and the second character string.It can be seen that embodiment of the present invention is avoided based on document
The calculation method of middle word vector space model, therefore the sparse problem of feature is avoided, to improve the standard of correlation prediction
True rate, and saved memory space and reduced costs.
Moreover, embodiment of the present invention proposes the texts based on the character strings level such as editing distance, longest common subsequence
For correlation as basic feature, text similarity between they can express short string from multiple dimensions can preferably handle many short essays
This is lack of standardization, participle is inaccurate or inconsistent situation.
In addition, embodiment of the present invention proposes the correlative character based on text classification, the analysis of probability implicit semantic, it can be with
The implication relation between short text and the word for constituting short text is sufficiently excavated, to calculate the classification connection between two short texts
System and theme contact are formed and are supplemented the feature of text relevant.
In addition, embodiment of the present invention proposes the correlative character of word-based Webpage searching result, the dictionary of dependence
Number of resources is controllable, and single machine memory space, calculating speed have significantly to be improved very much, so that the light weight between the short string of canbe used on line
Grade semantic dependency is calculated as possibility.
The foregoing is only a preferred embodiment of the present invention, is not intended to limit the scope of the present invention.It is all
Within the spirit and principles in the present invention, any modification, equivalent replacement, improvement and so on should be included in protection of the invention
Within the scope of.
Claims (14)
1. a kind of correlation calculations method of text, which is characterized in that this method includes:
Receive the first character string and the second character string;
Calculate the text relevant characteristic value and the first character string and the second character string of the first character string and the second character string
Semantic dependency characteristic value;
The text relevant characteristic value and semantic dependency characteristic value are fitted to the first character string by logic-based regression model
With the correlative character value of the second character string;
It is described calculate the first character string and the second character string semantic dependency characteristic value include:
Calculate the correlative character value of the first character string Yu word granularity of second character string based on Webpage searching result;
The correlative character value for wherein calculating word granularity of first character string with the second character string based on Webpage searching result includes:
The maximum multiple Feature Words of TF-IDF value are extracted from the Webpage searching result of word, the TF-IDF value of these Feature Words
The feature vector of composition is as the characterization to word justice.
2. the correlation calculations method of text according to claim 1, which is characterized in that the first character string of the calculating with
The text relevant characteristic value of second character string includes:
The first character string and correlative character value of second character string based on editing distance are calculated, and/or calculates the first character string
Correlative character value with the second character string based on longest common subsequence.
3. the correlation calculations method of text according to claim 1, which is characterized in that the first character string of the calculating with
The semantic dependency characteristic value of second character string includes:
Construct category of employment Feature Words dictionary;
For the first character string, category distribution belonging to each word is obtained according to category of employment Feature Words dictionary, it then will be each
The category distribution of word adds up again multiplied by the global inverse document frequency weight of the word, to obtain the first character string category distribution;
For the second character string, category distribution belonging to each word is obtained according to category of employment Feature Words dictionary, then by each word
Category distribution adds up again multiplied by the global inverse document frequency weight of the word, to obtain the second character string category distribution;
The cosine angle similarity of the category distribution of the first character string and the second character string is calculated, to obtain the first character string and
The semantic dependency characteristic value of two character strings.
4. the correlation calculations method of text according to claim 3, which is characterized in that
The building category of employment Feature Words dictionary includes:
Based on the category of employment feature set of words manually marked, classified using full text matching mode classification to each webpage;
Webpage for possessing categorical attribute carries out full text word cutting, extracts Based on Class Feature Word Quadric, and the Based on Class Feature Word Quadric that will be extracted
It is merged into the category of employment feature set of words, to construct category of employment Feature Words dictionary.
5. the correlation calculations method of text according to claim 1, which is characterized in that
It is described calculate the first character string and the second character string semantic dependency characteristic value include:
For the first character string, theme distribution belonging to each word is obtained, then by the theme of all words in first character string
Distribution adds up again multiplied by the global inverse document frequency weight of the word, to obtain the theme distribution of first character string;For
Second character string obtains theme distribution belonging to each word, then by the theme distribution of all words in second character string multiplied by
The global inverse document frequency weight of the word adds up again, to obtain the theme distribution of second character string;
The cosine angle similarity of the theme distribution of the first character string and the second character string is calculated, to obtain the first character string and
The semantic dependency characteristic value of two character strings.
6. the correlation calculations method of text according to any one of claims 1-5, which is characterized in that described to be based on patrolling
The text relevant characteristic value is fitted to correlative character value with semantic dependency characteristic value and includes by volume regression model:
For the text relevant characteristic value and semantic dependency characteristic value of calculated first character string and the second character string,
Construction feature vector;
Training examples are constructed using described eigenvector, and are instructed for the training examples using two sorted logic regression models
Practice, respectively obtains the weight of text relevant characteristic value, the weight and biasing of semantic dependency characteristic value;
Utilize the weight of text relevant characteristic value, text relevant characteristic value, the weight of semantic dependency characteristic value, semantic phase
Closing property characteristic value and biasing, calculate the correlative character value.
7. the correlation calculations method of text according to any one of claims 1-5, which is characterized in that
It is described calculate the first character string and the second character string semantic dependency characteristic value include:Calculate the first character string and second
The correlative character value based on text classification of character string.
8. a kind of correlation calculations device of text, which is characterized in that the device includes character string receiving unit, correlative character
It is worth computing unit and correlative character value fitting unit, wherein:
Character string receiving unit, for receiving the first character string and the second character string;
Correlative character value computing unit, for calculate the first character string and the second character string text relevant characteristic value and
The semantic dependency characteristic value of first character string and the second character string;
Correlative character value fitting unit is related to semanteme by the text relevant characteristic value for logic-based regression model
Property characteristic value is fitted to the correlative character value of the first character string Yu the second character string;
It is described calculate the first character string and the second character string semantic dependency characteristic value include:
Calculate the correlative character value of the first character string Yu word granularity of second character string based on Webpage searching result;
The correlative character value for wherein calculating word granularity of first character string with the second character string based on Webpage searching result includes:
The maximum multiple Feature Words of TF-IDF value are extracted from the Webpage searching result of word, the TF-IDF value of these Feature Words
The feature vector of composition is as the characterization to word justice.
9. the correlation calculations device of text according to claim 8, which is characterized in that
Correlative character value computing unit, it is special for calculating correlation of first character string with the second character string based on editing distance
Value indicative, and/or calculate the first character string and correlative character value of second character string based on longest common subsequence.
10. the correlation calculations device of text according to claim 8, which is characterized in that
Correlative character value computing unit, for constructing category of employment Feature Words dictionary;For the first character string, according to industry class
Other Feature Words dictionary obtains category distribution belonging to each word, then by the category distribution of each word multiplied by the inverse text of the overall situation of the word
This frequency index weight adds up again, to obtain the first character string category distribution;For the second character string, according to category of employment feature
Word dictionary obtains category distribution belonging to each word, then by the category distribution of each word multiplied by the inverse text frequency of the overall situation of the word
Index weight adds up again, to obtain the second character string category distribution;Calculate the category distribution of the first character string and the second character string
Cosine angle similarity, to obtain the semantic dependency characteristic value of the first character string and the second character string.
11. the correlation calculations device of text according to claim 10, which is characterized in that
Correlative character value computing unit, for being divided using full text matching based on the category of employment feature set of words manually marked
Class mode classifies to each webpage;Webpage for possessing categorical attribute carries out full text word cutting, extracts Based on Class Feature Word Quadric, and
The Based on Class Feature Word Quadric extracted is merged into the category of employment feature set of words, to construct category of employment Feature Words dictionary.
12. the correlation calculations device of text according to claim 8, which is characterized in that
Correlative character value computing unit obtains theme distribution belonging to each word, then should for being directed to the first character string
The theme distribution of all words adds up again multiplied by the global inverse document frequency weight of the word in first character string, with obtain this
The theme distribution of one character string;For the second character string, theme distribution belonging to each word is obtained, then by second character string
In the theme distributions of all words add up again multiplied by the global inverse document frequency weight of the word, to obtain second character string
Theme distribution;The cosine angle similarity of the theme distribution of the first character string and the second character string is calculated, to obtain the first character
The semantic dependency characteristic value of string and the second character string.
13. the correlation calculations device of the text according to any one of claim 8-12, which is characterized in that
Correlative character value fitting unit, it is special for the text relevant for calculated first character string and the second character string
Value indicative and semantic dependency characteristic value, construction feature vector;Training examples are constructed using described eigenvector, and for described
Training examples do training using two sorted logic regression models, respectively obtain weight, the semantic correlation of text relevant characteristic value
The weight and biasing of property characteristic value;Utilize the weight of text relevant characteristic value, text relevant characteristic value, semantic dependency
Weight, semantic dependency characteristic value and the biasing of characteristic value, calculate the correlative character value.
14. the correlation calculations device of the text according to any one of claim 8-12, which is characterized in that
Correlative character value computing unit, for calculating correlation based on text classification of first character string with the second character string
Characteristic value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310388496.XA CN104424279B (en) | 2013-08-30 | 2013-08-30 | A kind of correlation calculations method and apparatus of text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310388496.XA CN104424279B (en) | 2013-08-30 | 2013-08-30 | A kind of correlation calculations method and apparatus of text |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104424279A CN104424279A (en) | 2015-03-18 |
CN104424279B true CN104424279B (en) | 2018-11-20 |
Family
ID=52973259
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310388496.XA Active CN104424279B (en) | 2013-08-30 | 2013-08-30 | A kind of correlation calculations method and apparatus of text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104424279B (en) |
Families Citing this family (47)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106445963B (en) * | 2015-08-10 | 2021-11-23 | 北京奇虎科技有限公司 | Advertisement index keyword automatic generation method and device of APP platform |
US9928236B2 (en) * | 2015-09-18 | 2018-03-27 | Mcafee, Llc | Systems and methods for multi-path language translation |
CN106776493B (en) * | 2015-11-19 | 2020-03-03 | 腾讯科技(深圳)有限公司 | Information filtering method and information filtering device |
US10217025B2 (en) | 2015-12-22 | 2019-02-26 | Beijing Qihoo Technology Company Limited | Method and apparatus for determining relevance between news and for calculating relevance among multiple pieces of news |
CN105528335B (en) * | 2015-12-22 | 2018-10-09 | 北京奇虎科技有限公司 | The method and apparatus for determining correlation between news |
CN105630766B (en) * | 2015-12-22 | 2018-11-06 | 北京奇虎科技有限公司 | Correlation calculations method and apparatus between more news |
CN105630767B (en) * | 2015-12-22 | 2018-06-15 | 北京奇虎科技有限公司 | The comparative approach and device of a kind of text similarity |
CN105528336B (en) * | 2015-12-23 | 2018-09-21 | 北京奇虎科技有限公司 | The method and apparatus that more mark posts determine article correlation |
CN105654346A (en) * | 2015-12-30 | 2016-06-08 | 芜湖乐锐思信息咨询有限公司 | Analysis system based on product refinement operation |
CN105550905A (en) * | 2015-12-30 | 2016-05-04 | 芜湖乐锐思信息咨询有限公司 | Product selling analysis system based on network |
CN105678571A (en) * | 2015-12-30 | 2016-06-15 | 芜湖乐锐思信息咨询有限公司 | Networked product planning analysis system based on Internet |
CN105550904A (en) * | 2015-12-30 | 2016-05-04 | 芜湖乐锐思信息咨询有限公司 | Product layout analysis system based on network operation |
CN105427138A (en) * | 2015-12-30 | 2016-03-23 | 芜湖乐锐思信息咨询有限公司 | Neural network model-based product market share analysis method and system |
CN106951422B (en) * | 2016-01-07 | 2021-05-28 | 腾讯科技(深圳)有限公司 | Webpage training method and device, and search intention identification method and device |
CN105930468B (en) * | 2016-04-22 | 2019-05-17 | 江苏金鸽网络科技有限公司 | A kind of rule-based information correlativity determination method |
CN106095845B (en) * | 2016-06-02 | 2021-04-06 | 腾讯科技(深圳)有限公司 | Text classification method and device |
CN107590146A (en) * | 2016-07-06 | 2018-01-16 | 北京搜狗科技发展有限公司 | A kind of prescription matching process and device, a kind of device for prescription matching |
CN106339371B (en) * | 2016-08-30 | 2019-04-30 | 齐鲁工业大学 | A method and device for English-Chinese word meaning mapping based on word vector |
CN106484678A (en) * | 2016-10-13 | 2017-03-08 | 北京智能管家科技有限公司 | A kind of short text similarity calculating method and device |
CN106657016A (en) * | 2016-11-10 | 2017-05-10 | 北京奇艺世纪科技有限公司 | Illegal user name recognition method and system |
CN108205757B (en) * | 2016-12-19 | 2022-05-27 | 创新先进技术有限公司 | Method and device for verifying legality of electronic payment service |
CN108241867B (en) * | 2016-12-26 | 2022-10-25 | 阿里巴巴集团控股有限公司 | Classification method and device |
CN108268465A (en) * | 2016-12-30 | 2018-07-10 | 广东精点数据科技股份有限公司 | A kind of text search technology towards mixed data model |
CN108388480B (en) * | 2017-02-03 | 2021-06-11 | 百度在线网络技术(北京)有限公司 | Short string correlation verification method and device |
CN107066443A (en) * | 2017-03-27 | 2017-08-18 | 成都优译信息技术股份有限公司 | Multilingual sentence similarity acquisition methods and system are applied to based on linear regression |
CN107301248B (en) * | 2017-07-19 | 2020-07-21 | 百度在线网络技术(北京)有限公司 | Word vector construction method and device of text, computer equipment and storage medium |
CN109325509B (en) * | 2017-07-31 | 2023-01-17 | 北京国双科技有限公司 | Similarity determination method and device |
CN107844559A (en) * | 2017-10-31 | 2018-03-27 | 国信优易数据有限公司 | A kind of file classifying method, device and electronic equipment |
CN110019801B (en) * | 2017-12-01 | 2021-03-23 | 北京搜狗科技发展有限公司 | Text relevance determining method and device |
CN108182182B (en) * | 2017-12-27 | 2021-09-10 | 传神语联网网络科技股份有限公司 | Method and device for matching documents in translation database and computer readable storage medium |
CN108536800B (en) * | 2018-04-03 | 2022-04-19 | 有米科技股份有限公司 | Text classification method, system, computer device and storage medium |
CN110738220B (en) * | 2018-07-02 | 2022-09-30 | 百度在线网络技术(北京)有限公司 | Method and device for analyzing emotion polarity of sentence and storage medium |
CN110895656B (en) * | 2018-09-13 | 2023-12-29 | 北京橙果转话科技有限公司 | Text similarity calculation method and device, electronic equipment and storage medium |
CN110929498B (en) * | 2018-09-20 | 2023-05-09 | 中国移动通信有限公司研究院 | Calculation method and device for short text similarity, and readable storage medium |
CN109522551B (en) * | 2018-11-09 | 2024-02-20 | 天津新开心生活科技有限公司 | Entity linking method and device, storage medium and electronic equipment |
CN109271641B (en) * | 2018-11-20 | 2023-09-08 | 广西三方大供应链技术服务有限公司 | Text similarity calculation method and device and electronic equipment |
CN111460110B (en) * | 2019-01-22 | 2023-04-25 | 阿里巴巴集团控股有限公司 | Abnormal text detection method, abnormal text sequence detection method and device |
CN109947919B (en) * | 2019-03-12 | 2020-05-15 | 北京字节跳动网络技术有限公司 | Method and apparatus for generating text matching model |
CN111191087B (en) * | 2019-12-31 | 2023-11-07 | 歌尔股份有限公司 | Character matching method, terminal device and computer readable storage medium |
CN111382255B (en) * | 2020-03-17 | 2023-08-01 | 北京百度网讯科技有限公司 | Method, apparatus, device and medium for question-answering processing |
CN111522918A (en) * | 2020-04-24 | 2020-08-11 | 天津易维数科信息科技有限公司 | Data aggregation method and device, electronic equipment and computer readable storage medium |
CN112749252B (en) * | 2020-07-14 | 2023-11-03 | 腾讯科技(深圳)有限公司 | Text matching method and related device based on artificial intelligence |
CN112185573B (en) * | 2020-09-25 | 2023-11-03 | 志诺维思(北京)基因科技有限公司 | Similar character string determining method and device based on LCS and TF-IDF |
CN113392212A (en) * | 2021-01-14 | 2021-09-14 | 腾讯科技(深圳)有限公司 | Service knowledge graph construction method and device, electronic equipment and storage medium |
CN113239666B (en) * | 2021-05-13 | 2023-09-29 | 深圳市智灵时代科技有限公司 | Text similarity calculation method and system |
CN113254596B (en) * | 2021-06-22 | 2021-10-08 | 湖南大学 | User quality inspection requirement classification method and system based on rule matching and deep learning |
CN114139559B (en) * | 2021-12-01 | 2025-01-28 | 中科合肥技术创新工程院 | A cross-domain method for measuring the comparability of bilingual texts |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101777042A (en) * | 2010-01-21 | 2010-07-14 | 西南科技大学 | Neural network and tag library-based statement similarity algorithm |
CN102184169A (en) * | 2011-04-20 | 2011-09-14 | 北京百度网讯科技有限公司 | Method, device and equipment used for determining similarity information among character string information |
CN102332012A (en) * | 2011-09-13 | 2012-01-25 | 南方报业传媒集团 | Chinese text sorting method based on correlation study between sorts |
CN102622338A (en) * | 2012-02-24 | 2012-08-01 | 北京工业大学 | Computer-assisted computing method of semantic distance between short texts |
CN103049569A (en) * | 2012-12-31 | 2013-04-17 | 武汉传神信息技术有限公司 | Text similarity matching method on basis of vector space model |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8812493B2 (en) * | 2008-04-11 | 2014-08-19 | Microsoft Corporation | Search results ranking using editing distance and document information |
-
2013
- 2013-08-30 CN CN201310388496.XA patent/CN104424279B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101777042A (en) * | 2010-01-21 | 2010-07-14 | 西南科技大学 | Neural network and tag library-based statement similarity algorithm |
CN102184169A (en) * | 2011-04-20 | 2011-09-14 | 北京百度网讯科技有限公司 | Method, device and equipment used for determining similarity information among character string information |
CN102332012A (en) * | 2011-09-13 | 2012-01-25 | 南方报业传媒集团 | Chinese text sorting method based on correlation study between sorts |
CN102622338A (en) * | 2012-02-24 | 2012-08-01 | 北京工业大学 | Computer-assisted computing method of semantic distance between short texts |
CN103049569A (en) * | 2012-12-31 | 2013-04-17 | 武汉传神信息技术有限公司 | Text similarity matching method on basis of vector space model |
Also Published As
Publication number | Publication date |
---|---|
CN104424279A (en) | 2015-03-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104424279B (en) | A kind of correlation calculations method and apparatus of text | |
Khan et al. | A novel natural language processing (NLP)–based machine translation model for English to Pakistan sign language translation | |
CN106649818B (en) | Application search intent identification method, device, application search method and server | |
US20200073882A1 (en) | Artificial intelligence based corpus enrichment for knowledge population and query response | |
Schubotz et al. | Semantification of identifiers in mathematics for better math information retrieval | |
CN104794169B (en) | A kind of subject terminology extraction method and system based on sequence labelling model | |
CN106599032B (en) | Text event extraction method combining sparse coding and structure sensing machine | |
CN106997382A (en) | Innovation intention label automatic marking method and system based on big data | |
CN110378409A (en) | It is a kind of based on element association attention mechanism the Chinese get over news documents abstraction generating method | |
CN104573028A (en) | Intelligent question-answer implementing method and system | |
CN108319734A (en) | A kind of product feature structure tree method for auto constructing based on linear combiner | |
Chen et al. | Doctag2vec: An embedding based multi-label learning approach for document tagging | |
US20230111911A1 (en) | Generation and use of content briefs for network content authoring | |
Hu et al. | Self-supervised synonym extraction from the web. | |
CN110569355A (en) | A combined method and system for opinion target extraction and target sentiment classification based on lexical chunks | |
CN104317837B (en) | A kind of cross-module state search method based on topic model | |
Lin et al. | NL2API: A framework for bootstrapping service recommendation using natural language queries | |
Laddha et al. | Extracting aspect specific opinion expressions | |
Zhang et al. | PKU paraphrase bank: A sentence-level paraphrase corpus for Chinese | |
US20190095525A1 (en) | Extraction of expression for natural language processing | |
Hourrane et al. | Using deep learning word embeddings for citations similarity in academic papers | |
CN103646017A (en) | Acronym generating system for naming and working method thereof | |
CN111382333B (en) | Case element extraction method in news text sentences based on case correlation joint learning and graph convolution | |
Rubtsova et al. | Aspect extraction from reviews using conditional random fields | |
CN118395987A (en) | BERT-based landslide hazard assessment named entity identification method of multi-neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |