CN104424279B

CN104424279B - A kind of correlation calculations method and apparatus of text

Info

Publication number: CN104424279B
Application number: CN201310388496.XA
Authority: CN
Inventors: 赫南; 张文斌; 姚伶伶; 王莉峰; 何琪; 张博
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2013-08-30
Filing date: 2013-08-30
Publication date: 2018-11-20
Anticipated expiration: 2033-08-30
Also published as: CN104424279A

Abstract

Embodiment of the present invention proposes a kind of correlation calculations method and apparatus of text.Method includes：Receive the first character string and the second character string；Calculate the text relevant characteristic value of the first character string and the second character string and the semantic dependency characteristic value of the first character string and the second character string；The text relevant characteristic value and semantic dependency characteristic value are fitted to the correlative character value of the first character string Yu the second character string by logic-based regression model.Embodiment of the present invention improves the accuracy rate of correlation prediction, has saved memory space and has reduced costs.

Description

A kind of correlation calculations method and apparatus of text

Technical field

Embodiment of the present invention is related to technical field of internet application, more particularly, to a kind of correlation meter of text Calculate method and apparatus.

Background technique

With the rapid development of computer technology and network technology, internet（Internet）Daily life, The effect played in study and work is also increasing.Various applications on internet emerge one after another.

Search advertisements are a very important business in the Internet advertising ecosystem, it depends on search engine, this It is that matching is sold based on keyword in matter.Advertiser is in the database of business promotion, in addition to providing advertisement for display Except title, description, some keywords with the advertisement with certain correlation are also added（Buy word）, and specified matching Type and bid and orientation matching target flow（Meet the user that retrieval is intended to）.In classical matching process, purchase Word forms the direct index to advertisement.When query word and the purchase word " matching " of advertiser of user, correlation reaches certain Degree thinks the primary election condition for meeting advertisement triggering（It is assumed that first ignoring other orientations and filtering link）, Ke Yila Take corresponding advertisement（Title, description）Do out further subsequent selected, such as clicking rate is estimated, order ads, shows plan Slightly selection etc..

It is retrieving（Retrieve）Stage, ad system can utilize the query string of user, use a variety of online, offline plans Slightly do purchase word matching.Here the purchase word found is all advertiser's specified and advertisement title and description when filling in material Relevant short text.Query word is measured in system on line（query）Word is bought with candidate（bidterm）Correlation essence It is the correlation between short text.

Traditionally have much based on the literal matched method of character string, offline online appraisal procedure also has difference, all deposits In certain limitation.Sahami of Google et al. proposes using the Webpage searching result of short text as semantic extension, The semantic dependency between short text is calculated on the basis of this, it is more preferable than simple word-based effect.University of Massachusetts Dumais of Metzler and Microsoft et al. has also attempted the method that a variety of short texts indicate and has been used to calculate semantic dependency.

However, traditional calculation method based on word vector space model in document, it is sparse to face feature on short text The problem of.Simultaneously as the word segmentation result of short text depends on language model, the consistent of different word segmentations is not ensured that, It can aggravate the sparse of vector to a certain extent.Therefore, traditional calculation method based on word vector space model in document, tool Have the shortcomings that correlation prediction accuracy rate is not high.

Moreover, needing a large amount of memory spaces in traditional calculation method based on word vector space model in document Term vector is stored, therefore also wastes memory space and improves cost.

Summary of the invention

Embodiment of the present invention proposes a kind of correlation calculations methods of text, to improve the accuracy rate of correlation prediction.

Embodiment of the present invention proposes a kind of correlation calculations devices of text, to improve the accuracy rate of correlation prediction.

The technical solution of embodiment of the present invention is as follows：

A kind of correlation calculations method of text, this method include：

Receive the first character string and the second character string；

Calculate the text relevant characteristic value and the first character string and the second character of the first character string and the second character string The semantic dependency characteristic value of string；

The text relevant characteristic value and semantic dependency characteristic value are fitted to the first word by logic-based regression model The correlative character value of symbol string and the second character string.

A kind of correlation calculations device of text, the device include character string receiving unit, correlative character value calculating list Member and correlative character value fitting unit, wherein：

Character string receiving unit, for receiving the first character string and the second character string；

Correlative character value computing unit, for calculating the text relevant characteristic value of the first character string Yu the second character string And first character string and the second character string semantic dependency characteristic value；

Correlative character value fitting unit is used for logic-based regression model for the text relevant characteristic value and semanteme Correlative character value is fitted to the correlative character value of the first character string Yu the second character string.

It can be seen from the above technical proposal that in embodiments of the present invention, receiving the first character string and the second character string； Calculate the text relevant characteristic value of the first character string and the second character string and the semanteme of the first character string and the second character string Correlative character value；The text relevant characteristic value and semantic dependency characteristic value are fitted to the by logic-based regression model The correlative character value of one character string and the second character string.It can be seen that embodiment of the present invention is avoided based on word in document The calculation method of vector space model, therefore the sparse problem of feature is avoided, so that the accuracy rate of correlation prediction is improved, And it saves memory space and has reduced costs.

Moreover, embodiment of the present invention proposes the texts based on the character strings level such as editing distance, longest common subsequence For correlation as basic feature, text similarity between they can express short string from multiple dimensions can preferably handle many short essays This is lack of standardization, participle is inaccurate or inconsistent situation.

In addition, embodiment of the present invention proposes the correlative character based on text classification, the analysis of probability implicit semantic, it can be with The implication relation between short text and the word for constituting short text is sufficiently excavated, to calculate the classification connection between two short texts System and theme contact are formed and are supplemented the feature of text relevant.

In addition, embodiment of the present invention proposes the correlative character of word-based Webpage searching result, the dictionary of dependence Number of resources is controllable, and single machine memory space, calculating speed have significantly to be improved very much, so that the light weight between the short string of canbe used on line Grade semantic dependency is calculated as possibility.

Detailed description of the invention

Fig. 1 is the correlation calculations method flow diagram according to embodiment of the present invention text；

Fig. 2 is the correlation calculations structure drawing of device according to embodiment of the present invention.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, the present invention is made with reference to the accompanying drawing further Detailed description.

In various applications, often it is related to the correlation calculations of two short texts.The correlation of two short texts refers to The two is in semantically existing correlation degree, but not necessarily literal similar.Correlation is one and compares similarity （Similarity）Wider concept is all of great significance in many products and system.Short text refers to that length is shorter Character string, for example be no more than 38 Chinese characters etc. in certain network applications.

Buy word（Bidterm）It is the purchase word for bidding that advertiser submits in bid advertisement system；Query word (Query) be in search engine user submit search key.Query word and purchase word are typically all the shorter text of length All query words and purchase word can be referred to as short text by character string.

Fig. 1 is the correlation calculations method flow diagram according to embodiment of the present invention text.

As shown in Figure 1, this method includes：

Step 101：Receive the first character string and the second character string.

Herein, the first character string and the second character string are preferably all short text.For example, the first character string and the second character String can be query word, purchase word etc. respectively.

Step 102：Calculate the first character string and the second character string text relevant characteristic value and the first character string with The semantic dependency characteristic value of second character string.

Text similarity between the short string of correlative character primary metric of text level.The correlative character of text level The text information of short string has only been used, can have been obtained by efficient optimization algorithm instant computing.

For example, the first character string and correlative character value of second character string based on editing distance can be calculated, and/or meter Calculate the first character string and correlative character value of second character string based on longest common subsequence.

Concept, the similarity of meaning between the short string of correlative character primary metric of semantic level.

In one embodiment, the semantic dependency characteristic value of calculating the first character string and the second character string includes：

Construct category of employment Feature Words dictionary（Such as level-one category of employment Feature Words dictionary）；

For the first character string, category distribution belonging to each word is obtained according to category of employment Feature Words dictionary, then will The category distribution of each word adds up again multiplied by the global inverse document frequency weight of the word, to obtain the first character string classification point Cloth；For the second character string, category distribution belonging to each word is obtained according to category of employment Feature Words dictionary, then by each word Category distribution add up again multiplied by the global inverse document frequency weight of the word, to obtain the second character string category distribution；

The cosine angle similarity of the category distribution of the first character string and the second character string is calculated, to obtain the first character string With the semantic dependency characteristic value of the second character string.

Preferably, the building category of employment Feature Words dictionary includes：

Based on the category of employment feature set of words manually marked, each webpage is divided using full text matching mode classification Class；

Webpage for possessing categorical attribute carries out full text word cutting, extracts Based on Class Feature Word Quadric, and the classification extracted is special Sign word is merged into the category of employment feature set of words, to construct category of employment Feature Words dictionary.

For the first character string, theme distribution belonging to each word is obtained, then by all words in first character string Theme distribution adds up again multiplied by the global inverse document frequency weight of the word, to obtain the theme distribution of first character string； For the second character string, theme distribution belonging to each word is obtained, then by the theme distribution of all words in second character string It adds up again multiplied by the global inverse document frequency weight of the word, to obtain the theme distribution of second character string；

The cosine angle similarity of the theme distribution of the first character string and the second character string is calculated, to obtain the first character string With the semantic dependency characteristic value of the second character string.

In one embodiment, the semantic dependency characteristic value of calculating the first character string and the second character string includes：Meter Calculate the first character string and correlative character value of second character string based on statistical machine translation.

In one embodiment, the semantic dependency characteristic value of calculating the first character string and the second character string includes：Meter Calculate the semantic dependency characteristic value of the first character string Yu word granularity of second character string based on Webpage searching result.

Indeed, it is possible to which the text relevant of the first character string and the second character string is calculated using a variety of calculations simultaneously Characteristic value.For example the first character string and correlative character value of second character string based on editing distance can be calculated, and calculate the One character string and correlative character value of second character string based on longest common subsequence, then by the correlation based on editing distance Characteristic value and correlative character value based on longest common subsequence simultaneously as calculated text relevant characteristic value with Participate in the Fitting Calculation of step 103.

Similarly, the semantic dependency of the first character string and the second character string can be calculated using a variety of calculations simultaneously Characteristic value.

Such as：Calculate the first character string and the second character string semantic dependency characteristic value include in following at least one It is a：

Calculate the correlative character value based on editing distance of the first character string Yu the second character string；Calculate the first character string With the correlative character value based on longest common subsequence of the second character string；Calculate the base of the first character string Yu the second character string In the correlative character value of text classification；The first character string is calculated with the second character string based on probability latent semantic analysis （PLSA）Topic relativity characteristic value；Calculate correlation based on statistical machine translation of first character string with the second character string Characteristic value；Calculate the correlative character value of the first character string Yu word granularity of second character string based on Webpage searching result.

Then all calculated semantic dependency characteristic values are participated in the Fitting Calculation of step 103.

Step 103：The text relevant characteristic value and semantic dependency characteristic value are fitted by logic-based regression model At the correlative character value of the first character string and the second character string.

Herein, for the text relevant characteristic value and semantic phase of calculated first character string and the second character string Closing property characteristic value, construction feature vector；

Training examples are constructed using described eigenvector, and use two sorted logic regression models for the training examples Training is done, the weight of text relevant characteristic value, the weight and biasing of semantic dependency characteristic value are respectively obtained；

Utilize the weight of text relevant characteristic value, text relevant characteristic value, the weight of semantic dependency characteristic value, language Adopted correlative character value and biasing calculate the correlative character value.

It is described in more detail below the correlation calculations method of the text of embodiment of the present invention.

Problems solved by the invention formal definition is as follows：

Give two short text T₁、T₂, calculate the semantic dependency R (T for reflecting its semantic association degree₁,T₂), wherein R (T₁,T₂)∈[0,1]。

For a short text T, string length is used | T | it indicates, word segmentation result is expressed as T=t₁t₂...t_n；Then T₁、T₂Word segmentation result be respectively T₁=t₁₁t₁₂...t_1n, T₂=t₂₁t₂₂...t_2n。

First two short texts are calculated separately with the correlative character of various dimensions, it then will be multiple using Logic Regression Models The correlative character score value of dimension is fitted to a final semantic dependency score.

It is specific as follows：

For calculating the text relevant characteristic value between two short texts, the i.e. correlative character of calculating text level, Due to the text similarity between the short string of correlative character primary metric of text level, the text envelope of short string has only been used Breath, therefore can be obtained by efficient optimization algorithm instant computing.

Such as：

（1）, correlation calculations text relevant characteristic value based on editing distance

Editing distance（Edit Distance）, also known as Levenshtein distance refers between two character strings, by one Change into the minimum edit operation times needed for another.The edit operation of license includes that a character is substituted for another word Symbol, is inserted into a character, deletes a character.

Two short text T₁、T₂Editing distance EditDist (T₁,T₂), can by time complexity O (| T₁|*|T₂|) Dynamic programming algorithm be calculated.

Correlative character calculation formula of two short texts based on editing distance is as follows：

（2）, correlation calculations text relevant characteristic value based on longest common subsequence

The subsequence of one character string refers to can be deleted the substring obtained after some characters by the character string（sub- string）.

The longest common subsequence of two character strings is longest one in its all identical subsequence.Two short texts T₁、T₂Longest common subsequence LCS (T₁,T₂), can by time complexity O (| T₁|*|T₂|) dynamic programming algorithm meter It obtains.

Correlative character calculation formula of two short texts based on longest common subsequence is as follows：

For calculating the semantic dependency characteristic value between two short texts, that is, calculate the correlative character master of semantic level Measure concept, the similarity of meaning between short string.

The semantic dependency characteristic value between two short texts can be calculated in the following way：

（1）, correlative character based on text classification calculate semantic dependency characteristic value

Exemplarily, embodiment of the present invention mainly uses the method based on Feature Words to short text classification, basic Process is：

It is primarily based on the initial level-one category of employment feature set of words manually marked（It include on a small quantity artificial in the set The level-one category of employment Feature Words of mark）, to hundreds of millions of webpage using the matched mode classification of full text, to each webpage into Row classification；

Webpage for possessing categorical attribute carries out full text word cutting, extracts Based on Class Feature Word Quadric, calculates the category feature of extraction Weight contribution of the word for generic（That is weight vectors）, the Based on Class Feature Word Quadric for then extracting these from webpage is merged into In level-one category of employment feature set of words；

It is finished to the extraction of whole web page characteristics words, has just automatically derived a comprehensive level-one category of employment feature word set It closes, so that building obtains level-one category of employment Feature Words dictionary.The dictionary is described as with formula：P (c | w), wherein c indicates class Not, w indicates word, that is to say, that each word has a category distribution.

Give two short text T₁、T₂, for each short text, classification belonging to each word can be obtained according to p (c | w) Distribution, then the category distribution of each word of the short text is added up, finally obtain the short essay again multiplied by Global ID's F weight of the word Category distribution p originally (c | T).

Using cosine formula, two short text T are obtained₁、T₂Text classification similarity be：

（2）, topic relativity feature calculation semantic dependency characteristic value based on PLSA

PLSA model is a kind of non-supervisory machine learning model, for identification potential theme in document（Topic）Letter Breath excavates the potential semantic relation of document.PLSA model thinks that in user's authored documents, what is selected first is the theme of document Information distribution, then selects suitable word according to the theme distribution of document, to form a complete document.Use mathematical linguistics It is described as follows:

The probability of a selected document is p (d), and every document belongs to a theme with Probability p (z | d), and one given Theme, each word are generated with Probability p (w | z).This process, which is formed united probabilistic model expression formula, is：

p(d,w)=p(d)p(w|d)

p(w|d)=∑_z∈Zp(w|z)p(z|d)；

Give two short text T₁、T₂, for each short text, theme belonging to each word can be obtained according to p (z | w) Then the theme distribution of all words of the short text is multiplied Global ID's F weight in the word and added up again, then obtains the short text by distribution Theme distribution p (z | T).

Using cosine formula, the PLSA similarity for obtaining two short texts is：

（3）, correlative character based on statistical machine translation calculate semantic dependency characteristic value

The translation probability thought of bilingual sentence pair in statistical machine translation field, can naturally enough expect for short essay This progress correlation modeling.

Give two short text T₁、T₂If given T₂, T₁The probability of appearance is P (T₁|T₂), i.e. likelihood score （likelihood）.

Obviously, T₁、T₂More related, likelihood score is bigger.Since text is multifarious, directly its likelihood score is modeled It is more difficult, it is rewritten using Bayesian formula as follows：

Wherein, P (T₂|T₁) be machine translation in translation model；Indicate T₁It is translated as T₂Probability；P(T₁) and P (T₂) point It Wei not T₁And T₂Language model；That portray respectively is T₁And T₂Whether be a legal short text probability.

Based on BOW model hypothesis, then

Wherein P (t_2j|t_1i) it is word t_1iTo t_2jTranslation probability, i.e. word alignment dictionary.Translation probability between word pair can make With EM algorithm, training is obtained on parallel corpora.

In a particular application, translation model and language model may be by large-scale Webpage search log and advertisement Main purchase word is obtained using the machine translation software moses training of open source.

Two short text T₁、T₂Correlative character calculation formula based on Machine Translation Model designs as follows：

In statistical machine translation field, this method is fine to the translation mapping effect between different language.But single Language（It such as is both the short string of Chinese）Between, experiment shows that dictionary for translation coverage rate is limited, promotes coverage rate and needs increased put down The number of row corpus is larger.Embodiment of the present invention uses for reference the thought of machine translation, constructs the correlation between a short text Property feature.

（4）, word granularity based on Webpage searching result correlative character calculate semantic dependency characteristic value

The core calculated above based on the correlative character of machine translation is word alignment dictionary, is mapped and is closed by this word granularity The inspiration of system, embodiment of the present invention it is further proposed that word-based Webpage searching result correlative character, portray short text Between correlation.

A word is given, the maximum N number of Feature Words of TF-IDF value are extracted from its Webpage searching result（In real system N takes 64）, feature vector V (t)=(w of the TF-IDF value composition of these Feature Words₁,w₂...w_n) as the table to word justice Sign.Then two word t₁、t₂The correlation calculations formula of word-based Webpage searching result is defined as follows：

Two short text T₁、T₂The correlative character calculation formula of word-based Webpage searching result designs as follows：

The feature of word-based granularity, it is only necessary to store the TF-IDF feature vector of common word, so that it may greatly reduce The expense of disk space, the long retrieval for not needing storage magnanimity are gone here and there.Each retrieval string can use the spy of more fine-grained word The correlation to express, between short text is levied, can be measured with above formula.

According to above-mentioned algorithm, multiple correlative character values can be calculated（It is related and/or semantic related including text）, Then these correlative character values can be merged to get up to constitute a total correlative character value.

It specifically includes：

According to aforementioned, the correlative character value of multiple and different dimensions, the spy of specific choice can be calculated between short string Sign includes but is not limited to：Editing distance, longest common subsequence, classification, PLSA topic model, word-based granularity correlation Deng all correlative character values are finally fitted to a total semantic dependency score value using Logic Regression Models.

The sample of the training corpus of semantic dependency model is usually the relevance score that two short texts are provided with editor, Wish model output is the relevance score between one 0 to 1.However, logistic regression is a disaggregated model, it is desirable that training The sample of corpus is feature vector and a class label, and model output is also a class label.

Embodiment of the present invention includes：：

Multiple correlative character score values above-mentioned are calculated to the short text of each pair of editor's mark, a feature of composition to Amount；

M training examples are constituted with each feature vector, it, then will wherein if editor's marking is S (S ∈ [0,1]) The category label of a sample is 1, remaining sample is labeled as 0；

The weight w of each correlative character is obtained using the training of two sorted logic regression models₁,w₂...w_nWith biasing b；

For giving two short text T₁、T₂, first calculate its multiple correlative character score value R above-mentioned₁,R₂...R_n, then Final relevance score, which is calculated, using Sigmoid function is

The input domain of Sigmoid function is (- ∞ ,+∞), and domain output is [0,1], is highly suitable for calculating correlation point Value.

Embodiment of the present invention can be applied in multiple fields, for example can be applied to the actual retrieval of search advertisements In system, primary election is done to purchase word using Logic Regression Models, and according to the relevance score between short string, certain threshold value is set It is filtered, retains with the semantic maximally related purchase word of query string as candidate.

In conclusion being faced on short text in traditional calculation method based on word vector space model in document The sparse problem of feature.Simultaneously as the word segmentation result of short text depends on language model, different word segmentations are not ensured that Unanimously, the sparse of vector can also be aggravated to a certain extent.

For this problem, embodiment of the present invention proposes based on character strings levels such as editing distance, longest common subsequences Text relevant as basic feature, text similarity between they can express short string from multiple dimensions can preferably be handled very More short texts are lack of standardization, participle is inaccurate or inconsistent situation.

Moreover, tradition is based on literal similar correlation calculations method, traditional BOW (bag-of-words) is mainly utilized Model is typically found on the basis of feature independently assumes, the correlation of short text is measured according to the match condition of feature vector Property, but in practice, many times there is many incidence relations between feature, are especially encountering polysemy and one It when the more words of justice, can semantically offset, association is caused to calculate inaccuracy.

For this problem, embodiment of the present invention proposes the correlation based on text classification, the analysis of probability implicit semantic is special Sign.It can sufficiently excavate the implication relation between short text and the word for constituting short text, thus calculate two short texts it Between classification connection and theme contact, formed and the feature of text relevant supplemented.

Moreover, traditional calculation method based on short text Webpage searching result, is formed using external resource Literal extension to short string.From effect, spreading result depends critically upon the correlation of the products such as selected search engine Quality.From performance, the search result huge amount relied on, each short string requires to store corresponding as a result, to downloading It is required with calculating speed very high；Two it is synonymous but it is literal have slight difference or even the different short text of word order, search result can also It can differ widely, and need to store respectively.In addition, indexed results are also that can regularly update, the spreading result of respective stored Need to change therewith, how to guarantee extend quality do not decline, how equilibrium data update update expense, cannot all avoid Problem.

Embodiment of the present invention proposes the correlative character of word-based Webpage searching result, the dictionary resources numbers of dependence Mesh is controllable, and single machine memory space, calculating speed have significantly to be improved very much, so that the lightweight between the short string of canbe used on line is semantic Correlation calculations are possibly realized.

Based on above-mentioned detailed analysis, embodiment of the present invention also proposed a kind of correlation calculations device of text.

Fig. 2 is the correlation calculations structure drawing of device according to embodiment of the present invention text.

As shown in Fig. 2, the device includes character string receiving unit 201, correlative character value computing unit 202 and correlation Characteristic value fitting unit 203, wherein：

Character string receiving unit 201, for receiving the first character string and the second character string；

Correlative character value computing unit 202, the text relevant for calculating the first character string and the second character string are special The semantic dependency characteristic value of value indicative and the first character string and the second character string；

Correlative character value fitting unit 203, for logic-based regression model by the text relevant characteristic value with Semantic dependency characteristic value is fitted to the correlative character value of the first character string Yu the second character string.

In one embodiment：

Correlative character value computing unit 202, for calculating the first character string and the second character string based on editing distance Correlative character value, and/or calculate the first character string and correlative character value of second character string based on longest common subsequence.

In one embodiment：

Correlative character value computing unit, for constructing level-one category of employment Feature Words dictionary；For the first character string, root Category distribution belonging to each word is obtained according to level-one category of employment Feature Words dictionary, then by the category distribution of each word multiplied by this The global inverse document frequency weight of word adds up again, to obtain the first character string category distribution；For the second character string, according to Level-one category of employment Feature Words dictionary obtains category distribution belonging to each word, then by the category distribution of each word multiplied by the word Global inverse document frequency weight add up again, to obtain the second character string category distribution；Calculate the first character string and second The cosine angle similarity of the category distribution of character string, to obtain the semantic dependency feature of the first character string and the second character string Value.

In one embodiment：

Correlative character value computing unit 202, for using based on the level-one category of employment feature set of words manually marked Full text matching mode classification classifies to each webpage；Webpage for possessing categorical attribute carries out full text word cutting, extracts class Other Feature Words, and the Based on Class Feature Word Quadric extracted is merged into the level-one category of employment feature set of words, to construct level-one row Industry Based on Class Feature Word Quadric dictionary.

In one embodiment：

Correlative character value computing unit 202, for obtaining theme distribution belonging to each word for the first character string, Then the theme distribution of all words in first character string is added up again multiplied by the global inverse document frequency weight of the word, with Obtain the theme distribution of first character string；For the second character string, obtain theme distribution belonging to each word, then by this The theme distribution of all words adds up again multiplied by the global inverse document frequency weight of the word in two character strings, with obtain this second The theme distribution of character string；The cosine angle similarity of the theme distribution of the first character string and the second character string is calculated, to obtain The semantic dependency characteristic value of first character string and the second character string.

In one embodiment：

Correlative character value computing unit 202 is turned over the second character string based on statistical machine for calculating the first character string The correlative character value translated, and/or calculate the language of the first character string and word granularity of second character string based on Webpage searching result Adopted correlative character value.

In one embodiment：

Correlative character value fitting unit 202, for being directed to the text of calculated first character string and the second character string Correlative character value and semantic dependency characteristic value, construction feature vector；Training examples are constructed using described eigenvector, and Do training using two sorted logic regression models for the training examples, respectively obtain text relevant characteristic value weight, The weight and biasing of semantic dependency characteristic value；Utilize the weight of text relevant characteristic value, text relevant characteristic value, language Weight, semantic dependency characteristic value and the biasing of adopted correlative character value, calculate the correlative character value.

In one embodiment：

Correlative character value computing unit 202 executes at least one of following for calculating：

Calculate the correlative character value based on editing distance of the first character string Yu the second character string；

Calculate the correlative character value based on longest common subsequence of the first character string Yu the second character string；

Calculate the correlative character value based on text classification of the first character string Yu the second character string；

Calculate the topic relativity feature based on probability latent semantic analysis PLSA of the first character string Yu the second character string Value；

Calculate the correlative character value based on statistical machine translation of the first character string Yu the second character string；

Calculate the correlative character value of the first character string Yu word granularity of second character string based on Webpage searching result.

Indeed, it is possible to which the correlation meter for the text that embodiment of the present invention is proposed is embodied by diversified forms Calculation method.For example, the application programming interfaces centainly standardized can be followed, the correlation calculations method of text is written as being installed to Plug-in card program in server can also be encapsulated as application program so that user voluntarily downloads use.When being written as plug-in unit When program, a variety of card formats such as ocx, dll, cab can be implemented as.Flash plug-in unit, RealPlayer can also be passed through The particular techniques such as plug-in unit, MMS plug-in unit, MI staff plug-in unit, ActiveX plug-in unit implement the text that embodiment of the present invention is proposed This correlation calculations method.

The correlation for the text that can be proposed embodiment of the present invention by the storing mode of instruction or instruction set storage Property calculation method is stored on various storage mediums.These storage mediums include but is not limited to：It is floppy disk, CD, DVD, hard Disk, flash memory, USB flash disk, CF card, SD card, mmc card, SM card, memory stick（Memory Stick）, xD card etc..

Furthermore it is also possible to be applied to the correlation calculations method for the text that embodiment of the present invention is proposed based on flash memory （Nand flash）Storage medium in, such as USB flash disk, CF card, SD card, SDHC card, mmc card, SM card, memory stick, xD card etc..

In conclusion in embodiments of the present invention, in embodiments of the present invention, receiving the first character string and the second word Symbol string；Calculate the text relevant characteristic value and the first character string and the second character string of the first character string and the second character string Semantic dependency characteristic value；The text relevant characteristic value and semantic dependency characteristic value are fitted by logic-based regression model At the correlative character value of the first character string and the second character string.It can be seen that embodiment of the present invention is avoided based on document The calculation method of middle word vector space model, therefore the sparse problem of feature is avoided, to improve the standard of correlation prediction True rate, and saved memory space and reduced costs.

The foregoing is only a preferred embodiment of the present invention, is not intended to limit the scope of the present invention.It is all Within the spirit and principles in the present invention, any modification, equivalent replacement, improvement and so on should be included in protection of the invention Within the scope of.

Claims

1. a kind of correlation calculations method of text, which is characterized in that this method includes：

Receive the first character string and the second character string；

Calculate the text relevant characteristic value and the first character string and the second character string of the first character string and the second character string Semantic dependency characteristic value；

The text relevant characteristic value and semantic dependency characteristic value are fitted to the first character string by logic-based regression model With the correlative character value of the second character string；

It is described calculate the first character string and the second character string semantic dependency characteristic value include：

Calculate the correlative character value of the first character string Yu word granularity of second character string based on Webpage searching result；

The correlative character value for wherein calculating word granularity of first character string with the second character string based on Webpage searching result includes：

The maximum multiple Feature Words of TF-IDF value are extracted from the Webpage searching result of word, the TF-IDF value of these Feature Words The feature vector of composition is as the characterization to word justice.

2. the correlation calculations method of text according to claim 1, which is characterized in that the first character string of the calculating with The text relevant characteristic value of second character string includes：

The first character string and correlative character value of second character string based on editing distance are calculated, and/or calculates the first character string Correlative character value with the second character string based on longest common subsequence.

3. the correlation calculations method of text according to claim 1, which is characterized in that the first character string of the calculating with The semantic dependency characteristic value of second character string includes：

Construct category of employment Feature Words dictionary；

For the first character string, category distribution belonging to each word is obtained according to category of employment Feature Words dictionary, it then will be each The category distribution of word adds up again multiplied by the global inverse document frequency weight of the word, to obtain the first character string category distribution； For the second character string, category distribution belonging to each word is obtained according to category of employment Feature Words dictionary, then by each word Category distribution adds up again multiplied by the global inverse document frequency weight of the word, to obtain the second character string category distribution；

The cosine angle similarity of the category distribution of the first character string and the second character string is calculated, to obtain the first character string and The semantic dependency characteristic value of two character strings.

4. the correlation calculations method of text according to claim 3, which is characterized in that

The building category of employment Feature Words dictionary includes：

Based on the category of employment feature set of words manually marked, classified using full text matching mode classification to each webpage；

Webpage for possessing categorical attribute carries out full text word cutting, extracts Based on Class Feature Word Quadric, and the Based on Class Feature Word Quadric that will be extracted It is merged into the category of employment feature set of words, to construct category of employment Feature Words dictionary.

5. the correlation calculations method of text according to claim 1, which is characterized in that

For the first character string, theme distribution belonging to each word is obtained, then by the theme of all words in first character string Distribution adds up again multiplied by the global inverse document frequency weight of the word, to obtain the theme distribution of first character string；For Second character string obtains theme distribution belonging to each word, then by the theme distribution of all words in second character string multiplied by The global inverse document frequency weight of the word adds up again, to obtain the theme distribution of second character string；

The cosine angle similarity of the theme distribution of the first character string and the second character string is calculated, to obtain the first character string and The semantic dependency characteristic value of two character strings.

6. the correlation calculations method of text according to any one of claims 1-5, which is characterized in that described to be based on patrolling The text relevant characteristic value is fitted to correlative character value with semantic dependency characteristic value and includes by volume regression model：

For the text relevant characteristic value and semantic dependency characteristic value of calculated first character string and the second character string, Construction feature vector；

Training examples are constructed using described eigenvector, and are instructed for the training examples using two sorted logic regression models Practice, respectively obtains the weight of text relevant characteristic value, the weight and biasing of semantic dependency characteristic value；

Utilize the weight of text relevant characteristic value, text relevant characteristic value, the weight of semantic dependency characteristic value, semantic phase Closing property characteristic value and biasing, calculate the correlative character value.

7. the correlation calculations method of text according to any one of claims 1-5, which is characterized in that

It is described calculate the first character string and the second character string semantic dependency characteristic value include：Calculate the first character string and second The correlative character value based on text classification of character string.

8. a kind of correlation calculations device of text, which is characterized in that the device includes character string receiving unit, correlative character It is worth computing unit and correlative character value fitting unit, wherein：

Correlative character value computing unit, for calculate the first character string and the second character string text relevant characteristic value and The semantic dependency characteristic value of first character string and the second character string；

Correlative character value fitting unit is related to semanteme by the text relevant characteristic value for logic-based regression model Property characteristic value is fitted to the correlative character value of the first character string Yu the second character string；

9. the correlation calculations device of text according to claim 8, which is characterized in that

Correlative character value computing unit, it is special for calculating correlation of first character string with the second character string based on editing distance Value indicative, and/or calculate the first character string and correlative character value of second character string based on longest common subsequence.

10. the correlation calculations device of text according to claim 8, which is characterized in that

Correlative character value computing unit, for constructing category of employment Feature Words dictionary；For the first character string, according to industry class Other Feature Words dictionary obtains category distribution belonging to each word, then by the category distribution of each word multiplied by the inverse text of the overall situation of the word This frequency index weight adds up again, to obtain the first character string category distribution；For the second character string, according to category of employment feature Word dictionary obtains category distribution belonging to each word, then by the category distribution of each word multiplied by the inverse text frequency of the overall situation of the word Index weight adds up again, to obtain the second character string category distribution；Calculate the category distribution of the first character string and the second character string Cosine angle similarity, to obtain the semantic dependency characteristic value of the first character string and the second character string.

11. the correlation calculations device of text according to claim 10, which is characterized in that

Correlative character value computing unit, for being divided using full text matching based on the category of employment feature set of words manually marked Class mode classifies to each webpage；Webpage for possessing categorical attribute carries out full text word cutting, extracts Based on Class Feature Word Quadric, and The Based on Class Feature Word Quadric extracted is merged into the category of employment feature set of words, to construct category of employment Feature Words dictionary.

12. the correlation calculations device of text according to claim 8, which is characterized in that

Correlative character value computing unit obtains theme distribution belonging to each word, then should for being directed to the first character string The theme distribution of all words adds up again multiplied by the global inverse document frequency weight of the word in first character string, with obtain this The theme distribution of one character string；For the second character string, theme distribution belonging to each word is obtained, then by second character string In the theme distributions of all words add up again multiplied by the global inverse document frequency weight of the word, to obtain second character string Theme distribution；The cosine angle similarity of the theme distribution of the first character string and the second character string is calculated, to obtain the first character The semantic dependency characteristic value of string and the second character string.

13. the correlation calculations device of the text according to any one of claim 8-12, which is characterized in that

Correlative character value fitting unit, it is special for the text relevant for calculated first character string and the second character string Value indicative and semantic dependency characteristic value, construction feature vector；Training examples are constructed using described eigenvector, and for described Training examples do training using two sorted logic regression models, respectively obtain weight, the semantic correlation of text relevant characteristic value The weight and biasing of property characteristic value；Utilize the weight of text relevant characteristic value, text relevant characteristic value, semantic dependency Weight, semantic dependency characteristic value and the biasing of characteristic value, calculate the correlative character value.

14. the correlation calculations device of the text according to any one of claim 8-12, which is characterized in that

Correlative character value computing unit, for calculating correlation based on text classification of first character string with the second character string Characteristic value.