US8266150B1 - Scalable document signature search engine - Google Patents
Scalable document signature search engine Download PDFInfo
- Publication number
- US8266150B1 US8266150B1 US12/364,134 US36413409A US8266150B1 US 8266150 B1 US8266150 B1 US 8266150B1 US 36413409 A US36413409 A US 36413409A US 8266150 B1 US8266150 B1 US 8266150B1
- Authority
- US
- United States
- Prior art keywords
- signature
- index
- signatures
- search
- file identifier
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
- 230000009977 dual effect Effects 0.000 claims abstract description 6
- 230000001174 ascending effect Effects 0.000 claims description 16
- 238000013500 data storage Methods 0.000 claims description 14
- 238000004422 calculation algorithm Methods 0.000 abstract description 9
- 238000000034 method Methods 0.000 description 123
- 230000008569 process Effects 0.000 description 107
- 230000006870 function Effects 0.000 description 27
- 238000001514 detection method Methods 0.000 description 19
- 101150035983 str1 gene Proteins 0.000 description 11
- 238000013459 approach Methods 0.000 description 9
- 238000005516 engineering process Methods 0.000 description 5
- 102100035353 Cyclin-dependent kinase 2-associated protein 1 Human genes 0.000 description 4
- 230000008901 benefit Effects 0.000 description 4
- 238000003780 insertion Methods 0.000 description 4
- 230000037431 insertion Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000012217 deletion Methods 0.000 description 2
- 230000037430 deletion Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000000135 prohibitive effect Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 229910052710 silicon Inorganic materials 0.000 description 2
- XUIMIQQOPSSXEZ-UHFFFAOYSA-N Silicon Chemical compound [Si] XUIMIQQOPSSXEZ-UHFFFAOYSA-N 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000003990 capacitor Substances 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 229910044991 metal oxide Inorganic materials 0.000 description 1
- 150000004706 metal oxides Chemical class 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000010845 search algorithm Methods 0.000 description 1
- 239000010703 silicon Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
Definitions
- the present application is related to co-pending U.S. application Ser. No. 11/361,340, “Matching Engine with Signature Generation,” filed Feb. 24, 2006 by Liwei Ren et al., the disclosure of which is hereby incorporated by reference.
- the present application is also related to co-pending U.S. application Ser. No. 12/069,659, “Document Matching Engine with Asymmetric Signature Generation,” filed Feb. 11, 2008 by Liwei Ren et al., the disclosure of which is hereby incorporated by reference.
- the present invention generally relates to the field of search engine technologies with application, for example, to the area of data leakage prevention.
- an enterprise search engine is a software system to search relevant documents with given query statements.
- the enterprise search engine typically consists of a crawler, an indexer, a searcher and a query engine.
- the crawler gathers documents from pre-assigned locations and dumps them into document repositories.
- the indexer fetches documents from the document repositories, creates indices from the documents, and stores the indices into an index database.
- the searcher searches the index database and returns a list of relevant documents (referenced as “hits”) in response to a specific query.
- the query engine parses a query expression provided by a user and sends query commands to searcher for processing.
- the conventional search system 100 may fetch documents from one or more document sources 105 ( a - n ) that are stored in a document repository 110 .
- the documents from document sources 105 ( a - n ) are indexed by a search engine 120 , and the indexed documents 122 are stored in an index database 124 .
- a user 150 seeking information may use a query composer 130 to compose a query to search documents 126 in the search engine 120 .
- the search may then be conducted by the search engine 120 against the indexed documents 122 in the index database 124 .
- a match or matches i.e. “hits”
- the search engine 120 returns the matching indexed documents as search results 135 that are presented to the user 150 .
- One embodiment relates to a computer apparatus for managing an index of documents.
- the apparatus includes data storage configured to store computer-readable instructions and data, and a processor configured to execute computer-readable instructions and to access said data storage.
- a file identifier-signature index is stored on the data storage, the file identifier-signature index comprising index headers and a signature array.
- a signature-file identifier index is stored on the data storage, the signature-file identifier index comprising signature and file identifier data pairs.
- the apparatus includes data storage configured to store computer-readable instructions and data, and a processor configured to execute computer-readable instructions and to access said data storage.
- Computer-readable code is configured to implement a search engine.
- the search engine is configured to apply a weak search if configured with a weak searching index from an index engine and to apply a strong search if configured with a strong searching index from the index engine, wherein both the weak and strong searching indices are derived from the set of documents.
- FIG. 1 illustrates an example of a conventional architecture of a search engine.
- FIG. 2A depicts an embodiment of an architecture of a matching engine with signature generation.
- FIG. 2B illustrates deployment of the matching engine with signature generation in a distributed computing environment.
- FIG. 2C depicts an embodiment of an architecture of a matching engine with asymmetric signature generation in accordance with an embodiment of the invention.
- FIG. 3 illustrates a first embodiment of a signature generation process for use with English language documents in accordance with an embodiment of the invention.
- FIG. 4 illustrates a second embodiment of a signature generation process for use with universal transformation format encoded documents in accordance with an embodiment of the invention.
- FIG. 5 illustrates a first embodiment of a relevance detection process in accordance with an embodiment of the invention.
- FIG. 6 illustrates a second embodiment of a relevance detection process in accordance with an embodiment of the invention.
- FIG. 7 shows components of a signature index in accordance with an embodiment of the invention.
- FIG. 8 show the structure of an array of index headers in accordance with an embodiment of the invention.
- FIG. 9 shows an array of signatures grouped by file ID (fid) in accordance with an embodiment of the invention.
- FIG. 10 shows an array of pairs ⁇ sig, fid> in ascending order in accordance with an embodiment of the invention.
- FIG. 11 provides a logical presentation of a weak search index in accordance with an embodiment of the invention.
- FIG. 12 provides a logical presentation of a strong search index in accordance with an embodiment of the invention.
- FIG. 13 shows the architecture of signature index and search in accordance with an embodiment of the invention.
- FIG. 14 depicts operations for the creation and insertion of new records into an fid-signature index in accordance with an embodiment of the invention.
- FIG. 15 depicts operations for the creation and insertion of new records into a signature-fid index in accordance with an embodiment of the invention.
- FIG. 16 shows operations for deleting existing records in accordance with an embodiment of the invention.
- FIG. 17 shows operations for updating existing records in accordance with an embodiment of the invention.
- FIG. 18 depicts the creation of weak and strong searching indices in accordance with an embodiment of the invention.
- FIG. 19 depicts a weak signature search in accordance with an embodiment of the invention.
- FIG. 20 depicts a strong signature search in accordance with an embodiment of the invention.
- FIG. 21 is a schematic diagram depicting an example computer apparatus 700 which may be configured to perform various method steps in accordance with an embodiment of the invention.
- the conventional search system discussed in the Background section has various limitations and inefficiencies.
- the conventional search system lacks an accurate and efficient measurement of the document relevance.
- a conventional search system generally returns a large list of documents, most of which may not be relevant at all. Thus, the precision rate of retrieval is low. Returning a large list of documents is a common problem of conventional search engine technologies because the query presented by key terms is unable to precisely depict the documents that users are trying to retrieve.
- Another disadvantage with the direct application of conventional search systems is that they typically measure the relevance of documents through models that are often inaccurate or that are highly computing intensive. Examples of these inaccurate and resource intensive models include a term vector-space model, a probabilistic model, a latent semantic space model, and the like.
- the conventional data structure of a signature index has disadvantages and inefficiencies. These shortcomings include that the search performance may be slow for a conventional index structure.
- the conventional nonlinear index data structure is unsuitable for incremental signature updates based on delta technology.
- the index data takes up a large amount of disk space, and the search APIs (application programming interfaces) have a large footprint and so consumes a large amount of memory.
- FIG. 2A illustrates an architecture of a matching engine with signature generation, as disclosed in U.S. application Ser. No. 11/361,340.
- one or more document resources 205 may be collected (or stored) in a document repository 210 .
- the architecture is structured to pre-process the tokens from the document, select the most informative tokens, and, based on the informative tokens, generate signatures associated with the document.
- the architecture also is configured to ensure uniqueness of the generated signatures with respect to an input document context.
- the architecture is further configured to limit the number of signatures collected while keeping the stability of the collection across varied versions of the same document.
- the signature is a value, for example, a hash representation corresponding to particular information or a string of ASCII characters in accordance with the selected tokens.
- documents 205 may be collected manually or through use of a crawler.
- a crawler may be configured to visit all assigned document sources to collect documents, assigns a unique document identifier (ID) to each document that is collected, and then place the unique document ID and document into the document repository 210 .
- ID unique document identifier
- a signature generator 215 generates a list of signatures from particular documents in the document repository 210 .
- Signatures are strings or values that one makes from the unique information that represents a document. This representative information is unique to the document and stable when the document has moderate changes.
- the signature generator 215 may be configured to store one or more signature generation processes.
- the signature generator 215 may be further configured to select and execute one of the stored processes based on a type of document to be processed.
- a signature generation process is adapted (configured) for use with English language documents, for example, in ASCII code, and is further described with respect to FIG. 3 .
- the process can also apply to other languages that may use lower cases, stop-words and stemming, for example, Romance and Latin languages.
- Another embodiment of a signature generation process is adapted for use with documents in UTF-8 (universal transformation format) encoding for any language supported by Unicode, and is further described with respect to FIG. 4 .
- an indexer 222 indexes the document with unique document identifier (ID) and the signatures generated by the signature generator 215 .
- ID unique document identifier
- the result is an indexed document (by indexer 222 ) that is stored in an index database 224 of a search engine 220 .
- a user 250 may use a query writer 230 to compose a query expression based on the signatures generated by the signature generator 215 . It is noted that the input document provided by a user 250 provides a query input. The user 250 does not need to know what the signatures are; rather, the user 250 only needs to know what the input document is. The user 250 passes the input document to signature generator 215 . The signatures output from the signature generator 215 are passed to the query writer 230 for query composition. The composed query is then passed to a searcher 226 (search mechanism) for searching documents.
- searcher 226 search mechanism
- the searcher 226 in the search engine 220 searches the index database 224 using the query provided through the query writer 230 .
- the searcher returns a list of possible relevant documents 226 (“hits”) to a relevance detection engine 240 .
- the relevance detection engine 240 calculates a relevance (e.g., in percentage terms) between the input document and the hits.
- the relevance detection engine 240 is configured to include one or more processes for a relevance calculation (or analysis).
- a first embodiment of a relevance detection process is further described with respect to FIG. 5 .
- a second embodiment of relevance detection process is further described with respect to FIG. 6 . It is noted that the relevance detection engine 240 can select and/or implement either of these processes. For example, for small documents, the first embodiment of the relevance detection process may be deployed and for larger documents, e.g., greater than 10 megabytes (MB) in size, the second embodiment of the relevance detection process may be deployed.
- MB megabytes
- the matching engine architecture discussed above may be used to efficiently find a limited set of highly-relevant documents. For example, given a query to find documents related to document D with a relevance percentage X %, the matching engine efficiently searches a list of documents ⁇ D 1 , . . . , D b ⁇ from a document repository and returns a set of documents from the list which have a relevance greater than X % in relation to document D.
- the matching engine architecture of FIG. 2A may be used to efficiently find a limited set of highly-relevant documents, there are shortcomings with the architecture.
- the architecture has limitations in relation to bandwidth constraints and scalability.
- the signature (index) database 224 is generated at the computer system 265 of the indexer 222 .
- the searcher 226 also needs access to the signature (index) database 224 in order to execute the document search queries. Therefore, a copy of the signature (index) database 224 needs to be downloaded to the computer system 270 of the searcher 226 .
- the number of documents in the document depository may grow to be a very large number, for example, in excess of a million documents or more.
- the signature (index) database 224 becomes larger and larger.
- the network bandwidth cost of downloading a large signature (index) database 224 may become prohibitive and problematic.
- the data storage space required at the computer system 270 of the searcher 226 to store a large signature (index) database 224 may also become prohibitive and problematic.
- the present application discloses a solution which substantially improves the scalability of the matching engine.
- the matching engine architecture of FIG. 2A is symmetric in that the signature generator 215 is configured to be the same for both the indexer 222 and the searcher 226 .
- the solution disclosed herein breaks that symmetry.
- FIG. 2C depicts an embodiment of an architecture of a matching engine with asymmetric signature generation in accordance with the present invention. As indicated in the diagram, the two signature generators 215 -A (for the indexer) and 215 -B (for the searcher) are now configured differently.
- the two signature generators 215 -A (for the indexer) and 215 -B (for the searcher) may use a same signature generation algorithm or process, but they may be configured differently from each other in that they may use different input parameters for the signature generation process.
- Symmetric signature generators have the same inputs (use the same parameters) to generate the signatures for any given text:
- asymmetric signature generators have different inputs (use different parameters) to generate the signatures for any given text.
- the input parameter N may be different for the two signature generators, while the input parameter M may be the same.
- the tolerance level defines the expectation that the match engine is able to identify variations of any document.
- the tolerance level may be presented in percentile. For example, a tolerance level of 30% means that match engine is able to identify a version of a document even if the content has been changed up to 30%.
- the following six tolerance levels are defined.
- this specific implementation defines the two functions Get-N-for-GeneratorA and Get-N-for-GeneratorB according to Table 1 shown below, depending on the tolerance level (Levels 1 through 6) and the text size (in various size ranges).
- N generally increases with increasing tolerance level and, in this implementation, is assumed to be a number which is a power of 2 (i.e. 2, 4, 8, 16, 32, . . . ). Given a selected tolerance level, N generally increases with increasing text size. Moreover, given a particular tolerance level and text size, N for generator A is generally smaller than N for generator B.
- the above-discussed specific embodiment selects different numbers N in an adaptive manner depending on the text size while applying a same number M. Applicants have determined that this embodiment may be utilized to advantageously reduce a volume of the index (signature) database 224 while maintaining almost the same accuracy and performance of symmetric signature generation.
- asymmetric signature generation may be used so as to reduce the signature database volume while maintaining almost the same accuracy and performance of symmetric signature generation.
- searchers 226 may be configured at various protection points of a network. Placement of the searchers 226 at protection points of a network may be used, for example, to protect against leakage of sensitive data from an institutional network. Protection points of a network may include, for example, internet protocol routers, wireless access points, certain input/output (e.g., data storage or networking) interfaces of host computers (e.g., desktops, laptops), mobile devices and so on.
- the database may be a conventional database, for example, a flat file or relationship database.
- various embodiments of the processes disclosed herein may be implemented using one or more software elements.
- a software element (or modules) may refer to any software structures arranged to perform certain operations.
- the software elements may include program instructions and/or data adapted for execution by a hardware element, such as a processor.
- Program instructions may include an organized list of commands comprising words, values or symbols arranged in a predetermined syntax, that when executed, may cause a processor to perform a corresponding set of operations.
- a signature generator extracts a set of signatures from any relevant document.
- Each signature may be, for example, an ASCII string with fixed length.
- FIG. 3 illustrates a first embodiment of a signature generation process in accordance with the present invention.
- This embodiment illustrates generating signatures from an English document encoded in ASCII codes. Note the following regarding this process.
- Steps 310 , 315 , 320 , and 325 pre-process tokens from the documents.
- Steps 330 , 335 , 340 , 345 , and 350 select the most informative tokens.
- Steps 355 and 360 generate the signatures.
- step 355 ensures signatures generated are unique to the input document context, and step 360 limits the number of signatures collected while keeping the stability of the collection across varied versions of the same document.
- the process begins with inputting 305 the document.
- the process parses 310 the document to generate (or create) an initial list of one or more tokens (a token list).
- a token includes text in the document separated by a predefined character characteristic. Examples of predefined character characteristics include a delimiter. Once tokens are separated, functions such as stemming, stop-word or lower case analysis can be applied.
- Lower casing 315 is a function that converts each letter of a token to a lower case character.
- the process also stems 320 each token of the token list. It is noted that word stemming is a process to identify or extract core roots from a word.
- the process applies 325 a stop-word-list to each token of the list to formulate a new first token list (L 1 ).
- the stop words are words that are considered as carrying no information. Examples of stop words include ‘the’, ‘are’, ‘do’, ‘am’ and the like.
- the process stems each member of a stop-word-list.
- the process then calculates (or generates) 340 a ranking score of each token in the second token list L 2 .
- a score function measures the importance of a token in the text by the frequency and also its assigned weight.
- weight( ) may be a pre-defined function. In one embodiment, its value is a ‘1’, although in alternative embodiments its value may be some pre-assigned number, e.g., 6.8, if the token contains some special characters like ‘ ⁇ ’, ‘_’ and ‘@’.
- the score function may be determined by S j *Weight(T j ). The score function may be used to evenly distribute tokens over the document to get better scores. This is determined by [P(j,S j ) ⁇ P(j,1)]/Sqrt (D j ).
- the process sorts 345 the second token list L 2 by the calculated scores and then selects (or picks) 350 the top M tokens by score from that list (L 2 ).
- M can be any integer and may be predefined within the system or may be selected as an input into the system.
- the top M tokens by score from the second token list L 2 creates a third token list L 3 .
- For each token T j of the third token list L 3 generate 355 signatures out of its occurrences and the neighboring tokens in L 1 .
- This process also can be represented as:
- the process sorts the list ⁇ F j,1 , F j,2 , . . . F j,Sj ⁇ and selects 360 the top N signatures from this sorted list. It is noted that N can be any integer and may be predefined within the system or may be selected as an input into the system.
- N can be any integer and may be predefined within the system or may be selected as an input into the system.
- M*N selected signatures are gathered (or collected) 365 .
- the process then outputs 370 the collection of signatures.
- FIG. 4 illustrates a second embodiment of a signature generation process in accordance with the present invention.
- the second embodiment includes a process that inputs 405 , for example, a textual document of any language in plain UTF-8 format (universal transformation format) and a list of characters in UTF-8 alphabet that we consider as being informative.
- other inputs may include some number M, which corresponds to a number of characters with top ranking scores, and some number N, which corresponds to a maximum signature number for each character.
- Other optional inputs may include a constant integer CHAR_NEIGHBOR, which can have a predefined value, for example, 30. This constant integer defines a size of a character's neighbor in a text string, which will be used to generate signatures.
- Another input is a selection rate R. It has a predefined range between 0 and 1, for example, 0.20. The selection rate is a number for use of selecting a subset out of a set.
- Yet another input may be an empty signature list S.
- the process normalizes 410 the document by scanning the document to remove the characters that are not informative.
- a non-informative character is a UTF-8 character that does not contribute to the text context. They may provide other purposes such as formatting. For example, if a string has n consecutive spaces, then n ⁇ 1 spaces are considered non-informative. Other examples of non-informative characters include control (CTRL) characters and returns.
- CTRL control
- the process then scans 415 the normalized document to record the occurrences of each character, c, in the UTF-8 alphabet.
- the position of the occurrences is denoted as P(1,c), P(2,c), . . . , P(n,c).
- the score function measures an importance of a character in the text by its frequency.
- the score function also ensures that the characters that are evenly distributed over the document get better scores.
- a calculation for achieving this includes: [ P ( n,c ) ⁇ P (1 ,c )]/Sqrt( D ).
- the process continues with sorting 420 the character alphabet by score and then selects (or picks) 425 the M characters with top scores.
- This generated list may be denoted as character list L. It is noted that M can be any integer and may be predefined within the system or may be selected as an input into the system as previously described.
- the process For each character c in the character list L, at each occurrence p of character c, the process calculates its neighbor. In particular, the process values by taking its left and right character and concatenating all the encoding bytes together to form an integer v. This neighbor-value v and the occurrence p make a pair (v, p).
- the process assigns 430 a value of 1 to a variable j.
- Variable j is an enumeration of the list L.
- members of L may be processed one by one. In the illustrated process, this structure is used to realize a concept of “each” and is incrementally increased 435 . In turn, this forms 440 a list L i (c) of pairs for each character c in the character list L.
- the size of the list L 1 (c) may be denoted as N(c).
- the process counts the repeats m of each neighbor-value v in the list to form 445 a second list L 2 (c) with triplets (m, v, p).
- the size of the second list L 2 (c) also may be denoted as N(c).
- Each list L 2 (c) is sorted 450 by (m, v), where m is the first comparison parameter and v is the second comparison parameter.
- the process selects (or picks) 455 the top K(c) triplets from the second sorted list L 2 (c), where K(c) ⁇ R*N(c). This forms a third list L 3 (c).
- the process calculates 460 its hash value by a hash function, hash(p), which generates hash value with the neighboring characters surrounding the occurrence position p.
- hash(p) may be the conventional Karp-Rabin hash function.
- the number of neighboring characters is determined by CHAR_NEIGHBOR.
- the process sorts 465 the third list L 3 (c) by hash value and selects (picks) 470 up to N triplets from the top of sorted list L 3 (c) to form a fourth list L 4 (c).
- N can be any integer and may be predefined within the system or may be selected as an input into the system as previously noted.
- the process For each triplet (m, v, p) in L 4 (c), the process generates 475 a signature using the characters surrounding the occurrence position p and add it into signature list S. It is noted that process described is iterative, and therefore, is iterated for all characters c in list L.
- the signature generator is a unique configuration that beneficially replaces the roles of keywords when composing queries.
- the signature generator is efficient because it reduces the size of hits. This increases the performance of the matching engine.
- the signature generator improves the search precision rate of the matching engine.
- the signature generator can be structured to be language-independent, thus expanding the scope of documents available for search.
- signatures play a particular role in a search engine in a manner that may be more useful than conventional keywords.
- Signatures are abstracted from documents in a manner as described herein to characterize/represent documents better than keywords. Hence, they are more relevant to the documents than keywords.
- signatures may be different than keywords in that a signature is strongly relevant to a document while a keyword is not necessarily so, two irrelevant documents do not share any signature while they could own the same single keyword, and signatures achieve better search precision rates than keywords.
- a system in accordance with the present invention also may include opportunities for relevance detection.
- each document can be considered as a string of characters (ASCII, Unicode, etc.) of an alphabet.
- ASCII a string of characters
- Unicode Unicode
- the relevance of two documents is strongly related to the similarity of two strings.
- SIM is applied to measure the similarity of str2 to str1.
- str1 “AAAAACCCCCCCCBBBBBBDDDDDDAAAAAALLLLLLL”
- str2 “CCCCCCCCCZZZZAAAAAAABBBBTTTTLLL”.
- the example above illustrates one embodiment of similarity of two strings that is actually defined by substring copies from str1 to str2 with a minimum size requirement of each copy.
- text documents there are many characters that are not necessarily contributing to the document context. For example, extra space and invisible characters are not informative at all. Hence, these useless characters are first removed from the documents before applying the function SIM.
- This process may be referenced as string normalization. For example, the string “There are some useless characters in this sentence !” can be normalized as “There are some useless characters in this sentence!”. In this example, there are unneeded (or useless) spaces between words in the original sentence and only one space between words after normalization.
- the suffix strings are sorted. String X has n+1 suffix strings. These can be sorted lexicographically by any means. Suffix sorting is a conventional algorithm problem known to those skilled in the art.
- FIG. 5 illustrates a first embodiment of a relevance detection process in accordance with the present invention.
- the process starts with input 505 of an initial document (e.g., referenced as doc) plus one or more additional documents, plus an integer M.
- an initial document e.g., referenced as doc
- additional documents may be a list of text documents to be matched.
- the additional documents may be referenced as doc 1 (or doc — 1) through doc m (or doc_m), where m is the number of additional documents and M is an integer corresponding to a minimum substring match length.
- M can be any integer and may be predefined within the system or may be selected as an input into the system as previously described.
- the process normalizes 510 all the documents, initial doc plus additional doc s , through doc n , to get strings str, str 1 (or str — 1) through str m (or str_m).
- the process sorts 515 the suffixes of str with an array IDX to record the suffix string positions. It is noted that array IDX is known in conventional suffix sorting algorithms.
- the process next searches 535 a maximum matching length of string str and S(str k , P).
- the above illustrates an example of a function searchMaxMatchLen to search the suffix string (of string str) which shares the longest common prefix substring with another string str2.
- This function is implemented by a binary search.
- the function getMaxMatchSize is to get the longest common prefix among two strings.
- the output advantageously presents a similarity in percentages between an input document and a list of additional documents. For example, as illustrated above there is given a threshold percentage x % and an input document to find the documents in the stored index document database.
- the process beneficially generates the signatures of the input document by signature generator.
- the searcher searches the index database using the signatures and returns a list of documents (hits), each of which shares at least one common signature with the input documents.
- the relevance detection process calculates the similarity between the input document and each document in the list. These are output as SIM 1 , . . . , SIM m .
- FIG. 6 it illustrates a second embodiment of a relevance detection process in accordance with the present invention.
- the process begins with an input 605 of an initial text document, referenced as doc, and a list of text documents to be matched to the doc, plus an integer M.
- the list of text documents is referenced as doc 1 , . . . , doc m .
- m is the number of text documents and M is a minimum substring match length.
- M can be any integer and may be predefined within the system or may be selected as an input into the system as previously described.
- the process normalizes 610 doc, doc 1 , . . . , doc m to generate (or produce) strings str, str 1 , . . . , str m .
- the process assigns 615 a prime number, Q, which is larger than the size of string str and is referenced as L.
- Q a prime number
- the process allocates an array H with size Q for a hash table with chaining capability to resolve collisions of hash values.
- the hash function HT_FUN is to calculate a hash value of a substring of the string str, which starts at position j and with a length M.
- a conventional Karp-Rabin hash function may be applied.
- V(s) getMaxMatchSize(str+s,L ⁇ s, str k +P, L k ⁇ P) to get the maximal matching length of two sub-strings.
- V maximum(V(s)).
- V represents the length of the largest prefix string of S(str k ,P) and this prefix is also a substring of string str.
- the relevance detection engine beneficially is configured to determine document relevance in percentage measurements.
- the configuration is structured so that irrelevant documents included in hits can be filtered out by a percentage threshold. This increases search engine utilization and provides results having a greater degree of acceptance.
- the relevance detection engine is beneficially structured to provide a document filter. It calculates a relevance (or similarity) between a given document and a list of other documents based on the definition of document relevance. The relevance is given in percentages. For a given threshold X %, the engine filters out the documents in the list that have relevance less than X %.
- the present application discloses an advantageous logical structure for a signature index that includes search tables.
- An exemplary system for creating, managing and using the signature index are discussed herein.
- the system consists of index engine and search engine.
- the search tables are compact in disk usage and suitable for diff computation, making large scale deployment feasible.
- the search engine based on the search tables is extremely efficient in searching signatures from large search tables storing millions of signatures.
- the number of possible signatures is 91 8 , which is 4,702,525,276,151,521, close to 4,702 trillion.
- the size of the signature space is 4,702,525,276,151,521. This is a huge space which is sufficient to accommodate signatures of billions of files. Assume that there are one billion files and each file has at most 128 signatures. With those assumptions, there would be at most 128 billions of signatures which is quite a small number in comparison with the 4,700 trillion size signature space.
- FIG. 7 shows components of a signature index in accordance with an embodiment of the invention.
- the signature index has a hierarchical structure and comprises a managing index and a searching index.
- the managing index has two parts, the fid-signature index and the signature-fid index.
- the fid-signature index is used to add signatures into index.
- the signature-fid index is generated from the fid-signature index, and the signature-fid index is used to generate the searching index.
- the fid-signature index includes index headers (index heads) and a signature array.
- the index headers may be stored in an index header file
- the signature array may be stored in a signature file.
- the index headers are shown in FIG. 8 . Signatures belonging to the same fid are stored consecutively together in the signature array. As such, the index headers indicate, for each fid (file identifier or file ID), the number of signatures belonging to a fid (# sig), the position (st_pos) of the first signature for this fid in the signature array, and the position (en_pos) of the last signature for this fid in the signature array.
- the index headers are arranged in ascending order based on the fid value. When a new fid is added, its entry must be inserted in the appropriate position to keep the order.
- the index header array may be in memory during the process of record creation. It may be saved in an index header file.
- the signature array comprises an array of signatures grouped by fid as shown in FIG. 9 .
- the signatures Before the signature array is saved as a signature file, the signatures may be sorted in ascending order of fid such that the signatures in the signature file are stored in ascending order of fid (i.e. in ascending order by the value of the fid).
- the signatures in the array may be grouped by fid, but not necessarily in ascending order of fid when the array is in memory.
- a new record with its fid and associated signatures
- its signatures may be appended at the end of the signature array.
- the signature-fid index comprises an array of pairs ⁇ sig, fid>. In accordance with an embodiment of the invention, these signature-fid pairs are ordered according to the signatures in ascending order.
- the signature-fid index may be saved in a file called the signature-fid file.
- the searching index comprises a weak searching index and a strong searching index.
- FIG. 11 provides a logical presentation of a weak searching index in accordance with an embodiment of the invention
- FIG. 12 provides a logical presentation of a strong searching index in accordance with an embodiment of the invention.
- the weak searching index includes a meta data array and a weak search table.
- the weak search table includes signatures that are stored in ascending order based on the signatures themselves. The ordered signatures are organized in blocks.
- the weak search table meta data may comprise start (st_pos) and end (en_pos) positions for each block.
- each signature may be a string of 8 ASCII characters, for example, and the signatures in the same block may have the same first two characters.
- the meta-data array would have 8,281 elements with each element corresponding to a block of signatures grouped by their first two bytes (for example). If a block has no signature shown in the array, the starting and ending positions may be both zero.
- the weak search table is preferably implemented such that it does not contain duplicate signatures.
- the strong searching index also includes meta data and a strong search table.
- the strong searching table comprises a sorted sig-fid cell array.
- each entry in the strong search table includes a signature and associated fids.
- the entries are stored in ascending order based on the signatures themselves, and the ordered signatures are organized in blocks.
- the associated fids are the identifiers for the files (documents) that contain the signature for that entry.
- the strong search table meta-data may comprise start (st_pos) and end (en_pos) positions for each block of cells. As shown in FIG.
- the strong searching index has two parts—one part is an array of elements called sig-fid elements may be called the strong search table and the other part is a meta-data array with each element corresponding to a block of signatures grouped by the first two bytes (for example).
- a sig-fid element has three fields ⁇ signature, fid 1 , fid 2 >. If a signature belongs to only one fid in index, the fid 2 of its associated element has a null value. If a signature belongs to more than two fid, it has more than one associated element.
- the sig-fid array may be sorted in the ascending order of signatures, and it may contain duplicated signatures.
- the meta-data array may be defined in the same way as with the case of the weak searching index.
- the strong search table is used in strong search of signatures.
- the signature index shown in FIG. 7 may be stored in seven files.
- Two of the files are the index header file and the signature file for the fid-signature index.
- a third file is the signature-fid file for the signature-fid index.
- Fourth and fifth files are the weak meta data file and the weak search file (weak search table) for the weak searching index.
- the sixth and seventh files are the strong meta data file and the strong search file (strong search table) for the strong searching index.
- the weak and strong search files are pushed (transmitted and stored) from the server where the index engine (indexer) is hosted to client computers where the search engine (searcher) resides.
- the difference between different versions of these two search tables may be calculated.
- the difference for the meta-data files may not need to be calculated. Instead, the meta-data files may be compressed and pushed to client computers directly.
- either weak search files or strong search files, but not both, may be pushed to a particular client computer. Whether a weak or strong search file is pushed to a particular client is configurable.
- FIG. 13 shows the architecture of signature index and search in accordance with an embodiment of the invention.
- the architecture includes an index engine and a search engine.
- the index engine is responsible for the index lifecycle management which includes the operations such as creation/insertion/update/deletion.
- the index engine maintains the managing index and creates the searching indices.
- data stored so as to be accessible to the index engine include the fid-signature index, the signature-fid index, the strong searching index and the weak searching index.
- the search engine is responsible for signature search.
- Each search engine uses either a weak searching index or a strong searching index.
- the search engine may provide weak searching by utilizing a weak search table and associated meta data or may provide strong searching by utilizing a strong search table and associated meta data.
- the searching index files are pushed to each client that runs the search engine.
- the searching index files are designed in the way for efficient incremental signature update using delta or diff technology.
- the index engine is configured to receive new fid and signature data and to create and insert records of that data, as appropriate, into the fid-signature index.
- FIG. 14 depicts the process for creating and inserting new records.
- n sets of fid and signature data may be received, and fid-signature indices l through n may be created therefrom.
- the following steps may be used to create a managing index directly from fid and signature data.
- create the fid-signature index which includes both array of index headers and signature array, in memory and save into files.
- fid-signature indices may be merged pair-wise, in multiple steps as may be necessary, to create a single fid-signature index 1402 .
- the following steps may be used to merge two managing indices into one. First, merge two signature files into one signature file. Second, merge the two index header files into one index header file according to the new signature file. Finally, merge the two signature-fid files into one signature-fid file.) Thereafter, the single fid-signature index 1402 may be merged with the existing fid-signature index 1404 so as to generate the final (updated) fid-signature index 1406 . This final merge step effectively inserts the new fid-signature data into the existing fid-signature index.
- the index engine is configured to receive new fid and signature data and to create and insert records of that data, as appropriate, into the signature-fid index.
- the process for creating and inserting new records is depicted in FIG. 15 . This process may start with the fid-signature indices l through n which were created as shown in FIG. 14 . These fid-signature indices may be each transformed so as to create corresponding signature-fid indices l through n. The signature-fid indices may then be merged pair-wise, in multiple steps as may be necessary, to create a single signature-fid index 1502 .
- the cumulative signature-fid index 1502 may be merged with the existing signature-fid index 1504 so as to generate the final (updated) signature-fid index 1506 .
- This final merge step effectively inserts the new signature-fid data into the existing signature-fid index.
- the index engine is further configured to receive an instruction as to a set of fids to be deleted and to delete such obsolete records of that data, as appropriate, from the fid-signature and the signature-fid indices.
- the process for deleting records is depicted in FIG. 16 .
- a set of fid to be deleted is received by the index engine.
- the index engine then retrieves the array of relevant ⁇ fid,sig> elements from the fid-signature index of the managing index.
- the index engine then converts the array of relevant ⁇ fid,sig> elements into a sorted array of relevant ⁇ sig,fid> elements.
- the index engine may then delete all relevant ⁇ fid,sig> elements from the fid-signature index of the managing index by deleting all signatures belonging to the set of fid from the signature file and updating the index header file accordingly. This results in the new (updated) fid-signature index.
- the index engine deletes all relevant ⁇ sig,fid> elements from the signature-fid index. This results in the new (updated) signature-fid index.
- the index engine is further configured to update existing records of data, as appropriate, in the fid-signature and signature-fid indices.
- the process for updating records is depicted in FIG. 17 .
- a set of fid with their signatures to be updated is received by the index engine. This may be formed as an array A 1 of ⁇ fid,sig>.
- the index engine retrieves an array A 2 of ⁇ fid,sig> from the existing fid-signature index of the managing index. The index engine may then convert A 1 and A 2 into two sorted arrays of ⁇ sig,fid> which may be denoted as B 1 and B 2 , respectively.
- the index engine may then update the managing index as follows: update all signatures belonging to the set of fid in the signature file (the signature array of the fid-signature index) with the information provided by A 1 and A 2 ; update the index header file accordingly; and update all relevant ⁇ sig,fid> elements in the signature-fid index with the information provided by B 1 and B 2 .
- the index engine creates new (strong and weak) searching indices from the final signature-fid index.
- the creation of new weak and strong searching indices from the final signature-fid index is depicted in FIG. 18 . It is a straightforward process.
- search engine is a separate component from the index engine (indexer).
- the search engine may be configured to provide one of two searching operations, depending on which searching index is sent to the search engine.
- the search engine is configured to search an input array of sorted signatures based on the weak searching index.
- the search engine first determines to which sub-space (block) each signature belongs based on the first two bytes. For all signatures in the same sub-space, the search engine uses the meta-data to go directly to the corresponding block in the sorted signature array and execute a multiple value binary search (MVBS) within the block.
- MVBS multiple value binary search
- the procedure for a weak signature search may be outlined as follows. First, the search engine receives an array of sorted signatures as input. The signatures of the input signature array are assigned into sub-spaces based on the first two bytes of each signature. Then, each sub-space may be processed (walked through) with the following steps.
- the search engine is configured to search an input array of sorted signatures based on the strong searching index. For each signature, the block containing it can be determined based on its first two bytes, and then the MVBS algorithm is performed within the block. If the signature is found in a cell, then the neighboring cells are checked so as to find all the instances of the signature and get all fids to which the signature belongs. This is done due to the possible duplication of the signatures in the strong search table. This strong search procedure is advantageously efficient.
- a multiple value binary search may be utilized.
- Inputs are the sorted reference list R[1, . . . , N] and the target list T[1, . . . , M].
- a regular binary search is applied against the reference list R to determine whether t belongs to R.
- a regular binary search is applied M times against a reference list of size N. Applicants have determined that this first solution is inefficient.
- Second solution Again, inputs are the sorted reference list R[1, . . . , N] and the target list T[1, . . . , M].
- This solution is an iterative process starting from binary searching the first element T[1] against R. If T[1] exists in R, any element of R less than T[1] is removed from the reference list for the next search. Otherwise, the reference list R is not changed. The procedure then continues to the next element T[2] and so on with the same reference updating mechanism. See “Multiple Values Search Algorithm,” by Arabic Sharif and Aman Ullah Khan, Journal of Information & Communication Technology , Vol. 1, No. 2 (Fall 2007), pp. 49-58. Applicants have determined that this solution is generally more efficient than the first solution in that the size of the reference list becomes smaller after each successful search.
- the present application discloses the following new solution to the multiple value binary search (MVBS) problem.
- This innovative solution has the advantage of dual binary searching with respect to both the target list and the reference list.
- this third solution may be referred to as a multi-value dual binary search.
- the second solution performs solo binary searching in regard only to its reference list.
- our third solution searches for each element of T from R, its reference list has a smaller size than the corresponding reference list of the second solution. Based on our testing, our multi-value dual binary search procedure is more efficient than the second solution, especially for a very large reference list.
- our procedure starts with an empty list S.
- MVB-SEARCH is a recursive function which returns the total number of matched items with the following prototype and logic.
- FIG. 21 is a schematic diagram depicting an example computer apparatus 2100 which may be configured to perform various method steps in accordance with an embodiment of the invention. Other designs for the computer apparatus may be used in alternate embodiments. As discussed above, embodiments of the present invention may be performed by multiple computer apparatus 2100 communicatively interconnected by a network.
- the computer apparatus 2100 comprises a processor 2102 , a computer-readable memory system 2104 , a storage interface 2108 , a network interface 2110 , and other interfaces 2112 . These system components are interconnected through the use of an interconnection network (such as a system bus or other interconnection system) 2106 .
- the memory 2104 may be configured to include, in addition to other components and data, processor-executable instructions to perform various method steps disclosed herein.
- the storage interface 2108 may be used to connect storage devices 2114 to the computer apparatus 2100 .
- the network interface 2110 may be used to communicate with other computers 2118 by way of an external network 2116 .
- the other interfaces may interface to various devices, for example, a display 2120 , a keyboard 2122 , and other devices.
- a user is provided mechanisms, e.g., by receiving and/or transmitting control signals, to control access to particular information as described herein.
- control signals e.g., by receiving and/or transmitting control signals
- these benefits accrue regardless of whether all or portions of components, e.g., server systems, to support their functionality are located locally or remotely relative to the user.
- a hardware element may refer to any hardware structures arranged to perform certain operations.
- the hardware elements may include any analog or digital electrical or electronic elements fabricated on a substrate.
- the fabrication may be performed using silicon-based integrated circuit (IC) techniques, such as complementary metal oxide semiconductor (CMOS), bipolar, and bipolar CMOS (BiCMOS) techniques, for example.
- CMOS complementary metal oxide semiconductor
- BiCMOS bipolar CMOS
- Examples of hardware elements may include processors, microprocessors, circuits, circuit elements (e.g., transistors, resistors, capacitors, inductors, and so forth), integrated circuits, application specific integrated circuits (ASIC), programmable logic devices (PLD), digital signal processors (DSP), field programmable gate array (FPGA), logic gates, registers, semiconductor device, chips, microchips, chip sets, and so forth.
- processors microprocessors, circuits, circuit elements (e.g., transistors, resistors, capacitors, inductors, and so forth), integrated circuits, application specific integrated circuits (ASIC), programmable logic devices (PLD), digital signal processors (DSP), field programmable gate array (FPGA), logic gates, registers, semiconductor device, chips, microchips, chip sets, and so forth.
- ASIC application specific integrated circuits
- PLD programmable logic devices
- DSP digital signal processors
- FPGA field programmable gate array
- the embodiments are not limited in this context.
- a software element may refer to any software structures arranged to perform certain operations.
- the software elements may include program instructions and/or data adapted for execution by a hardware element, such as a processor.
- Program instructions may include an organized list of commands comprising words, values or symbols arranged in a predetermined syntax, that when executed, may cause a processor to perform a corresponding set of operations.
- the software may be written or coded using a programming language. Examples of programming languages may include C, C++, BASIC, Perl, Matlab, Pascal, Visual BASIC, JAVA, ActiveX, assembly language, machine code, and so forth.
- the software may be stored using any type of computer-readable media or machine-readable media. Furthermore, the software may be stored on the media as source code or object code. The software may also be stored on the media as compressed and/or encrypted data.
- Examples of software may include any software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, application program interfaces (API), instruction sets, computing code, computer code, code segments, computer code segments, words, values, symbols, or any combination thereof.
- the embodiments are not limited in this context.
- Some embodiments may be implemented, for example, using any computer-readable media, machine-readable media, or article capable of storing software.
- the media or article may include any suitable type of memory unit, memory device, memory article, memory medium, storage device, storage article, storage medium and/or storage unit, such as any of the examples described with reference to a memory.
- the media or article may comprise memory, removable or non-removable media, erasable or non-erasable media, writeable or re-writeable media, digital or analog media, hard disk, floppy disk, Compact Disk Read Only Memory (CD-ROM), Compact Disk Recordable (CD-R), Compact Disk Rewriteable (CD-RW), optical disk, magnetic media, magneto-optical media, removable memory cards or disks, various types of Digital Versatile Disk (DVD), subscriber identify module, tape, cassette, or the like.
- the instructions may include any suitable type of code, such as source code, object code, compiled code, interpreted code, executable code, static code, dynamic code, and the like.
- the instructions may be implemented using any suitable high-level, low-level, object-oriented, visual, compiled and/or interpreted programming language, such as C, C++, Java, BASIC, Perl, Matlab, Pascal, Visual BASIC, JAVA, ActiveX, assembly language, machine code, and so forth.
- suitable high-level, low-level, object-oriented, visual, compiled and/or interpreted programming language such as C, C++, Java, BASIC, Perl, Matlab, Pascal, Visual BASIC, JAVA, ActiveX, assembly language, machine code, and so forth.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Signature Generator A | S = ExtractSignature(T, M, N) | ||
Signature Generator B | S = ExtractSignature(T, M, N) | ||
where M and N are the same.
For example, a particular implementation may set M=5 and N=50 uniformly for all input text T.
-
- Level 1: 5%
- Level2: 10%
- Level 3: 20%
- Level 4: 40%
- Level 5: 50%
- Level 6: Best Effort
TABLE 1 | ||||
N for | N for | |||
Level | Text Size Range in KB | M | generator A | generator B |
1 | 0-10K | 4 | 2 | 32 |
1 | 10-20K | 4 | 4 | 64 |
1 | 20-30K | 4 | 4 | 64 |
1 | 30-50K | 4 | 4 | 128 |
1 | 50-70K | 4 | 8 | 128 |
1 | 70-80K | 4 | 8 | 128 |
1 | 80-100K | 4 | 8 | 128 |
1 | 100-500K | 4 | 16 | 1024 |
1 | >500K | 4 | 1024 | 1024 |
2 | 0-10K | 4 | 2 | 32 |
2 | 10-20K | 4 | 4 | 64 |
2 | 20-30K | 4 | 8 | 64 |
2 | 30-50K | 4 | 8 | 128 |
2 | 50-70K | 4 | 16 | 128 |
2 | 70-80K | 4 | 16 | 256 |
2 | 80-100K | 4 | 16 | 256 |
2 | 100-500K | 4 | 64 | 1024 |
2 | >500K | 4 | 1024 | 1024 |
3 | 0-10K | 4 | 4 | 32 |
3 | 10-20K | 4 | 8 | 64 |
3 | 20-30K | 4 | 16 | 64 |
3 | 30-50K | 4 | 16 | 128 |
3 | 50-70K | 4 | 32 | 256 |
3 | 70-80K | 4 | 32 | 256 |
3 | 80-100K | 4 | 32 | 1024 |
3 | 100-500K | 4 | 128 | 1024 |
3 | >500K | 4 | 1024 | 1024 |
4 | 0-10K | 4 | 8 | 64 |
4 | 10-20K | 4 | 16 | 256 |
4 | 20-30K | 4 | 16 | 256 |
4 | 30-50K | 4 | 32 | 1024 |
4 | 50-70K | 4 | 128 | 1024 |
4 | 70-80K | 4 | 128 | 1024 |
4 | 80-100K | 4 | 128 | 1024 |
4 | 100-500K | 4 | 1024 | 1024 |
4 | >500K | 4 | 1024 | 1024 |
5 | 0-10K | 4 | 16 | 64 |
5 | 10-20K | 4 | 32 | 256 |
5 | 20-30K | 4 | 32 | 256 |
5 | 30-50K | 4 | 32 | 1024 |
5 | 50-70K | 4 | 128 | 1024 |
5 | 70-80K | 4 | 128 | 1024 |
5 | 80-100K | 4 | 128 | 1024 |
5 | 100-500K | 4 | 1024 | 1024 |
5 | >500K | 4 | 1024 | 1024 |
6 | 0-10K | 4 | 16 | 256 |
6 | 10-20K | 4 | 32 | 256 |
6 | 20-30K | 4 | 32 | 1024 |
6 | 30-50K | 4 | 64 | 1024 |
6 | 50-70K | 4 | 128 | 1024 |
6 | 70-80K | 4 | 128 | 1024 |
6 | 80-100K | 4 | 128 | 1024 |
6 | 100-500K | 4 | 1024 | 1024 |
6 | >500K | 4 | 1024 | 1024 |
-
- Generator A should generate much less number of signatures than generator B.
- The numbers of signatures extracted by both Generators A and B should be set accordingly based on the input text size.
- Numbers of signatures extracted by both A and B should be set accordingly based on pre-defined tolerance level which is configurable by the system.
-
- Theorem: Let T be any text, and n and m be two numbers with power of 2. If n<m, the following result holds:
S(T,n)<S(T,m) - which means the set S(T,n) is included in S(T,m).
- Corollary: Let T1 and T2 be two versions of the same text, and n and m be two numbers with power of 2. If n<m, the following result holds:
S(T 1 ,n)∩S(T 2 ,n)<S(T 1 ,n)∩S(T 2 ,m)<S(T 1 ,m)∩S(T 2 ,m) - where S(T1, n)∩S(T2, m) is exactly what the asymmetric signature generation presents the match accuracy. Here n and m can be considered as input numbers for generator A and B respectively.
- Theorem: Let T be any text, and n and m be two numbers with power of 2. If n<m, the following result holds:
L 1 ={t 1 ,t 2 , . . . ,t m}
L 2 ={T 1 ,T 2 , . . . ,T n},
-
- where denote Ti˜<P(i,1), P(i,2), . . . , P(i,Si)> to mark the positions of occurrences and where i=1, . . . , n and where S1+S2+ . . . +Sn=m.
Score(T j)=[P(j,S j)−P(j,1)]*S j*Weight(T j)/Sqrt(D j),
where D j =[P(j,2)−P(j,1)]2 +[P(j,3)−P(j,2)]2 . . . +[P(j,S j)−P(j,S j−1)] 2
In addition, a score function measures the importance of a token in the text by the frequency and also its assigned weight. It is noted that weight( ) may be a pre-defined function. In one embodiment, its value is a ‘1’, although in alternative embodiments its value may be some pre-assigned number, e.g., 6.8, if the token contains some special characters like ‘−’, ‘_’ and ‘@’. The score function may be determined by Sj*Weight(Tj). The score function may be used to evenly distribute tokens over the document to get better scores. This is determined by [P(j,Sj)−P(j,1)]/Sqrt (Dj).
-
- For each kε{P(j,1), P(j,2), . . . , P(j,Si)}, pick its neighboring 2d tokens in L1 and concatenate them together to form a string, that's tk+d+ . . . +tk−1+tk+tk+1+ . . . +tk+d.
- Encoding this string gives us a signature Fj,k.
Score(c)=Sqrt(n)*[P(n,c)−P(1,c)]/Sqrt(D)
where D=[(P(2,c)−P(1,c)]2+[(P(3,c)−P(2,c)]2+ . . . +[(P(n,c)−P(n−1,c)]2. The score function measures an importance of a character in the text by its frequency. The score function also ensures that the characters that are evenly distributed over the document get better scores. A calculation for achieving this includes:
[P(n,c)−P(1,c)]/Sqrt(D).
SIM(str2,str1)=(sum of lengths of all substrings in S)/(length of str2)*100%
It is advised that the function SIM is not symmetric, i.e., SIM(str1,str2)≠SIM(str2,str1). For example, consider str1=“AAAAACCCCCCCCBBBBBBDDDDDDAAAAAALLLLLLL” and str2=“CCCCCCCCCZZZZZAAAAAAABBBBTTTTLLL”. The required minimum of substring length may be set, for example, as M=4. Then S={“AAAAAA”,“CCCCCCCC”,“BBBB”} the substrings of str2 is what is needed to calculate a similarity:
SIM(str2,str1)=18/27=67%.
-
- int searchMaxMatchLen (int IDX, int start, int end, char *str, int len, char *str2, int len2){
- int i, j;
- if(end-start<2) {
- i=getMaxMatchSize(str+IDX[start], len −IDX[start], str2, len2);
- j=getMaxMatchSize(str+IDX[end], len −IDX[end], str2, len2);
- if(i>j)
- return i;
- else
- return j;}
- i=start+(end-start)/2;
- if(strncmp(str+IDX[i], str2, minimum(len−IDX[i], len2))<0)
- return searchMaxMatchLen (IDX, i, end, str, len, str2, len2);
- else
- return searchMaxMatchLen (IDX, i, start, str, len, str2, len2);}
- int getMaxMatchSize(char *str, int len, char *str2, int len2){
- int i;
- for(i=0; (i<len) && (i<len2); i++)
- if(str[i] !=str2[i]) break;
- return i;}
-
- A signature is a string of 8 ASCII characters with ASCII values ranging from 33 to 123 inclusively. That is, each character in a signature can assume one of 91 values.
- Each document may have multiple signatures.
- A unique ID is assigned to each document. This unique ID is called the document ID or file ID. The two terms document ID and file ID are used alternatively herein.
- For a given collection of documents, one has a set of pairs <id, s>, where id and s are a document ID and a signature respectively. And each unique id may be relevant to multiple signatures.
- Our example search problem is defined as follows:
- All the pairs <id,s> are indexed into well-defined data structures.
- These data structures are referred to as the signature index.
- For any given sequence of signatures S={s1, s2, . . . , son} that may be in ascending order, we may retrieve all document IDs, which may be paired with any signature in S, against the signature index. This is called a strong search.
- For any given sequence of signatures S={s1, s2, . . . , sn} that may be in ascending order, we may search whether the signature index contains any of the signatures in S. This is called a weak search.
- All the pairs <id,s> are indexed into well-defined data structures.
- Two separate system entities are provided to support the signature search problem.
- An Indexer (index engine) which builds signature index files from the file IDs and the signatures.
- A Searcher (search engine) which executes both strong searches and weak searches.
-
- Divide the whole signature space into 8,281 sub-spaces. Each sub-space is defined by the first two bytes of signatures. Hence, the size of each sub-space is 916=60,414,540,641, which is around 60 billion signatures.
- A signature block consists of signatures in the same sub-space. The signatures in a block have the same values in the first 2 characters. There are 912=8,281 signature blocks, corresponding to 8,281 sub-spaces.
To search for a signature, the search engine first determines to which sub-space the signature belongs, then the search engine directly searches through signatures of the corresponding block in the search table. The signatures of each block are sorted in the ascending order in search table. A binary search may then be applied, which is a very fast procedure. In particular, since we are searching a sequence of signatures in ascending order, a multiple value binary search (MVBS) may be utilized for rapid searching performance. A particularly efficient MVBS is disclosed below.
-
- For all target signatures in this sub-space (the target array S), which is again an array of sorted signatures, by using meta-data, the search engine may identify an array B of sorted signatures that belongs to the corresponding block from the weak table. The array B is loaded into memory from the file. The loading may be performed using only one input/output (I/O) operation.
- The search engine may then search the target array S against the reference array B by applying the MVBS algorithm.
MVB-SEARCH(S,R[L1,...,H1], T[L2,...,H2] ) | ||
BEGIN | ||
IF L2>H2 THEN RETURN 0 | ||
## We take truncated value for the following value | ||
M2 = (L2 + H2)/2 | ||
L=L1 | ||
H=H1 | ||
WHILE (L≦H ) DO | ||
BEGIN | ||
## We take truncated value for the following value | ||
M1=(L+H)/2 | ||
IF T[M2]>R[M1] THEN L=M1+1 | ||
ELSE IF T[M2]<R[M1] THEN H=M1−1 | ||
ELSE | ||
BEGIN | ||
ADD T[M2] TO S | ||
V=1+MVB-SEARCH(S,R[L1,...,M1−1], T[L2,...,M2−1]) | ||
V=V+ MVB-SEARCH(S,R[M1+1,...,H1], T[M2+1,...,H2]) | ||
RETURN V | ||
END ELSE | ||
END WHILE | ||
V=MVB-SEARCH(S,R[L1,...,H], T[L2,...,M2−1]) | ||
V=V+ MVB-SEARCH(S,R[L,...,H1], T[M2+1,...,H2]) | ||
RETURN V | ||
END MVB-SEARCH | ||
-
- m1=(s+e)/2;
- if(T[m2]>R[m1]) s=m1+1;
- else if(T[m2]<R[m1]) e=m1−1;
- else {
- printf(“T[% d]=% d is in reference list\n”,m2,T[m2]);
- return 1+mv_bsearch(R,s1,m1−1,T,s2,m2−1)
- +mv_bsearch(R,m1+1,e1,T,m2+1,e2);
- }
}
Claims (11)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/364,134 US8266150B1 (en) | 2009-02-02 | 2009-02-02 | Scalable document signature search engine |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/364,134 US8266150B1 (en) | 2009-02-02 | 2009-02-02 | Scalable document signature search engine |
Publications (1)
Publication Number | Publication Date |
---|---|
US8266150B1 true US8266150B1 (en) | 2012-09-11 |
Family
ID=46760763
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/364,134 Active 2030-05-14 US8266150B1 (en) | 2009-02-02 | 2009-02-02 | Scalable document signature search engine |
Country Status (1)
Country | Link |
---|---|
US (1) | US8266150B1 (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130238590A1 (en) * | 2012-03-12 | 2013-09-12 | Oracle International Corporation | System and method for supporting heterogeneous solutions and management with an enterprise crawl and search framework |
US20140082183A1 (en) * | 2012-09-14 | 2014-03-20 | Salesforce.Com, Inc. | Detection and handling of aggregated online content using characterizing signatures of content items |
US20140269281A1 (en) * | 2013-03-14 | 2014-09-18 | Cavium, Inc. | Apparatus and Method for Providing Sort Offload |
US9043908B1 (en) | 2013-04-18 | 2015-05-26 | Trend Micro Incorporated | Detection of encryption and compression applications |
US9237581B2 (en) | 2013-03-14 | 2016-01-12 | Cavium, Inc. | Apparatus and method for media access control scheduling with a sort hardware coprocessor |
US20160110292A1 (en) * | 2014-10-21 | 2016-04-21 | Samsung Electronics Co., Ltd. | Efficient key collision handling |
US9706564B2 (en) | 2013-03-14 | 2017-07-11 | Cavium, Inc. | Apparatus and method for media access control scheduling with a priority calculation hardware coprocessor |
WO2020201926A1 (en) * | 2019-03-31 | 2020-10-08 | Cortica Ltd. | Invariant signature generation and pattern detection |
US20210357373A1 (en) * | 2020-05-12 | 2021-11-18 | Couchbase, Inc. | Efficient indexing for querying arrays in databases |
US11227004B2 (en) | 2016-02-11 | 2022-01-18 | Ebay Inc. | Semantic category classification |
US11698921B2 (en) * | 2018-09-17 | 2023-07-11 | Ebay Inc. | Search system for providing search results using query understanding and semantic binary signatures |
US12038956B2 (en) * | 2013-09-20 | 2024-07-16 | Fulcrum Management Solutions Ltd. | System and method for thought object selection by custom filtering and computed diversification |
US20240354320A1 (en) * | 2013-09-20 | 2024-10-24 | Fulcrum Management Solutions Ltd. | System and method for diverse thought object selection using llm and embedding models |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6493709B1 (en) | 1998-07-31 | 2002-12-10 | The Regents Of The University Of California | Method and apparatus for digitally shredding similar documents within large document sets in a data processing environment |
US20030182310A1 (en) * | 2002-02-04 | 2003-09-25 | Elizabeth Charnock | Method and apparatus for sociological data mining |
US20040199491A1 (en) * | 2003-04-04 | 2004-10-07 | Nikhil Bhatt | Domain specific search engine |
US7031972B2 (en) | 2003-07-21 | 2006-04-18 | Innopath Software, Inc. | Algorithms for block-level code alignment of software binary files |
US20060253439A1 (en) | 2005-05-09 | 2006-11-09 | Liwei Ren | Matching engine for querying relevant documents |
US7516130B2 (en) | 2005-05-09 | 2009-04-07 | Trend Micro, Inc. | Matching engine with signature generation |
-
2009
- 2009-02-02 US US12/364,134 patent/US8266150B1/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6493709B1 (en) | 1998-07-31 | 2002-12-10 | The Regents Of The University Of California | Method and apparatus for digitally shredding similar documents within large document sets in a data processing environment |
US20030182310A1 (en) * | 2002-02-04 | 2003-09-25 | Elizabeth Charnock | Method and apparatus for sociological data mining |
US20040199491A1 (en) * | 2003-04-04 | 2004-10-07 | Nikhil Bhatt | Domain specific search engine |
US7031972B2 (en) | 2003-07-21 | 2006-04-18 | Innopath Software, Inc. | Algorithms for block-level code alignment of software binary files |
US20060253439A1 (en) | 2005-05-09 | 2006-11-09 | Liwei Ren | Matching engine for querying relevant documents |
US7516130B2 (en) | 2005-05-09 | 2009-04-07 | Trend Micro, Inc. | Matching engine with signature generation |
Non-Patent Citations (10)
Title |
---|
Anagnostopoulos, A. et al., "Sampling Search-Engine Results," Proceedings of the 14th International Conference on World Wide Web, WWW 2005, May 10-14, 2005, pp. 245-256, Chiba, Japan. |
Chakrabarti, et al. "Scalable feature selection, classification and signature generation for organizing large text databases into hierachical taxonomies", 1998, pp. 163-178, vol. 7, No. 3, VLDB Journal. |
Chen, L., et al., "Template Detection for Large Scale Search Engines," SAC '06, Apr. 23-27, 2006, 5 pages, Dijon, France. |
Chen. J., et al., "Knowledge Discovery and Data Mlning Based on Power Plant Real-Time Database: A Survey," Proceedings of International Conference on Power Engineering, Oct. 8-12, 2001, pp. 1-5, Xi'an, China. |
Hamilton. N., "The Mechanics of a Deep Net Metasearch Engine," Proceedings of the 12th International World Wide Web Conference, 2003, 2 pages. |
Jessop, M., et al., Pattern Matching Against Distributed Datasets, 2004, 6 pages. |
Lai, W.C., et al., "An Anatomy of a Large-Scale Image Search Engine," IEEE MSE, Dec. 2002, 4 pages, Irvine. |
Lavrenko, V., et al., "Relevance Models for Topic Detection and Tracking," 2002, 6 pages. |
Muhammad Sharif, et al. "Multiple Values Search Algorithm" 2007-Spring 2008, pp. 49-58, vol. 1, No. 2, Journal of Information & Communication Technology. |
Pallickara, S. et al., "Incorporating an XML Matching Engine in Distributed Brokering Systems," Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, (PDPTA '03) 2003, pp. 1-7. |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9405780B2 (en) | 2012-03-12 | 2016-08-02 | Oracle International Corporation | System and method for providing a global universal search box for the use with an enterprise crawl and search framework |
US9361330B2 (en) | 2012-03-12 | 2016-06-07 | Oracle International Corporation | System and method for consistent embedded search across enterprise applications with an enterprise crawl and search framework |
US9524308B2 (en) | 2012-03-12 | 2016-12-20 | Oracle International Corporation | System and method for providing pluggable security in an enterprise crawl and search framework environment |
US9098540B2 (en) | 2012-03-12 | 2015-08-04 | Oracle International Corporation | System and method for providing a governance model for use with an enterprise crawl and search framework environment |
US9189507B2 (en) | 2012-03-12 | 2015-11-17 | Oracle International Corporation | System and method for supporting agile development in an enterprise crawl and search framework environment |
US20130238590A1 (en) * | 2012-03-12 | 2013-09-12 | Oracle International Corporation | System and method for supporting heterogeneous solutions and management with an enterprise crawl and search framework |
US9286337B2 (en) * | 2012-03-12 | 2016-03-15 | Oracle International Corporation | System and method for supporting heterogeneous solutions and management with an enterprise crawl and search framework |
US20140082183A1 (en) * | 2012-09-14 | 2014-03-20 | Salesforce.Com, Inc. | Detection and handling of aggregated online content using characterizing signatures of content items |
US9237581B2 (en) | 2013-03-14 | 2016-01-12 | Cavium, Inc. | Apparatus and method for media access control scheduling with a sort hardware coprocessor |
US9706564B2 (en) | 2013-03-14 | 2017-07-11 | Cavium, Inc. | Apparatus and method for media access control scheduling with a priority calculation hardware coprocessor |
US20140269281A1 (en) * | 2013-03-14 | 2014-09-18 | Cavium, Inc. | Apparatus and Method for Providing Sort Offload |
US9043908B1 (en) | 2013-04-18 | 2015-05-26 | Trend Micro Incorporated | Detection of encryption and compression applications |
US20240354320A1 (en) * | 2013-09-20 | 2024-10-24 | Fulcrum Management Solutions Ltd. | System and method for diverse thought object selection using llm and embedding models |
US12038956B2 (en) * | 2013-09-20 | 2024-07-16 | Fulcrum Management Solutions Ltd. | System and method for thought object selection by custom filtering and computed diversification |
US20160110292A1 (en) * | 2014-10-21 | 2016-04-21 | Samsung Electronics Co., Ltd. | Efficient key collision handling |
US9846642B2 (en) * | 2014-10-21 | 2017-12-19 | Samsung Electronics Co., Ltd. | Efficient key collision handling |
US11227004B2 (en) | 2016-02-11 | 2022-01-18 | Ebay Inc. | Semantic category classification |
US11698921B2 (en) * | 2018-09-17 | 2023-07-11 | Ebay Inc. | Search system for providing search results using query understanding and semantic binary signatures |
WO2020201926A1 (en) * | 2019-03-31 | 2020-10-08 | Cortica Ltd. | Invariant signature generation and pattern detection |
US11416458B2 (en) * | 2020-05-12 | 2022-08-16 | Couchbase, Inc. | Efficient indexing for querying arrays in databases |
US20210357373A1 (en) * | 2020-05-12 | 2021-11-18 | Couchbase, Inc. | Efficient indexing for querying arrays in databases |
US12079181B2 (en) | 2020-05-12 | 2024-09-03 | Couchbase, Inc. | Efficient indexing for querying arrays in databases |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8266150B1 (en) | Scalable document signature search engine | |
US7860853B2 (en) | Document matching engine using asymmetric signature generation | |
US7747642B2 (en) | Matching engine for querying relevant documents | |
US7516130B2 (en) | Matching engine with signature generation | |
JP3581652B2 (en) | Data retrieval system and method and its use in search engines | |
US6240409B1 (en) | Method and apparatus for detecting and summarizing document similarity within large document sets | |
JP3566111B2 (en) | Symbol dictionary creation method and symbol dictionary search method | |
JP5128101B2 (en) | Method, apparatus and system for supporting indexing and searching taxonomy with large full-text index | |
WO2000007094A9 (en) | Method and apparatus for digitally shredding similar documents within large document sets in a data processing environment | |
Gog et al. | Large-scale pattern search using reduced-space on-disk suffix arrays | |
US20120124060A1 (en) | Method and system of identifying adjacency data, method and system of generating a dataset for mapping adjacency data, and an adjacency data set | |
JP5072832B2 (en) | Signature generation and matching engine with relevance | |
KR100818742B1 (en) | Document retrieval method using relevance of index word's location information in document | |
Cortez et al. | A flexible approach for extracting metadata from bibliographic citations | |
KR100659370B1 (en) | Method for Forming Document DV by Information Thesaurus Matching and Information Retrieval Method | |
Malki | Comprehensive study and comparison of information retrieval indexing techniques | |
Baeza-Yates et al. | Modeling text databases | |
Kim et al. | Efficient processing of substring match queries with inverted variable-length gram indexes | |
Shah | Review of indexing techniques applied in information retrieval | |
Guo et al. | Text Matching and Categorization: Mining Implicit Semantic Knowledge from Tree‐Shape Structures | |
Fan et al. | Mining collective pair data from the web | |
MAHMOUD et al. | A METHOD FOR ENHANCING WEB SEARCH RESULTS USING CONTEXT-BASED INDEXING | |
Tsay et al. | A scalable approach for Chinese term extraction | |
Bhimireddy et al. | A survey to fix the threshold and implementation for detecting duplicate web documents | |
Soyemi et al. | Database Record Duplicate Detection System using Simil Algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: TREND MICRO INCORPORATED, JAPAN Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LIN, YINGQIANG;REN, LIWEI;TAN, DEHUA;REEL/FRAME:022586/0371 Effective date: 20090130 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FPAY | Fee payment |
Year of fee payment: 4 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 12TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1553); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 12 |