US8650195B2 - Region based information retrieval system - Google Patents
Region based information retrieval system Download PDFInfo
- Publication number
- US8650195B2 US8650195B2 US13/072,647 US201113072647A US8650195B2 US 8650195 B2 US8650195 B2 US 8650195B2 US 201113072647 A US201113072647 A US 201113072647A US 8650195 B2 US8650195 B2 US 8650195B2
- Authority
- US
- United States
- Prior art keywords
- region
- sets
- regions
- documents
- region set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related, expires
Links
- 238000000034 method Methods 0.000 claims description 13
- 238000012545 processing Methods 0.000 abstract description 8
- 230000008520 organization Effects 0.000 description 3
- 238000000926 separation method Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 230000014616 translation Effects 0.000 description 1
Images
Classifications
-
- G06F17/30675—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/93—Document management systems
Definitions
- This invention relates to information retrieval systems, including search engines and organization of large document collections.
- Information retrieval systems are used to find relevant information from large data sets. Universities and public libraries use information retrieval systems to provide access to books, journals and other documents, whereas large enterprises use information retrieval systems to provide internal access to their large collections of internal documents. Web search engines (e.g. Google) are the most visible information retrieval systems.
- a typical implementation of an information retrieval system might include 1) a document collection subsystem, 2) an indexing subsystem, and 3) a searching and ranking subsystem.
- a typical document collection subsystem e.g. a web crawler
- takes a list of document references e.g. URLs
- retrieves documents from the locations indicated in these document references After retrieval, the documents along with the corresponding document reference are passed onto the indexing subsystem.
- the documents are also parsed and any document references found within the documents are extracted. These document references are then added to the lists which the document collection subsystem uses for retrieving further documents.
- a typical indexing subsystem takes the documents with their corresponding document references and uses this to create and update a searchable index, where all the associations between the documents and individual words and other data are stored in such a way as to enable efficient lookups.
- the documents or the words are often ranked in order to determine which documents are the most relevant to a given word.
- a typical searching and ranking subsystem uses the search information (e.g. keywords typed into a Google search box) to look up in the searchable index, and retrieves and ranks the set of document references from here. Sometimes the actual documents or extracts of the documents are also part of the results.
- search information e.g. keywords typed into a Google search box
- a region based information retrieval system improves on conventional information retrieval systems by breaking down documents into one or more region(s) and processing the additional information available at a region level of analysis.
- regions When looking at regions, it becomes possible to quickly distinguish between groups of related documents, quickly ignore or focus on certain information, track recent evolutions of documents, as well as understand the historical relationships, heritage, and versions of these documents. This is all possible whether or not the document publishers specify where the content originally came from.
- FIG. 1 is a flowchart illustrating an example of an implementation adding documents to a region based information retrieval system.
- FIG. 2 is a flowchart illustrating an example of an implementation adding a region (identified above) to a region based information retrieval system.
- FIG. 3 is a flowchart illustrating an example of an implementation retrieving information from a region based information retrieval system.
- FIG. 4 is a system diagram of an example of an implementation of a region based information retrieval system.
- a region based information retrieval system improves on conventional information retrieval systems by breaking down documents into one or more region(s) and processing the additional information available at a region level of analysis.
- regions When looking at regions, it becomes possible to quickly distinguish between groups of related documents, quickly ignore or focus on certain information, track recent evolutions of documents, as well as understand the historical relationships, heritage, and versions of these documents. This is all possible whether or not the document publishers specify where the content originally came from.
- the portion of a document that is considered to be a region may be determined by separations found in the document structure, encoding or formatting, and/or by identifying regions that have duplicates or near-duplicates within the same document or in other documents. Note that regions can fully or partially contain other regions—even in the same documents—and that each complete document can itself be considered a region. Examples of regions in a text document include a phrase, a sentence, a paragraph, a set of paragraphs. Examples of regions in a video include a segment, a dialogue, a scene, or any other part of the video.
- Region sets are sets of duplicate or near-duplicate regions in multiple documents, as well as the sets of any unique regions.
- a data structure that describes a region set would contain representations of the content in the regions and an indication of the locations of all the related regions within the documents.
- a simple example of a region set includes a case where two paragraphs of specific text are present in three different documents. The two paragraphs would be considered a region in each of the documents, and the region set would contain some representation of the two paragraphs and references to where the regions would exist in each of the three documents.
- data structures describing documents, regions, and region sets can contain additional meta information about the related documents, regions, or region sets, which includes information such as content types, sizes, languages, URLs, timestamps, as well as information like how closely the related regions match.
- the relationship between region sets can be characterized as 1) enclosed overlap relationships, where the content of one region set (the subset region set) is the subset of the content of another region set (the superset region set), 2) non-enclosed overlap relationships, where some content exist in both region sets, while both region sets also have content which are not in the other region set, 3) other relationships, where other information, such as the presence of an explicit reference (e.g. URL to a document containing a region set), can be used to establish a relationship between region sets, and 4) no relationship for region sets which have no direct relationship.
- an explicit reference e.g. URL to a document containing a region set
- Region set relationships are used to establish region set graphs, which are directed graphs of region sets based on the relationships between region sets.
- region set graphs which are directed graphs of region sets based on the relationships between region sets.
- any related meta information can be used to determine strength of and type of these relationships, as well as help with any clustering of the graphs.
- a simple example of a region set graph include a case where a region set (region set A) represents three paragraphs of specific text which is present in two different documents and where another region set (region set B) represents two of these paragraphs in five different documents (include the previously mentioned two documents).
- region set B would be considered in an enclosed overlap relationship with region set A and region set B would be a subset of region set A. This relationship would be represented in the region set graph.
- a region-based information retrieval system can create a searchable index of region sets for efficiently finding the most appropriate region sets which have regions related to particular search terms.
- Methodologies currently used to create a searchable index of documents can be used to create a searchable index of region sets.
- region set graphs can be used to optimize the searchable index for efficient region set retrieval.
- a searching and ranking subsystem When region based information retrieval systems receive a request for information, a searching and ranking subsystem first identifies any matching region sets using the searchable index, and then organizes the region sets into region set clusters of closely related region sets based on the relationships in the region set graphs. The importance of all or some of the region sets in each region set cluster can then be used to determine the importance of each region set cluster.
- Methodologies currently used to establish the importance of a single document in search results can be used to establish the importance of a region set.
- the aggregate information about all documents containing a region in a region set can help determine the importance of the related region set or any region set closely related in the region set graph.
- the searching and ranking subsystem creates a list of region set clusters. This list can be ranked according to importance of the regions set clusters or any other sorting criteria.
- This resulting list of region set clusters can be used in a number of different ways, which can be done separately or in combination.
- one use of the list of region set clusters is to directly display the list of region sets clusters.
- the selection of which data or data samples to show can be based on many factors, including importance of a region set, position of a region set graphs, timestamps (newest, oldest), and other meta information by itself or in aggregate.
- each entry in the list of region set clusters can optionally be expanded in order to show detailed information.
- This information can include a list of the related region sets and the related documents within a region set cluster. This information can be organized by relationship, time, importance, sub-search, etc.
- An alternative way to organize the search results is to construct a list of document references for documents which contain samples of each region set cluster.
- additional clustering could be performed in order to organize region set clusters from the same documents or from the same origin into single clusters.
- results can be combined with meta information and other known information about originality of regions or documents in order to more precisely determine answers about the lineage of information and to questions like “where did this specific content go?” And “where did this specific content come from?”.
- any methodology currently used for displaying search results can be used along or instead of the mentioned search result display.
- FIG. 1 shows an example of an implementation adding documents to a region based information retrieval system.
- the document collection subsystem first retrieves a document 102 based on a document reference. Then 106 it updates the document database with the content of the document and the results of the document analysis 104 , after which it updates the document reference database with the retrieval status and any other document references found in the document.
- the system finds a set of regions using separations in the document structure and adds these regions to the system ( 108 , 110 , and 112 ).
- the details of adding a region to the system can be handled by the indexing subsystem as explained using FIG. 2 below.
- the system finds another set of regions by comparing the content of the document to the content of the regions ( 114 , 116 , and 118 ) using the index of region set database. This process will identify regions which have full or partial duplicates as well as full or partial near-duplicates in the existing region set database. As in the previous example above, the details of adding a region to the system can be done as described using FIG. 2 below.
- the system continues by getting either a new document reference from the document reference database, or by getting a document reference for a document that should be revisited by the document collection subsystem 120 .
- FIG. 2 shows an example of an implementation adding a region (identified above) to a region based information retrieval system.
- the region is first examined for any matches in the region set database 202 .
- this information may already be available from the region identification step 114 .
- the region has a full duplicate or full near-duplicate in an existing region set 204 , then the region is simply added to the region set 206 .
- the region has partial duplicate(s) or partial near-duplicate(s) in one or more existing region sets 208 .
- a new region set is created 214 using the region currently examined, as well as a list of full duplicate or full near-duplicate regions established by taken the matching portions of the regions in the otherwise partially matching region sets 212 .
- the region has neither a full duplicate, a full near-duplicate 204 , a partial duplicate, nor a partial near-duplicate 208 in the existing region set, then the region is unique, and a new region set is created with only this region in it 210 .
- the region set graph database is updated to incorporate any new relationships with the other region sets 216 .
- the final step is to update the searchable index of region set database based on the information just added to the other databases 218 .
- FIG. 3 shows an example of an implementation retrieving information from a region based information retrieval system.
- search terms are looked up in the index of region sets 302 to find matching region sets.
- the resulting region sets are organized into region set clusters based on the relationships between the region sets as indicated by the region set graphs 304 .
- the information in the region set clusters, the related region sets, the related region set graphs, and the related documents is used to rank the clusters according to expected relevance to the end user 306 .
- the region sets within each region set cluster is then organized using meta information and the relationships in the region set graph 308 in order to be able to illustrate the relationships between the results.
- Information from the related documents e.g. sample content, timestamp
- region sets e.g. # of document matches
- region set graphs e.g. 10 region sets with 35 documents appear older
- FIG. 4 illustrates an example of an implementation of a region based information retrieval system.
- the document collection subsystem 402 uses the document references in the document reference database 404 to determine which documents to collect. They then collect the documents from document repositories 406 and store the documents with all related meta information in the document database 408 .
- the document reference database 404 is continuously updated based on any additional information in newly collected documents or their meta information.
- the document reference database 404 can be updated from external sources.
- the region finding, splitting and graphing subsystem 410 creates and maintain the region set database 412 and region set graph database 414 .
- documents from the document database 408 are analyzed, regions identified, the region set database 412 is updated with the new regions and their meta information, and the region set graph database 414 is updated with the relationships between the updated region sets and other regions sets.
- the indexing subsystem 416 creates and maintains the searchable index of region sets database 418 based on the region set database 412 and the region set graph database 414 .
- the searchable region index facilitates fast searching for content in the region sets.
- the searching and ranking subsystem 420 performs the actual searching for content using the searchable index of region sets database 418 , and then combines the relevant information from the region set database 412 and the region set graph database 414 in order to present the end-users 422 with any requested search results.
- Another implementation could use sets of words, phrases, or non-contiguous regions as an alternative to contiguous regions described above.
- Another implementation could establish additional regions based on the relationship between regions and between regions and the respective documents. This includes additional regions which consist of one or more existing regions present in the same document.
- Another implementation could use time-information from documents, such as creation time, modification time, or time information in the document data, in order to establish a time-based order to the graph of region sets.
- region set clusters could extend region set clusters to include additional region sets with no query matches, as long as they have relevant relationships to matching region sets in the region set graph.
- region sets could extend region sets to include groups of region sets where either the same or related documents contain the regions in the region set—the documents could be related by e.g. Coming from the same source.
- Another implementation could interpret translations of data to indicate a duplicate or near-duplicate.
- Another implementation could interpret rewording, rephrasing, recompilation, or other identifiable transformations of data to indicate a duplicate or near-duplicate.
- Another implementation could apply to documents which are partially or fully binaries, pictures, videos, other digital media, or combinations thereof.
- An example could be where a picture embedded in one document is a near-duplicate of a second picture after cropping has been applied to the second picture.
- Another implementation could include uncompressing, decrypting, or otherwise unpacking documents into one or more documents before the processing is applied.
- Another implementation could send search keywords or keywords within retrieved documents to an alternative information retrieval system (e.g. Yahoo or Google) to obtain an additional list of document references to feed the document collection subsystem. This would enhance the completeness of the results.
- an alternative information retrieval system e.g. Yahoo or Google
- a subset region appears to be a summary
- a subset region appears to be a quote
- a URL reference appears to be a backlink
- a superset region appears to be an aggregated list.
- Another implementation could fully or partially flatten the graph of region sets based on the collection of documents to facilitate quick search results.
- Another implementation presents the search results using an interactive graph instead of a list.
- Each subsystem and database could be implemented in various ways, including using mobile devices, single CPUs, single servers, distributed server-farms, or computing clouds—such as the Amazon Elastic Compute Cloud (Amazon EC2) or the Google App Engine.
- computing clouds such as the Amazon Elastic Compute Cloud (Amazon EC2) or the Google App Engine.
- Each database could be implemented in various ways, including using SQL database(s), NoSQL database(s), file system(s), distributed storage, RAID system(s), disk(s), tape, flash device(s).
- Another implementation could split up one or more of the databases into multiple databases or combine the databases into fewer databases.
- Another implementation could split up one or more of the subsystems into multiple subsystems or combine the subsystems into fewer subsystems.
- region finding, splitting and graphing subsystem could be organize the data less, and leave some processing (e.g. region set organization and graphing) to the searching and ranking subsystem—e.g. compute the relevant region set graphs for the region sets after search is done.
- processing e.g. region set organization and graphing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Business, Economics & Management (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (19)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/072,647 US8650195B2 (en) | 2010-03-26 | 2011-03-25 | Region based information retrieval system |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US31827210P | 2010-03-26 | 2010-03-26 | |
US13/072,647 US8650195B2 (en) | 2010-03-26 | 2011-03-25 | Region based information retrieval system |
Publications (2)
Publication Number | Publication Date |
---|---|
US20110238664A1 US20110238664A1 (en) | 2011-09-29 |
US8650195B2 true US8650195B2 (en) | 2014-02-11 |
Family
ID=44657531
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/072,647 Expired - Fee Related US8650195B2 (en) | 2010-03-26 | 2011-03-25 | Region based information retrieval system |
Country Status (1)
Country | Link |
---|---|
US (1) | US8650195B2 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140366085A1 (en) * | 2008-06-10 | 2014-12-11 | Object Security Llc | Method and system for rapid accreditation/re-accreditation of agile it environments, for example service oriented architecture (soa) |
US10002256B2 (en) * | 2014-12-05 | 2018-06-19 | GeoLang Ltd. | Symbol string matching mechanism |
US10678866B1 (en) | 2016-09-30 | 2020-06-09 | Vasumathi Ranganathan | Rules driven content network for tracking, tracing, auditing and life cycle management of information artifacts |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103164435B (en) * | 2011-12-13 | 2016-03-09 | 北大方正集团有限公司 | A kind of acquisition method of network data and system |
US9922022B2 (en) * | 2016-02-01 | 2018-03-20 | Microsoft Technology Licensing, Llc. | Automatic template generation based on previous documents |
US10839149B2 (en) | 2016-02-01 | 2020-11-17 | Microsoft Technology Licensing, Llc. | Generating templates from user's past documents |
Citations (103)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5313616A (en) | 1990-09-18 | 1994-05-17 | 88Open Consortium, Ltd. | Method for analyzing calls of application program by inserting monitoring routines into the executable version and redirecting calls to the monitoring routines |
US5343527A (en) | 1993-10-27 | 1994-08-30 | International Business Machines Corporation | Hybrid encryption method and system for protecting reusable software components |
US5469354A (en) | 1989-06-14 | 1995-11-21 | Hitachi, Ltd. | Document data processing method and apparatus for document retrieval |
US5577249A (en) | 1992-07-31 | 1996-11-19 | International Business Machines Corporation | Method for finding a reference token sequence in an original token string within a database of token strings using appended non-contiguous substrings |
US5745900A (en) | 1996-08-09 | 1998-04-28 | Digital Equipment Corporation | Method for indexing duplicate database records using a full-record fingerprint |
US5765152A (en) | 1995-10-13 | 1998-06-09 | Trustees Of Dartmouth College | System and method for managing copyrighted electronic media |
US5774883A (en) | 1995-05-25 | 1998-06-30 | Andersen; Lloyd R. | Method for selecting a seller's most profitable financing program |
US5892900A (en) | 1996-08-30 | 1999-04-06 | Intertrust Technologies Corp. | Systems and methods for secure transaction management and electronic rights protection |
US5893095A (en) | 1996-03-29 | 1999-04-06 | Virage, Inc. | Similarity engine for content-based retrieval of images |
US5909677A (en) | 1996-06-18 | 1999-06-01 | Digital Equipment Corporation | Method for determining the resemblance of documents |
US5917912A (en) | 1995-02-13 | 1999-06-29 | Intertrust Technologies Corporation | System and methods for secure transaction management and electronic rights protection |
US5924090A (en) | 1997-05-01 | 1999-07-13 | Northern Light Technology Llc | Method and apparatus for searching a database of records |
US5953006A (en) | 1992-03-18 | 1999-09-14 | Lucent Technologies Inc. | Methods and apparatus for detecting and displaying similarities in large data sets |
US5958051A (en) | 1996-11-27 | 1999-09-28 | Sun Microsystems, Inc. | Implementing digital signatures for data streams and data archives |
US6029002A (en) | 1995-10-31 | 2000-02-22 | Peritus Software Services, Inc. | Method and apparatus for analyzing computer code using weakest precondition |
US6035402A (en) | 1996-12-20 | 2000-03-07 | Gte Cybertrust Solutions Incorporated | Virtual certificate authority |
US6072493A (en) | 1997-03-31 | 2000-06-06 | Bellsouth Corporation | System and method for associating services information with selected elements of an organization |
US6112203A (en) | 1998-04-09 | 2000-08-29 | Altavista Company | Method for ranking documents in a hyperlinked environment using connectivity and selective content analysis |
US6119124A (en) | 1998-03-26 | 2000-09-12 | Digital Equipment Corporation | Method for clustering closely resembling data objects |
US6138113A (en) | 1998-08-10 | 2000-10-24 | Altavista Company | Method for identifying near duplicate pages in a hyperlinked database |
US6148401A (en) | 1997-02-05 | 2000-11-14 | At&T Corp. | System and method for providing assurance to a host that a piece of software possesses a particular property |
US6189146B1 (en) | 1998-03-18 | 2001-02-13 | Microsoft Corporation | System and method for software licensing |
US6226618B1 (en) | 1998-08-13 | 2001-05-01 | International Business Machines Corporation | Electronic content delivery system |
US6240409B1 (en) | 1998-07-31 | 2001-05-29 | The Regents Of The University Of California | Method and apparatus for detecting and summarizing document similarity within large document sets |
US6249769B1 (en) | 1998-11-02 | 2001-06-19 | International Business Machines Corporation | Method, system and program product for evaluating the business requirements of an enterprise for generating business solution deliverables |
US6260141B1 (en) | 1997-09-19 | 2001-07-10 | Hyo Joon Park | Software license control system based on independent software registration server |
US6263348B1 (en) | 1998-07-01 | 2001-07-17 | Serena Software International, Inc. | Method and apparatus for identifying the existence of differences between two files |
US6275223B1 (en) | 1998-07-08 | 2001-08-14 | Nortel Networks Limited | Interactive on line code inspection process and tool |
US6282698B1 (en) | 1998-02-09 | 2001-08-28 | Lucent Technologies Inc. | Detecting similarities in Java sources from bytecodes |
US6285999B1 (en) | 1997-01-10 | 2001-09-04 | The Board Of Trustees Of The Leland Stanford Junior University | Method for node ranking in a linked database |
US6330670B1 (en) | 1998-10-26 | 2001-12-11 | Microsoft Corporation | Digital rights management operating system |
WO2002027486A1 (en) | 2000-09-28 | 2002-04-04 | Curl Corporation | Methods and apparatus for generating unique identifiers for software components |
US6381698B1 (en) | 1997-05-21 | 2002-04-30 | At&T Corp | System and method for providing assurance to a host that a piece of software possesses a particular property |
US6397205B1 (en) | 1998-11-24 | 2002-05-28 | Duquesne University Of The Holy Ghost | Document categorization and evaluation via cross-entrophy |
US20020138764A1 (en) | 2001-02-01 | 2002-09-26 | Jacobs Bruce A. | System and method for an automatic license facility |
US20020138477A1 (en) | 2000-07-26 | 2002-09-26 | Keiser Richard G. | Configurable software system and user interface for automatically storing computer files |
US20020138441A1 (en) | 2001-03-21 | 2002-09-26 | Thomas Lopatic | Technique for license management and online software license enforcement |
US6480959B1 (en) | 1997-12-05 | 2002-11-12 | Jamama, Llc | Software system and associated methods for controlling the use of computer programs |
US6480834B1 (en) | 1999-11-17 | 2002-11-12 | Serena Software, Inc. | Method and apparatus for serving files from a mainframe to one or more clients |
US6493709B1 (en) | 1998-07-31 | 2002-12-10 | The Regents Of The University Of California | Method and apparatus for digitally shredding similar documents within large document sets in a data processing environment |
US20020188608A1 (en) | 2001-06-12 | 2002-12-12 | Nelson Dean S. | Automated license dependency resolution and license generation |
US6546114B1 (en) | 1999-09-07 | 2003-04-08 | Microsoft Corporation | Technique for detecting a watermark in a marked image |
US6557105B1 (en) | 1999-04-14 | 2003-04-29 | Tut Systems, Inc. | Apparatus and method for cryptographic-based license management |
US6574348B1 (en) | 1999-09-07 | 2003-06-03 | Microsoft Corporation | Technique for watermarking an image and a resulting watermarked image |
US20030126456A1 (en) | 2001-11-14 | 2003-07-03 | Siemens Aktiengesellschaft | Method for licensing software |
US6615209B1 (en) | 2000-02-22 | 2003-09-02 | Google, Inc. | Detecting query-specific duplicate documents |
US20030164849A1 (en) * | 2002-03-01 | 2003-09-04 | Iparadigms, Llc | Systems and methods for facilitating the peer review process |
US6658423B1 (en) | 2001-01-24 | 2003-12-02 | Google, Inc. | Detecting duplicate and near-duplicate files |
US6658626B1 (en) | 1998-07-31 | 2003-12-02 | The Regents Of The University Of California | User interface for displaying document comparison information |
US6735490B2 (en) | 2001-10-12 | 2004-05-11 | General Electric Company | Method and system for automated integration of design analysis subprocesses |
US20040162827A1 (en) | 2003-02-19 | 2004-08-19 | Nahava Inc. | Method and apparatus for fundamental operations on token sequences: computing similarity, extracting term values, and searching efficiently |
US20050015343A1 (en) | 2002-09-11 | 2005-01-20 | Norihiro Nagai | License management device, license management method, and computer program |
US6862696B1 (en) | 2000-05-03 | 2005-03-01 | Cigital | System and method for software certification |
US20050060643A1 (en) | 2003-08-25 | 2005-03-17 | Miavia, Inc. | Document similarity detection and classification system |
US20050125358A1 (en) | 2003-12-04 | 2005-06-09 | Black Duck Software, Inc. | Authenticating licenses for legally-protectable content based on license profiles and content identifiers |
US20050165718A1 (en) * | 2004-01-26 | 2005-07-28 | Fontoura Marcus F. | Pipelined architecture for global analysis and index building |
US6928419B2 (en) | 1994-11-23 | 2005-08-09 | Contentguard Holdings, Inc. | Method and apparatus for repackaging portions of digital works as new digital works |
US6954747B1 (en) | 2000-11-14 | 2005-10-11 | Microsoft Corporation | Methods for comparing versions of a program |
US6976170B1 (en) | 2001-10-15 | 2005-12-13 | Kelly Adam V | Method for detecting plagiarism |
US6983371B1 (en) | 1998-10-22 | 2006-01-03 | International Business Machines Corporation | Super-distribution of protected digital content |
US20060015465A1 (en) | 2004-07-13 | 2006-01-19 | Hiroshi Kume | Apparatus, method and program for license information ascertainment |
US20060020571A1 (en) * | 2004-07-26 | 2006-01-26 | Patterson Anna L | Phrase-based generation of document descriptions |
US20060122983A1 (en) | 2004-12-03 | 2006-06-08 | King Martin T | Locating electronic instances of documents based on rendered instances, document fragment digest generation, and digest based document fragment determination |
US7062468B2 (en) | 2000-04-28 | 2006-06-13 | Hillegass James C | Licensed digital material distribution system and method |
US20060155975A1 (en) | 2002-11-22 | 2006-07-13 | Koninklijke Philips Electronics N.V. | Method and apparatus for processing conditonal branch instructions |
US7085996B2 (en) | 2001-10-18 | 2006-08-01 | International Business Corporation | Apparatus and method for source compression and comparison |
US7197156B1 (en) | 1998-09-25 | 2007-03-27 | Digimarc Corporation | Method and apparatus for embedding auxiliary information within original data |
US7228427B2 (en) | 2000-06-16 | 2007-06-05 | Entriq Inc. | Method and system to securely distribute content via a network |
US20070157311A1 (en) | 2005-12-29 | 2007-07-05 | Microsoft Corporation | Security modeling and the application life cycle |
US20070162890A1 (en) | 2005-12-29 | 2007-07-12 | Microsoft Corporation | Security engineering and the application life cycle |
US20070174296A1 (en) * | 2006-01-17 | 2007-07-26 | Andrew Gibbs | Method and system for distributing a database and computer program within a network |
US7254587B2 (en) | 2004-01-12 | 2007-08-07 | International Business Machines Corporation | Method and apparatus for determining relative relevance between portions of large electronic documents |
US20070244915A1 (en) * | 2006-04-13 | 2007-10-18 | Lg Electronics Inc. | System and method for clustering documents |
US20070299825A1 (en) | 2004-09-20 | 2007-12-27 | Koders, Inc. | Source Code Search Engine |
US20080040316A1 (en) | 2004-03-31 | 2008-02-14 | Lawrence Stephen R | Systems and methods for analyzing boilerplate |
US20080044016A1 (en) | 2006-08-04 | 2008-02-21 | Henzinger Monika H | Detecting duplicate and near-duplicate files |
US7343297B2 (en) | 2001-06-15 | 2008-03-11 | Microsoft Corporation | System and related methods for managing and enforcing software licenses |
US7346839B2 (en) | 2003-09-30 | 2008-03-18 | Google Inc. | Information retrieval based on historical data |
US7346621B2 (en) | 2004-05-14 | 2008-03-18 | Microsoft Corporation | Method and system for ranking objects based on intra-type and inter-type relationships |
US7383269B2 (en) | 2003-09-12 | 2008-06-03 | Accenture Global Services Gmbh | Navigating a software project repository |
US20080222142A1 (en) * | 2007-03-08 | 2008-09-11 | Utopio, Inc. | Context based data searching |
US7483860B2 (en) | 2002-03-08 | 2009-01-27 | Pace Anti-Piracy | Method and system for managing software licenses |
US7490319B2 (en) | 2003-11-04 | 2009-02-10 | Kimberly-Clark Worldwide, Inc. | Testing tool comprising an automated multidimensional traceability matrix for implementing and validating complex software systems |
US20090043767A1 (en) | 2007-08-07 | 2009-02-12 | Ashutosh Joshi | Approach For Application-Specific Duplicate Detection |
US7503035B2 (en) | 2003-11-25 | 2009-03-10 | Software Analysis And Forensic Engineering Corp. | Software tool for detecting plagiarism in computer source code |
US7552093B2 (en) | 2003-12-04 | 2009-06-23 | Black Duck Software, Inc. | Resolving license dependencies for aggregations of legally-protectable content |
US20090171958A1 (en) | 2002-08-14 | 2009-07-02 | Anderson Iv Robert | Computer-Based System and Method for Generating, Classifying, Searching, and Analyzing Standardized Text Templates and Deviations from Standardized Text Templates |
US7568109B2 (en) | 2003-09-11 | 2009-07-28 | Ipx, Inc. | System for software source code comparison |
US7627613B1 (en) | 2003-07-03 | 2009-12-01 | Google Inc. | Duplicate document detection in a web crawler system |
US7676465B2 (en) | 2006-07-05 | 2010-03-09 | Yahoo! Inc. | Techniques for clustering structurally similar web pages based on page features |
US7680773B1 (en) * | 2005-03-31 | 2010-03-16 | Google Inc. | System for automatically managing duplicate documents when crawling dynamic documents |
US7681045B2 (en) | 2006-10-12 | 2010-03-16 | Black Duck Software, Inc. | Software algorithm identification |
US7707157B1 (en) | 2004-03-25 | 2010-04-27 | Google Inc. | Document near-duplicate detection |
US7707433B2 (en) | 1998-05-14 | 2010-04-27 | Purdue Research Foundation | Method and system for secure computational outsourcing and disguise |
US7716216B1 (en) | 2004-03-31 | 2010-05-11 | Google Inc. | Document ranking based on semantic distance between terms in a document |
US7734627B1 (en) | 2003-06-17 | 2010-06-08 | Google Inc. | Document similarity detection |
US7757097B2 (en) | 1999-09-03 | 2010-07-13 | Purdue Research Foundation | Method and system for tamperproofing software |
US7783976B2 (en) | 2005-10-24 | 2010-08-24 | Fujitsu Limited | Method and apparatus for comparing documents, and computer product |
US7797245B2 (en) | 2005-03-18 | 2010-09-14 | Black Duck Software, Inc. | Methods and systems for identifying an area of interest in protectable content |
US7900042B2 (en) | 2001-06-26 | 2011-03-01 | Ncipher Corporation Limited | Encrypted packet inspection |
US8001462B1 (en) * | 2009-01-30 | 2011-08-16 | Google Inc. | Updating search engine document index based on calculated age of changed portions in a document |
US8010538B2 (en) | 2006-05-08 | 2011-08-30 | Black Duck Software, Inc. | Methods and systems for reporting regions of interest in content files |
US8010803B2 (en) | 2006-10-12 | 2011-08-30 | Black Duck Software, Inc. | Methods and apparatus for automated export compliance |
-
2011
- 2011-03-25 US US13/072,647 patent/US8650195B2/en not_active Expired - Fee Related
Patent Citations (108)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5469354A (en) | 1989-06-14 | 1995-11-21 | Hitachi, Ltd. | Document data processing method and apparatus for document retrieval |
US5313616A (en) | 1990-09-18 | 1994-05-17 | 88Open Consortium, Ltd. | Method for analyzing calls of application program by inserting monitoring routines into the executable version and redirecting calls to the monitoring routines |
US5953006A (en) | 1992-03-18 | 1999-09-14 | Lucent Technologies Inc. | Methods and apparatus for detecting and displaying similarities in large data sets |
US5577249A (en) | 1992-07-31 | 1996-11-19 | International Business Machines Corporation | Method for finding a reference token sequence in an original token string within a database of token strings using appended non-contiguous substrings |
US5343527A (en) | 1993-10-27 | 1994-08-30 | International Business Machines Corporation | Hybrid encryption method and system for protecting reusable software components |
US6928419B2 (en) | 1994-11-23 | 2005-08-09 | Contentguard Holdings, Inc. | Method and apparatus for repackaging portions of digital works as new digital works |
US5917912A (en) | 1995-02-13 | 1999-06-29 | Intertrust Technologies Corporation | System and methods for secure transaction management and electronic rights protection |
US5774883A (en) | 1995-05-25 | 1998-06-30 | Andersen; Lloyd R. | Method for selecting a seller's most profitable financing program |
US5765152A (en) | 1995-10-13 | 1998-06-09 | Trustees Of Dartmouth College | System and method for managing copyrighted electronic media |
US6029002A (en) | 1995-10-31 | 2000-02-22 | Peritus Software Services, Inc. | Method and apparatus for analyzing computer code using weakest precondition |
US5893095A (en) | 1996-03-29 | 1999-04-06 | Virage, Inc. | Similarity engine for content-based retrieval of images |
US5909677A (en) | 1996-06-18 | 1999-06-01 | Digital Equipment Corporation | Method for determining the resemblance of documents |
US6230155B1 (en) | 1996-06-18 | 2001-05-08 | Altavista Company | Method for determining the resemining the resemblance of documents |
US5745900A (en) | 1996-08-09 | 1998-04-28 | Digital Equipment Corporation | Method for indexing duplicate database records using a full-record fingerprint |
US5892900A (en) | 1996-08-30 | 1999-04-06 | Intertrust Technologies Corp. | Systems and methods for secure transaction management and electronic rights protection |
US5958051A (en) | 1996-11-27 | 1999-09-28 | Sun Microsystems, Inc. | Implementing digital signatures for data streams and data archives |
US6035402A (en) | 1996-12-20 | 2000-03-07 | Gte Cybertrust Solutions Incorporated | Virtual certificate authority |
US6285999B1 (en) | 1997-01-10 | 2001-09-04 | The Board Of Trustees Of The Leland Stanford Junior University | Method for node ranking in a linked database |
US6148401A (en) | 1997-02-05 | 2000-11-14 | At&T Corp. | System and method for providing assurance to a host that a piece of software possesses a particular property |
US6072493A (en) | 1997-03-31 | 2000-06-06 | Bellsouth Corporation | System and method for associating services information with selected elements of an organization |
US5924090A (en) | 1997-05-01 | 1999-07-13 | Northern Light Technology Llc | Method and apparatus for searching a database of records |
US6381698B1 (en) | 1997-05-21 | 2002-04-30 | At&T Corp | System and method for providing assurance to a host that a piece of software possesses a particular property |
US6260141B1 (en) | 1997-09-19 | 2001-07-10 | Hyo Joon Park | Software license control system based on independent software registration server |
US6480959B1 (en) | 1997-12-05 | 2002-11-12 | Jamama, Llc | Software system and associated methods for controlling the use of computer programs |
US6282698B1 (en) | 1998-02-09 | 2001-08-28 | Lucent Technologies Inc. | Detecting similarities in Java sources from bytecodes |
US6189146B1 (en) | 1998-03-18 | 2001-02-13 | Microsoft Corporation | System and method for software licensing |
US6119124A (en) | 1998-03-26 | 2000-09-12 | Digital Equipment Corporation | Method for clustering closely resembling data objects |
US6112203A (en) | 1998-04-09 | 2000-08-29 | Altavista Company | Method for ranking documents in a hyperlinked environment using connectivity and selective content analysis |
US7707433B2 (en) | 1998-05-14 | 2010-04-27 | Purdue Research Foundation | Method and system for secure computational outsourcing and disguise |
US6393438B1 (en) | 1998-06-19 | 2002-05-21 | Serena Software International, Inc. | Method and apparatus for identifying the existence of differences between two files |
US6263348B1 (en) | 1998-07-01 | 2001-07-17 | Serena Software International, Inc. | Method and apparatus for identifying the existence of differences between two files |
US6275223B1 (en) | 1998-07-08 | 2001-08-14 | Nortel Networks Limited | Interactive on line code inspection process and tool |
US6493709B1 (en) | 1998-07-31 | 2002-12-10 | The Regents Of The University Of California | Method and apparatus for digitally shredding similar documents within large document sets in a data processing environment |
US6240409B1 (en) | 1998-07-31 | 2001-05-29 | The Regents Of The University Of California | Method and apparatus for detecting and summarizing document similarity within large document sets |
US6658626B1 (en) | 1998-07-31 | 2003-12-02 | The Regents Of The University Of California | User interface for displaying document comparison information |
US6138113A (en) | 1998-08-10 | 2000-10-24 | Altavista Company | Method for identifying near duplicate pages in a hyperlinked database |
US6226618B1 (en) | 1998-08-13 | 2001-05-01 | International Business Machines Corporation | Electronic content delivery system |
US7197156B1 (en) | 1998-09-25 | 2007-03-27 | Digimarc Corporation | Method and apparatus for embedding auxiliary information within original data |
US6983371B1 (en) | 1998-10-22 | 2006-01-03 | International Business Machines Corporation | Super-distribution of protected digital content |
US6330670B1 (en) | 1998-10-26 | 2001-12-11 | Microsoft Corporation | Digital rights management operating system |
US6249769B1 (en) | 1998-11-02 | 2001-06-19 | International Business Machines Corporation | Method, system and program product for evaluating the business requirements of an enterprise for generating business solution deliverables |
US6397205B1 (en) | 1998-11-24 | 2002-05-28 | Duquesne University Of The Holy Ghost | Document categorization and evaluation via cross-entrophy |
US6557105B1 (en) | 1999-04-14 | 2003-04-29 | Tut Systems, Inc. | Apparatus and method for cryptographic-based license management |
US7757097B2 (en) | 1999-09-03 | 2010-07-13 | Purdue Research Foundation | Method and system for tamperproofing software |
US6546114B1 (en) | 1999-09-07 | 2003-04-08 | Microsoft Corporation | Technique for detecting a watermark in a marked image |
US6574348B1 (en) | 1999-09-07 | 2003-06-03 | Microsoft Corporation | Technique for watermarking an image and a resulting watermarked image |
US6480834B1 (en) | 1999-11-17 | 2002-11-12 | Serena Software, Inc. | Method and apparatus for serving files from a mainframe to one or more clients |
US6615209B1 (en) | 2000-02-22 | 2003-09-02 | Google, Inc. | Detecting query-specific duplicate documents |
US7779002B1 (en) | 2000-02-22 | 2010-08-17 | Google Inc. | Detecting query-specific duplicate documents |
US7062468B2 (en) | 2000-04-28 | 2006-06-13 | Hillegass James C | Licensed digital material distribution system and method |
US6862696B1 (en) | 2000-05-03 | 2005-03-01 | Cigital | System and method for software certification |
US7228427B2 (en) | 2000-06-16 | 2007-06-05 | Entriq Inc. | Method and system to securely distribute content via a network |
US20020138477A1 (en) | 2000-07-26 | 2002-09-26 | Keiser Richard G. | Configurable software system and user interface for automatically storing computer files |
WO2002027486A1 (en) | 2000-09-28 | 2002-04-04 | Curl Corporation | Methods and apparatus for generating unique identifiers for software components |
US6954747B1 (en) | 2000-11-14 | 2005-10-11 | Microsoft Corporation | Methods for comparing versions of a program |
US6658423B1 (en) | 2001-01-24 | 2003-12-02 | Google, Inc. | Detecting duplicate and near-duplicate files |
US20080162478A1 (en) | 2001-01-24 | 2008-07-03 | William Pugh | Detecting duplicate and near-duplicate files |
US20020138764A1 (en) | 2001-02-01 | 2002-09-26 | Jacobs Bruce A. | System and method for an automatic license facility |
US20020138441A1 (en) | 2001-03-21 | 2002-09-26 | Thomas Lopatic | Technique for license management and online software license enforcement |
US20020188608A1 (en) | 2001-06-12 | 2002-12-12 | Nelson Dean S. | Automated license dependency resolution and license generation |
US7343297B2 (en) | 2001-06-15 | 2008-03-11 | Microsoft Corporation | System and related methods for managing and enforcing software licenses |
US7900042B2 (en) | 2001-06-26 | 2011-03-01 | Ncipher Corporation Limited | Encrypted packet inspection |
US6735490B2 (en) | 2001-10-12 | 2004-05-11 | General Electric Company | Method and system for automated integration of design analysis subprocesses |
US6976170B1 (en) | 2001-10-15 | 2005-12-13 | Kelly Adam V | Method for detecting plagiarism |
US7085996B2 (en) | 2001-10-18 | 2006-08-01 | International Business Corporation | Apparatus and method for source compression and comparison |
US20030126456A1 (en) | 2001-11-14 | 2003-07-03 | Siemens Aktiengesellschaft | Method for licensing software |
US20030164849A1 (en) * | 2002-03-01 | 2003-09-04 | Iparadigms, Llc | Systems and methods for facilitating the peer review process |
US7483860B2 (en) | 2002-03-08 | 2009-01-27 | Pace Anti-Piracy | Method and system for managing software licenses |
US20090171958A1 (en) | 2002-08-14 | 2009-07-02 | Anderson Iv Robert | Computer-Based System and Method for Generating, Classifying, Searching, and Analyzing Standardized Text Templates and Deviations from Standardized Text Templates |
US20050015343A1 (en) | 2002-09-11 | 2005-01-20 | Norihiro Nagai | License management device, license management method, and computer program |
US20060155975A1 (en) | 2002-11-22 | 2006-07-13 | Koninklijke Philips Electronics N.V. | Method and apparatus for processing conditonal branch instructions |
US20040162827A1 (en) | 2003-02-19 | 2004-08-19 | Nahava Inc. | Method and apparatus for fundamental operations on token sequences: computing similarity, extracting term values, and searching efficiently |
US7734627B1 (en) | 2003-06-17 | 2010-06-08 | Google Inc. | Document similarity detection |
US7627613B1 (en) | 2003-07-03 | 2009-12-01 | Google Inc. | Duplicate document detection in a web crawler system |
US20050060643A1 (en) | 2003-08-25 | 2005-03-17 | Miavia, Inc. | Document similarity detection and classification system |
US7568109B2 (en) | 2003-09-11 | 2009-07-28 | Ipx, Inc. | System for software source code comparison |
US7383269B2 (en) | 2003-09-12 | 2008-06-03 | Accenture Global Services Gmbh | Navigating a software project repository |
US7346839B2 (en) | 2003-09-30 | 2008-03-18 | Google Inc. | Information retrieval based on historical data |
US7490319B2 (en) | 2003-11-04 | 2009-02-10 | Kimberly-Clark Worldwide, Inc. | Testing tool comprising an automated multidimensional traceability matrix for implementing and validating complex software systems |
US7503035B2 (en) | 2003-11-25 | 2009-03-10 | Software Analysis And Forensic Engineering Corp. | Software tool for detecting plagiarism in computer source code |
US7552093B2 (en) | 2003-12-04 | 2009-06-23 | Black Duck Software, Inc. | Resolving license dependencies for aggregations of legally-protectable content |
US20050125358A1 (en) | 2003-12-04 | 2005-06-09 | Black Duck Software, Inc. | Authenticating licenses for legally-protectable content based on license profiles and content identifiers |
US7254587B2 (en) | 2004-01-12 | 2007-08-07 | International Business Machines Corporation | Method and apparatus for determining relative relevance between portions of large electronic documents |
US20050165718A1 (en) * | 2004-01-26 | 2005-07-28 | Fontoura Marcus F. | Pipelined architecture for global analysis and index building |
US7707157B1 (en) | 2004-03-25 | 2010-04-27 | Google Inc. | Document near-duplicate detection |
US20080040316A1 (en) | 2004-03-31 | 2008-02-14 | Lawrence Stephen R | Systems and methods for analyzing boilerplate |
US7716216B1 (en) | 2004-03-31 | 2010-05-11 | Google Inc. | Document ranking based on semantic distance between terms in a document |
US7346621B2 (en) | 2004-05-14 | 2008-03-18 | Microsoft Corporation | Method and system for ranking objects based on intra-type and inter-type relationships |
US20060015465A1 (en) | 2004-07-13 | 2006-01-19 | Hiroshi Kume | Apparatus, method and program for license information ascertainment |
US20060020571A1 (en) * | 2004-07-26 | 2006-01-26 | Patterson Anna L | Phrase-based generation of document descriptions |
US20070299825A1 (en) | 2004-09-20 | 2007-12-27 | Koders, Inc. | Source Code Search Engine |
US20060122983A1 (en) | 2004-12-03 | 2006-06-08 | King Martin T | Locating electronic instances of documents based on rendered instances, document fragment digest generation, and digest based document fragment determination |
US7797245B2 (en) | 2005-03-18 | 2010-09-14 | Black Duck Software, Inc. | Methods and systems for identifying an area of interest in protectable content |
US7680773B1 (en) * | 2005-03-31 | 2010-03-16 | Google Inc. | System for automatically managing duplicate documents when crawling dynamic documents |
US20100174686A1 (en) * | 2005-03-31 | 2010-07-08 | Anurag Acharya | Generating Equivalence Classes and Rules for Associating Content with Document Identifiers |
US7783976B2 (en) | 2005-10-24 | 2010-08-24 | Fujitsu Limited | Method and apparatus for comparing documents, and computer product |
US20070157311A1 (en) | 2005-12-29 | 2007-07-05 | Microsoft Corporation | Security modeling and the application life cycle |
US20070162890A1 (en) | 2005-12-29 | 2007-07-12 | Microsoft Corporation | Security engineering and the application life cycle |
US20070174296A1 (en) * | 2006-01-17 | 2007-07-26 | Andrew Gibbs | Method and system for distributing a database and computer program within a network |
US20070244915A1 (en) * | 2006-04-13 | 2007-10-18 | Lg Electronics Inc. | System and method for clustering documents |
US8010538B2 (en) | 2006-05-08 | 2011-08-30 | Black Duck Software, Inc. | Methods and systems for reporting regions of interest in content files |
US7676465B2 (en) | 2006-07-05 | 2010-03-09 | Yahoo! Inc. | Techniques for clustering structurally similar web pages based on page features |
US20080044016A1 (en) | 2006-08-04 | 2008-02-21 | Henzinger Monika H | Detecting duplicate and near-duplicate files |
US7681045B2 (en) | 2006-10-12 | 2010-03-16 | Black Duck Software, Inc. | Software algorithm identification |
US8010803B2 (en) | 2006-10-12 | 2011-08-30 | Black Duck Software, Inc. | Methods and apparatus for automated export compliance |
US20080222142A1 (en) * | 2007-03-08 | 2008-09-11 | Utopio, Inc. | Context based data searching |
US20090043767A1 (en) | 2007-08-07 | 2009-02-12 | Ashutosh Joshi | Approach For Application-Specific Duplicate Detection |
US8001462B1 (en) * | 2009-01-30 | 2011-08-16 | Google Inc. | Updating search engine document index based on calculated age of changed portions in a document |
Non-Patent Citations (63)
Title |
---|
"PB Code Analyzer Feature List" [online], Ecocion, Inc., 2010, [retrieved on Jun. 11, 2012], retrieved from the Internet . |
"PB Code Analyzer Feature List" [online], Ecocion, Inc., 2010, [retrieved on Jun. 11, 2012], retrieved from the Internet <URL: http://www.ecocion.com/sites/default/files/PB%20Code/20Analyzer%20Feature%20List.pdf>. |
"The Quest for an Open Source Genome" [online], Black Duck Software, Inc., 2007 [retrieved on Feb. 23, 2010], retrieved from the Internet . |
"The Quest for an Open Source Genome" [online], Black Duck Software, Inc., 2007 [retrieved on Feb. 23, 2010], retrieved from the Internet <URL: http://www.wetzelconsultingllc.com/Open-Source-Genome.pdf>. |
Baeza-Yates et al., "A New Approach to Text Searching", Communications of the ACM 35, Oct 1992, pp. 74-82. |
Bernstein et al., "Accurate discovery of co-derivative documents via duplicate text detection", Proceedings of the String Processing and Information Retrieval Symposium, Oct. 2004, Padua, Italy, pp. 55-67. |
Bernstein et al., "Redundant Documents and Search Effectiveness", CIKM'05, Proceedings of the 14th ACM international conference on Information and knowledge management, Oct. 31-Nov. 5, 2005, Bremen, Germany, pp. 736-743. |
Brin et al., "The Anatomy of a Large-Scale Hypertextual Web Search Engine", Seventh International World-Wide Web Conference (WWW 1998), Apr. 14-18, 1998, Brisbane, Australia. |
Broder et al., "Syntactic Clustering of the Web" [online], SRC (Digital Systems Research Center) Technical Note #1997-015, Jul. 1997 [retrieved on Jun. 9, 2009], retrieved from the Internet . |
Broder et al., "Syntactic Clustering of the Web" [online], SRC (Digital Systems Research Center) Technical Note #1997-015, Jul. 1997 [retrieved on Jun. 9, 2009], retrieved from the Internet <URL: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.3239&rep=rep1&type=pdf>. |
Broder, "Algorithms for duplicate documents" [online], Presentation at Princeton University, Princeton, New Jersey, USA, Feb. 2005 [retrieved on Jan. 31, 2010], retrieved from the Internet . |
Broder, "Algorithms for duplicate documents" [online], Presentation at Princeton University, Princeton, New Jersey, USA, Feb. 2005 [retrieved on Jan. 31, 2010], retrieved from the Internet <URL: http://www.cs.princeton.edu/courses/archive/spr05/cos598E/bib/Princeton.pdf>. |
Broder, "On the resemblance and containment of documents", Compression and Complexity of Sequences '97, Sequences '97: Proceedings, Positano, Amalfitan Coast, Salerno, Italy, Jun. 11-13, 1997, pp. 21-29. |
Burkhard et al., "Some Approaches to Best-Match File Searching", Communications of the ACM, Apr. 1973, vol. 16, No. 4, pp. 230-236. |
Chang et al., "Theoretical and Empirical Comparisons of Approximate String Matching Algorithms" [online], University of California, Berkeley EECS Technical Reports [retrieved on Oct. 10, 2010], retrieved from the Internet . |
Chang et al., "Theoretical and Empirical Comparisons of Approximate String Matching Algorithms" [online], University of California, Berkeley EECS Technical Reports [retrieved on Oct. 10, 2010], retrieved from the Internet <URL: http://www.eecs.berkeley.edu/Pubs/TechRpts/1991/CSD-91-653.pdf>. |
Chen et al., "Efficient String Matching Algorithms for Combinatorial Universal Denoising", Proceedings of the 2005 Data Compression Conference (DCC'05), Snowbird, Utah, 2005, pp. 153-162. |
Clifford, et al., "A Fast, Randomised, Maximal Subset Matching Algorithm for Document-Level Music retrieval," SOFSEM '07 Proceedings of the 33rd conference on Current Trends in Theory and Practice of Computer Science, copyright 2006 University of Victoria. |
Clough, "Plagiarism in natural and programming languages: an overview of current tools and technologies" [online], Jul. 2000 [retrieved on Jun. 11, 2012], retrieved from the Internet . |
Clough, "Plagiarism in natural and programming languages: an overview of current tools and technologies" [online], Jul. 2000 [retrieved on Jun. 11, 2012], retrieved from the Internet <URL: http://ir.shef.ac.uk/cloughie/papers/plagiarism2000.pdf>. |
Cohen, "Recursive Hashing Functions for n-Grams", ACM Transactions on Information Systems, vol. 15, No. 3, Jul. 1997, pp. 291-320. |
Cole et al., "Verifying Candidate Matches in Sparse and Wildcard Matching" [online], STOC 02, May 19-21, 2002, Montreal, Quebec, Canada [retrieved on Jun. 11, 2012], retrieved from the Internet . |
Cole et al., "Verifying Candidate Matches in Sparse and Wildcard Matching" [online], STOC 02, May 19-21, 2002, Montreal, Quebec, Canada [retrieved on Jun. 11, 2012], retrieved from the Internet <URL: http://hariharan-ramesh.com/papers/dontcares.pdf>. |
Damashek, "Gauging Similarity with n-Grams: Language-Independent Categorization of Text", Science, New Series, vol. 267, No. 5199, Feb. 10, 1995, pp. 843-848. |
Eastlake et al., "US Secure Hash Algorithm 1 (SHA 1)" [online], The Internet Engineering Task Force (IETF), Network Working Group, Request for Comments: 3174, Sep. 2001 [retrieved on Jun. 11, 2012], retrieved from the Internet , pp. 1-22. |
Eastlake et al., "US Secure Hash Algorithm 1 (SHA 1)" [online], The Internet Engineering Task Force (IETF), Network Working Group, Request for Comments: 3174, Sep. 2001 [retrieved on Jun. 11, 2012], retrieved from the Internet <URL: http://tools.ietf.org/pdf/rfc3760>, pp. 1-22. |
Farringdon, "Analysing for Authorship: A Guide to the Cusum Technique" [online], Introduction document originally on http://members.aol.com/qsums/QsumIntroduction.html [retrieved on Jun. 11, 2012], retrieved via Internet Archive from the Internet . |
Farringdon, "Analysing for Authorship: A Guide to the Cusum Technique" [online], Introduction document originally on http://members.aol.com/qsums/QsumIntroduction.html [retrieved on Jun. 11, 2012], retrieved via Internet Archive from the Internet <URL: http://web.archive.org/web/20041013020613/http://members.aol.com/qsums/QsumIntroduction.html>. |
Ganguly et al., "A new randomized algorithm for Document Fingerprinting" [online], Indian Institute of Technology, Kanpur, UP, India [retrieved on Jan. 31, 2010], retrieved from the Internet . |
Ganguly et al., "A new randomized algorithm for Document Fingerprinting" [online], Indian Institute of Technology, Kanpur, UP, India [retrieved on Jan. 31, 2010], retrieved from the Internet <URL: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.126.9163&rep=rep1&type=pdf>. |
Gansner et al., "A Technique for Drawing Directed Graphs", IEEE Transactions on Software Engineering archive vol. 19 Issue 3, Mar. 1993, pp. 214-230. |
Hartig et al., "Publishing and Consuming Provenance Metadata on the Web of Linked Data", Proceedings of the 3rd International Provenance and Annotation Workshop (IPAW), Troy, New York, USA, Jun. 2010. |
Hartig, "Provenance Information in the Web of Data", LDOW2009, Apr. 20, 2009, Madrid, Spain. |
Hua et al., "Variable-Stride Multi-Pattern Matching for Scalable Deep Packet Inspection", Proceedings of IEEE INFOCOM 2009, pp. 415-423. |
Kamps et al., "Best Match Querying form DocumentCentric XML" [online], Seventh International Workshop on the Web and Databases (WebDB 2004), Jun. 17-18, 2004, Paris, France [retrieved on Jun. 11, 2012], retrieved from the Internet . |
Kamps et al., "Best Match Querying form DocumentCentric XML" [online], Seventh International Workshop on the Web and Databases (WebDB 2004), Jun. 17-18, 2004, Paris, France [retrieved on Jun. 11, 2012], retrieved from the Internet <URL: http://webdb2004.cs.columbia.edu/papers/4-3.pdf>. |
Karp et al., "Efficient randomized pattern-matching algorithms", IBM Journal of Research and Development vol. 31 Issue 2, Mar. 1987, pp. 249-260. |
Kleinberg, "An Impossibility Theorem for Clustering", Advances in Neural Information Processing Systems 15, Proceedings of the 2002 Conference, pp. 446-453, MIT Press, Cambridge, Massachusetts, USA. |
Kurland et al., "Respect My Authority! HITS Without Hyperlinks, Utilizing Cluster-Based Language Models", SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, Aug. 2006, Seattle, Washington, USA, pp. 83-90. |
Lopresti, "Models and Algorithms for Duplicate Document Detection", Proceedings of the Fifth International Conference on Document Analysis and Recognition (ICDAR '99), Bangalore, India, pp. 297-300. |
Manber, "Finding Similar Files in a Large File System", WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference, San Francisco, California, USA, pp. 1-10. |
Manku et al., "Detecting NearDuplicates for Web Crawling" [online], WWW 2007, May 8-12, 2007, Banff, Alberta, Canada [retrieved on Jun. 11, 2012], retrieved from the Internet . |
Manku et al., "Detecting NearDuplicates for Web Crawling" [online], WWW 2007, May 8-12, 2007, Banff, Alberta, Canada [retrieved on Jun. 11, 2012], retrieved from the Internet <URL: http://www2007.org/papers/paper215.pdf>. |
Mens et al., "Software Evolution, Part I, Understanding and Analysing Software Evolution", Software Evolution, 2008, pp. 15-90, Springer. |
Meziane et al., "A Document Management Methodology Based on Similarity Contents" Elsevier, 2003, Salford, United Kingdom, Information Sciences vol. 158 (2004), pp. 15-36. |
Moussiades et al., "PDetect: a Clustering Approach for Detecting Plagiarism in Source Code Datasets", The Computer Journal vol. 48 No. 6, 2005, pp. 651-661. |
Mozgovoy et al., "Fast Plagiarism Detection System", Proceedings of the International Symposium on String Processing and Information Retrieval (SPIRE2005), Buenos Aires, Argentina, Nov. 2005 (Lecture Notes in Computer Science), pp. 267-270. |
Parapar et al., "Winnowing-Based Text Clustering", CIKM'08, Proceedings of the 17th ACM conference on Information and knowledge management, Oct. 26-30, 2008, Napa Valley, California, USA, pp. 1353-1354. |
Paul et al., "A Framework for Source Code Search using Program Patterns", IEEE Transactions on Software Engineering, vol. 20, No. 6, Jun. 1994, pp. 463-475. |
Pérez, "Provenance: From eScience to the Web of Data" [online], Presentation at Centre for Intelligent Information Technologies (CETINIA), Universidad Rey Juan Carlos, Madrid, Spain, Nov. 2009, [retrieved on Jun. 11, 2012], retrieved from the Internet <URL: http://www.cetinia.urjc.es/sites/default/files/userfiles/file/invited-lectures/Provenance-from%20eScience-to-Web-of-Data-17-11-09.pdf>. |
Prechelt et al., "Finding plagiarisms among a set of programs with JPlag", Resubmission to J. of Universal Computer Science, Nov. 28, 2001, pp. 1-23. |
Rabin, "Fingerprinting by Random Polynomials" [online], Center for Research in Computing Technology Harvard University Report TR-15-81, 1981 [retrieved on Oct. 10, 2010], retrieved from the Internet . |
Rabin, "Fingerprinting by Random Polynomials" [online], Center for Research in Computing Technology Harvard University Report TR-15-81, 1981 [retrieved on Oct. 10, 2010], retrieved from the Internet <URL: http://www.xmailserver.org/rabin.pdf>. |
Schleimer et al., "Winnowing: Local Algorithms for Document Fingerprinting", Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data 2003, pp. 76-85. |
Seo et al., "Local Text Reuse Detection", Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '08), pp. 571-578. |
Shivakumar et al., "Finding near-replicas of documents on the web", International Workshop on the Web and Databases (WebDB 1998), Mar. 27-28, 1998, Valencia, Spain. |
Shixia et al., "An LOD Model for Graph Visualization and its Application in Web Navigation", Proceedings of the 7th Asia-Pacific web conference on Web Technologies Research and Development (APWeb'05), pp. 441-452. |
Si et al., "Check: A Document Plagiarism Detection System", Symposium on Applied Computing, Proceedings of the 1997 ACM symposium on Applied computing, 1997, pp. 70-77. |
Sutinen et al., "On Using q-Gram Locations in Approximate String Matching", Proceedings of the Third Annual European Symposium on Algorithms (ESA '95), pp. 327-340, Springer-Verlag, London, United Kingdom. |
Whale, "Identification of Program Similarity in Large Populations", The Computer Journal, vol. 33, No. 2, 1990, pp. 140-146. |
Wise et al., "YAP3: Improved Detection of Similarities in Computer Program and Other Texts," SIGCSEB'96: SIGCSE Bulletin (ACM Special Interest Group on Computer Science Education), pp. 130-134. |
Zeng et al., "Learning to Cluster Web Search Results" [online], Presentation 2004 [retrieved on Sep. 6, 2009], retrieved from the Internet . |
Zeng et al., "Learning to Cluster Web Search Results" [online], Presentation 2004 [retrieved on Sep. 6, 2009], retrieved from the Internet <URL: http://klpl.re.pusan.ac.kr/seminar/2004/winter/%EC%84%B8%EB%AF%B8%EP%82%98%5B1%5D.ppt>. |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140366085A1 (en) * | 2008-06-10 | 2014-12-11 | Object Security Llc | Method and system for rapid accreditation/re-accreditation of agile it environments, for example service oriented architecture (soa) |
US9729576B2 (en) * | 2008-06-10 | 2017-08-08 | Object Security Llc | Method and system for rapid accreditation/re-accreditation of agile IT environments, for example service oriented architecture (SOA) |
US10116704B2 (en) | 2008-06-10 | 2018-10-30 | Object Security Llc | Method and system for rapid accreditation/re-accreditation of agile IT environments, for example service oriented architecture (SOA) |
US10560486B2 (en) | 2008-06-10 | 2020-02-11 | Object Security Llc | Method and system for rapid accreditation/re-accreditation of agile it environments, for example service oriented architecture (SOA) |
US10002256B2 (en) * | 2014-12-05 | 2018-06-19 | GeoLang Ltd. | Symbol string matching mechanism |
US10657267B2 (en) | 2014-12-05 | 2020-05-19 | GeoLang Ltd. | Symbol string matching mechanism |
US10678866B1 (en) | 2016-09-30 | 2020-06-09 | Vasumathi Ranganathan | Rules driven content network for tracking, tracing, auditing and life cycle management of information artifacts |
Also Published As
Publication number | Publication date |
---|---|
US20110238664A1 (en) | 2011-09-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8244700B2 (en) | Rapid update of index metadata | |
US9348890B2 (en) | System and method of search indexes using key-value attributes to searchable metadata | |
US7890521B1 (en) | Document-based synonym generation | |
US8090708B1 (en) | Searching indexed and non-indexed resources for content | |
US8140482B2 (en) | Using RSS archives | |
US8606780B2 (en) | Image re-rank based on image annotations | |
US20120002884A1 (en) | Method and apparatus for managing video content | |
US20130191414A1 (en) | Method and apparatus for performing a data search on multiple user devices | |
US20200159783A1 (en) | Method of and system for updating search index database | |
US8650195B2 (en) | Region based information retrieval system | |
US9842158B2 (en) | Clustering web pages on a search engine results page | |
US20080059432A1 (en) | System and method for database indexing, searching and data retrieval | |
US20090210389A1 (en) | System to support structured search over metadata on a web index | |
JPWO2014050002A1 (en) | Query similarity evaluation system, evaluation method, and program | |
WO2012129152A2 (en) | Annotating schema elements based associating data instances with knowledge base entities | |
CN107870915B (en) | Indication of search results | |
US8527518B2 (en) | Inverted indexes with multiple language support | |
Barrio et al. | Sampling strategies for information extraction over the deep web | |
Koren et al. | Searching and navigating petabyte-scale file systems based on facets | |
US9773035B1 (en) | System and method for an annotation search index | |
US8875007B2 (en) | Creating and modifying an image wiki page | |
CN114402316A (en) | System and method for federated search using dynamic selection and distributed correlations | |
Rats et al. | Using of cloud computing, clustering and document-oriented database for enterprise content management | |
WO2011093691A2 (en) | A semantic organization and retrieval system and methods thereof | |
Sampada et al. | Performance Analysis of Multidimensional Indexing in Keyword Search |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.) |
|
FEPP | Fee payment procedure |
Free format text: SURCHARGE FOR LATE PAYMENT, MICRO ENTITY (ORIGINAL EVENT CODE: M3554) |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, MICRO ENTITY (ORIGINAL EVENT CODE: M3551) Year of fee payment: 4 |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: MICROENTITY |
|
LAPS | Lapse for failure to pay maintenance fees |
Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: MICROENTITY |
|
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20220211 |