US6910029B1 - System for weighted indexing of hierarchical documents - Google Patents
System for weighted indexing of hierarchical documents Download PDFInfo
- Publication number
- US6910029B1 US6910029B1 US09/510,054 US51005400A US6910029B1 US 6910029 B1 US6910029 B1 US 6910029B1 US 51005400 A US51005400 A US 51005400A US 6910029 B1 US6910029 B1 US 6910029B1
- Authority
- US
- United States
- Prior art keywords
- metadata
- weight
- file
- index
- elements
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Lifetime, expires
Links
- 238000000034 method Methods 0.000 claims description 18
- 230000006870 function Effects 0.000 claims description 8
- 238000004590 computer program Methods 0.000 claims description 6
- 238000013479 data entry Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99931—Database or file accessing
- Y10S707/99932—Access augmentation or optimizing
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99931—Database or file accessing
- Y10S707/99933—Query processing, i.e. searching
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99941—Database schema or data structure
- Y10S707/99943—Generating database or data structure, e.g. via user interface
Definitions
- the present invention relates generally to indexing Web-based hierarchical documents for search.
- HTML hypertext markup language
- XML extensible markup language
- a search engine typically has three main parts.
- the first part is a crawler, which accesses Web documents and gathers information about the documents.
- the information is summarized either by the document producer or by the crawler, with each summary being arranged in a hierarchy and being referred to as “metadata”.
- the metadata is “marked up” by means of tags, i.e., each item of information in the hierarchy is labelled by a corresponding tag, to identify the item of information.
- an index engine indexes the metadata.
- the index essentially is a catalogue of the metadata.
- a query executor portion of the search engine responds to a user query by accessing the indexed metadata and returning the names (also referred to as “uniform resource locators”, or URLs) of documents that satisfy the query.
- the metadata that a crawler creates includes not only data about document content, which is useful to a query executor during the search phase, but also includes internally useful information such as the name of the crawler, date of the crawl, and so on.
- the metadata summary is marked up with tags that identify the various elements in the summary.
- the tags are not necessarily useful to the query executor, but rather, in the context of the query phase, constitute noise.
- the present invention understands that, depending on the document type, some information as identified by the tags happens to be more useful in the context of the query phase than other information.
- current indexing engines do not separate tags from the data identified by the tags, nor do they provide a means for weighting relatively important information more highly than less important information, nor do they provide a means for eliminating completely useless (from a query execution standpoint) information from the index.
- current indexing engines do not optimize the subsequent performance of query executors. The present invention recognizes the above-noted problems and provides the solutions disclosed herein.
- a general purpose computer is programmed according to the inventive steps herein to authenticate a user to plural accounts.
- the invention can also be embodied as an article of manufacture—a machine component—that is used by a digital processing apparatus and which tangibly embodies a program of instructions that are executable by the digital processing apparatus to execute the present logic.
- This invention is realized in a critical machine component that causes a digital processing apparatus to perform the inventive method steps herein.
- the invention can be implemented by a computer system including a general purpose computer and logic executable by the computer for undertaking method acts.
- These method acts include receiving metadata representing at least one document that is accessible via a wide area computer network.
- the metadata includes plural elements, and the method embodied by the logic includes weighting at least some elements in accordance with a weighting scheme to render weighted metadata.
- the logic provides the weighted metadata to an index engine.
- the system can include the index engine, which generates an index based on the weighted metadata, and a crawler that generates the metadata. Also, a query executor can be included to access the index to execute queries for documents.
- At least one specification file preferably is generated for at least one respective metadata document class defining a metadata hierarchy.
- the specification file defines a specification hierarchy that matches the metadata hierarchy. More preferably, plural specification files are generated for respective plural classes.
- the specification file includes at least one higher element having an associated higher weight and at least one lower element.
- the lower element is hierarchically lower than the higher element, and the lower element has an associated weight attribute.
- the lower element has a default weight equal to the higher weight when the weight attribute is null, and otherwise has a weight equal to a value in the weight attribute.
- the metadata is arranged in a hierarchical metadata file having plural tags with associated metadata elements.
- One preferred way in which the weighting is undertaken includes, for each metadata element in the metadata file, accessing a corresponding weight in the specification file. Metadata elements, but not tags, are written out in accordance with the respective weights. Thus, a metadata element is not written out if its respective weight is zero, and a metadata element is written out twice if its respective weight is two.
- a computer-implemented method for indexing documents includes generating a specification file for each of a plurality of document classes defining respective metadata hierarchies. Also, the method includes receiving at least one metadata file representative of at least one document, and parsing the metadata file in accordance with the specification file to write out data to an index file in markup language (e.g., HTML or XML) accordance with weights defined by the specification file. No markup tags associated with the data are written out to the index file.
- the index file can then be sent to an indexing engine of a Web search engine that can be selected from the group including full text indexing engines, value indexing engines, and path expression indexing engines.
- a computer program device includes a computer program storage device readable by a digital processing apparatus.
- a program is on the program storage device.
- the program includes instructions that can be executed by the digital processing apparatus.
- the program device includes computer readable code means for receiving at least one metadata file representative of a document on the World Wide Web.
- the metadata file includes tags and associated data elements.
- Computer readable code means are provided for writing only data elements to an index file, with a data element being written “n” times to the index file wherein “n” is a weight associated with the data element.
- FIG. 1 is a schematic diagram showing the system of the present invention
- FIG. 2 is a flow chart showing the overall logic
- FIG. 3 is a flow chart showing the logic for generating the specification files.
- FIG. 4 is a flow chart showing the details of the logic for parsing a metadata file using a specification file.
- a system for weighted indexing of hierarchical documents contained in Web sites 12 that are accessible to software-implemented browsers 13 of user computers 14 via the Internet 16 .
- the hierarchical documents can include various document classes, including HTML pages, XML pages, newsgroup articles, JAVA programs, binary data, images, database data, and so on.
- a search engine server 18 also accesses the Web sites 12 .
- the search engine server 18 can be a GCS server or Hotbot server or other appropriate search engine.
- the preferred server 18 includes a crawler 20 that crawls the Web sites 12 to generate metadata files representative of the content of the Web sites 12 .
- the metadata files can less preferably be manually generated and made available at a Web site 12 .
- the crawler 20 communicates with a weighting module 22 of the present invention, which weights elements in the metadata files in accordance with a weighting scheme to render an index file representing weighted metadata.
- the weighting can be conditional, wherein the weight of an element is a function of the weight of another element or elements or wherein the weight of an element otherwise depends on the weight of another element or elements. If the weight of an element is zero, the element is essentially eliminated.
- the index file which can be rendered in the form of an HTML or other markup language document, is then provided to a conventional index engine 24 which generates an index based on the metadata.
- a conventional index engine 24 which generates an index based on the metadata.
- a user of the user computer 14 If a user of the user computer 14 generates a request for Web documents using a keyword search entered by means of an input device 26 such as a mouse or keyboard, the request is sent via the Internet 16 to a query executor 28 that is associated with the search engine server 18 .
- the query executor 28 accesses the index to execute the requests (queries) for documents, and returns a list of documents satisfying the request to the user computer 14 for display on an output device 30 such as a monitor or printer.
- an output device 30 such as a monitor or printer.
- both the user computer 14 and search server 18 are digital processing apparatus, such as a personal computers made by International Business Machines Corporation (IBM) of Armonk, N.Y., or any other computer, including computers sold under trademarks such as AS400, with accompanying IBM Network Stations.
- the computers may be Unix computers, or OS/2 servers, or Windows NT servers, or IBM workstations or IBM laptop computers.
- the flow charts herein illustrate the structure of the weighting module 22 of the present invention as embodied in computer program software.
- the flow charts illustrate the structures of logic elements, such as computer program code elements or electronic logic circuits, that function according to this invention.
- the invention is practiced in its essential embodiment by a machine component that renders the logic elements in a form that instructs a digital processing apparatus (that is, a computer) to perform a sequence of function steps corresponding to those shown.
- the weighting module 22 may be a computer program that is executed by a processor within the server 18 as a series of computer-executable instructions.
- these instructions may reside, for example, in RAM of the server 18 , or the instructions may be stored on a DASD array, magnetic tape, electronic read-only memory, or other appropriate data storage device.
- the computer-executable instructions may be lines of XML code.
- a hierarchical metadata file is received, preferably from the crawler 20 , although it could be received from another source.
- data entries i.e., elements in the metadata file
- the weighting at block 34 produces weighted metadata which, as more fully disclosed below, preferably is in the form of an HTML-formatted index file. Other formats, including XML, can be used.
- the weighted metadata is sent to the index engine 24 .
- the index engine 24 uses the weighted metadata to generate an index at block 38 .
- This index can then be used at block 40 by the query executor 28 to respond to a search request (query) by retrieving URLs of documents that satisfy the request.
- FIGS. 3 and 4 show the details of how the preferred embodiment undertakes the overall logic of FIG. 2 .
- a DO loop is entered for each class of documents.
- a specification file is created for the class under test.
- the specification file has the same schema, i.e., hierarchical format, as the metadata file for the class under test.
- the hierarchy of the specification file matches the hierarchy of the metadata file.
- the schema for the specification files (in XML) are as follows (for a full-text indexing implementation):
- the specification language includes two types of XML elements, i.e., “elt” (the occurrence of an element in a metadata file in the associated document class) and “attr” (the occurrence of an attribute in a metadata file in the associated document class).
- elements “elt” and “attr” have a string attribute, which is required, and a weight attribute, which is optional and which can be conditional, i.e., the weight of one element can depend or be a function of the weight of another element.
- the string attribute specifies the name of the element or attribute that is being weighted, whereas the weight attribute specifies the weight (in absolute terms or conditional terms as a function of another element) that is to be given to the corresponding “attr” or “elt” element. Since the weight attribute preferably is optional, in the preferred embodiment when the weight attribute is not specified the corresponding element inherits the weight value from the parent node, i.e., the next higher node in the hierarchy.
- the root element has a null weight attribute, i.e., that it has a weight of zero. This means that by default, nothing under it will be indexed, unless the default is overridden. This indeed is the case of “resource” and “summarizer”, both of which have weights of one, i.e., the weight attributes of these elements have been provided and are one.
- “Description” inherits the weight of zero from its parent node “summary”, but its child node “strings” does not default to zero, but rather has been associated with a weight of two in its weight attribute.
- every element under “strings” has a weight of two, e.g., the weight of the element “Seq” is two, unless overridden in accordance with the above principles, e.g., the weight attribute of element “L1”, which is a child node of the element “seq”, has been provided as one, to override the default weight defined by the parent node “seq”.
- the metadata file is received for indexing.
- the appropriate specification file for the class of the metadata file is accessed.
- both files are scanned top-down (hierarchically) simultaneously, it being recalled that the specification file is constructed to have the same format as the metadata file for that class.
- the metadata file is parsed using the specification file, such that only elements and attributes that have a weight of one or more are written out to an index file at block 54 in accordance with the respective weights contained in the specification file.
- the “L1” element in the metadata file having a weight of two, is written out twice to the index file, whereas the “strings” element in the metadata file is written out only once, having a weight of one.
- the index file appears as:
- the present invention can be used with any indexing protocol, and using one specification file per document class allows element indexing to be tuned to each class. Additionally, the present invention works well with XML/RDF (hierarchical-based and graph-based) encoded material.
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
<!ELEMENT full-text (element) *> | ||
<!ENTITY % digit “(0|1|2|3|4|5|6|7|8|9)”> | ||
<!ENTITY % number “NMTOKEN”> | ||
<!ENTITY % boolean “(true|false)”> | ||
<!ELEMENT elt (elt | attr | EMPTY) * > | ||
<!ATTLIST elt | ||
name CDATA #REQUIRED | ||
count %number; #IMPLIED> | ||
<!ELEMENT attr EMPTY> | ||
<!ATTLIST attr | ||
name CDATA #REQUIRED | ||
count %number; #IMPLIED> | ||
METADATA FILE | ||
<summary date = “01/01/00” | ||
by=“http://foobar.com”> | ||
<description resource = “http://www.important.com” | ||
summarizer = “http://www.gcs.ibm.com”> | ||
This string will not occur at all | ||
<strings> | ||
This will occur twice | ||
<seq> | ||
<L1>This string will occur only once</L1> | ||
</seq> | ||
<strings> | ||
</description> | ||
</summary> | ||
SPECIFICATION FILE | ||
<?xml version = “1.0” ?> | ||
<!DOCTYPE full-text SYSTEM “fulltext.dtd”> | ||
<full-text> | ||
<elt name = “summary” count = “0”> | ||
<elt name = “Description”> | ||
<attr name=“resource” count = “1”/> | ||
<attr name=“summarizer” count = “1”/> | ||
<elt name = “strings” count = “2”> | ||
<elt name=“Seq”> | ||
<elt name = “L1” count=“1”/> | ||
</elt> | ||
</elt> | ||
</elt> | ||
</elt> | ||
<full-text> | ||
- http://www.important.com http://www.gcs.ibm.com This will occur twice This will occur twice This string will occur only once.
- http://www.important.com http://www.gcs.ibm.com This will occur twice This will occur twice This string will occur only once.
Claims (21)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/510,054 US6910029B1 (en) | 2000-02-22 | 2000-02-22 | System for weighted indexing of hierarchical documents |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/510,054 US6910029B1 (en) | 2000-02-22 | 2000-02-22 | System for weighted indexing of hierarchical documents |
Publications (1)
Publication Number | Publication Date |
---|---|
US6910029B1 true US6910029B1 (en) | 2005-06-21 |
Family
ID=34652247
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/510,054 Expired - Lifetime US6910029B1 (en) | 2000-02-22 | 2000-02-22 | System for weighted indexing of hierarchical documents |
Country Status (1)
Country | Link |
---|---|
US (1) | US6910029B1 (en) |
Cited By (43)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030088593A1 (en) * | 2001-03-21 | 2003-05-08 | Patrick Stickler | Method and apparatus for generating a directory structure |
US20030182274A1 (en) * | 2000-07-27 | 2003-09-25 | Young-June Oh | Navigable search engine |
US20030233618A1 (en) * | 2002-06-17 | 2003-12-18 | Canon Kabushiki Kaisha | Indexing and querying of structured documents |
US20040044659A1 (en) * | 2002-05-14 | 2004-03-04 | Douglass Russell Judd | Apparatus and method for searching and retrieving structured, semi-structured and unstructured content |
US20040044680A1 (en) * | 2002-03-25 | 2004-03-04 | Thorpe Jonathan Richard | Data structure |
US20040237037A1 (en) * | 2003-03-21 | 2004-11-25 | Xerox Corporation | Determination of member pages for a hyperlinked document with recursive page-level link analysis |
US20040243554A1 (en) * | 2003-05-30 | 2004-12-02 | International Business Machines Corporation | System, method and computer program product for performing unstructured information management and automatic text analysis |
US20040243557A1 (en) * | 2003-05-30 | 2004-12-02 | International Business Machines Corporation | System, method and computer program product for performing unstructured information management and automatic text analysis, including a search operator functioning as a weighted and (WAND) |
US20040243645A1 (en) * | 2003-05-30 | 2004-12-02 | International Business Machines Corporation | System, method and computer program product for performing unstructured information management and automatic text analysis, and providing multiple document views derived from different document tokenizations |
US20040243556A1 (en) * | 2003-05-30 | 2004-12-02 | International Business Machines Corporation | System, method and computer program product for performing unstructured information management and automatic text analysis, and including a document common analysis system (CAS) |
US20040243560A1 (en) * | 2003-05-30 | 2004-12-02 | International Business Machines Corporation | System, method and computer program product for performing unstructured information management and automatic text analysis, including an annotation inverted file system facilitating indexing and searching |
US20050033733A1 (en) * | 2001-02-26 | 2005-02-10 | Ori Software Development Ltd. | Encoding semi-structured data for efficient search and browsing |
US20050086583A1 (en) * | 2000-01-28 | 2005-04-21 | Microsoft Corporation | Proxy server using a statistical model |
US20050188300A1 (en) * | 2003-03-21 | 2005-08-25 | Xerox Corporation | Determination of member pages for a hyperlinked document with link and document analysis |
US20050198077A1 (en) * | 2003-12-24 | 2005-09-08 | Van Der Heijden Antonius Nicolaas A. | Method, computer system, computer program and computer program product for storage and retrieval of data files in a data storage means |
US20050210006A1 (en) * | 2004-03-18 | 2005-09-22 | Microsoft Corporation | Field weighting in text searching |
US20050235197A1 (en) * | 2003-07-11 | 2005-10-20 | Computer Associates Think, Inc | Efficient storage of XML in a directory |
US20060004717A1 (en) * | 2004-07-01 | 2006-01-05 | Microsoft Corporation | Dispersing search engine results by using page category information |
US20060074871A1 (en) * | 2004-09-30 | 2006-04-06 | Microsoft Corporation | System and method for incorporating anchor text into ranking search results |
US20060095446A1 (en) * | 2004-10-29 | 2006-05-04 | Hewlett-Packard Development Company, L.P. | Methods for indexing data, systems, software and apparatus relating thereto |
US20060136411A1 (en) * | 2004-12-21 | 2006-06-22 | Microsoft Corporation | Ranking search results using feature extraction |
US20060200460A1 (en) * | 2005-03-03 | 2006-09-07 | Microsoft Corporation | System and method for ranking search results using file types |
US20060294100A1 (en) * | 2005-03-03 | 2006-12-28 | Microsoft Corporation | Ranking search results using language types |
WO2007005382A2 (en) | 2005-06-29 | 2007-01-11 | Microsoft Corporation | Sensing, storing, indexing, and retrieving data leveraging measures of user activity, attention, and interest |
US20070022126A1 (en) * | 2005-07-21 | 2007-01-25 | Caterpillar Inc. | Method and apparatus for updating an asset catalog |
US20070038622A1 (en) * | 2005-08-15 | 2007-02-15 | Microsoft Corporation | Method ranking search results using biased click distance |
US20070073894A1 (en) * | 2005-09-14 | 2007-03-29 | O Ya! Inc. | Networked information indexing and search apparatus and method |
US20080005150A1 (en) * | 2006-06-30 | 2008-01-03 | Microsoft Corporation | Type definition language for defining content-index from a rich structured WinFS data type |
US20080010238A1 (en) * | 2006-07-07 | 2008-01-10 | Microsoft Corporation | Index having short-term portion and long-term portion |
US20080028022A1 (en) * | 2006-07-25 | 2008-01-31 | Jun Nakagawa | Method and system for supporting responding to inquiry regarding digital content |
US20080281781A1 (en) * | 2007-05-07 | 2008-11-13 | Lei Zhao | Searching document sets with differing metadata schemata |
US20090106223A1 (en) * | 2007-10-18 | 2009-04-23 | Microsoft Corporation | Enterprise relevancy ranking using a neural network |
US20100076964A1 (en) * | 2007-12-18 | 2010-03-25 | Daniel Joseph Parrell | Instance-Class-Attribute Matching Web Page Ranking |
US7761448B2 (en) | 2004-09-30 | 2010-07-20 | Microsoft Corporation | System and method for ranking search results using click distance |
US7827181B2 (en) | 2004-09-30 | 2010-11-02 | Microsoft Corporation | Click distance determination |
US20110252040A1 (en) * | 2010-04-07 | 2011-10-13 | Oracle International Corporation | Searching document object model elements by attribute order priority |
CN103530384A (en) * | 2013-10-21 | 2014-01-22 | 济南政和科技有限公司 | Internet information resource quick searching method |
US8738635B2 (en) | 2010-06-01 | 2014-05-27 | Microsoft Corporation | Detection of junk in search result ranking |
US8812493B2 (en) | 2008-04-11 | 2014-08-19 | Microsoft Corporation | Search results ranking using editing distance and document information |
US8843486B2 (en) | 2004-09-27 | 2014-09-23 | Microsoft Corporation | System and method for scoping searches using index keys |
US20150149385A1 (en) * | 2007-04-16 | 2015-05-28 | Ebay Inc. | Visualization of reputation ratings |
US9348912B2 (en) | 2007-10-18 | 2016-05-24 | Microsoft Technology Licensing, Llc | Document length as a static relevance feature for ranking search results |
US9495462B2 (en) | 2012-01-27 | 2016-11-15 | Microsoft Technology Licensing, Llc | Re-ranking search results |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5111384A (en) * | 1990-02-16 | 1992-05-05 | Bull Hn Information Systems Inc. | System for performing dump analysis |
US5893101A (en) * | 1994-06-08 | 1999-04-06 | Systems Research & Applications Corporation | Protection of an electronically stored image in a first color space by the alteration of digital component in a second color space |
US5983176A (en) * | 1996-05-24 | 1999-11-09 | Magnifi, Inc. | Evaluation of media content in media files |
US5991459A (en) * | 1992-01-22 | 1999-11-23 | Eastman Kodak Company | Method of modifying a time-varying image sequence by estimation of velocity vector fields |
US6151624A (en) * | 1998-02-03 | 2000-11-21 | Realnames Corporation | Navigating network resources based on metadata |
US6199081B1 (en) * | 1998-06-30 | 2001-03-06 | Microsoft Corporation | Automatic tagging of documents and exclusion by content |
US6208988B1 (en) * | 1998-06-01 | 2001-03-27 | Bigchalk.Com, Inc. | Method for identifying themes associated with a search query using metadata and for organizing documents responsive to the search query in accordance with the themes |
US6240407B1 (en) * | 1998-04-29 | 2001-05-29 | International Business Machines Corp. | Method and apparatus for creating an index in a database system |
US6295529B1 (en) * | 1998-12-24 | 2001-09-25 | Microsoft Corporation | Method and apparatus for indentifying clauses having predetermined characteristics indicative of usefulness in determining relationships between different texts |
US20020013782A1 (en) * | 2000-02-18 | 2002-01-31 | Daniel Ostroff | Software program for internet information retrieval, analysis and presentation |
US6345288B1 (en) * | 1989-08-31 | 2002-02-05 | Onename Corporation | Computer-based communication system and method using metadata defining a control-structure |
US6351755B1 (en) * | 1999-11-02 | 2002-02-26 | Alta Vista Company | System and method for associating an extensible set of data with documents downloaded by a web crawler |
US6353823B1 (en) * | 1999-03-08 | 2002-03-05 | Intel Corporation | Method and system for using associative metadata |
US6389412B1 (en) * | 1998-12-31 | 2002-05-14 | Intel Corporation | Method and system for constructing integrated metadata |
-
2000
- 2000-02-22 US US09/510,054 patent/US6910029B1/en not_active Expired - Lifetime
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6345288B1 (en) * | 1989-08-31 | 2002-02-05 | Onename Corporation | Computer-based communication system and method using metadata defining a control-structure |
US5111384A (en) * | 1990-02-16 | 1992-05-05 | Bull Hn Information Systems Inc. | System for performing dump analysis |
US5991459A (en) * | 1992-01-22 | 1999-11-23 | Eastman Kodak Company | Method of modifying a time-varying image sequence by estimation of velocity vector fields |
US5893101A (en) * | 1994-06-08 | 1999-04-06 | Systems Research & Applications Corporation | Protection of an electronically stored image in a first color space by the alteration of digital component in a second color space |
US5983176A (en) * | 1996-05-24 | 1999-11-09 | Magnifi, Inc. | Evaluation of media content in media files |
US6151624A (en) * | 1998-02-03 | 2000-11-21 | Realnames Corporation | Navigating network resources based on metadata |
US6240407B1 (en) * | 1998-04-29 | 2001-05-29 | International Business Machines Corp. | Method and apparatus for creating an index in a database system |
US6208988B1 (en) * | 1998-06-01 | 2001-03-27 | Bigchalk.Com, Inc. | Method for identifying themes associated with a search query using metadata and for organizing documents responsive to the search query in accordance with the themes |
US6199081B1 (en) * | 1998-06-30 | 2001-03-06 | Microsoft Corporation | Automatic tagging of documents and exclusion by content |
US6295529B1 (en) * | 1998-12-24 | 2001-09-25 | Microsoft Corporation | Method and apparatus for indentifying clauses having predetermined characteristics indicative of usefulness in determining relationships between different texts |
US6389412B1 (en) * | 1998-12-31 | 2002-05-14 | Intel Corporation | Method and system for constructing integrated metadata |
US6353823B1 (en) * | 1999-03-08 | 2002-03-05 | Intel Corporation | Method and system for using associative metadata |
US6351755B1 (en) * | 1999-11-02 | 2002-02-26 | Alta Vista Company | System and method for associating an extensible set of data with documents downloaded by a web crawler |
US20020013782A1 (en) * | 2000-02-18 | 2002-01-31 | Daniel Ostroff | Software program for internet information retrieval, analysis and presentation |
Non-Patent Citations (3)
Title |
---|
K. Selcuk Candan, Huan Liu, and Reshma Suvarna, Resource Description Framework: Metadata and Its Applications, SIGKDD Explorations, vol. 3, pp. 6-18. * |
Publication: "An Efficiently Updatable Index Scheme for Structured Documents." Kanemoto et al. IEEE. pp. 991-996. 1998. |
Publication: "An Indexing Model for Structured Documents to Support Queries on Content, Structure and Attributes" Dao. IEEE. pp. 89-97. 1998. |
Cited By (73)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7603616B2 (en) | 2000-01-28 | 2009-10-13 | Microsoft Corporation | Proxy server using a statistical model |
US20050086583A1 (en) * | 2000-01-28 | 2005-04-21 | Microsoft Corporation | Proxy server using a statistical model |
US20030182274A1 (en) * | 2000-07-27 | 2003-09-25 | Young-June Oh | Navigable search engine |
US20050033733A1 (en) * | 2001-02-26 | 2005-02-10 | Ori Software Development Ltd. | Encoding semi-structured data for efficient search and browsing |
US8489597B2 (en) * | 2001-02-26 | 2013-07-16 | Ori Software Development Ltd. | Encoding semi-structured data for efficient search and browsing |
US20030088593A1 (en) * | 2001-03-21 | 2003-05-08 | Patrick Stickler | Method and apparatus for generating a directory structure |
US7200627B2 (en) * | 2001-03-21 | 2007-04-03 | Nokia Corporation | Method and apparatus for generating a directory structure |
US20040044680A1 (en) * | 2002-03-25 | 2004-03-04 | Thorpe Jonathan Richard | Data structure |
US7587419B2 (en) * | 2002-03-25 | 2009-09-08 | Sony United Kingdom Limited | Video metadata data structure |
US20040044659A1 (en) * | 2002-05-14 | 2004-03-04 | Douglass Russell Judd | Apparatus and method for searching and retrieving structured, semi-structured and unstructured content |
US7756857B2 (en) * | 2002-06-17 | 2010-07-13 | Canon Kabushiki Kaisha | Indexing and querying of structured documents |
US20030233618A1 (en) * | 2002-06-17 | 2003-12-18 | Canon Kabushiki Kaisha | Indexing and querying of structured documents |
US20040237037A1 (en) * | 2003-03-21 | 2004-11-25 | Xerox Corporation | Determination of member pages for a hyperlinked document with recursive page-level link analysis |
US20050188300A1 (en) * | 2003-03-21 | 2005-08-25 | Xerox Corporation | Determination of member pages for a hyperlinked document with link and document analysis |
US20040243645A1 (en) * | 2003-05-30 | 2004-12-02 | International Business Machines Corporation | System, method and computer program product for performing unstructured information management and automatic text analysis, and providing multiple document views derived from different document tokenizations |
US8280903B2 (en) | 2003-05-30 | 2012-10-02 | International Business Machines Corporation | System, method and computer program product for performing unstructured information management and automatic text analysis, including a search operator functioning as a Weighted AND (WAND) |
US20040243557A1 (en) * | 2003-05-30 | 2004-12-02 | International Business Machines Corporation | System, method and computer program product for performing unstructured information management and automatic text analysis, including a search operator functioning as a weighted and (WAND) |
US20040243554A1 (en) * | 2003-05-30 | 2004-12-02 | International Business Machines Corporation | System, method and computer program product for performing unstructured information management and automatic text analysis |
US20040243560A1 (en) * | 2003-05-30 | 2004-12-02 | International Business Machines Corporation | System, method and computer program product for performing unstructured information management and automatic text analysis, including an annotation inverted file system facilitating indexing and searching |
US7512602B2 (en) | 2003-05-30 | 2009-03-31 | International Business Machines Corporation | System, method and computer program product for performing unstructured information management and automatic text analysis, including a search operator functioning as a weighted and (WAND) |
US7139752B2 (en) | 2003-05-30 | 2006-11-21 | International Business Machines Corporation | System, method and computer program product for performing unstructured information management and automatic text analysis, and providing multiple document views derived from different document tokenizations |
US7146361B2 (en) * | 2003-05-30 | 2006-12-05 | International Business Machines Corporation | System, method and computer program product for performing unstructured information management and automatic text analysis, including a search operator functioning as a Weighted AND (WAND) |
US20040243556A1 (en) * | 2003-05-30 | 2004-12-02 | International Business Machines Corporation | System, method and computer program product for performing unstructured information management and automatic text analysis, and including a document common analysis system (CAS) |
US20070112763A1 (en) * | 2003-05-30 | 2007-05-17 | International Business Machines Corporation | System, method and computer program product for performing unstructured information management and automatic text analysis, including a search operator functioning as a weighted and (WAND) |
US20050235197A1 (en) * | 2003-07-11 | 2005-10-20 | Computer Associates Think, Inc | Efficient storage of XML in a directory |
US7792855B2 (en) * | 2003-07-11 | 2010-09-07 | Computer Associates Think, Inc. | Efficient storage of XML in a directory |
US20050198077A1 (en) * | 2003-12-24 | 2005-09-08 | Van Der Heijden Antonius Nicolaas A. | Method, computer system, computer program and computer program product for storage and retrieval of data files in a data storage means |
US8706686B2 (en) * | 2003-12-24 | 2014-04-22 | Split-Vision Kennis B.V. | Method, computer system, computer program and computer program product for storage and retrieval of data files in a data storage means |
US7584221B2 (en) | 2004-03-18 | 2009-09-01 | Microsoft Corporation | Field weighting in text searching |
US20050210006A1 (en) * | 2004-03-18 | 2005-09-22 | Microsoft Corporation | Field weighting in text searching |
US20060004717A1 (en) * | 2004-07-01 | 2006-01-05 | Microsoft Corporation | Dispersing search engine results by using page category information |
US7428530B2 (en) * | 2004-07-01 | 2008-09-23 | Microsoft Corporation | Dispersing search engine results by using page category information |
US8843486B2 (en) | 2004-09-27 | 2014-09-23 | Microsoft Corporation | System and method for scoping searches using index keys |
US8082246B2 (en) | 2004-09-30 | 2011-12-20 | Microsoft Corporation | System and method for ranking search results using click distance |
US7827181B2 (en) | 2004-09-30 | 2010-11-02 | Microsoft Corporation | Click distance determination |
US7761448B2 (en) | 2004-09-30 | 2010-07-20 | Microsoft Corporation | System and method for ranking search results using click distance |
US20060074871A1 (en) * | 2004-09-30 | 2006-04-06 | Microsoft Corporation | System and method for incorporating anchor text into ranking search results |
US7739277B2 (en) | 2004-09-30 | 2010-06-15 | Microsoft Corporation | System and method for incorporating anchor text into ranking search results |
US20060095446A1 (en) * | 2004-10-29 | 2006-05-04 | Hewlett-Packard Development Company, L.P. | Methods for indexing data, systems, software and apparatus relating thereto |
US8892564B2 (en) * | 2004-10-29 | 2014-11-18 | Hewlett-Packard Development Company, L.P. | Indexing for data having indexable and non-indexable parent nodes |
US20060136411A1 (en) * | 2004-12-21 | 2006-06-22 | Microsoft Corporation | Ranking search results using feature extraction |
US7716198B2 (en) * | 2004-12-21 | 2010-05-11 | Microsoft Corporation | Ranking search results using feature extraction |
US20060200460A1 (en) * | 2005-03-03 | 2006-09-07 | Microsoft Corporation | System and method for ranking search results using file types |
US7792833B2 (en) | 2005-03-03 | 2010-09-07 | Microsoft Corporation | Ranking search results using language types |
US20060294100A1 (en) * | 2005-03-03 | 2006-12-28 | Microsoft Corporation | Ranking search results using language types |
EP1897002A4 (en) * | 2005-06-29 | 2016-09-14 | Microsoft Technology Licensing Llc | DETECTION, STORAGE, INDEXING AND RECOVERY OF DATA AMPLIFICATION MEASUREMENTS RELATING TO THE ACTIVITY, ATTENTION AND INTEREST OF A USER |
WO2007005382A2 (en) | 2005-06-29 | 2007-01-11 | Microsoft Corporation | Sensing, storing, indexing, and retrieving data leveraging measures of user activity, attention, and interest |
US20070022126A1 (en) * | 2005-07-21 | 2007-01-25 | Caterpillar Inc. | Method and apparatus for updating an asset catalog |
US20070038622A1 (en) * | 2005-08-15 | 2007-02-15 | Microsoft Corporation | Method ranking search results using biased click distance |
US7599917B2 (en) | 2005-08-15 | 2009-10-06 | Microsoft Corporation | Ranking search results using biased click distance |
US20070073894A1 (en) * | 2005-09-14 | 2007-03-29 | O Ya! Inc. | Networked information indexing and search apparatus and method |
US7590654B2 (en) | 2006-06-30 | 2009-09-15 | Microsoft Corporation | Type definition language for defining content-index from a rich structured WinFS data type |
US20080005150A1 (en) * | 2006-06-30 | 2008-01-03 | Microsoft Corporation | Type definition language for defining content-index from a rich structured WinFS data type |
US20080010238A1 (en) * | 2006-07-07 | 2008-01-10 | Microsoft Corporation | Index having short-term portion and long-term portion |
US7562114B2 (en) * | 2006-07-25 | 2009-07-14 | International Business Machines Corporation | Method and system for supporting responding to inquiry regarding digital content |
US20080028022A1 (en) * | 2006-07-25 | 2008-01-31 | Jun Nakagawa | Method and system for supporting responding to inquiry regarding digital content |
US11030662B2 (en) | 2007-04-16 | 2021-06-08 | Ebay Inc. | Visualization of reputation ratings |
US20150149385A1 (en) * | 2007-04-16 | 2015-05-28 | Ebay Inc. | Visualization of reputation ratings |
US11763356B2 (en) | 2007-04-16 | 2023-09-19 | Ebay Inc. | Visualization of reputation ratings |
US10127583B2 (en) * | 2007-04-16 | 2018-11-13 | Ebay Inc. | Visualization of reputation ratings |
US20080281781A1 (en) * | 2007-05-07 | 2008-11-13 | Lei Zhao | Searching document sets with differing metadata schemata |
US7711729B2 (en) * | 2007-05-07 | 2010-05-04 | Microsoft Corporation | Searching a document based on a customer defined metadata schemata |
US9348912B2 (en) | 2007-10-18 | 2016-05-24 | Microsoft Technology Licensing, Llc | Document length as a static relevance feature for ranking search results |
US7840569B2 (en) | 2007-10-18 | 2010-11-23 | Microsoft Corporation | Enterprise relevancy ranking using a neural network |
US20090106223A1 (en) * | 2007-10-18 | 2009-04-23 | Microsoft Corporation | Enterprise relevancy ranking using a neural network |
US20100076964A1 (en) * | 2007-12-18 | 2010-03-25 | Daniel Joseph Parrell | Instance-Class-Attribute Matching Web Page Ranking |
US8812493B2 (en) | 2008-04-11 | 2014-08-19 | Microsoft Corporation | Search results ranking using editing distance and document information |
US20110252040A1 (en) * | 2010-04-07 | 2011-10-13 | Oracle International Corporation | Searching document object model elements by attribute order priority |
US9460232B2 (en) * | 2010-04-07 | 2016-10-04 | Oracle International Corporation | Searching document object model elements by attribute order priority |
US8738635B2 (en) | 2010-06-01 | 2014-05-27 | Microsoft Corporation | Detection of junk in search result ranking |
US9495462B2 (en) | 2012-01-27 | 2016-11-15 | Microsoft Technology Licensing, Llc | Re-ranking search results |
CN103530384A (en) * | 2013-10-21 | 2014-01-22 | 济南政和科技有限公司 | Internet information resource quick searching method |
CN103530384B (en) * | 2013-10-21 | 2017-01-25 | 政和科技股份有限公司 | Internet information resource quick searching method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6910029B1 (en) | System for weighted indexing of hierarchical documents | |
US6487566B1 (en) | Transforming documents using pattern matching and a replacement language | |
US6480865B1 (en) | Facility for adding dynamism to an extensible markup language | |
US8046681B2 (en) | Techniques for inducing high quality structural templates for electronic documents | |
US6654734B1 (en) | System and method for query processing and optimization for XML repositories | |
US7536389B1 (en) | Techniques for crawling dynamic web content | |
US20090125529A1 (en) | Extracting information based on document structure and characteristics of attributes | |
US7747610B2 (en) | Database system and methodology for processing path based queries | |
EP1367504B1 (en) | Method and computer system for indexing structured documents | |
US7370061B2 (en) | Method for querying XML documents using a weighted navigational index | |
US6799184B2 (en) | Relational database system providing XML query support | |
US7299221B2 (en) | Progressive relaxation of search criteria | |
US7409634B2 (en) | Method and apparatus for end-to-end content publishing system using XML with an object dependency graph | |
US6105043A (en) | Creating macro language files for executing structured query language (SQL) queries in a relational database via a network | |
US20020010709A1 (en) | Method and system for distilling content | |
US6658624B1 (en) | Method and system for processing documents controlled by active documents with embedded instructions | |
US8341144B2 (en) | Selecting and presenting user search results based on user information | |
US20100169311A1 (en) | Approaches for the unsupervised creation of structural templates for electronic documents | |
US20090248707A1 (en) | Site-specific information-type detection methods and systems | |
US20060070022A1 (en) | URL mapping with shadow page support | |
US7698329B2 (en) | Method for improving quality of search results by avoiding indexing sections of pages | |
US20070005606A1 (en) | Approach for requesting web pages from a web server using web-page specific cookie data | |
US20070016605A1 (en) | Mechanism for computing structural summaries of XML document collections in a database system | |
Frey | Indexing ajax web applications | |
Buraga et al. | Search Semi-Structured Data on Web |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:SUNDARESAN, NEELAKANTAN;REEL/FRAME:010632/0479 Effective date: 20000207 |
|
FEPP | Fee payment procedure |
Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FPAY | Fee payment |
Year of fee payment: 4 |
|
REMI | Maintenance fee reminder mailed | ||
FPAY | Fee payment |
Year of fee payment: 8 |
|
SULP | Surcharge for late payment |
Year of fee payment: 7 |
|
FPAY | Fee payment |
Year of fee payment: 12 |