US20090204638A1 - Automated client sitemap generation - Google Patents
Automated client sitemap generation Download PDFInfo
- Publication number
- US20090204638A1 US20090204638A1 US12/028,502 US2850208A US2009204638A1 US 20090204638 A1 US20090204638 A1 US 20090204638A1 US 2850208 A US2850208 A US 2850208A US 2009204638 A1 US2009204638 A1 US 2009204638A1
- Authority
- US
- United States
- Prior art keywords
- url
- web
- sitemap
- web page
- same domain
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
Definitions
- Embodiments of the present invention relate to methods, systems, and computer-storage media for automated generation of a sitemap for a web site.
- a universal resource locator (URL) for a web site is received, the web site having a plurality of web pages with which it is associated, that is, web pages having the same domain as the web site URL.
- Log files are analyzed to ascertain whether each web page has been previously crawled.
- Other files, downloaded from the root site contain permission controls and are analyzed to determine which web pages may be crawled and/or indexed.
- the permitted, not-previously-crawled web pages are subsequently crawled and the structure of the web site, that is the linking of the pages between one another, is ascertained.
- a current sitemap is generated that provides the hierarchy and related details in the form of metadata.
- the sitemap file is then written to a disk and may then be sent to search engines as generated or in a compressed format. Certain embodiments can implement the generation of a new sitemap any time the web site is modified.
- FIG. 1 is a block diagram of an exemplary computing environment suitable for use in implementing embodiments of the present invention
- FIG. 2 is a flowchart of a method suitable for generating a current sitemap of a web site, in accordance with an embodiment of the present invention
- FIG. 3 is a flowchart of a method suitable for calculating a priority value for a web page, in accordance with an embodiment of the present invention
- FIG. 4 is a flowchart of a method suitable for calculating a modification frequency for a web page, in accordance with an embodiment of the present invention
- FIG. 5 is a flowchart of a method suitable for generating a sitemap file for a web site, in accordance with an embodiment of the present invention.
- FIG. 6 is a flowchart of a method suitable for generating a sitemap for a web site, in accordance with an embodiment of the present invention.
- Embodiments of the present invention relate to methods, systems, and computer storage media having computer-executable instructions embodied thereon that, when executed, perform methods for generating a sitemap file for a web site in an automated manner.
- server log files are analyzed in conjunction with the present web site structure being crawled.
- Specified files denote the permissible pages to crawl and crawling occurs in accordance with such permissions.
- Determined values may be modified manually if desired, or compared with previous sitemap files and server log files to refine values.
- the web site structure and metadata are subsequently used to generate a sitemap file for the web site.
- the sitemap file may be sent to one or more specified search engines.
- Embodiments further provide for compression of the sitemap file prior to transmission to a search engine if needed. Additionally, embodiments provide for an updated sitemap file to be generated each time a web page having the same domain as the web site URL is modified.
- computing device 100 an exemplary operating environment for implementing embodiments of the present invention is shown and designated generally as computing device 100 .
- Computing device 100 is but one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the invention. Neither should the computing environment 100 be interpreted as having any dependency or requirement relating to any one or combination of modules/modules illustrated.
- Embodiments may be described in the general context of computer code or machine-useable instructions, including computer-executable instructions such as program modules, being executed by a computer or other machine, such as a personal data assistant or other handheld device.
- program modules including routines, programs, objects, modules, data structures, and the like, refer to code that performs particular tasks, or implement particular abstract data types.
- Embodiments may be practiced in a variety of system configurations, including hand-held devices, consumer electronics, general-purpose computers, specialty computing devices, etc.
- Embodiments may also be practiced in distributed computing environments where tasks are performed by remote-processing devices that are linked through a communications network.
- computing device 100 includes a bus 110 that directly or indirectly couples the following devices: memory 112 , one or more processors 114 , one or more presentation modules 116 , input/output (I/O) ports 118 , I/O modules 120 , and an illustrative power supply 122 .
- Bus 110 represents what may be one or more busses (such as an address bus, data bus, or combination thereof).
- FIG. 1 is merely illustrative of an exemplary computing device that can be used in connection with one or more embodiments. Distinction is not made between such categories as “workstation,” “server,” “laptop,” “hand-held device,” etc., as all are contemplated within the scope of FIG. 1 and reference to “computer” or “computing device.”
- Computing device 100 typically includes a variety of computer-readable media.
- computer-readable media may comprise Random Access Memory (RAM); Read Only Memory (ROM); Electronically Erasable Programmable Read Only Memory (EEPROM); flash memory or other memory technologies; CDROM, digital versatile disks (DVD) or other optical or holographic media; magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, carrier wave or any other medium that can be used to encode desired information and be accessed by computing device 100 .
- Memory 112 includes computer-storage media in the form of volatile and/or nonvolatile memory.
- the memory may be removable, non-removable, or a combination thereof.
- Exemplary hardware devices include solid-state memory, hard drives, optical-disc drives, etc.
- Computing device 100 includes one or more processors that read data from various entities such as memory 112 or I/O modules 120 .
- Presentation module(s) 116 present data indications to a user or other device.
- Exemplary presentation modules include a display device, speaker, printing module, vibrating module, etc.
- I/O ports 118 allow computing device 100 to be logically coupled to other devices including I/O modules 120 , some of which may be built in.
- Illustrative modules include a microphone, joystick, game pad, satellite dish, scanner, printer, wireless device, etc.
- FIG. 2 a flow chart illustrating a method, in accordance with an embodiment hereof, for automated sitemap file generation for a web site in accordance with the web site URL, is shown and designated generally as reference numeral 200 .
- a web site for which a sitemap is to be generated is received.
- such receipt comprises receiving the URL for the web site, although it will be understood by those of ordinary skill in the art that any web site identifier from which the web site URL may be ascertained may be received in accordance with embodiments hereof.
- a sitemap file is generated, as more fully described below, based upon those web pages having the same domain as the web site URL.
- the sitemap file generation would be limited to web pages that have this specific root domain, such as www.mywebsite.com/index or www.mywebsite.com/faqs.html, etc. Any pages linked to the root domain that differ in domain name will not be included in the sitemap file generated. Thus, in the above example, if a page was linked to www.archive.mywebsite.com, then the sitemap generation would not include this page or related links.
- one or more files are analyzed, as indicated at block 212 .
- the web server logs that is files that log user visits to web pages based on respective URLs, are analyzed to discover those URLs that haven't previously been crawled.
- a list of URLs is built from which to seed the crawler.
- Each URL in the log file is examined and compared to a list of URLs already present in the corresponding data structure. If the URL is a URL that is not in the data structure, it is added.
- a list of URLs that act as a starting point for the crawler is generated.
- analysis of the files as indicated at block 212 may include not only analysis to discover those URLs that haven't previously been crawled but analysis of several different types of files which are capable of being examined for different forms of information.
- the log files may be analyzed to determine the number of visits a particular web page has received.
- the log files may be analyzed to determine a total number of log file entries, that is, a total number of visits to any URL logged in the log files.
- files downloaded from the root site that grant or deny permission to spiders to crawl portions of a web site may be analyzed. These files delineate which web pages, links, and subsequent paths may or may not be crawled and similarly may or may not be included in a sitemap structure. Only those web pages where the crawler is invited to go are crawled (as more fully described below with reference to reference numeral 214 ). Before crawling begins, the robots.txt file is attempted to be retrieved and parsed and a data structure is created of all off-limits base URLs. The URLs appearing in the off-limits data structure is strictly adhered to once crawling begins.
- analysis is not limited to log and permission files. For instance, previous sitemaps may also be analyzed for structure or for gathering details of metadata. Thus, it will be understood and appreciated by those or ordinary skill in the art that the analysis indicated at block 212 is meant to be illustrative and not restrictive as there any other files from which relevant information may be gathered may be analyzed within the spirit and scope of embodiments of the present invention.
- the permissible web pages having the same domain as the web site URL are crawled.
- the permissible web pages are crawled in a traditional manner by loading the web page URL and the log file analysis (block 212 ). Each link on the web page is examined to see if the link has already been crawled. If it has not, the link is followed. This process is repeated until all the web pages have been examined and, effectively, the tree of pages comprising the web site structure has been crawled.
- relevant data items are gathered about the web site, that is, data items that may aid in generating the sitemap file.
- One such data item is the web page URL itself.
- the URL is the primary piece of information and each unique URL gathered forms an entry in the sitemap file.
- Other data items may include, without limitation, link information.
- link information To enable later use in determining a priority value for the web page (as more fully described below with reference to FIG. 3 ), the number of links each web page has from other web pages having the same domain as the URL may be gathered, as well as the number of web pages having the same domain as the URL.
- the link counter may be incremented each time a new link to a URL is discovered during the crawling step.
- a relational structure of the web site is determined by examining the relationships between each permitted web page having the same domain as the web site URL.
- the relational structure takes into account the web pages that are a part of the domain, as well as the interconnections between the web pages.
- a hierarchal “picture” of the web site starts to form in terms of links between web pages and the routes through which the web pages may be reached.
- Metadata can constitute a variety of information associated with the web pages including, without limitation, the frequency at which a page is modified, the relative importance or priority ranking of the page, whether a site administrator or other user has manually altered the modification frequency and/or priority value, and the like.
- metadata may be determined automatically and/or set manually by a user. The analysis of two portions of metadata, priority value and modification frequency, is described in further detail below with reference to FIGS. 3 and 4 , respectively. However, this list is not meant to be exhaustive, but merely to show exemplary items of metadata that may be analyzed.
- a current sitemap of the web site is generated, as indicated at block 220 .
- the generated sitemap may be created using a markup language, for example and not by way of limitation, extensible markup language (XML). Standard formats can be followed so that the sitemap conforms to protocols maximizing web site accessibility.
- XML extensible markup language
- Standard formats can be followed so that the sitemap conforms to protocols maximizing web site accessibility.
- www.sitemap.org allows web users to provide a standard sitemap coded in XML conforming to protocols accepted by many major search engines.
- the sitemap file may be written to disk.
- the file may optionally be compressed, for instance, utilizing the gzip compression algorithm, as known to those of ordinary skill in the art.
- the sitemap file generally must contain no more than 50,000 URLs and must be less than 10 MB in size before compression is applied (compression is used to reduce the upload time to the search engines). If the data for the sitemap has more than the 50,000 URLs or the sitemap file grows over the 10 MB file size limit, then multiple sitemap files may be created along with a sitemap index file.
- the user may also be provided with the capability to save the sitemap in a text format. Although considered legacy, some sites still utilize text-based sitemaps.
- the file may be desired to inform let one or more specified search engine know by transmitting the current, up-to-date sitemap that is generated.
- the search engine can be “pinged” with the URL to the latest sitemap file or index, as desired.
- methods in accordance with embodiments of the present invention may provide functionality for verifying a sitemap file by comparing the file to the standard format, e.g., XML format, for a sitemap.
- the file will either pass or fails. If the file fails, then a list of errors may be generated allowing the user to correct the sitemap format, for instance, prior to informing a search engine of the sitemap file.
- each web page may be compared against the top X (where X is a number that varies based on computing device performance) search engine optimization rules and suggestions may be offered to the web site owner of changes that may allow their site to better optimize page ranking within a search engine, or the like.
- the method 200 interaction necessary by a site administrator or webmaster is diminished. Rather than requiring extensive user input, information that is already available is combined with algorithms, discussed hereinafter, to systematically generate the sitemap file. Additionally, a site administrator or other user may generate the sitemap locally—that is as a client-oriented tool, rather than relying on a served application. In some embodiments, the above method can be incorporated into the generation and upkeep of a web site. Thus, modifications to the web site can lead to an automatically-generated sitemap that is current.
- FIG. 3 a flow chart illustrating a method for determining a priority value for one or more web pages is shown and designated generally as reference numeral 300 .
- the priority value is calculated during the metadata analysis step 218 of FIG. 2 .
- a web page for which a priority value is desired is received, typically via receipt of the web page URL.
- a priority value for each web page being crawled is determined and a specific indication that such value is desired for a particular web page is not necessary.
- the number of visits a particular web page has received, as well as the total number of log file entries are determined, as indicated at blocks 312 and 314 , respectively.
- these values have already been determined as part of the log file analysis indicated at block 212 of FIG. 2 and, accordingly, at the time of the priority value calculation, are merely recalled. However, if one or more of these values was not determined as part of the log file analysis indicated at block 212 of FIG. 2 , such values may be determined via log file analysis at or near the time the priority value is being calculated.
- the number of links the web page has from other web pages having the same domain as the URL, as well as the number of web pages having the same domain as the URL are determined, as indicated at blocks 316 and 318 , respectively.
- these values have already been determined as part of the crawling indicated at block 214 of FIG. 2 and, accordingly, at the time of the priority value calculation, are merely recalled. However, if one or more of these values was not determined as part of the crawling indicated at block 214 of FIG. 2 , such values may be determined at or near the time the priority value is being calculated.
- a priority value is calculated, as indicated at step 320 .
- such calculation may be performed utilizing the following formula:
- Priority URL ⁇ ⁇ ⁇ link ⁇ ⁇ count Total ⁇ ⁇ number ⁇ ⁇ of ⁇ ⁇ URL ' ⁇ s + URL ⁇ ⁇ log ⁇ ⁇ file ⁇ ⁇ count Total ⁇ ⁇ number ⁇ ⁇ of ⁇ ⁇ log ⁇ ⁇ ⁇ file ⁇ ⁇ entries
- the number of pages linked to the particular page as a proportion to the total pages in the domain and a proportion of visits to the page in proportion to total visits both help to determine that web page's priority ranking.
- this value can be normalized to fall between zero and one so that standard values can be determined across multiple domains. This is indicated at block 322 .
- the largest priority value may be utilized to calculate a multiplier to get the value equal to one. All priority values then may be multiplied by the multiplier to achieve a final priority value respectively therefore.
- a novel aspect of the present invention is the ability of a priority ranking to be generated for the sitemap file without user intervention. Thus, numerous calculations could fall within the scope and spirit of the invention.
- user modification may be permitted, if desired.
- whether or not the value has been modified may be, in and of itself, a portion of the metadata associated with the web page that may be analyzed, for instance, at block 218 of FIG. 2 , as well as the priority value itself.
- a flowchart illustrating a method for determining the frequency with which a web page is modified is shown, and designated generally as reference numeral 400 .
- the modification frequency is calculated during the metadata analysis step 218 of FIG. 2 .
- a web page for which a modification frequency is desired is received, typically via receipt of the web page URL.
- a modification frequency for each web page being crawled is determined and a specific indication that such value is desired for a particular web page is not necessary.
- basic metrics of the web page are determined as indicated at block 412 . These include, the time of the last modification to the web page, as well as a current time.
- the difference between the current time and last modification time is calculated to ascertain a time delta value.
- this delta value is known, it is compared with one or more preset threshold values, as indicated at block 416 .
- the types of time periods and threshold values may vary according to variations in implementation. As an example, if the delta value is less than 24 hours old, the update frequency could be deemed to be daily, while a value less than 10 days old could be weekly. Values less than four hours and two months could be deemed hourly and monthly, respectively. Once again, these values are meant to convey illustration only and are not intended to limit the scope of embodiments of the present invention.
- a previous sitemap may be analyzed to compare values and determine if refinements to the modification frequency are necessary.
- previous values could be used to determine in which frequency category a web page may be placed. Using an average of previous values with the current, calculated value may aid in producing a more accurate calculation with a larger sampling of modifications.
- a current update frequency value can be associated with the web page as another portion of metadata available for analysis, for instance, at block 218 of FIG. 2 .
- FIG. 5 a flow chart illustrating a method for generating a sitemap and notifying search engines of such sitemap is shown and designated generally as reference numeral 500 .
- one or more log files associated with the URL for which sitemap generation is desired are received.
- one or more files controlling permission for programmed crawling of the web pages having the same domain as the web site URL are received, as indicated at block 512 .
- data in the received files in analyzed to determine which web pages have not been previously crawled and for which of the non-crawled subset of web pages crawling is permitted. This is indicated at block 514 .
- the permitted web pages are then crawled, as indicated at block 516 .
- the sitemap file structure is determined, as indicated at block 518 , for instance, by analyzing the relationships between web pages having the same domain as the web site URL.
- metadata values including, without limitation, priority values and modification frequencies, are determined. This is indicated at block 520 .
- a sitemap file for the web site is subsequently generated, as indicated at block 522 . If desired, one or more specified search engines may be notified, as indicated at block 524 .
- FIG. 6 a flow chart illustrating a method for generating a sitemap, in accordance with an embodiment of the present invention, is shown and designated generally as reference numeral 600 .
- the URL for a web site for which sitemap generation is desired in received comprises receiving the URL for the web site, although it will be understood by those of ordinary skill in the art that any web site identifier from which the web site URL may be ascertained may be received in accordance with embodiments hereof.
- the web pages having the same domain as the web site URL are crawled in accordance with log file and control permissions, as described hereinabove with reference to FIG. 2 .
- permitted web pages having the same domain as the web site URL are crawled. Once all permitted web pages have been crawled, a relational structure of the web pages is determined, as indicated at block 614 .
- web page priority values for each permitted web page associated with the web site are calculated, for instance, utilizing the method described with reference to 3 .
- a change or modification frequency for each web page associated with the web site is calculated, for instance, utilizing the method described with reference to FIG. 4 .
- a sitemap is generated, as indicated at block 620 .
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
- The proliferation of the web pages available on the Internet has produced striations in production quality and complexity among web sites. Web sites for individuals and very small businesses can be fairly simple, with few hierarchical levels and relatively static “structures.” Some content changes may be of minor significance to the structure of the site and may not necessitate changes to the associated sitemap. Extensive changes create different issues. Budgets may be limited or resources scarce enough that business owners act as their own webmasters. Therefore, overhauling structure and creating a need for a new sitemap is cost-prohibitive from a resource standpoint. This can be contrasted with large businesses that may have enormous web sites with significant complexity. These undertakings are generally tackled with much larger financial wherewithal. Some manifestations include entire departments dedicated to only the web upkeep function. Other companies may outsource such flexibility, but at an expense. Thus, web sites that do undergo large-scale modification either have dedicated staff or adequate resources to document the changes. For those web sites falling in between, complexity accompanying adaptability may be required without the resources to properly document the modifications.
- Embodiments of the present invention relate to methods, systems, and computer-storage media for automated generation of a sitemap for a web site. A universal resource locator (URL) for a web site is received, the web site having a plurality of web pages with which it is associated, that is, web pages having the same domain as the web site URL. Log files are analyzed to ascertain whether each web page has been previously crawled. Other files, downloaded from the root site, contain permission controls and are analyzed to determine which web pages may be crawled and/or indexed. The permitted, not-previously-crawled web pages are subsequently crawled and the structure of the web site, that is the linking of the pages between one another, is ascertained. Other items of metadata, such as web page modification frequency or priority values, also are determined. Once the structure and metadata are available, a current sitemap is generated that provides the hierarchy and related details in the form of metadata. The sitemap file is then written to a disk and may then be sent to search engines as generated or in a compressed format. Certain embodiments can implement the generation of a new sitemap any time the web site is modified.
- This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
- Embodiments are described in detail below with reference to the attached drawing figures, wherein:
-
FIG. 1 is a block diagram of an exemplary computing environment suitable for use in implementing embodiments of the present invention; -
FIG. 2 is a flowchart of a method suitable for generating a current sitemap of a web site, in accordance with an embodiment of the present invention; -
FIG. 3 is a flowchart of a method suitable for calculating a priority value for a web page, in accordance with an embodiment of the present invention; -
FIG. 4 is a flowchart of a method suitable for calculating a modification frequency for a web page, in accordance with an embodiment of the present invention; -
FIG. 5 is a flowchart of a method suitable for generating a sitemap file for a web site, in accordance with an embodiment of the present invention; and -
FIG. 6 is a flowchart of a method suitable for generating a sitemap for a web site, in accordance with an embodiment of the present invention. - The subject matter of the present invention is described with specificity herein to meet statutory requirements. However, the description itself is not intended to limit the scope of this patent. Rather, the inventors have contemplated that the claimed subject matter might also be embodied in other ways, to include different steps or combinations of steps similar to the ones described in this document, in conjunction with other present or future technologies. Moreover, although the terms “step” and/or “block” may be used herein to connote different elements of methods employed, the terms should not be interpreted as implying any particular order among or between various steps herein disclosed unless and except when the order of individual steps is explicitly described.
- Embodiments of the present invention relate to methods, systems, and computer storage media having computer-executable instructions embodied thereon that, when executed, perform methods for generating a sitemap file for a web site in an automated manner. Upon receiving an indication for a web site or universal resource locator (URL) domain, server log files are analyzed in conjunction with the present web site structure being crawled. Specified files denote the permissible pages to crawl and crawling occurs in accordance with such permissions. Once the web site structure (i.e., the relational structure of web pages having the same domain as the web site URL) has been determined, items of metadata such as web page priority ranking and modification frequency are automatically determined, that is, without user intervention, for each web page comprising the structure. Determined values may be modified manually if desired, or compared with previous sitemap files and server log files to refine values. The web site structure and metadata are subsequently used to generate a sitemap file for the web site. In embodiments, the sitemap file may be sent to one or more specified search engines. Embodiments further provide for compression of the sitemap file prior to transmission to a search engine if needed. Additionally, embodiments provide for an updated sitemap file to be generated each time a web page having the same domain as the web site URL is modified.
- Having briefly described an overview of embodiments of the present invention, an exemplary operating environment suitable for implementing embodiments hereof is described below.
- Referring to the drawings in general, and initially to
FIG. 1 in particular, an exemplary operating environment for implementing embodiments of the present invention is shown and designated generally ascomputing device 100.Computing device 100 is but one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the invention. Neither should thecomputing environment 100 be interpreted as having any dependency or requirement relating to any one or combination of modules/modules illustrated. - Embodiments may be described in the general context of computer code or machine-useable instructions, including computer-executable instructions such as program modules, being executed by a computer or other machine, such as a personal data assistant or other handheld device. Generally, program modules including routines, programs, objects, modules, data structures, and the like, refer to code that performs particular tasks, or implement particular abstract data types. Embodiments may be practiced in a variety of system configurations, including hand-held devices, consumer electronics, general-purpose computers, specialty computing devices, etc. Embodiments may also be practiced in distributed computing environments where tasks are performed by remote-processing devices that are linked through a communications network.
- With continued reference to
FIG. 1 ,computing device 100 includes abus 110 that directly or indirectly couples the following devices:memory 112, one ormore processors 114, one ormore presentation modules 116, input/output (I/O)ports 118, I/O modules 120, and anillustrative power supply 122.Bus 110 represents what may be one or more busses (such as an address bus, data bus, or combination thereof). Although the various blocks ofFIG. 1 are shown with lines for the sake of clarity, in reality, delineating various modules is not so clear, and metaphorically, the lines would more accurately be grey and fuzzy. For example, one may consider a presentation module such as a display device to be an I/O module. Also, processors have memory. The inventors hereof recognize that such is the nature of the art, and reiterate that the diagram ofFIG. 1 is merely illustrative of an exemplary computing device that can be used in connection with one or more embodiments. Distinction is not made between such categories as “workstation,” “server,” “laptop,” “hand-held device,” etc., as all are contemplated within the scope ofFIG. 1 and reference to “computer” or “computing device.” -
Computing device 100 typically includes a variety of computer-readable media. By way of example, and not limitation, computer-readable media may comprise Random Access Memory (RAM); Read Only Memory (ROM); Electronically Erasable Programmable Read Only Memory (EEPROM); flash memory or other memory technologies; CDROM, digital versatile disks (DVD) or other optical or holographic media; magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, carrier wave or any other medium that can be used to encode desired information and be accessed bycomputing device 100. -
Memory 112 includes computer-storage media in the form of volatile and/or nonvolatile memory. The memory may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid-state memory, hard drives, optical-disc drives, etc.Computing device 100 includes one or more processors that read data from various entities such asmemory 112 or I/O modules 120. Presentation module(s) 116 present data indications to a user or other device. Exemplary presentation modules include a display device, speaker, printing module, vibrating module, etc. I/O ports 118 allowcomputing device 100 to be logically coupled to other devices including I/O modules 120, some of which may be built in. Illustrative modules include a microphone, joystick, game pad, satellite dish, scanner, printer, wireless device, etc. - Turning now to
FIG. 2 , a flow chart illustrating a method, in accordance with an embodiment hereof, for automated sitemap file generation for a web site in accordance with the web site URL, is shown and designated generally asreference numeral 200. Initially, as indicated atblock 210, a web site for which a sitemap is to be generated is received. Generally, such receipt comprises receiving the URL for the web site, although it will be understood by those of ordinary skill in the art that any web site identifier from which the web site URL may be ascertained may be received in accordance with embodiments hereof. Utilizing the web site URL domain, a sitemap file is generated, as more fully described below, based upon those web pages having the same domain as the web site URL. As an example, if the web site for generating a sitemap had a domain name of www.mywebsite.com, the sitemap file generation would be limited to web pages that have this specific root domain, such as www.mywebsite.com/index or www.mywebsite.com/faqs.html, etc. Any pages linked to the root domain that differ in domain name will not be included in the sitemap file generated. Thus, in the above example, if a page was linked to www.archive.mywebsite.com, then the sitemap generation would not include this page or related links. - Once the root domain is specified, one or more files are analyzed, as indicated at block 212. Initially, the web server logs, that is files that log user visits to web pages based on respective URLs, are analyzed to discover those URLs that haven't previously been crawled. By scanning the web server logs, a list of URLs is built from which to seed the crawler. Each URL in the log file is examined and compared to a list of URLs already present in the corresponding data structure. If the URL is a URL that is not in the data structure, it is added. Upon completion of this process, a list of URLs that act as a starting point for the crawler is generated.
- In embodiments, analysis of the files as indicated at block 212 may include not only analysis to discover those URLs that haven't previously been crawled but analysis of several different types of files which are capable of being examined for different forms of information. By way of example, and not limitation, the log files may be analyzed to determine the number of visits a particular web page has received. Likewise, the log files may be analyzed to determine a total number of log file entries, that is, a total number of visits to any URL logged in the log files.
- In addition to log files, files downloaded from the root site that grant or deny permission to spiders to crawl portions of a web site, such as a “robots.txt” file, may be analyzed. These files delineate which web pages, links, and subsequent paths may or may not be crawled and similarly may or may not be included in a sitemap structure. Only those web pages where the crawler is invited to go are crawled (as more fully described below with reference to reference numeral 214). Before crawling begins, the robots.txt file is attempted to be retrieved and parsed and a data structure is created of all off-limits base URLs. The URLs appearing in the off-limits data structure is strictly adhered to once crawling begins. It should be noted that some web sites choose to not utilize a specific robots.txt file but instead individually mark web pages as off-limits by using a robots Meta tag in the HTML of the web page. The robots Meta tag informs the crawler that it should either not index and/or not follow this page and the links contained within the page. In accordance with embodiments of the present invention, any and all such identifiable permissions are followed.
- It should be further noted that analysis is not limited to log and permission files. For instance, previous sitemaps may also be analyzed for structure or for gathering details of metadata. Thus, it will be understood and appreciated by those or ordinary skill in the art that the analysis indicated at block 212 is meant to be illustrative and not restrictive as there any other files from which relevant information may be gathered may be analyzed within the spirit and scope of embodiments of the present invention.
- Subsequently, as indicated at
block 214, the permissible web pages having the same domain as the web site URL are crawled. In embodiments, the permissible web pages are crawled in a traditional manner by loading the web page URL and the log file analysis (block 212). Each link on the web page is examined to see if the link has already been crawled. If it has not, the link is followed. This process is repeated until all the web pages have been examined and, effectively, the tree of pages comprising the web site structure has been crawled. During web page crawling, relevant data items are gathered about the web site, that is, data items that may aid in generating the sitemap file. One such data item is the web page URL itself. The URL is the primary piece of information and each unique URL gathered forms an entry in the sitemap file. Other data items may include, without limitation, link information. To enable later use in determining a priority value for the web page (as more fully described below with reference toFIG. 3 ), the number of links each web page has from other web pages having the same domain as the URL may be gathered, as well as the number of web pages having the same domain as the URL. The link counter may be incremented each time a new link to a URL is discovered during the crawling step. - Next, as indicated at
block 216, a relational structure of the web site is determined by examining the relationships between each permitted web page having the same domain as the web site URL. Thus, the relational structure takes into account the web pages that are a part of the domain, as well as the interconnections between the web pages. Thus, a hierarchal “picture” of the web site starts to form in terms of links between web pages and the routes through which the web pages may be reached. - Next, as indicated at
block 218, one or more items of metadata related to the web pages comprising the web site is analyzed. This metadata can constitute a variety of information associated with the web pages including, without limitation, the frequency at which a page is modified, the relative importance or priority ranking of the page, whether a site administrator or other user has manually altered the modification frequency and/or priority value, and the like. In embodiments, such metadata may be determined automatically and/or set manually by a user. The analysis of two portions of metadata, priority value and modification frequency, is described in further detail below with reference toFIGS. 3 and 4 , respectively. However, this list is not meant to be exhaustive, but merely to show exemplary items of metadata that may be analyzed. Other examples would include plug-ins required by a page, file size of or associated with a page, whether access to a page requires a security login, and the like. Any and all such forms of metadata, and any combinations thereof, are contemplated to be within the scope of embodiments of the present invention. - Once the relational structure and metadata for a web site are known, a current sitemap of the web site is generated, as indicated at
block 220. The generated sitemap may be created using a markup language, for example and not by way of limitation, extensible markup language (XML). Standard formats can be followed so that the sitemap conforms to protocols maximizing web site accessibility. As an example, the format offered as www.sitemap.org allows web users to provide a standard sitemap coded in XML conforming to protocols accepted by many major search engines. - Once the sitemap is generated, the sitemap file may be written to disk. The file may optionally be compressed, for instance, utilizing the gzip compression algorithm, as known to those of ordinary skill in the art. In this embodiment, the sitemap file generally must contain no more than 50,000 URLs and must be less than 10 MB in size before compression is applied (compression is used to reduce the upload time to the search engines). If the data for the sitemap has more than the 50,000 URLs or the sitemap file grows over the 10 MB file size limit, then multiple sitemap files may be created along with a sitemap index file. For legacy consideration, the user may also be provided with the capability to save the sitemap in a text format. Although considered legacy, some sites still utilize text-based sitemaps.
- Once the file has been written, it may be desired to inform let one or more specified search engine know by transmitting the current, up-to-date sitemap that is generated. To inform a search engine of a sitemap change, the search engine can be “pinged” with the URL to the latest sitemap file or index, as desired.
- If desired, methods in accordance with embodiments of the present invention may provide functionality for verifying a sitemap file by comparing the file to the standard format, e.g., XML format, for a sitemap. The file will either pass or fails. If the file fails, then a list of errors may be generated allowing the user to correct the sitemap format, for instance, prior to informing a search engine of the sitemap file.
- Additionally, if desired, during the crawling of the web pages (as indicated at
block 214, the HTML and page structure of each of the pages associated with the web site may be analyzed for search engine optimization opportunities. For instance, each web page may be compared against the top X (where X is a number that varies based on computing device performance) search engine optimization rules and suggestions may be offered to the web site owner of changes that may allow their site to better optimize page ranking within a search engine, or the like. - Utilizing the
method 200, interaction necessary by a site administrator or webmaster is diminished. Rather than requiring extensive user input, information that is already available is combined with algorithms, discussed hereinafter, to systematically generate the sitemap file. Additionally, a site administrator or other user may generate the sitemap locally—that is as a client-oriented tool, rather than relying on a served application. In some embodiments, the above method can be incorporated into the generation and upkeep of a web site. Thus, modifications to the web site can lead to an automatically-generated sitemap that is current. - Turning now to
FIG. 3 , a flow chart illustrating a method for determining a priority value for one or more web pages is shown and designated generally asreference numeral 300. Typically, the priority value is calculated during themetadata analysis step 218 ofFIG. 2 . Initially, as indicated atblock 310, a web page for which a priority value is desired is received, typically via receipt of the web page URL. In embodiments, a priority value for each web page being crawled is determined and a specific indication that such value is desired for a particular web page is not necessary. Subsequently, the number of visits a particular web page has received, as well as the total number of log file entries (that is, a total number of visits to any URL logged in the log files) are determined, as indicated atblocks FIG. 2 and, accordingly, at the time of the priority value calculation, are merely recalled. However, if one or more of these values was not determined as part of the log file analysis indicated at block 212 of FIG. 2, such values may be determined via log file analysis at or near the time the priority value is being calculated. - Referring back to
FIG. 3 , prior to, subsequent to, or contemporaneous with the metadata values gathered from the log files, the number of links the web page has from other web pages having the same domain as the URL, as well as the number of web pages having the same domain as the URL are determined, as indicated atblocks block 214 ofFIG. 2 and, accordingly, at the time of the priority value calculation, are merely recalled. However, if one or more of these values was not determined as part of the crawling indicated atblock 214 ofFIG. 2 , such values may be determined at or near the time the priority value is being calculated. - Once the relevant data items have been determined, a priority value is calculated, as indicated at
step 320. In embodiments, such calculation may be performed utilizing the following formula: -
- Thus, the number of pages linked to the particular page as a proportion to the total pages in the domain and a proportion of visits to the page in proportion to total visits both help to determine that web page's priority ranking.
- If desired, this value can be normalized to fall between zero and one so that standard values can be determined across multiple domains. This is indicated at
block 322. In embodiments, the largest priority value may be utilized to calculate a multiplier to get the value equal to one. All priority values then may be multiplied by the multiplier to achieve a final priority value respectively therefore. It should be noted that a novel aspect of the present invention is the ability of a priority ranking to be generated for the sitemap file without user intervention. Thus, numerous calculations could fall within the scope and spirit of the invention. Once a priority value has been generated, however, user modification may be permitted, if desired. In embodiments, whether or not the value has been modified may be, in and of itself, a portion of the metadata associated with the web page that may be analyzed, for instance, atblock 218 ofFIG. 2 , as well as the priority value itself. - Turning to the flowchart of
FIG. 4 , a flowchart illustrating a method for determining the frequency with which a web page is modified is shown, and designated generally asreference numeral 400. Typically, the modification frequency is calculated during themetadata analysis step 218 ofFIG. 2 . Initially, as indicated atblock 410, a web page for which a modification frequency is desired is received, typically via receipt of the web page URL. In embodiments, a modification frequency for each web page being crawled is determined and a specific indication that such value is desired for a particular web page is not necessary. Subsequently, basic metrics of the web page are determined as indicated atblock 412. These include, the time of the last modification to the web page, as well as a current time. To determine the value for the last modified time, it is necessary to look at the source for the web page. This step may not always be possible as it depends on how the web pages are generated. However, most web pages are generally stored as a file that is named the same as the page name in the URL (e.g., page.html, page.aspx, etc.). - Next, as indicated at
block 414, the difference between the current time and last modification time is calculated to ascertain a time delta value. Once this delta value is known, it is compared with one or more preset threshold values, as indicated atblock 416. The types of time periods and threshold values may vary according to variations in implementation. As an example, if the delta value is less than 24 hours old, the update frequency could be deemed to be daily, while a value less than 10 days old could be weekly. Values less than four hours and two months could be deemed hourly and monthly, respectively. Once again, these values are meant to convey illustration only and are not intended to limit the scope of embodiments of the present invention. - Next, as indicated at
block 418, a previous sitemap may be analyzed to compare values and determine if refinements to the modification frequency are necessary. As an example, if a current value is very close to a threshold value, previous values could be used to determine in which frequency category a web page may be placed. Using an average of previous values with the current, calculated value may aid in producing a more accurate calculation with a larger sampling of modifications. Upon completion, a current update frequency value can be associated with the web page as another portion of metadata available for analysis, for instance, atblock 218 ofFIG. 2 . - Referring now to
FIG. 5 , a flow chart illustrating a method for generating a sitemap and notifying search engines of such sitemap is shown and designated generally asreference numeral 500. Initially, as indicated atblock 510, one or more log files associated with the URL for which sitemap generation is desired are received. Likewise, one or more files controlling permission for programmed crawling of the web pages having the same domain as the web site URL are received, as indicated atblock 512. Next, data in the received files in analyzed to determine which web pages have not been previously crawled and for which of the non-crawled subset of web pages crawling is permitted. This is indicated atblock 514. The permitted web pages are then crawled, as indicated atblock 516. - Subsequently, the sitemap file structure is determined, as indicated at
block 518, for instance, by analyzing the relationships between web pages having the same domain as the web site URL. Likewise, metadata values, including, without limitation, priority values and modification frequencies, are determined. This is indicated atblock 520. Utilizing the file structure, metadata values and any other relevant data items, a sitemap file for the web site is subsequently generated, as indicated atblock 522. If desired, one or more specified search engines may be notified, as indicated atblock 524. - With reference now to
FIG. 6 , a flow chart illustrating a method for generating a sitemap, in accordance with an embodiment of the present invention, is shown and designated generally asreference numeral 600. Initially, as indicated atblock 610, the URL for a web site for which sitemap generation is desired in received. Generally, such receipt comprises receiving the URL for the web site, although it will be understood by those of ordinary skill in the art that any web site identifier from which the web site URL may be ascertained may be received in accordance with embodiments hereof. Subsequently, the web pages having the same domain as the web site URL are crawled in accordance with log file and control permissions, as described hereinabove with reference toFIG. 2 . Next, as indicated atblock 612, permitted web pages having the same domain as the web site URL are crawled. Once all permitted web pages have been crawled, a relational structure of the web pages is determined, as indicated atblock 614. - Subsequently, as indicated at
block 616, web page priority values for each permitted web page associated with the web site are calculated, for instance, utilizing the method described with reference to 3. Likewise, as indicated atblock 618, a change or modification frequency for each web page associated with the web site is calculated, for instance, utilizing the method described with reference toFIG. 4 . Subsequently, utilizing the relational structure, priority values, modification frequencies and any other relevant data items, a sitemap is generated, as indicated atblock 620. - The present invention has been described in relation to particular embodiments, which are intended in all respects to be illustrative rather than restrictive. Alternative embodiments will become apparent to those of ordinary skill in the art to which the present invention pertains without departing from its scope.
- From the foregoing, it will be seen that this invention is one well adapted to attain all the ends and objects set forth above, together with other advantages which are obvious and inherent to the system and method. It will be understood that certain features and sub-combinations are of utility and may be employed without reference to other features and sub-combinations. This is contemplated by and is within the scope of the claims.
Claims (20)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/028,502 US8126869B2 (en) | 2008-02-08 | 2008-02-08 | Automated client sitemap generation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/028,502 US8126869B2 (en) | 2008-02-08 | 2008-02-08 | Automated client sitemap generation |
Publications (2)
Publication Number | Publication Date |
---|---|
US20090204638A1 true US20090204638A1 (en) | 2009-08-13 |
US8126869B2 US8126869B2 (en) | 2012-02-28 |
Family
ID=40939800
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/028,502 Expired - Fee Related US8126869B2 (en) | 2008-02-08 | 2008-02-08 | Automated client sitemap generation |
Country Status (1)
Country | Link |
---|---|
US (1) | US8126869B2 (en) |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090171993A1 (en) * | 2007-09-28 | 2009-07-02 | Xcerion Ab | Network operating system |
US20100235339A1 (en) * | 2009-02-09 | 2010-09-16 | PixelSilk | Search Advice Systems and Methods |
US20120005187A1 (en) * | 2010-07-02 | 2012-01-05 | Philippe Chavanne | Web Site Content Management Techniques |
US20120284252A1 (en) * | 2009-10-02 | 2012-11-08 | David Drai | System and Method For Search Engine Optimization |
US8402013B2 (en) | 2010-06-25 | 2013-03-19 | Microsoft Corporation | Rich site maps |
US8898296B2 (en) | 2010-04-07 | 2014-11-25 | Google Inc. | Detection of boilerplate content |
EP2707808A4 (en) * | 2011-05-13 | 2015-10-21 | Microsoft Technology Licensing Llc | USE OF QUERY LOOKING PROTOCOLS FOR DOMAIN RECOGNITION IN UNDERSTANDING SPOKEN LANGUAGE |
US20160004674A1 (en) * | 2014-07-04 | 2016-01-07 | Yandex Europe Ag | Method of and system for determining creation time of a web resource |
CN105260469A (en) * | 2015-10-16 | 2016-01-20 | 广州神马移动信息科技有限公司 | Sitemap processing method, apparatus and device |
US9286378B1 (en) * | 2012-08-31 | 2016-03-15 | Facebook, Inc. | System and methods for URL entity extraction |
US9330093B1 (en) * | 2012-08-02 | 2016-05-03 | Google Inc. | Methods and systems for identifying user input data for matching content to user interests |
US9430567B2 (en) | 2012-06-06 | 2016-08-30 | International Business Machines Corporation | Identifying unvisited portions of visited information |
US9558176B2 (en) | 2013-12-06 | 2017-01-31 | Microsoft Technology Licensing, Llc | Discriminating between natural language and keyword language items |
US9934319B2 (en) | 2014-07-04 | 2018-04-03 | Yandex Europe Ag | Method of and system for determining creation time of a web resource |
CN108255831A (en) * | 2016-12-28 | 2018-07-06 | 航天信息股份有限公司 | A kind of method and system for being used to generate site maps for website |
US20190228105A1 (en) * | 2018-01-24 | 2019-07-25 | Rocket Fuel Inc. | Dynamic website content optimization |
US10678869B2 (en) * | 2013-05-31 | 2020-06-09 | Verizon Media Inc. | Systems and methods for selective distribution of online content |
CN114095234A (en) * | 2021-11-17 | 2022-02-25 | 北京知道创宇信息技术股份有限公司 | Honeypot generation method, honeypot generation device, server and computer-readable storage medium |
US11366862B2 (en) * | 2019-11-08 | 2022-06-21 | Gap Intelligence, Inc. | Automated web page accessing |
US11709909B1 (en) * | 2022-01-31 | 2023-07-25 | Walmart Apollo, Llc | Systems and methods for maintaining a sitemap |
US11838851B1 (en) | 2014-07-15 | 2023-12-05 | F5, Inc. | Methods for managing L7 traffic classification and devices thereof |
US11895138B1 (en) * | 2015-02-02 | 2024-02-06 | F5, Inc. | Methods for improving web scanner accuracy and devices thereof |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2281246A4 (en) * | 2008-04-17 | 2012-07-25 | Google Inc | Generating sitemaps |
CN105543831B (en) * | 2015-12-14 | 2018-10-09 | 华南理工大学 | It is a kind of alkalinity tungstates passivating solution and be passivated chemical plating Ni-P coating method |
CN105525282B (en) * | 2016-01-15 | 2018-04-13 | 华南理工大学 | A kind of alkalescence chromium-free passivation liquid and its normal temperature passivated Electroless Plating Ni P layers of method |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6516337B1 (en) * | 1999-10-14 | 2003-02-04 | Arcessa, Inc. | Sending to a central indexing site meta data or signatures from objects on a computer network |
US6525748B1 (en) * | 1996-07-17 | 2003-02-25 | Microsoft Corporation | Method for downloading a sitemap from a server computer to a client computer in a web environment |
US20040267739A1 (en) * | 2000-02-24 | 2004-12-30 | Dowling Eric Morgan | Web browser with multilevel functions |
US6957383B1 (en) * | 1999-12-27 | 2005-10-18 | International Business Machines Corporation | System and method for dynamically updating a site map and table of contents for site content changes |
US20060070004A1 (en) * | 2004-09-30 | 2006-03-30 | Microsoft Corporation | System and method for unified navigation |
US20060101330A1 (en) * | 2004-11-08 | 2006-05-11 | Taiwan Semiconductor Manufacturing Company, Ltd. | Browser sitemap viewer |
US20070050338A1 (en) * | 2005-08-29 | 2007-03-01 | Strohm Alan C | Mobile sitemaps |
US20070124506A1 (en) * | 2005-10-27 | 2007-05-31 | Brown Douglas S | Systems, methods, and media for dynamically generating a portal site map |
US20070244883A1 (en) * | 2006-04-14 | 2007-10-18 | Websidestory, Inc. | Analytics Based Generation of Ordered Lists, Search Engine Fee Data, and Sitemaps |
US7599920B1 (en) * | 2006-10-12 | 2009-10-06 | Google Inc. | System and method for enabling website owners to manage crawl rate in a website indexing system |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080178122A1 (en) | 2006-02-03 | 2008-07-24 | Crown Partners,Llc | System and method for website configuration and management |
EP1840765A1 (en) | 2006-03-02 | 2007-10-03 | Indigen Solutions SARL | Process for extracting data from a web site |
-
2008
- 2008-02-08 US US12/028,502 patent/US8126869B2/en not_active Expired - Fee Related
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6525748B1 (en) * | 1996-07-17 | 2003-02-25 | Microsoft Corporation | Method for downloading a sitemap from a server computer to a client computer in a web environment |
US6516337B1 (en) * | 1999-10-14 | 2003-02-04 | Arcessa, Inc. | Sending to a central indexing site meta data or signatures from objects on a computer network |
US6957383B1 (en) * | 1999-12-27 | 2005-10-18 | International Business Machines Corporation | System and method for dynamically updating a site map and table of contents for site content changes |
US20040267739A1 (en) * | 2000-02-24 | 2004-12-30 | Dowling Eric Morgan | Web browser with multilevel functions |
US20060070004A1 (en) * | 2004-09-30 | 2006-03-30 | Microsoft Corporation | System and method for unified navigation |
US20060101330A1 (en) * | 2004-11-08 | 2006-05-11 | Taiwan Semiconductor Manufacturing Company, Ltd. | Browser sitemap viewer |
US20070050338A1 (en) * | 2005-08-29 | 2007-03-01 | Strohm Alan C | Mobile sitemaps |
US20070124506A1 (en) * | 2005-10-27 | 2007-05-31 | Brown Douglas S | Systems, methods, and media for dynamically generating a portal site map |
US20070244883A1 (en) * | 2006-04-14 | 2007-10-18 | Websidestory, Inc. | Analytics Based Generation of Ordered Lists, Search Engine Fee Data, and Sitemaps |
US7599920B1 (en) * | 2006-10-12 | 2009-10-06 | Google Inc. | System and method for enabling website owners to manage crawl rate in a website indexing system |
Cited By (38)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8738567B2 (en) * | 2007-09-28 | 2014-05-27 | Xcerion Aktiebolag | Network file system with enhanced collaboration features |
US20090171993A1 (en) * | 2007-09-28 | 2009-07-02 | Xcerion Ab | Network operating system |
US20090192969A1 (en) * | 2007-09-28 | 2009-07-30 | Xcerion Aktiebolag | Network operating system |
US11838358B2 (en) | 2007-09-28 | 2023-12-05 | Xcerion Aktiebolag | Network operating system |
US9344497B2 (en) | 2007-09-28 | 2016-05-17 | Xcerion Aktiebolag | State management of applications and data |
US8112460B2 (en) * | 2007-09-28 | 2012-02-07 | Xcerion Aktiebolag | Framework for applying rules |
US9071623B2 (en) | 2007-09-28 | 2015-06-30 | Xcerion Aktiebolag | Real-time data sharing |
US8234315B2 (en) * | 2007-09-28 | 2012-07-31 | Xcerion Aktiebolag | Data source abstraction system and method |
US20090192992A1 (en) * | 2007-09-28 | 2009-07-30 | Xcerion Aktiebolag | Network operating system |
US20100235339A1 (en) * | 2009-02-09 | 2010-09-16 | PixelSilk | Search Advice Systems and Methods |
US10346483B2 (en) * | 2009-10-02 | 2019-07-09 | Akamai Technologies, Inc. | System and method for search engine optimization |
US20120284252A1 (en) * | 2009-10-02 | 2012-11-08 | David Drai | System and Method For Search Engine Optimization |
US8898296B2 (en) | 2010-04-07 | 2014-11-25 | Google Inc. | Detection of boilerplate content |
US8402013B2 (en) | 2010-06-25 | 2013-03-19 | Microsoft Corporation | Rich site maps |
US20120005187A1 (en) * | 2010-07-02 | 2012-01-05 | Philippe Chavanne | Web Site Content Management Techniques |
EP2707808A4 (en) * | 2011-05-13 | 2015-10-21 | Microsoft Technology Licensing Llc | USE OF QUERY LOOKING PROTOCOLS FOR DOMAIN RECOGNITION IN UNDERSTANDING SPOKEN LANGUAGE |
US9430567B2 (en) | 2012-06-06 | 2016-08-30 | International Business Machines Corporation | Identifying unvisited portions of visited information |
US10671584B2 (en) | 2012-06-06 | 2020-06-02 | International Business Machines Corporation | Identifying unvisited portions of visited information |
US9916337B2 (en) | 2012-06-06 | 2018-03-13 | International Business Machines Corporation | Identifying unvisited portions of visited information |
US9330093B1 (en) * | 2012-08-02 | 2016-05-03 | Google Inc. | Methods and systems for identifying user input data for matching content to user interests |
US9286378B1 (en) * | 2012-08-31 | 2016-03-15 | Facebook, Inc. | System and methods for URL entity extraction |
US12056195B2 (en) | 2013-05-31 | 2024-08-06 | Yahoo Ad Tech Llc | Systems and methods for selective distribution of online content |
US11704372B2 (en) | 2013-05-31 | 2023-07-18 | Yahoo Ad Tech Llc | Systems and methods for selective distribution of online content |
US11042593B2 (en) | 2013-05-31 | 2021-06-22 | Verizon Media Inc. | Systems and methods for selective distribution of online content |
US10678869B2 (en) * | 2013-05-31 | 2020-06-09 | Verizon Media Inc. | Systems and methods for selective distribution of online content |
US9558176B2 (en) | 2013-12-06 | 2017-01-31 | Microsoft Technology Licensing, Llc | Discriminating between natural language and keyword language items |
US9934319B2 (en) | 2014-07-04 | 2018-04-03 | Yandex Europe Ag | Method of and system for determining creation time of a web resource |
US20160004674A1 (en) * | 2014-07-04 | 2016-01-07 | Yandex Europe Ag | Method of and system for determining creation time of a web resource |
US9692804B2 (en) * | 2014-07-04 | 2017-06-27 | Yandex Europe Ag | Method of and system for determining creation time of a web resource |
US11838851B1 (en) | 2014-07-15 | 2023-12-05 | F5, Inc. | Methods for managing L7 traffic classification and devices thereof |
US11895138B1 (en) * | 2015-02-02 | 2024-02-06 | F5, Inc. | Methods for improving web scanner accuracy and devices thereof |
CN105260469A (en) * | 2015-10-16 | 2016-01-20 | 广州神马移动信息科技有限公司 | Sitemap processing method, apparatus and device |
CN108255831A (en) * | 2016-12-28 | 2018-07-06 | 航天信息股份有限公司 | A kind of method and system for being used to generate site maps for website |
US20190228105A1 (en) * | 2018-01-24 | 2019-07-25 | Rocket Fuel Inc. | Dynamic website content optimization |
US11366862B2 (en) * | 2019-11-08 | 2022-06-21 | Gap Intelligence, Inc. | Automated web page accessing |
CN114095234A (en) * | 2021-11-17 | 2022-02-25 | 北京知道创宇信息技术股份有限公司 | Honeypot generation method, honeypot generation device, server and computer-readable storage medium |
US11709909B1 (en) * | 2022-01-31 | 2023-07-25 | Walmart Apollo, Llc | Systems and methods for maintaining a sitemap |
US20230244742A1 (en) * | 2022-01-31 | 2023-08-03 | Walmart Apollo, Llc | Systems and methods for maintaining a sitemap |
Also Published As
Publication number | Publication date |
---|---|
US8126869B2 (en) | 2012-02-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8126869B2 (en) | Automated client sitemap generation | |
US9436719B2 (en) | Updating an inverted index in a real time fashion | |
US9355177B2 (en) | Web crawler scheduler that utilizes sitemaps from websites | |
US6638314B1 (en) | Method of web crawling utilizing crawl numbers | |
US11599499B1 (en) | Third-party indexable text | |
US7917503B2 (en) | Specifying relevance ranking preferences utilizing search scopes | |
US6959326B1 (en) | Method, system, and program for gathering indexable metadata on content at a data repository | |
US8285702B2 (en) | Content analysis simulator for improving site findability in information retrieval systems | |
CN1755676B (en) | System and method for batched indexing of network documents | |
US20100191744A1 (en) | Ranking functions using document usage statistics | |
US7925641B2 (en) | Indexing web content of a runtime version of a web page | |
US7653654B1 (en) | Method and system for selectively accessing files accessible through a network | |
US20080027971A1 (en) | Method and system for populating an index corpus to a search engine | |
US20080168037A1 (en) | Integrating enterprise search systems with custom access control application programming interfaces | |
US20090119329A1 (en) | System and method for providing visibility for dynamic webpages | |
US8260766B2 (en) | Embedded communication of link information | |
US20050216845A1 (en) | Utilizing cookies by a search engine robot for document retrieval | |
US8775443B2 (en) | Ranking of business objects for search engines | |
US8073861B2 (en) | Identifying opportunities for effective expansion of the content of a collaboration application | |
US20030163465A1 (en) | Processing information about occurrences of multiple types of events in a consistent manner | |
US20080208831A1 (en) | Controlling search indexing | |
Chala et al. | Hybrid Method of Ranking Query Results in Search Engines | |
CA2545366A1 (en) | Method and system for populating an index corpus to a search engine | |
Abd Wahab | A project submitted to the Faculty of Information Technology in partial fulfillment of the requirements for the degree Master of Science (Intelligent Knowledge Based System) | |
Jackson | Difficulties in Electronic Publication Archival Processing for State Governments |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: MICROSOFT CORPORATION, WASHINGTON Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:HOLLIER, IAN V.;HIEMSTRA, MARTINA;REEL/FRAME:020920/0331;SIGNING DATES FROM 20080128 TO 20080207 Owner name: MICROSOFT CORPORATION, WASHINGTON Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:HOLLIER, IAN V.;HIEMSTRA, MARTINA;SIGNING DATES FROM 20080128 TO 20080207;REEL/FRAME:020920/0331 |
|
FEPP | Fee payment procedure |
Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
AS | Assignment |
Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MICROSOFT CORPORATION;REEL/FRAME:034542/0001 Effective date: 20141014 |
|
CC | Certificate of correction | ||
FPAY | Fee payment |
Year of fee payment: 4 |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
LAPS | Lapse for failure to pay maintenance fees |
Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20200228 |