CN102236719A - Page search engine based on page classification and quick search method - Google Patents
Page search engine based on page classification and quick search method Download PDFInfo
- Publication number
- CN102236719A CN102236719A CN2011102076463A CN201110207646A CN102236719A CN 102236719 A CN102236719 A CN 102236719A CN 2011102076463 A CN2011102076463 A CN 2011102076463A CN 201110207646 A CN201110207646 A CN 201110207646A CN 102236719 A CN102236719 A CN 102236719A
- Authority
- CN
- China
- Prior art keywords
- webpage
- classification
- search
- display
- search engine
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开了一种具有分类显示的搜索引擎及快速检索方法,该搜索引擎包括处于服务器端的分类模块,用于对每一网页按照中国图书馆图书分类法进行分类 , 将分类结果索引存入网页索引库;所述结果显示模块通过分栏显示与关键词相匹配的网页索引和与关键词相关的网页分类。该搜索引擎通过网页分类信息更好地帮助用户按网页类别更快速地、更准确地通过搜索引擎寻找到用户所感兴趣的网页。
The invention discloses a search engine with classification display and a fast retrieval method. The search engine includes a classification module at the server end, which is used to classify each web page according to the Chinese library book classification method, and store the index of the classification results into the web page Index library; the result display module displays the web page index matching the keyword and the web page classification related to the keyword by column. The search engine better helps the user find the webpage that the user is interested in more quickly and accurately by the webpage classification information through the webpage classification information.
Description
技术领域 technical field
本发明属于搜索引擎优化技术领域,具体涉及一种基于网页分类的网页搜索引擎及快速查找方法。The invention belongs to the technical field of search engine optimization, and in particular relates to a web page search engine and a fast search method based on web page classification.
背景技术 Background technique
随着互联网技术的快速发展,每天都有大量新的网页出现,互联网网页数量急剧增长。互联网网页中包含了丰富的信息,怎样从大量网页所包含的海量信息中快速地搜索到所感兴趣的信息就显得尤为重要。如果不能从互联网网页中在有限的、可容忍的时间内寻找到有用的信息,互联网的发展就不会那么迅猛,影响不会那么广泛。With the rapid development of Internet technology, a large number of new web pages appear every day, and the number of Internet web pages increases dramatically. Internet webpages contain a wealth of information, and how to quickly search for the information you are interested in from the massive information contained in a large number of webpages is particularly important. If useful information cannot be found from Internet web pages within a limited and tolerable time, the development of the Internet will not be so rapid and its influence will not be so extensive.
互联网搜索引擎从互联网网页中提取网页包含的信息,并将这些信息存入数据库。当用户在互联网搜索引擎搜索网页上输入关键词后,搜索引擎从数据库中寻找出与输入关键词相关的网页提供给用户。最早的搜索引擎可追溯到1990年由加拿大McGill大学三名学生设计实现的Archie系统。那时还没有所谓的互联网,Archie系统不是现在意义上的搜索引擎。用户通过输入文件名可用Archie系统搜索到哪一个FTP服务器上拥有可供下载的具有该文件名的文件。随着互联网的出现,陆续出现了许多不同意义上的互联网搜索引擎如WebCrawler,Excite、Infoseek、AltaVista、Yahoo等等。WebCrawler是由美国华盛顿大学Brian Pinkerton于1994年实现的第一个全文搜索引擎。它能搜索任何网页上任何词。而Yahoo实际上更准确地说是属于目录式搜索引擎而不是全文搜索引擎。它的分类目录网页上收集了大量分了类的网站。网页拥有者可自己将拥有的网页按类注册在分类目录网页上。其他用户可根据分类目录从分类目录网页上寻找到感兴趣的网页。其它的网页分类目录网页还有如新浪分类目录网页、DMOZ(www.dmoz.org)等等。大量的商业用户为了更好地利用互联网来推销它们的产品,使他们的产品信息网页更优先地出现在搜索引擎的搜索结果中,开始研究分类目录网页上网站的排名规则,并试图通过一些方法去调整自己网站在分类目录网页上的排名来使自己网站更易被搜索引擎搜索到,并相应地出现搜索结果更靠前的位置,这就出现了搜索引擎优化。另外在搜索引擎网页抓取时也常从这些分类目录网页出发去抓取网页,包含在这些分类目录网页上的网页,更易被搜索引擎抓取,从而显示在搜索引擎的搜索结果中。这些分类目录网页不属于真正的搜索引擎,但这些分类目录网页可被人为地利用来操纵搜索引擎网页搜索结果。不同于目录式搜索引擎,谷歌则采用了PageRank技术,将每一网页被其它网页(特别是一些常被访问的网页)链接的情况结合到搜索引擎搜索到的结果的优先级别里,从而被其它网页链接越多的网页越容易被搜索引擎搜索到。这使得搜索引擎的搜索结果与输入关键词的相关性大大提高。商用的搜索引擎如谷歌、百度等采用了这样的原理.Internet search engines extract information contained in web pages from Internet web pages, and store the information in a database. After the user enters keywords on the Internet search engine search webpage, the search engine finds out from the database the webpages related to the input keywords and provides them to the user. The earliest search engine can be traced back to the Archie system designed and realized by three students of McGill University in Canada in 1990. At that time, there was no so-called Internet, and the Archie system was not a search engine in the current sense. By inputting the file name, the user can use the Archie system to search which FTP server has the file with the file name available for download. With the emergence of the Internet, many Internet search engines in different senses such as WebCrawler, Excite, Infoseek, AltaVista, Yahoo, etc. have appeared one after another. WebCrawler is the first full-text search engine implemented by Brian Pinkerton of Washington University in 1994. It can search any word on any web page. And Yahoo is actually more accurately a directory search engine rather than a full-text search engine. Its category directory page collects a large number of classified websites. The webpage owner can register the webpages he owns on the category catalog webpage by himself. Other users can find interested webpages from the classified directory webpage according to the classified directory. Other webpage classification directory webpages also have such as Sina classification directory webpage, DMOZ (www.dmoz.org) and so on. In order to make better use of the Internet to promote their products, a large number of business users began to study the ranking rules of websites on category pages, and tried to use some methods to make their product information pages appear in the search results of search engines. To adjust the ranking of your website on the category page to make your website easier to be searched by search engines, and correspondingly appear in a higher position in the search results, this is search engine optimization. In addition, when crawling webpages of search engines, webpages are often crawled from these catalog webpages. The webpages included in these catalog webpages are more likely to be crawled by search engines, and thus displayed in the search results of search engines. These category pages are not real search engines, but these category pages can be artificially used to manipulate search engine page search results. Different from directory search engines, Google uses PageRank technology, which combines the fact that each webpage is linked by other webpages (especially some frequently visited webpages) into the priority level of the results searched by the search engine, so that it is recognized by other webpages. Web pages with more web links are more likely to be searched by search engines. This greatly improves the correlation between the search results of the search engine and the input keywords. Commercial search engines such as Google and Baidu have adopted this principle.
搜索引擎一般包含以下几步:网页抓取、网页分词、网页索引、和网页搜索。每一网页与它所包含的分词及分词出现的频率建立关联度,存入索引数据库供搜索引擎搜索时使用。这样当用搜索引擎搜索时,与输入的关键词相关联的网页一般按照关联度的程度排列来作为搜索结果。最相关的网页则排列在搜索结果的前面,越容易被用户搜索到。用户在使用搜索引擎搜索时,一般来说输入的关键词有限,二到三个关键词的情况是很普遍的。这样通过网页搜索引擎搜索到的网页结果里常常不一定是用户所真正想要搜索到的网页。经常用户通过一页一页地浏览搜索到的网页而不能找到真正想要搜索到的网页。为了更好地提供给用户真正需要的搜索结果,搜索引擎公司采用了一些方法来提供更好的搜索结果给用户。百度搜索引擎、比应搜索引擎可将搜索结果按图片、视频、新闻等来输出搜索结果,使用户在图片、视频、新闻这些类别里能更好地更准确地搜索到用户所感兴趣的网页。谷歌搜索引擎、有道搜索引擎则不光包含这些,同时也可将搜索结果网页按时间进行分类。如只显示过去一天内、一星期内、一月内或一年内更新过的网页。另外,搜索引擎还通过在每一显示的搜索结果网页旁加上链接指向与该显示结果网页相类似的网页,或者通过给出每一搜索结果网页的摘要或快照,这样用户就不用去每一网页浏览就能知道该网页是否包含用户感兴趣的结果。这些方法提高了提供给用户更好更准确的搜索结果的可能性,但搜索引擎显示的结果网页仍然可能包含很多不感兴趣的搜索结果,需要对这些信息进行过滤从而尽可能地仅提供给用户感兴趣的网页。A search engine generally includes the following steps: web crawling, web word segmentation, web indexing, and web search. Each web page is associated with the participle it contains and the frequency of participle occurrence, and stored in the index database for use by search engines. In this way, when searching with a search engine, the webpages associated with the input keywords are generally arranged according to the degree of relevance as the search results. The most relevant web pages are arranged in front of the search results, and are more likely to be searched by users. When a user uses a search engine to search, generally speaking, the input keywords are limited, and the situation of two to three keywords is very common. In this way, the webpage results searched by the webpage search engine are often not necessarily the webpage that the user really wants to search for. Frequently, users cannot find the webpage they really want to search for by browsing the searched webpages page by page. In order to better provide users with the search results they really need, search engine companies have adopted some methods to provide users with better search results. Baidu search engine and Biying search engine can output search results according to pictures, videos, news, etc., so that users can better and more accurately search for the webpages that users are interested in in the categories of pictures, videos, and news. Google search engine and Youdao search engine not only include these, but also can classify the search result pages by time. For example, only pages that have been updated within the past day, week, month, or year are displayed. In addition, the search engine also points to webpages similar to the displayed result webpage by adding a link next to each displayed search result webpage, or by giving a summary or snapshot of each search result webpage, so that users do not have to go to each search result webpage. By browsing the webpage, we can know whether the webpage contains the results that the user is interested in. These methods improve the possibility of providing users with better and more accurate search results, but the result pages displayed by search engines may still contain a lot of uninteresting search results, and these information need to be filtered so as to provide users with the most interesting results as much as possible. pages of interest.
发明内容 Contents of the invention
本发明目的在于提供一种具有分类显示的搜索引擎,解决了现有技术中搜索引擎的显示结果常常没有实现用户搜索的目的或者检索的信息太过繁杂使用户难以找到准确的信息等问题。The purpose of the present invention is to provide a search engine with classification and display, which solves the problems in the prior art that the displayed results of the search engine often fail to realize the purpose of the user's search or the retrieved information is too complicated to make it difficult for the user to find accurate information.
为了解决现有技术中的这些问题,本发明提供的技术方案是:In order to solve these problems in the prior art, the technical solution provided by the invention is:
一种具有分类显示的搜索引擎,包括处于服务器端的:A search engine with classified display, including on the server side:
网页抓取和预处理模块,用于自动从网络上搜集网页,进行预处理将网页信息转化成计算机可读方式的格式化文本信息,并定期实时更新网页信息和新网页信息抓取;The webpage crawling and preprocessing module is used to automatically collect webpages from the Internet, perform preprocessing to convert webpage information into computer-readable formatted text information, and regularly update webpage information and capture new webpage information in real time;
索引模块,用于对网页抓取和预处理模块处理后的格式化文本信息进行分词,并使每一网页与它所包含的分词及分词出现的频率建立具有关联度特征的网页索引库;The indexing module is used to segment the formatted text information processed by the webpage crawling and preprocessing module, and establishes a webpage index library with correlation characteristics for each webpage and the word segmentation contained in it and the frequency of occurrence of the word segmentation;
查询模块,用于响应用户端的查询请求,并搜索索引模块建立的网页索引库,获得与用户端的查询请求匹配的搜索结果列表;The query module is used to respond to the query request of the user terminal, and search the web page index library established by the index module to obtain a list of search results matching the query request of the user terminal;
和处于用户端的:and on the client side:
结果显示模块,用于供用户输入关键词查询请求,并从服务器端的查询模块获得与关键词相匹配的搜索结果列表,并按照关联度由大到小的顺序排列后展示给用户;The result display module is used for the user to input a keyword query request, and obtain a list of search results matching the keyword from the server-side query module, and display it to the user after being arranged in descending order of relevance;
其特征在于所述搜索引擎还包括处于服务器端的分类模块,用于对每一网页按照中国图书馆图书分类法进行分类,将分类结果索引存入网页索引库;所述结果显示模块通过分栏显示与关键词相匹配的网页索引和与关键词相关的网页分类。It is characterized in that the search engine also includes a classification module on the server side, which is used to classify each web page according to the Chinese Library Book Classification, and store the classification result index into the web page index library; the result display module displays the results in columns Indexing of webpages matching keywords and classifying webpages related to keywords.
优选的,所述搜索引擎服务器端每个网页按照中国图书馆图书分类法进行分类。Preferably, each web page at the server end of the search engine is classified according to the Chinese Library Book Classification.
优选的,所述分类采用人工分类或采用粒子群优化算法的机器学习分类。Preferably, the classification adopts manual classification or machine learning classification using particle swarm optimization algorithm.
优选的,所述结果显示模块包括两栏显示窗口,第一显示窗口用于显示与关键词相匹配的网页列表;第二显示窗口用于显示与关键词相关的网页分类。Preferably, the result display module includes two columns of display windows, the first display window is used to display a list of webpages matching keywords; the second display window is used to display categories of webpages related to keywords.
优选的,所述两栏显示窗口呈左右设置,左侧为第二显示窗口,右侧为第一显示窗口;第一显示窗口聚焦当前显示网页列表项时,第二显示窗口相应网页类别加粗或加色。Preferably, the two-column display windows are arranged left and right, the left side is the second display window, and the right side is the first display window; when the first display window focuses on the currently displayed webpage list item, the corresponding webpage category of the second display window is bolded or add color.
优选的,所述第一显示窗口内网页列表项具有与相应网页网站链接的超链接,当用户通过鼠标停留在网页列表项时,相应网页列表项对应的网页类别在第二显示窗口内加粗或变色,相应网页列表项右侧呈现与相应网页列表项相关的网页快照。Preferably, the web page list item in the first display window has a hyperlink linked to the corresponding web page website, and when the user stays on the web page list item with the mouse, the web page category corresponding to the corresponding web page list item is bolded in the second display window or change color, and the right side of the corresponding webpage list item presents a webpage snapshot related to the corresponding webpage list item.
优选的,当用户选择第一显示窗口内的网页列表项时,用户端将直接打开相应网页网站链接供用户浏览;当用户选择第二显示窗口内的网页分类时,第一显示窗口聚类显示用户选择的同一网页分类的网页列表。Preferably, when the user selects a webpage list item in the first display window, the user terminal will directly open the corresponding webpage website link for the user to browse; when the user selects the webpage classification in the second display window, the first display window clusters and displays List of webpages of the same webpage category selected by the user.
优选的,所述第二显示窗口还包括网页更新时间选项,所述网页更新时间选项设置在网页分类下端供用户选择。Preferably, the second display window further includes a webpage update time option, and the webpage update time option is set at the bottom of the webpage classification for users to choose.
本发明还提供了一种基于网页分类的网页搜索结果的快速查找方法,其特征在于所述方法包括以下步骤:The present invention also provides a method for quickly searching webpage search results based on webpage classification, characterized in that the method includes the following steps:
(1)接收用户端的查询请求;(1) Receive the query request from the client;
(2)响应查询请求,在索引有每一网页与它所包含的分词及分词出现的频率累计的关联度并对每一网页进行分类的网页索引库上执行查询以获得与查询请求匹配的搜索结果列表;(2) In response to the query request, perform a query on the webpage index library that indexes the cumulative association degree of each webpage with the participle it contains and the frequency of occurrence of the participle and classifies each webpage to obtain the search that matches the query request result list;
(3)将搜索列表按照关联度进行降序排列并将排序后的搜索列表和搜索列表的网页类别分成两栏生成网页呈现给用户。(3) Arranging the search list in descending order according to the degree of relevance and dividing the sorted search list and the webpage categories of the search list into two columns to generate a webpage and present it to the user.
本发明能更好地提供给用户更准确的用户所需的搜索结果。通过将搜索引擎搜索结果通过网页分类信息对搜索结果按网页类别显示搜索结果,从而提供给用户更准确的搜索结果,使用户能更快速地、更好地搜索到用户感兴趣的网页。这种基于网页分类的网页搜索结果的快速查找方法可与现有的商用搜索引擎结合来提供用户从这些商用搜索引擎提供的网页搜索结果中快速查找到所需要的信息。The present invention can better provide users with more accurate search results required by users. The search engine search results are displayed according to the webpage categories through the webpage classification information, thereby providing users with more accurate search results, enabling users to search for webpages of interest to users more quickly and better. The method for quickly finding webpage search results based on webpage classification can be combined with existing commercial search engines to provide users with the information they need to quickly find from the webpage search results provided by these commercial search engines.
在对网页抓取、分词、索引、及对每一网页与它所包含的分词及分词出现的频率建立关联度后,对每一网页按照中国图书馆图书分类法进行分类,每一网页可对应于一或几种类别,将所有信息存入数据库中,待网页搜索时用。After crawling webpages, word segmentation, indexing, and establishing the correlation between each webpage and the word segmentation it contains and the frequency of word segmentation, each webpage is classified according to the Chinese library book classification method, and each webpage can correspond to In one or several categories, store all the information in the database and use it when searching the webpage.
网页搜索时,搜索引擎显示页面分成左右两格,左边显示网页分类信息,右边显示搜索结果。当用户输入关键词后初始搜索结果显示在网页右边。排列在第一的网页的所属的网页类别在左边显示。左边同时包含返回初始搜索结果及页数的直接链接。选择右边显示的搜索结果里其它的网页则左边将显示该网页所对应的网页类别。双重选择右边的任一网页则会打开这一网页供用户浏览这一网页。选择左边显示的网页分类类别,则右边将会显示搜索引擎的所有搜索结果中属于这一网页类别的搜索结果。这样右边显示的网页链接都属于用户所感兴趣的网页类别的搜索结果,用户可更快速地、更好地、更准确地寻找到用户所需要寻找的信息。当然目前搜索引擎所用的一些方法可被结合起来一起更好地为用户服务,如通过给出每一搜索结果网页的摘要或快照等等。When searching the webpage, the search engine display page is divided into left and right grids, the left side displays the web page classification information, and the right side displays the search results. When the user enters keywords, the initial search results are displayed on the right side of the web page. The category of the webpage to which the webpage ranked first is displayed on the left. The left side also contains direct links back to the original search results and page number. Select other webpages in the search results displayed on the right, and the corresponding webpage category of the webpage will be displayed on the left. Double-selecting any web page on the right will open that web page for the user to browse this web page. Select the web page category displayed on the left, and the right will display the search results belonging to this web page category among all the search results of the search engine. In this way, the webpage links displayed on the right belong to the search results of the webpage category that the user is interested in, and the user can find the information that the user needs to find more quickly, better, and more accurately. Of course, some methods currently used by search engines can be combined to better serve users, such as by providing a summary or a snapshot of each search result web page.
在对网页抓取、分词、索引、及对每一网页与它所包含的分词及分词出现的频率建立关联度的同时,按照图一所示加入对每一网页按照中国图书馆图书分类法进行分类这一步,存入数据库。本专利采用的中国图书馆图书分类法可同样用其它网页分类法替代。网页分类可采用人工的方法,也可采用机器学习的方法如采用粒子群优化算法来对网页进行分类。While crawling webpages, word segmentation, indexing, and establishing correlations between each webpage and its contained word segmentation and the frequency of occurrence of word segmentation, as shown in Figure 1, add each webpage according to the book classification method of the Chinese library. This step of classification is stored in the database. The Chinese library book classification method adopted in this patent can be replaced by other webpage classification methods equally. Webpage classification can be done manually or by machine learning methods such as particle swarm optimization algorithm to classify webpages.
在用搜索引擎搜索网页时,对搜索引擎给出的网页搜索结果按照图二所示进行显示。搜索引擎搜索结果显示在右边,左边则通过加粗或加色对应网页类别来显示网页类别。左边可在上部同时显示初始搜索结果及页数的直接链接。显示初始搜索结果及页数的直接链接是为了右边可随时直接回到显示搜索引擎搜索结果,如选择第二页,则右边显示直接来自搜索引擎搜索结果显示的第二页。搜索结果第一次显示时,左边显示排列在第一的网页的所属的网页类别。如在右边显示的网页链接中选择其中的一个网页,则该网页对应的网页类别则显示在左边。选择左边的类别,则右边将会显示搜索引擎的所有搜索结果中属于这一网页类别的所有网页的链接。双重选择右边一网页链接则会直接打开这一网页供用户浏览。如用户知道所要搜索网页对应的类别,则用户可通过左边网页分类类别直接浏览到对应的网页分类类别来选择该网页类别。选择这一网页类别后,右边将仅显示搜索引擎搜索结果中属于该网页类别的网页链接。这样用户可快速直接查找自己感兴趣的网页。When searching a webpage with a search engine, the webpage search results given by the search engine are displayed as shown in FIG. 2 . The search engine search results are displayed on the right, and the webpage category is displayed on the left by bolding or adding color to the corresponding webpage category. The left side can display both the initial search results and a direct link to the page number in the upper part. The direct link to display the initial search results and page number is for the right to return directly to the search engine search results at any time. If the second page is selected, the second page directly from the search engine search results will be displayed on the right. When the search results are displayed for the first time, the category of the webpage to which the first webpage belongs is displayed on the left. If one of the webpages is selected from the webpage links displayed on the right, the webpage category corresponding to the webpage will be displayed on the left. Select a category on the left, and the right will display links to all webpages belonging to this webpage category in all search results of the search engine. Double selection of a webpage link on the right will directly open this webpage for users to browse. If the user knows the category corresponding to the webpage to be searched, the user can directly browse to the corresponding webpage classification category through the left webpage classification category to select the webpage category. After selecting this webpage category, only the webpage links belonging to this webpage category in the search engine search results will be displayed on the right. In this way, users can quickly and directly find the web pages they are interested in.
目前搜索引擎所用的一些方法可同时被结合起来一起更好地为用户服务,如通过给出每一搜索结果网页的摘要或快照,和只显示过去一天内、一星期内、一月内或一年内更新过的网页等等。Some methods currently used by search engines can be combined to better serve users at the same time, such as by giving a summary or snapshot of each search result page, and only displaying the results of the past day, week, month or a year. Pages updated during the year, etc.
相对于现有技术中的方案,本发明的优点是:Compared with the scheme in the prior art, the advantages of the present invention are:
本发明可通过网页分类信息更好地帮助用户按网页类别更快速地、更准确地通过搜索引擎寻找到用户所感兴趣的网页。The invention can better help the user find the webpage that the user is interested in more quickly and accurately through the search engine through the webpage classification information.
附图说明 Description of drawings
下面结合附图及实施例对本发明作进一步描述:The present invention will be further described below in conjunction with accompanying drawing and embodiment:
图1为本发明具有分类显示的搜索引擎在服务器端的工作流程图;Fig. 1 is the work flow chart that the present invention has the search engine that classification shows at server end;
图2为本发明具有分类显示的搜索引擎在用户端的工作流程图。Fig. 2 is a flow chart of the search engine with classified display in the user end of the present invention.
具体实施方式 Detailed ways
以下结合具体实施例对上述方案做进一步说明。应理解,这些实施例是用于说明本发明而不限于限制本发明的范围。实施例中采用的实施条件可以根据具体厂家的条件做进一步调整,未注明的实施条件通常为常规实验中的条件。The above solution will be further described below in conjunction with specific embodiments. It should be understood that these examples are used to illustrate the present invention and not to limit the scope of the present invention. The implementation conditions used in the examples can be further adjusted according to the conditions of specific manufacturers, and the implementation conditions not indicated are usually the conditions in routine experiments.
实施例Example
本发明具有分类显示的搜索引擎,服务器端包括网页抓取和预处理模块,用于自动从网络上搜集网页,进行预处理将网页信息转化成计算机可读方式的格式化文本信息,并定期实时更新网页信息和新建网页信息抓取;索引模块,用于对网页抓取和预处理模块处理后的格式化文本信息进行分词,并使每一网页与它所包含的分词及分词出现的频率建立具有关联度特征的网页索引库;查询模块,用于响应用户端的查询请求,并搜索索引模块建立的网页索引库,获得与用户端的查询请求匹配的搜索结果列表;分类模块,用于对每一网页按照中国图书馆图书分类法进行分类,将分类结果索引存入网页索引库;The present invention has a search engine for classification and display, and the server side includes a webpage grabbing and preprocessing module, which is used to automatically collect webpages from the Internet, perform preprocessing to convert webpage information into computer-readable formatted text information, and regularly and real-time Update web page information and new web page information capture; index module, used to segment the formatted text information processed by the web page capture and preprocessing module, and establish each web page with the word segmentation it contains and the frequency of occurrence of the word segmentation A web page index library with the characteristics of relevance; a query module, used to respond to the query request of the user terminal, and search the web page index library established by the index module to obtain a list of search results matching the query request of the user terminal; a classification module, used to classify each The web pages are classified according to the book classification method of the Chinese library, and the index of the classification results is stored in the web page index library;
如图1和图2所示,网页抓取用新浪分类目录网页(http://dir.iask.com/)作为起始网页,依据网页上的超链接列表,不断有序抓取新的网页,并将每一抓取过的网页上新的超链接加入超链接列表,供后续抓取新的网页提供网页链接。每一抓取到的网页在超链接列表中标注这次已被抓取过,避免重复抓取已抓取过的网页甚至进入死循环。对抓取的网页信息进行预处理,如去除掉HTML文件中标记符号,得到网页文本信息。第一次抓取时,所有抓取到的网页及它对应的网页文本信息存入数据库,供后续步骤使用。没有抓取到的网页也同样存入数据库但标记网页不存在。第一次抓取后,以后定期(如每晚10时)对网页进行抓取。如网页信息有更新,则抓取该网页的信息处理后存入数据库,如没有更新则不抓取该网页信息,数据库中该网页内容不变。对于网页信息有更新的网页,检查它所含有的超链接有没有包含新的网页,如有则对这些新的网页加入超链接列表,进行新建网页信息抓取。在网页抓取这一过程中,记录每一网页被其它网页链接的次数,并存入数据库。最后对所有存入的网页根据网页信息按照中国图书馆图书分类法进行分类,并将分类类别存入数据库中。As shown in Figure 1 and Figure 2, web crawling uses the Sina classification directory webpage (http://dir.iask.com/) as the starting webpage, and continuously and orderly crawls new webpages according to the hyperlink list on the webpage , and add new hyperlinks on each crawled webpage to the hyperlink list for subsequent crawling of new webpages to provide webpage links. Each crawled webpage is marked in the hyperlink list as having been crawled this time, avoiding repeated crawling of crawled webpages or even entering an endless loop. Perform preprocessing on the captured webpage information, such as removing markup symbols in HTML files to obtain webpage text information. When crawling for the first time, all captured webpages and their corresponding webpage text information are stored in the database for use in subsequent steps. Web pages that are not crawled are also stored in the database but the marked web pages do not exist. After the first crawl, the web pages will be crawled regularly (such as at 10 o'clock every night). If the information on the web page is updated, the information of the web page is captured and processed and stored in the database. If there is no update, the information on the web page is not captured, and the content of the web page in the database remains unchanged. For webpages with updated webpage information, check whether the hyperlinks contained in it contain new webpages, and if so, add hyperlink lists to these new webpages to crawl new webpage information. In the process of web crawling, the number of times each web page is linked by other web pages is recorded and stored in the database. Finally, all stored webpages are classified according to the webpage information according to the Chinese library book classification method, and the classification categories are stored in the database.
在数据库中对新抓取到的网页依据已建立的分词词典(如2词词典、3词词典、专有名词词典等等)进行分词,再根据每一分词在该网页中出现的频率建立该网页与每一出现的分词的关联度,从而可更进一步建立每一分词与包含该分词的所有网页的关联度,供网页搜索时依据关键词搜索网页用。这样当用户输入需要搜索的关键词后,查询模块将找出跟关键词相关的网页并将它们排序。网页排序则依据网页跟输入的关键词的关联度、相关联的关键词的数目及被其它网页链接的次数来确定。与网页关联的关键词越多、关联度越高及被其它网页链接的次数越多,则该网页排序越靠前。In the database, the newly captured web pages are segmented according to the established word segmentation dictionary (such as 2-word dictionary, 3-word dictionary, proper noun dictionary, etc.), and then the frequency of occurrence of each word segmentation in the web page is established. The degree of correlation between a web page and each word that appears can further establish the degree of association between each word and all web pages that contain the word, which can be used to search web pages based on keywords when searching for web pages. In this way, when the user inputs the keywords to be searched, the query module will find out the webpages related to the keywords and sort them. The ranking of webpages is determined according to the degree of relevance between the webpage and the input keyword, the number of associated keywords and the number of times linked by other webpages. The more keywords associated with a web page, the higher the degree of association and the more times it is linked by other web pages, the higher the ranking of the web page.
用户端包括结果显示模块,用于供用户输入关键词查询请求,并从服务器端的查询模块获得与关键词相匹配的搜索结果列表,并按照关联度由大到小的顺序排列后展示给用户;所述结果显示模块通过分栏显示与关键词相匹配的网页索引和与关键词相关的网页分类。The client includes a result display module, which is used for the user to input a keyword query request, and obtains a list of search results matching the keyword from the query module on the server side, and displays them to the user after being arranged in descending order of relevance; The result display module displays the web page index matching the keyword and the web page classification related to the keyword by columns.
服务器端每个网页按照中国图书馆图书分类法进行分类,分类采用人工分类。用户端结果显示模块包括两栏显示窗口,第一显示窗口用于显示与关键词相匹配的网页列表;第二显示窗口用于显示与关键词相关的网页分类。所述两栏显示窗口呈左右设置,左侧为第二显示窗口,右侧为第一显示窗口;第一显示窗口聚焦当前显示网页列表项时,第二显示窗口相应网页类别加粗或加色。所述第一显示窗口内网页列表项具有与相应网页网站链接的超链接,当用户通过鼠标停留在网页列表项时,相应网页列表项对应的网页类别在第二显示窗口内加粗或变色,相应网页列表项右侧呈现与相应网页列表项相关的网页快照。Each webpage on the server side is classified according to the book classification method of the Chinese library, and the classification adopts manual classification. The user terminal result display module includes two display windows. The first display window is used to display the list of webpages matching the keyword; the second display window is used to display the classification of webpages related to the keyword. The two-column display windows are arranged left and right, the left side is the second display window, and the right side is the first display window; when the first display window focuses on the currently displayed webpage list item, the corresponding webpage category of the second display window is bolded or colored . The webpage list item in the first display window has a hyperlink linked to the corresponding webpage website, and when the user stays on the webpage list item by the mouse, the corresponding webpage category of the corresponding webpage list item is bolded or discolored in the second display window, A webpage snapshot related to the corresponding webpage list item is displayed on the right side of the corresponding webpage list item.
当用户选择第一显示窗口内的网页列表项时,用户端将直接打开相应网页网站链接供用户浏览;当用户选择第二显示窗口内的网页分类时,第一显示窗口聚类显示用户选择的同一网页分类的网页列表。所述第二显示窗口还包括网页更新时间选项,所述网页更新时间选项设置在网页分类下端供用户选择。When the user selects the webpage list item in the first display window, the user terminal will directly open the corresponding webpage website link for the user to browse; A list of web pages in the same web category. The second display window also includes a webpage update time option, and the webpage update time option is set at the bottom of the webpage category for users to choose.
进行查询时,服务器端先接收用户端的查询请求;然后响应查询请求,在索引有每一网页与它所包含的分词及分词出现的频率累计的关联度并对每一网页进行分类的网页索引库上执行查询以获得与查询请求匹配的搜索结果列表;最后将搜索列表按照关联度进行降序排列并将排序后的搜索列表和搜索列表的网页类别分成两栏生成网页呈现给用户。When performing a query, the server first receives the query request from the client; then responds to the query request, and indexes the web page index library that has the cumulative association degree between each web page and the words it contains and the frequency of occurrence of the word, and classifies each web page Execute a query on the search engine to obtain a search result list that matches the query request; finally, sort the search list in descending order according to the degree of relevance and divide the sorted search list and the web page category of the search list into two columns to generate a web page and present it to the user.
上述实例只为说明本发明的技术构思及特点,其目的在于让熟悉此项技术的人是能够了解本发明的内容并据以实施,并不能以此限制本发明的保护范围。凡根据本发明精神实质所做的等效变换或修饰,都应涵盖在本发明的保护范围之内。The above examples are only to illustrate the technical conception and characteristics of the present invention, and its purpose is to allow people familiar with this technology to understand the content of the present invention and implement it accordingly, and cannot limit the protection scope of the present invention. All equivalent changes or modifications made according to the spirit of the present invention shall fall within the protection scope of the present invention.
Claims (9)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011102076463A CN102236719A (en) | 2011-07-25 | 2011-07-25 | Page search engine based on page classification and quick search method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011102076463A CN102236719A (en) | 2011-07-25 | 2011-07-25 | Page search engine based on page classification and quick search method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102236719A true CN102236719A (en) | 2011-11-09 |
Family
ID=44887365
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2011102076463A Pending CN102236719A (en) | 2011-07-25 | 2011-07-25 | Page search engine based on page classification and quick search method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102236719A (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102521267A (en) * | 2011-11-21 | 2012-06-27 | 沈文策 | In-station information searching method and system |
CN103064880A (en) * | 2012-11-23 | 2013-04-24 | 覃文浩 | Method, device and system based on searching information for providing users with website choice |
CN103258000A (en) * | 2013-03-29 | 2013-08-21 | 北界创想(北京)软件有限公司 | Method and device for clustering high-frequency keywords in webpages |
CN103365924A (en) * | 2012-04-09 | 2013-10-23 | 北京大学 | Method, device and terminal for searching information |
CN103631791A (en) * | 2012-08-22 | 2014-03-12 | 腾讯科技(深圳)有限公司 | Information fusion classification display method and system |
CN103914529A (en) * | 2014-03-31 | 2014-07-09 | 百度在线网络技术(北京)有限公司 | Search displaying method and search displaying device |
CN103995881A (en) * | 2014-05-28 | 2014-08-20 | 百度在线网络技术(北京)有限公司 | Method and device for showing search results |
CN104123366A (en) * | 2014-07-23 | 2014-10-29 | 谢建平 | Search method and server |
CN104572879A (en) * | 2014-12-19 | 2015-04-29 | 乐视网信息技术(北京)股份有限公司 | Method and device for updating index table and method and device for searching based on index table |
CN104572871A (en) * | 2014-12-19 | 2015-04-29 | 乐视网信息技术(北京)股份有限公司 | Method and device for searching based on index table |
CN106708901A (en) * | 2015-11-17 | 2017-05-24 | 北京国双科技有限公司 | Clustering method and device of search terms in website |
CN107301193A (en) * | 2016-04-15 | 2017-10-27 | 北京慧点科技有限公司 | A kind of information query method for MSN |
CN107343104A (en) * | 2017-07-19 | 2017-11-10 | 北京小米移动软件有限公司 | Handle the method, apparatus and terminal device of Information on Collection |
CN108920671A (en) * | 2018-07-06 | 2018-11-30 | 佛山市灏金赢科技有限公司 | A kind of web page tag lookup method and system |
CN109948032A (en) * | 2017-08-21 | 2019-06-28 | 李华林 | Web search results ranking device, search engine and browser based on user preference |
CN110059243A (en) * | 2019-03-21 | 2019-07-26 | 广东瑞恩科技有限公司 | Data optimization engine method, apparatus, equipment and computer readable storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050144162A1 (en) * | 2003-12-29 | 2005-06-30 | Ping Liang | Advanced search, file system, and intelligent assistant agent |
CN1716255A (en) * | 2004-07-01 | 2006-01-04 | 微软公司 | Dispersing search engine results by using page category information |
CN101014954A (en) * | 2004-09-07 | 2007-08-08 | 因特曼股份有限公司 | Information search provision apparatus and information search provision system |
CN101261629A (en) * | 2008-04-21 | 2008-09-10 | 上海大学 | Specific Information Search Method Based on Automatic Classification Technology |
CN102012922A (en) * | 2010-11-30 | 2011-04-13 | 无锡快度信息技术有限公司 | Modeling method for industrial application model of universal vertical search engine |
-
2011
- 2011-07-25 CN CN2011102076463A patent/CN102236719A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050144162A1 (en) * | 2003-12-29 | 2005-06-30 | Ping Liang | Advanced search, file system, and intelligent assistant agent |
CN1716255A (en) * | 2004-07-01 | 2006-01-04 | 微软公司 | Dispersing search engine results by using page category information |
CN101014954A (en) * | 2004-09-07 | 2007-08-08 | 因特曼股份有限公司 | Information search provision apparatus and information search provision system |
CN101261629A (en) * | 2008-04-21 | 2008-09-10 | 上海大学 | Specific Information Search Method Based on Automatic Classification Technology |
CN102012922A (en) * | 2010-11-30 | 2011-04-13 | 无锡快度信息技术有限公司 | Modeling method for industrial application model of universal vertical search engine |
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102521267A (en) * | 2011-11-21 | 2012-06-27 | 沈文策 | In-station information searching method and system |
CN102521267B (en) * | 2011-11-21 | 2014-01-22 | 沈文策 | In-station information searching method and system |
CN103365924A (en) * | 2012-04-09 | 2013-10-23 | 北京大学 | Method, device and terminal for searching information |
CN103365924B (en) * | 2012-04-09 | 2016-04-06 | 北京大学 | A kind of method of internet information search, device and terminal |
CN103631791A (en) * | 2012-08-22 | 2014-03-12 | 腾讯科技(深圳)有限公司 | Information fusion classification display method and system |
CN103631791B (en) * | 2012-08-22 | 2017-04-12 | 腾讯科技(深圳)有限公司 | Information fusion classification display method and system |
CN103064880A (en) * | 2012-11-23 | 2013-04-24 | 覃文浩 | Method, device and system based on searching information for providing users with website choice |
CN103064880B (en) * | 2012-11-23 | 2016-12-21 | 覃文浩 | A kind of methods, devices and systems providing a user with website selection based on search information |
CN103258000A (en) * | 2013-03-29 | 2013-08-21 | 北界创想(北京)软件有限公司 | Method and device for clustering high-frequency keywords in webpages |
CN103914529B (en) * | 2014-03-31 | 2016-03-16 | 百度在线网络技术(北京)有限公司 | Search exhibiting method and device |
WO2015149506A1 (en) * | 2014-03-31 | 2015-10-08 | 百度在线网络技术(北京)有限公司 | Search display method and device |
CN103914529A (en) * | 2014-03-31 | 2014-07-09 | 百度在线网络技术(北京)有限公司 | Search displaying method and search displaying device |
CN103995881A (en) * | 2014-05-28 | 2014-08-20 | 百度在线网络技术(北京)有限公司 | Method and device for showing search results |
CN103995881B (en) * | 2014-05-28 | 2018-04-13 | 百度在线网络技术(北京)有限公司 | Search result shows method and device |
CN104123366A (en) * | 2014-07-23 | 2014-10-29 | 谢建平 | Search method and server |
CN104572871A (en) * | 2014-12-19 | 2015-04-29 | 乐视网信息技术(北京)股份有限公司 | Method and device for searching based on index table |
CN104572879A (en) * | 2014-12-19 | 2015-04-29 | 乐视网信息技术(北京)股份有限公司 | Method and device for updating index table and method and device for searching based on index table |
CN106708901A (en) * | 2015-11-17 | 2017-05-24 | 北京国双科技有限公司 | Clustering method and device of search terms in website |
CN107301193A (en) * | 2016-04-15 | 2017-10-27 | 北京慧点科技有限公司 | A kind of information query method for MSN |
CN107343104A (en) * | 2017-07-19 | 2017-11-10 | 北京小米移动软件有限公司 | Handle the method, apparatus and terminal device of Information on Collection |
CN109948032A (en) * | 2017-08-21 | 2019-06-28 | 李华林 | Web search results ranking device, search engine and browser based on user preference |
CN108920671A (en) * | 2018-07-06 | 2018-11-30 | 佛山市灏金赢科技有限公司 | A kind of web page tag lookup method and system |
CN110059243A (en) * | 2019-03-21 | 2019-07-26 | 广东瑞恩科技有限公司 | Data optimization engine method, apparatus, equipment and computer readable storage medium |
CN110059243B (en) * | 2019-03-21 | 2024-05-07 | 广东瑞恩科技有限公司 | Data engine optimization method, device, equipment and computer readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102236719A (en) | Page search engine based on page classification and quick search method | |
CN102364473B (en) | Netnews search system and method based on geographic information and visual information | |
CN103020164B (en) | Semantic search method based on multi-semantic analysis and personalized sequencing | |
He et al. | Crawling deep web entity pages | |
US7617193B2 (en) | Interactive user-controlled relevance ranking retrieved information in an information search system | |
US10133823B2 (en) | Automatically providing relevant search results based on user behavior | |
CN105022827B (en) | A kind of Web news dynamic aggregation method of domain-oriented theme | |
US7475074B2 (en) | Web search system and method thereof | |
US20150278226A1 (en) | Matching and recommending relevant videos and media to individual search engine results | |
US20140372451A1 (en) | Discovering and scoring relationships extracted from human generated lists | |
US20090172514A1 (en) | Method and system for searching text-containing documents | |
CN102591948B (en) | A method and system for improving search results based on user behavior analysis | |
KR101355945B1 (en) | On line context aware advertising apparatus and method | |
CA2637239A1 (en) | System for searching | |
JP2008538149A (en) | Rating method, search result organizing method, rating system, and search result organizing system | |
CN110569273A (en) | A patent retrieval system and method based on relevance ranking | |
US20140280174A1 (en) | Interactive user-controlled search direction for retrieved information in an information search system | |
US7620631B2 (en) | Pyramid view | |
CN105912662A (en) | Coreseek-based vertical search engine research and optimization method | |
CN106599299A (en) | Determining method and device of website key words | |
Barrio et al. | Sampling strategies for information extraction over the deep web | |
CN107908681A (en) | A method, system, device and medium for searching similar websites | |
US20030018617A1 (en) | Information retrieval using enhanced document vectors | |
WO2019056727A1 (en) | Display method and apparatus for organization name search formula, device and storage medium | |
Manral et al. | An innovative approach for online meta search engine optimization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C12 | Rejection of a patent application after its publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20111109 |