US20040123237A1 - Example-based concept-oriented data extraction method - Google Patents
Example-based concept-oriented data extraction method Download PDFInfo
- Publication number
- US20040123237A1 US20040123237A1 US10/442,300 US44230003A US2004123237A1 US 20040123237 A1 US20040123237 A1 US 20040123237A1 US 44230003 A US44230003 A US 44230003A US 2004123237 A1 US2004123237 A1 US 2004123237A1
- Authority
- US
- United States
- Prior art keywords
- concept
- token
- exemplary
- graph
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Creation or modification of classes or clusters
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99931—Database or file accessing
- Y10S707/99933—Query processing, i.e. searching
- Y10S707/99936—Pattern matching access
Definitions
- the present invention relates to the technical field of data extraction, and, more particularly, to an example-based concept-oriented data extraction method that is adapted to free texts, such as the source code of web pages, or general articles.
- wrappers The conventional approach for extracting data from web pages is to write application-specific programs, named wrappers, which are able to locate data of interest on particular web pages. But writing wrappers is a tedious, error-prone, and time-consuming process. Furthermore, the wrapper programs are usually not reusable once the formatting convention of the targeted web pages changes. In such case, the painful wrapper writing process has to be repeated.
- the object of the present invention is to provide an example-based concept-oriented data extraction method for simplifying the process of establishing specific concepts and concept rules of examples so as to increase practicality without handling the complicated contextual patterns of the data of interest.
- Another object of the present invention is to provide an example-based concept-oriented data extraction method for defining the exemplary data string as flexible and practical examples to accurately and efficiently extract the targeted data from the untested data string after the format of the untested data string has been changed without bothering the user to specify a lot of examples.
- the example-based concept-oriented data extraction method comprises a first procedure for labeling an exemplary data string, and a second procedure for extracting targeted data from an untested data string.
- the first procedure comprises the steps of: (A1) capturing an exemplary data string; (A2) tokenizing the exemplary data string into a plurality of tokens as an exemplary token sequence, each token having an index; (A3) specifying the exemplary token sequence as a plurality of specific concepts, each being labeled to be a tuple and consisting of at least one token, the specific concept being selected from the group of a target concept and a filler concept, the target concept pointing to the targeted data of interest, the filler concept pointing to the contextual data of the targeted data, each tuple having a format including concept type, a concept name, a beginning index of the first token in the specific concept, an ending index of the last token in the specific concept, and an associated concept recognizer of the specific concept, wherein the associated concept recognizer is provided to recognize the possible token sequence of the specific concept; and (A4) constructing an exemplary concept graph of the exemplary data string according to the tuples.
- the second procedure comprises the steps of: (B1) capturing an untested data string; (B2) tokenizing the untested data string into a plurality of tokens as an untested token sequence; (B3) using the associated concept recognizers defined by the tuples for detecting a plurality of concept candidates, wherein each concept candidate having a format including the beginning index and the ending index of the corresponding token sequence, and the concept name of the concept candidate; (B4) constructing a preliminary concept graph of the untested token sequence according to the concept candidates; and (B5) determining an optimal hypothetical concept sequence by comparing the exemplary concept graph with the preliminary concept graph and capturing at least one matched target concept from the optimal hypothetical concept sequence for extracting the targeted data.
- FIG. 1 is a functional block diagram of the first embodiment according to the present invention.
- FIG. 2 is a flowchart of the first embodiment according to the present invention.
- FIG. 3 is a schematic drawing of an example web page in the first embodiment according to the present invention.
- FIG. 4 schematically illustrates the exemplary token sequence of the example web page illustrated in FIG. 3;
- FIG. 5 is an exemplary concept graph of the first embodiment according to the present invention.
- FIG. 6 schematically illustrates an untested token sequence of the first embodiment according to the present invention
- FIG. 7 is a list of concept candidates of the first embodiment according to the present invention.
- FIG. 8 is a preliminary concept graph of the first embodiment according to the present invention.
- FIG. 9 schematically illustrates an optimal hypothetical concept sequence of the first embodiment according to the present invention.
- FIG. 10 is a functional block diagram of the second embodiment according to the present invention.
- FIG. 11 is a flowchart of the second embodiment according to the present invention.
- FIG. 12 is a schematic drawing of an example web page in the second embodiment according to the present invention.
- FIG. 13 schematically illustrates the exemplary token sequence of the example web page illustrated in FIG. 12;
- FIG. 14 is an exemplary concept graph of the second embodiment according to the present invention.
- FIG. 15 is a schematic drawing of an untested web page in the second embodiment according to the present invention.
- FIG. 16 schematically illustrates the untested token sequence of the untested web page illustrated in FIG. 15;
- FIG. 17 is a preliminary concept graph of the second embodiment according to the present invention.
- FIG. 18 is a hypothetical concept graph of the second embodiment according to the present invention.
- FIG. 19 schematically illustrates an optimal hypothetical concept sequence of the second embodiment according to the present invention.
- FIG. 20 is a schematic drawing of the structure of the matched target concept of the second embodiment according to the present invention.
- the example-based concept-oriented data extraction method consists of two phases: an example labeling phase for labeling an exemplary data string, and a data extraction phase for extracting targeted data from an untested data string.
- the data string is preferably a text-based document, such as an article, or hypertext markup language (HTML) source codes.
- HTML hypertext markup language
- Each character in the data string could be an ASCII code, a binary value between 0 and 255, or a two-byte Unicode.
- the exemplary data string and the untested data string are HTML source codes, and each character is an ASCII code.
- the method disclosed in the first embodiment is applied in a computer system for extracting the targeted data, where the computer system has a concept labeling module 11 , a concept detection module 12 , and a concept sequence selection module 14 as shown in FIG. 1.
- the data of interest namely, the targeted data
- the data of interest is the value of the exchange rate of buying US dollars in cash (i.e., 33.44500) from the example web page as shown in FIG. 3.
- the HTML source code i.e., the exemplary data string
- the computer system tokenizes the HTML source code into a plurality of tokens as an exemplary token sequence 21 as shown in FIG. 4 (step A 02 ), wherein each token has been sequentially assigned an index.
- the exemplary token sequence 21 has 40 tokens sequentially assigned an index from 1 to 40, and each token is a HTML tag or a text segment between HTML tags. It is noted that if the source documents are free texts, the exemplary data string could be tokenized into words.
- the user uses the concept labeling module 11 to specify the exemplary token sequence 21 into a plurality of specific concepts, each consisting of at least one token as a concept sequence, and labels each specific concept as a tuple (step A 03 ) to tell the computer system which targeted data to be extracted, wherein each tuple represents an example.
- the specific concept could be a target concept pointed to the targeted data, or a filler concept pointed to the contextual data of the targeted data.
- the format of a tuple can be recorded as follows:
- the beginning index is the first token in the specific concept
- the ending index is the last token in the specific concept
- the associated concept recognizer 112 is adapted for recognizing the token sequence having the same format with the tuple, for example, a number recognizer, or a dynamically generated recognizer (DGR) . . . etc.
- the associated concept recognizer 112 is either user-provided (i.e., provided by the user) or system-provided (i.e., provided by the computer system).
- GUI graphical user interface
- Tuple T A1 (Target, BUY_CASH, 32, 32, Number_Recognizer).
- the tuple T A1 means that this example is an target concept.
- the name of this target concept is “BUY_CASH”. This target concept spans from 32nd token to 32nd token, and this concept should be recognized by the recognizer named “Number_Recognizer” adapted for recognizing numeral characters.
- Labeling a filler concept is similar to labeling a target concept, except that the Concept-Type is “Filler”.
- the user may specify a filler concept with the following tuple:
- Tuple T A2 (Filler, SELL_CASH, 35, 35, Number_Recognizer).
- the user wishes the exchange rate of selling US dollars in cash not to be extracted from the web page by the computer system, and therefore specifies it as a filler concept to be regarded as the contextual data of the targeted data.
- the computer system will automatically label the tokens which are not included in the labeled examples as filler concepts.
- Each automatically labeled filler concept will be associated to some (at least one) dynamically generated recognizers (DGR).
- DGR dynamically generated recognizers
- Tuple T A3 (Filler, ⁇ html>* ⁇ td>, 1, 31, _DGR ⁇ html>* ⁇ td> ⁇ );
- Tuple T A4 (Filler, ⁇ /td> ⁇ td>, 33, 34, _DGR ⁇ /td> ⁇ td> ⁇ );
- Tuple T A5 (Filler, ⁇ /td>* ⁇ /html>, 36, 40, _DGR ⁇ /td>* ⁇ /html> ⁇ ).
- Tuple T A3 means that the segment spanning from the first token (i.e., leading token) to the 31st token (i.e., ending token) is a filler concept whose name is “ ⁇ html>* ⁇ td>” which is automatically generated by the combination of the content of the first token “ ⁇ html>” and the content of the 31st token “ ⁇ td>”.
- the last element (i.e., Associated-Concept-Recognizer) in tuple T A3 indicates that this filler concept can be recognized by a recognizer driven by the rule “ ⁇ html>* ⁇ td>”, where the wildcard “*” represents any positive number of tokens.
- tuple T A4 means that the segment spanning from the 33rd token to the 34th token is filler concept whose name is “ ⁇ /td> ⁇ td>”
- tuple T A5 means that the segment spanning from the 36th token to the 40th token is a filler concept whose name is “ ⁇ /td>* ⁇ /html>”.
- There is no wildcard “*” in tuple T A4 because the corresponding segment consists of only two tokens.
- the associated concept recognizer 112 is constructed according to the leading two tokens and the ending two tokens of the filler concept, the following three tuples will be produced for the three uncovered segments:
- Tuple T A6 (Filler, ⁇ html> ⁇ body>* ⁇ /td> ⁇ td>, 1, 31, _DGR ⁇ html> ⁇ body>* ⁇ /td> ⁇ td> ⁇ );
- Tuple T A7 (Filler, ⁇ /td> ⁇ td>, 33, 34, _DGR ⁇ /td> ⁇ td> ⁇ );
- Tuple T A8 (Filler, ⁇ /td> ⁇ /tr>* ⁇ /body> ⁇ /html>, 36, 40, _DGR ⁇ /td> ⁇ /tr>* ⁇ /body> ⁇ /html> ⁇ ).
- tuple T A7 there is no wild card in tuple T A7 because the number of tokens in the corresponding segment is less than four. In fact, tuple T A7 is the same as tuple T A4 .
- the concept labeling module 11 can construct tuples for the uncovered segments with different criteria, such as constructing concept recognizer according to leading three tokens and ending three tokens. Furthermore, if the user specifies a specific concept without explicitly associating it to an associated concept recognizer 112 , the concept labeling module 11 of the computer system will automatically produce one according to the tokens on the left-hand side and on the right-hand side of the specific concept specified by the user.
- an exemplary concept graph 111 as shown in FIG. 5 is constructed by the concept labeling module 11 according to the beginning token index and the ending token index of the tuples specified by the user and automatically generated by the computer system (step A 04 ).
- the exemplary concept graph 111 is constructed according to tuple T A1 and tuple T A2 assigned by the user, and tuple T A3 , tuple T A4 , tuple T A5 , tuple T A6 , and tuple T A8 automatically generated by the computer system, where tuple T A7 is ignored because it is the same as tuple T A4 .
- the tuple T A1 is the target concept 31 while the other tuples are filler concepts.
- the example labeling phase would be accomplished after constructing the exemplary concept graph 111 .
- the computer system After obtaining the exemplary concept graph 111 by labeling the exemplary token sequence 21 , the computer system is ready to capture untested data string from new web pages (step A 05 ).
- the untested data string namely, the HTML source code
- the concept detection module 12 refers to the associated concept recognizers 112 defined by the tuples in the example labeling phase for detecting a plurality of concept candidates from the untested token sequence 22 (step A 07 ).
- the format of a concept candidate can be recorded as follows:
- the concept detection module 12 then constructs a (simplified) preliminary concept graph 121 as shown in FIG. 8 (step A 08 ).
- the isolated concept candidate represents that it cannot be connected with any other concept candidates.
- the ending token index of the concept candidate “ ⁇ html>* ⁇ td>(1,5)” is “5” (that is, the last token of the concept candidate is the 5th token in the untested token sequence 22 ), and there is no concept candidate having “6” as beginning concept index (that is, the first token of the concept candidate is the 6th token in the untested token sequence 22 ).
- the concept candidate “ ⁇ html>* ⁇ td>(1,5)” is not a valid one, namely, the concept candidate “ ⁇ html>* ⁇ td>(1,5)” is an isolated concept candidate, and has no need to be depicted in the preliminary concept graph 121 .
- the preliminary concept graph 121 is then sent to the concept sequence selection module 14 for comparing the preliminary concept graph 121 in FIG. 8 with the exemplary concept graph 111 in FIG. 5 by applying the dynamic programming technique (step A 09 ) to determine an optimal hypothetical concept sequence 141 as shown in FIG. 9 (step A 10 ).
- the matched target concept 33 (identical to the target concept 32 shown in FIG. 9) is captured from the optimal hypothetical concept sequence 141 so as to extract the target data (step A 11 ). Therefore, the 32nd token in the untested token sequence 22 belonging to the target concept “BUY_CASH” shown in FIG. 6 is captured according to the target concept 32 for extracting the exchange rate of buying US dollars in cash “33.45500”.
- step A 09 of this embodiment with the dynamic programming technique, the costs of different edit operations, such as deleting a concept x, inserting a concept x and substituting a concept x with y, are defined.
- x is a hypothetical concept
- y is an exemplary concept
- n x is the number of tokens in the hypothetical concept x
- n y is the number of tokens in the exemplary concept y
- ⁇ is a variable to control the sensitivity of cost on the difference between n x and n y .
- the second embodiment is basically the same as the first embodiment, except that the computer system further has a concept building module 13 ′.
- the concept labeling module 11 ′ also defines some concept rules 113 ′ for specifying composite concepts composed a plurality of specific concepts.
- the computer system also captures the HTML source code of the example web page as shown in FIG. 12 (step B 01 ), and tokenizes the HTML source code into a plurality of tokens as the exemplary token sequence 21 ′ (step B 02 ). Then, the exemplary token sequence 21 ′ is converted into a plurality of specific concepts as a concept sequence (step B 03 ).
- the tuples specified by the user in step B 03 comprises the following 7 tuples:
- Tuple T B1 (Target, NAME, 18, 18, Three_Uppercase_Letter_Recognizer);
- Tuple T B2 (Target, BUY, 21, 21, Number_Recognizer);
- Tuple T B3 (Target, SELL, 24, 24, Number_Recognizer);
- Tuple T B4 (Target, RECORD, 18, 24, GenerateNormalRule[T B1 , T B2 , T B3 ]);
- Tuple T B5 (Target, RECORD, 29, 35, DoNothing);
- Tuple T B6 (Target, RECORD+, 18, 35, GenerateNormalRule[T B4 , T B5 ]);
- Tuple T B7 (Target, RECORD+, 18, 57, DoNothing).
- Tuple T B4 indicates that the sequence of tokens 18 to 24 is an example of the composite concept “RECORD”.
- the associated concept recognizer 112 ′ in tuple T B4 i.e., GenerateNormalRule[T B1 , T B2 , T B3 ]) is a command, which indicates that this “RECORD” example consists of the examples specified by tuple T B1 , tuple T B2 , and tuple T B3 .
- tuple T B4 is associated from the 18th token to the 24th token
- tuple T B1 is associated with the 18th token
- tuple T B2 is associated with the 21st token
- tuple T B3 is only associated with the 24th token in the exemplary token sequence 21 ′
- the computer system will automatically generate the following fillers concepts (assuming that the computer system automatically generates the recognizer of a filler concept according to the first token and the last token of the filler concept):
- Tuple T B8 (Filler, ⁇ /td> ⁇ td>, 19, 20, _DGR ⁇ /td> ⁇ td> ⁇ );
- Tuple T B9 (Filler, ⁇ /td> ⁇ td>, 22, 23, _DGR ⁇ /td> ⁇ td> ⁇ ).
- the computer system will automatically generate the following context-free-rule to define the format of the composite concept of tuple T B4 :
- Rule C 1 RECORD ⁇ NAME ⁇ /td> ⁇ td>BUY ⁇ /td> ⁇ td>SELL.
- Tuple T B5 indicates that the sequence of tokens 29 to 35 is also an example of the “RECORD” concept as tuple T B4 . Because rule C 1 has been generated by the computer system, it is not necessary to redefine a new context-free rule for tuple T B5 . Thus, the associated concept recognizer 112 ′ is the command “DoNothing”, which notifies the computer system of not automatically generating associated concept recognizer 112 ′ for this example.
- Tuple T B6 indicates that the sequence of tokens 18 to 35 is an example of a special kind of composite concept “RECORD+”, which represents one or more records.
- This special kind of composite concept is also named aggregate concept.
- the associated concept recognizer 112 ′ in tuple T B6 i.e., GenerateNormalRule[T B4 , T B5 ]
- the computer system automatically generates the following tuple:
- Tuple T B10 (Filler, ⁇ /td>* ⁇ td>, 25, 28, _DGR ⁇ /td>* ⁇ td> ⁇ ).
- tuple T B7 The last user-specified tuple (i.e., tuple T B7 ) indicates that the sequence of tokens 18 to 57 is also an example of the aggregate concept “RECORD+” as tuple T B6 . Since all-user-specified tuples are processed, the system will generate the following tuples to cover the uncovered segments in the untested token sequence 22 ′:
- Tuple T B11 (Filler, ⁇ html> ⁇ body>* ⁇ tr> ⁇ td>, 1, 17, _DGR ⁇ html> ⁇ body>* ⁇ tr> ⁇ td> ⁇ );
- Tuple T B12 (Filler, ⁇ /td> ⁇ /tr>* ⁇ /body> ⁇ /html>, 58, 62, _DGR ⁇ /td> ⁇ /tr>* ⁇ /body> ⁇ /html> ⁇ ).
- an exemplary concept graph 111 ′ as shown in FIG. 14 is constructed by the concept labeling module 11 ′ according to the beginning token index and the ending token index of the tuples specified by the user and automatically generated by the computer system (step B 05 ). It should be noted that the concepts specified by tuple T B1 to tuple T B6 are not used to construct the exemplary concept graph 111 ′ because the concept specified by tuple T B7 covers the concepts specified by tuple T B1 to tuple T B6 .
- the computer system begins to extract targeted data from new documents.
- the computer system first captures the HTML source code of the web page as shown in FIG. 15 (step B 06 ), and tokenizes the HTML source code into the untested token sequence 22 ′ as shown in FIG. 16 (step B 07 ).
- the concept detection module 12 ′ defines concept candidates (step B 08 ) and constructs the preliminary concept graph 121 ′ as shown in FIG. 17 (step B 09 )
- the preliminary concept graph 121 ′ is then converted into a hypothetical concept graph 131 ′ as shown in FIG. 18 by the concept building module 13 ′ using the concept rules 113 ′ generated in the example labeling phase (step B 10 ).
- the hypothetical concept graph 131 ′ is then sent to the concept sequence selection module 14 ′ for comparing the hypothetical concept graph 131 ′ in FIG. 18 with the exemplary concept 111 ′ in FIG. 14 by applying the dynamic programming technique (step B 11 ) to determine an optimal hypothetical concept sequence 141 ′ as shown in FIG. 19 (step B 12 ).
- the matched target concept 33 ′ (identical to the target concept 32 ′ shown in FIG. 19) is extracted from the optimal hypothetical concept sequence 141 ′ (step B 13 ).
- the matched target concept 33 ′ in this embodiment is an aggregate concept, and it has to be spanned as illustrated in FIG.
- the computer system can extract the currency names (corresponding to the target concept “NAME”), the exchange rates of buying cash (corresponding to the target concept “BUY”), and the exchange rates of selling cash (corresponding to the target concept “SELL”) from the untested token sequence 22 ′.
- the user only needs to specify the targeted data of interest as examples in the example labeling phase (namely, the user is allowed to specify target concepts and filler concepts) for enabling the computer system to efficiently predict the location of the targeted data in the untested token sequence based on the user-specified examples.
- the method described in the present invention still can accurately extract targeted data.
- the example web page shown in FIG. 12 records the exchange rates of four kinds of currencies with 62 tokens being tokenized
- the untested web page shown in FIG. 15 records the exchange rates of only three kinds of currencies with 51 tokens being tokenized.
- the present invention provides a highly-practical and flexible method which is able to extract data of interest without asking the user to specify too many exmaples.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
- 1. Field of the Invention
- The present invention relates to the technical field of data extraction, and, more particularly, to an example-based concept-oriented data extraction method that is adapted to free texts, such as the source code of web pages, or general articles.
- 2. Description of Related Art
- Currently, with the development of information technology and the explosion of World Wide Web (WWW), more and more information is available online. There are a lot of web pages providing various useful information, such as weather forecast, stock quote, book price, and so on. However, these web pages are human-readable, but not machine-understandable. The information on the web pages is hard to be manipulated by machine. One way to handle web information more effectively is extracting data from web pages to populate databases for further manipulation.
- The conventional approach for extracting data from web pages is to write application-specific programs, named wrappers, which are able to locate data of interest on particular web pages. But writing wrappers is a tedious, error-prone, and time-consuming process. Furthermore, the wrapper programs are usually not reusable once the formatting convention of the targeted web pages changes. In such case, the painful wrapper writing process has to be repeated.
- Conventionally, many methods have been proposed to facilitate generating wrappers automatically or semi-automatically for solving the laborious and error-prone problems of handcrafting wrappers. These methods can be classified into two approaches. The first approach is developing languages specially designed to assist users in constructing wrappers. The other approach is using labeled examples to generate wrappers.
- Although using specially designed languages to build wrappers can more or less reduce the effort, it still inherits the drawbacks of manually building wrappers with general purpose languages, such as Perl and Java. While in the example-based approach, it consists of two phases: rule induction and data extraction. In the rule induction phase, some possible contextual rules are generated to specify the local contextual patterns around the labeled data. Then, in the data extraction phase, these contextual rules are then used to locate and extract the targeted data on new web pages. This approach is based on an assumption that the inducted contextual rules are able to precisely locate the targeted data. However, due to the imperfect rule induction or insufficient examples, the inducted rules sometimes also locate undesired data. This kind of errors may propagate and make the data extractor fail to grab the targeted data, even though the contexts of the targeted data satisfy the contextual rules. Besides, in the prior-art, the representation form of contextual rules is predefined and the inducted rules must be strictly obeyed. As a result, the user must label a lot of examples so that all possible contexts around the targeted data can be taken into account.
- The object of the present invention is to provide an example-based concept-oriented data extraction method for simplifying the process of establishing specific concepts and concept rules of examples so as to increase practicality without handling the complicated contextual patterns of the data of interest.
- Another object of the present invention is to provide an example-based concept-oriented data extraction method for defining the exemplary data string as flexible and practical examples to accurately and efficiently extract the targeted data from the untested data string after the format of the untested data string has been changed without bothering the user to specify a lot of examples.
- To achieve these and other objects of the present invention, the example-based concept-oriented data extraction method comprises a first procedure for labeling an exemplary data string, and a second procedure for extracting targeted data from an untested data string.
- The first procedure comprises the steps of: (A1) capturing an exemplary data string; (A2) tokenizing the exemplary data string into a plurality of tokens as an exemplary token sequence, each token having an index; (A3) specifying the exemplary token sequence as a plurality of specific concepts, each being labeled to be a tuple and consisting of at least one token, the specific concept being selected from the group of a target concept and a filler concept, the target concept pointing to the targeted data of interest, the filler concept pointing to the contextual data of the targeted data, each tuple having a format including concept type, a concept name, a beginning index of the first token in the specific concept, an ending index of the last token in the specific concept, and an associated concept recognizer of the specific concept, wherein the associated concept recognizer is provided to recognize the possible token sequence of the specific concept; and (A4) constructing an exemplary concept graph of the exemplary data string according to the tuples.
- The second procedure comprises the steps of: (B1) capturing an untested data string; (B2) tokenizing the untested data string into a plurality of tokens as an untested token sequence; (B3) using the associated concept recognizers defined by the tuples for detecting a plurality of concept candidates, wherein each concept candidate having a format including the beginning index and the ending index of the corresponding token sequence, and the concept name of the concept candidate; (B4) constructing a preliminary concept graph of the untested token sequence according to the concept candidates; and (B5) determining an optimal hypothetical concept sequence by comparing the exemplary concept graph with the preliminary concept graph and capturing at least one matched target concept from the optimal hypothetical concept sequence for extracting the targeted data.
- Other objects, advantages, and novel features of the invention will become more apparent from the following detailed description when taken in conjunction with the accompanying drawings.
- FIG. 1 is a functional block diagram of the first embodiment according to the present invention;
- FIG. 2 is a flowchart of the first embodiment according to the present invention;
- FIG. 3 is a schematic drawing of an example web page in the first embodiment according to the present invention;
- FIG. 4 schematically illustrates the exemplary token sequence of the example web page illustrated in FIG. 3;
- FIG. 5 is an exemplary concept graph of the first embodiment according to the present invention;
- FIG. 6 schematically illustrates an untested token sequence of the first embodiment according to the present invention;
- FIG. 7 is a list of concept candidates of the first embodiment according to the present invention;
- FIG. 8 is a preliminary concept graph of the first embodiment according to the present invention;
- FIG. 9 schematically illustrates an optimal hypothetical concept sequence of the first embodiment according to the present invention;
- FIG. 10 is a functional block diagram of the second embodiment according to the present invention;
- FIG. 11 is a flowchart of the second embodiment according to the present invention;
- FIG. 12 is a schematic drawing of an example web page in the second embodiment according to the present invention;
- FIG. 13 schematically illustrates the exemplary token sequence of the example web page illustrated in FIG. 12;
- FIG. 14 is an exemplary concept graph of the second embodiment according to the present invention;
- FIG. 15 is a schematic drawing of an untested web page in the second embodiment according to the present invention;
- FIG. 16 schematically illustrates the untested token sequence of the untested web page illustrated in FIG. 15;
- FIG. 17 is a preliminary concept graph of the second embodiment according to the present invention;
- FIG. 18 is a hypothetical concept graph of the second embodiment according to the present invention;
- FIG. 19 schematically illustrates an optimal hypothetical concept sequence of the second embodiment according to the present invention; and
- FIG. 20 is a schematic drawing of the structure of the matched target concept of the second embodiment according to the present invention.
- The example-based concept-oriented data extraction method according to the present invention consists of two phases: an example labeling phase for labeling an exemplary data string, and a data extraction phase for extracting targeted data from an untested data string. The data string is preferably a text-based document, such as an article, or hypertext markup language (HTML) source codes. Each character in the data string could be an ASCII code, a binary value between 0 and 255, or a two-byte Unicode. In the following two preferred embodiments, the exemplary data string and the untested data string are HTML source codes, and each character is an ASCII code.
- Referring to FIGS. 1 and 2, the method disclosed in the first embodiment is applied in a computer system for extracting the targeted data, where the computer system has a
concept labeling module 11, aconcept detection module 12, and a conceptsequence selection module 14 as shown in FIG. 1. In this embodiment, the data of interest, namely, the targeted data, is the value of the exchange rate of buying US dollars in cash (i.e., 33.44500) from the example web page as shown in FIG. 3. Since the content of the web page has been defined as not only the exchange rates of buying and selling US dollars in check and in cash, but also a lot of tags for regulating the format of the web page, the HTML source code (i.e., the exemplary data string) of the example web page must be captured first in the example labeling phase (step A01) for enabling the computer system to precisely extract the exchange rate of buying US dollars in cash from web pages with this kind of format. Then, the computer system tokenizes the HTML source code into a plurality of tokens as an exemplarytoken sequence 21 as shown in FIG. 4 (step A02), wherein each token has been sequentially assigned an index. In this embodiment, the exemplarytoken sequence 21 has 40 tokens sequentially assigned an index from 1 to 40, and each token is a HTML tag or a text segment between HTML tags. It is noted that if the source documents are free texts, the exemplary data string could be tokenized into words. - Then, the user uses the
concept labeling module 11 to specify the exemplarytoken sequence 21 into a plurality of specific concepts, each consisting of at least one token as a concept sequence, and labels each specific concept as a tuple (step A03) to tell the computer system which targeted data to be extracted, wherein each tuple represents an example. In this embodiment, the specific concept could be a target concept pointed to the targeted data, or a filler concept pointed to the contextual data of the targeted data. The format of a tuple can be recorded as follows: - (Concept-Type, Concept-Name, Beginning-Token-Index, Ending-Token-Index, Associated-Concept-Recognizer),
- wherein the beginning index is the first token in the specific concept, the ending index is the last token in the specific concept, and the associated
concept recognizer 112 is adapted for recognizing the token sequence having the same format with the tuple, for example, a number recognizer, or a dynamically generated recognizer (DGR) . . . etc. The associatedconcept recognizer 112 is either user-provided (i.e., provided by the user) or system-provided (i.e., provided by the computer system). - The user uses the graphical user interface (GUI) tool in the first embodiment to facilitate the process of specifying the exchange rate of buying US dollars in cash “33.44500” as a target concept to be recorded as the following tuple:
- Tuple TA1: (Target, BUY_CASH, 32, 32, Number_Recognizer).
- The tuple TA1 means that this example is an target concept. The name of this target concept is “BUY_CASH”. This target concept spans from 32nd token to 32nd token, and this concept should be recognized by the recognizer named “Number_Recognizer” adapted for recognizing numeral characters.
- Furthermore, the user is also allowed to specify the filler concepts. Labeling a filler concept is similar to labeling a target concept, except that the Concept-Type is “Filler”. For example, the user may specify a filler concept with the following tuple:
- Tuple TA2: (Filler, SELL_CASH, 35, 35, Number_Recognizer).
- That is, the user wishes the exchange rate of selling US dollars in cash not to be extracted from the web page by the computer system, and therefore specifies it as a filler concept to be regarded as the contextual data of the targeted data.
- After the user labels the target concepts and the filler concepts in the exemplary
token sequence 21, the computer system will automatically label the tokens which are not included in the labeled examples as filler concepts. Each automatically labeled filler concept will be associated to some (at least one) dynamically generated recognizers (DGR). In this embodiment, after the user specifies the above two examples, three segments of tokens remain uncovered. They will be treated as filler concepts to be recorded as the tuples listed below: - Tuple TA3: (Filler, <html>*<td>, 1, 31, _DGR{<html>*<td>});
- Tuple TA4: (Filler, </td><td>, 33, 34, _DGR{</td><td>}); and
- Tuple TA5: (Filler, </td>*</html>, 36, 40, _DGR{</td>*</html>}).
- Tuple TA3 means that the segment spanning from the first token (i.e., leading token) to the 31st token (i.e., ending token) is a filler concept whose name is “<html>*<td>” which is automatically generated by the combination of the content of the first token “<html>” and the content of the 31st token “<td>”. The last element (i.e., Associated-Concept-Recognizer) in tuple TA3 indicates that this filler concept can be recognized by a recognizer driven by the rule “<html>*<td>”, where the wildcard “*” represents any positive number of tokens. Likewise, tuple TA4 means that the segment spanning from the 33rd token to the 34th token is filler concept whose name is “</td><td>”, and tuple TA5 means that the segment spanning from the 36th token to the 40th token is a filler concept whose name is “</td>*</html>”. There is no wildcard “*” in tuple TA4 because the corresponding segment consists of only two tokens. In the same way, if the associated
concept recognizer 112 is constructed according to the leading two tokens and the ending two tokens of the filler concept, the following three tuples will be produced for the three uncovered segments: - Tuple TA6: (Filler, <html><body>*</td><td>, 1, 31, _DGR{<html><body>*</td><td>});
- Tuple TA7: (Filler, </td><td>, 33, 34, _DGR{</td><td>}); and
- Tuple TA8: (Filler, </td></tr>*</body></html>, 36, 40, _DGR{</td></tr>*</body></html>}).
- It is also noted that there is no wild card in tuple TA7 because the number of tokens in the corresponding segment is less than four. In fact, tuple TA7 is the same as tuple TA4.
- Of course, the
concept labeling module 11 can construct tuples for the uncovered segments with different criteria, such as constructing concept recognizer according to leading three tokens and ending three tokens. Furthermore, if the user specifies a specific concept without explicitly associating it to an associatedconcept recognizer 112, theconcept labeling module 11 of the computer system will automatically produce one according to the tokens on the left-hand side and on the right-hand side of the specific concept specified by the user. - After all tokens in the exemplary
token sequence 21 have been specified as specific concepts, anexemplary concept graph 111 as shown in FIG. 5 is constructed by theconcept labeling module 11 according to the beginning token index and the ending token index of the tuples specified by the user and automatically generated by the computer system (step A04). Referring to FIG. 5, theexemplary concept graph 111 is constructed according to tuple TA1 and tuple TA2 assigned by the user, and tuple TA3, tuple TA4, tuple TA5, tuple TA6, and tuple TA8 automatically generated by the computer system, where tuple TA7 is ignored because it is the same as tuple TA4. The tuple TA1 is thetarget concept 31 while the other tuples are filler concepts. Hence, the example labeling phase would be accomplished after constructing theexemplary concept graph 111. - After obtaining the
exemplary concept graph 111 by labeling the exemplarytoken sequence 21, the computer system is ready to capture untested data string from new web pages (step A05). In the data extraction phase, the untested data string (namely, the HTML source code) is also tokenized into an untestedtoken sequence 22 as shown in FIG. 6 before further processing (step A06). Then, theconcept detection module 12 refers to the associatedconcept recognizers 112 defined by the tuples in the example labeling phase for detecting a plurality of concept candidates from the untested token sequence 22 (step A07). The format of a concept candidate can be recorded as follows: - Concept-Name (Beginning-Token-Index, Ending-Token-Index).
- The detailed list of concept candidates of this embodiment is shown in FIG. 7. Since the user associates the tuple TA1 of the target concept “BUY_CASH” with the associated
concept recognizer 112 “Number_Recognizer”, the following four concept candidates are generated from the untested token sequence 22: BUY_CASH(26,26), BUY_CASH(29,29), BUY_CASH(32,32), BUY_CASH(35,35). Likewise, since the user also associates the tuple TA2 of the filler concept “SELL_CASH” with the associatedconcept recognizer 112 “Number_Recognizer”, the following four concept candidates are also generated: SELL_CASH(26,26), SELL_CASH(29,29), SELL_CASH(32,32), SELL_CASH(35,35). Other concept candidates are also generated according to aforementioned way. It should be noted that different concept candidates might cover the same token(s). - After excluding the isolated concept candidates in FIG. 7, the
concept detection module 12 then constructs a (simplified)preliminary concept graph 121 as shown in FIG. 8 (step A08). In this embodiment, the isolated concept candidate represents that it cannot be connected with any other concept candidates. For example, the ending token index of the concept candidate “<html>*<td>(1,5)” is “5” (that is, the last token of the concept candidate is the 5th token in the untested token sequence 22), and there is no concept candidate having “6” as beginning concept index (that is, the first token of the concept candidate is the 6th token in the untested token sequence 22). Thus, the concept candidate “<html>*<td>(1,5)” is not a valid one, namely, the concept candidate “<html>*<td>(1,5)” is an isolated concept candidate, and has no need to be depicted in thepreliminary concept graph 121. - The
preliminary concept graph 121 is then sent to the conceptsequence selection module 14 for comparing thepreliminary concept graph 121 in FIG. 8 with theexemplary concept graph 111 in FIG. 5 by applying the dynamic programming technique (step A09) to determine an optimalhypothetical concept sequence 141 as shown in FIG. 9 (step A10). Finally, the matched target concept 33 (identical to thetarget concept 32 shown in FIG. 9) is captured from the optimalhypothetical concept sequence 141 so as to extract the target data (step A11). Therefore, the 32nd token in the untestedtoken sequence 22 belonging to the target concept “BUY_CASH” shown in FIG. 6 is captured according to thetarget concept 32 for extracting the exchange rate of buying US dollars in cash “33.45500”. -
- where x is a hypothetical concept, y is an exemplary concept, and x=y means that x and y have the same concept name, otherwise x and y have different concept name.
-
- otherwise where nx is the number of tokens in the hypothetical concept x, ny is the number of tokens in the exemplary concept y, and Ε is a variable to control the sensitivity of cost on the difference between nx and ny.
- With reference to FIGS. 10 and 11, the second embodiment is basically the same as the first embodiment, except that the computer system further has a
concept building module 13′. In this embodiment, theconcept labeling module 11′ also defines someconcept rules 113′ for specifying composite concepts composed a plurality of specific concepts. By this way, that the user can label the data of interest efficiently even if targeted data of interest are all the elements in a table. For example, if the user wants to extract all of the currency names (including USD, JPD, AUD, and EUR) and the exchange rates of different currencies as targeted data of the example web page as shown in FIG. 12, he/she must respectively specifies a total amount of 12 target concepts as examples (including 4 currency names, 4 rates of buying cash, and 4 rates of selling cash) according to the aforementioned method disclosed in the first embodiment. To relieve this burden, context-free rules are used to enable the user to define and specify composite concepts easily in the second embodiment. Thus, to extract the total amount of 12 target concepts in the example web page, the user only needs to specify 7 tuples. - Referring to FIG. 11, in the example labeling phase of this embodiment, the computer system also captures the HTML source code of the example web page as shown in FIG. 12 (step B01), and tokenizes the HTML source code into a plurality of tokens as the exemplary
token sequence 21′ (step B02). Then, the exemplarytoken sequence 21′ is converted into a plurality of specific concepts as a concept sequence (step B03). The tuples specified by the user in step B03 comprises the following 7 tuples: - Tuple TB1: (Target, NAME, 18, 18, Three_Uppercase_Letter_Recognizer);
- Tuple TB2: (Target, BUY, 21, 21, Number_Recognizer);
- Tuple TB3: (Target, SELL, 24, 24, Number_Recognizer);
- Tuple TB4: (Target, RECORD, 18, 24, GenerateNormalRule[TB1, TB2, TB3]);
- Tuple TB5: (Target, RECORD, 29, 35, DoNothing);
- Tuple TB6: (Target, RECORD+, 18, 35, GenerateNormalRule[TB4, TB5]); and
- Tuple TB7: (Target, RECORD+, 18, 57, DoNothing).
- Tuple TB4 indicates that the sequence of
tokens 18 to 24 is an example of the composite concept “RECORD”. The associatedconcept recognizer 112′ in tuple TB4 (i.e., GenerateNormalRule[TB1, TB2, TB3]) is a command, which indicates that this “RECORD” example consists of the examples specified by tuple TB1, tuple TB2, and tuple TB3. Since tuple TB4 is associated from the 18th token to the 24th token, tuple TB1 is associated with the 18th token, tuple TB2 is associated with the 21st token, and tuple TB3 is only associated with the 24th token in the exemplary token sequence21′, the computer system will automatically generate the following fillers concepts (assuming that the computer system automatically generates the recognizer of a filler concept according to the first token and the last token of the filler concept): - Tuple TB8: (Filler, </td><td>, 19, 20, _DGR{</td><td>}); and
- Tuple TB9: (Filler, </td><td>, 22, 23, _DGR{</td><td>}).
- Therefore, the computer system will automatically generate the following context-free-rule to define the format of the composite concept of tuple TB4:
- Rule C1: RECORD→NAME</td><td>BUY</td><td>SELL.
- That is, the format of the specific concept “RECORD” has to follow the format defined by rule C1.
- Tuple TB5 indicates that the sequence of
tokens 29 to 35 is also an example of the “RECORD” concept as tuple TB4. Because rule C1 has been generated by the computer system, it is not necessary to redefine a new context-free rule for tuple TB5. Thus, the associatedconcept recognizer 112′ is the command “DoNothing”, which notifies the computer system of not automatically generating associatedconcept recognizer 112′ for this example. - Tuple TB6 indicates that the sequence of
tokens 18 to 35 is an example of a special kind of composite concept “RECORD+”, which represents one or more records. This special kind of composite concept is also named aggregate concept. The associatedconcept recognizer 112′ in tuple TB6 (i.e., GenerateNormalRule[TB4, TB5]) is a command, which specifies that this “RECORD+” example consists of the examples specified by tuple TB4 and tuple TB5. As a result, the computer system automatically generates the following tuple: - Tuple TB10: (Filler, </td>*<td>, 25, 28, _DGR{</td>*<td>}).
- This command also tells the computer system to generate rules for this “RECORD+” example. Therefore, the computer system generates the following context-free rules to represent concept of one or more records:
- Rule C2: (RECORD+→RECORD); and
- Rule C3: (RECORD+→RECORD+</td>*<td>RECORD).
- The last user-specified tuple (i.e., tuple TB7) indicates that the sequence of
tokens 18 to 57 is also an example of the aggregate concept “RECORD+” as tuple TB6. Since all-user-specified tuples are processed, the system will generate the following tuples to cover the uncovered segments in the untestedtoken sequence 22′: - Tuple TB11: (Filler, <html><body>*<tr><td>, 1, 17, _DGR{<html><body>*<tr><td>}); and
- Tuple TB12: (Filler, </td></tr>*</body></html>, 58, 62, _DGR {</td></tr>*</body></html>}).
- It should be noted that although all the concept rules in this embodiment are generated by the computer system, the user is also allowed to define concept rules to specify the relation between tuples.
- After all tokens in the exemplary
token sequence 21′ have been specified as specific concepts, anexemplary concept graph 111′ as shown in FIG. 14 is constructed by theconcept labeling module 11′ according to the beginning token index and the ending token index of the tuples specified by the user and automatically generated by the computer system (step B05). It should be noted that the concepts specified by tuple TB1 to tuple TB6 are not used to construct theexemplary concept graph 111′ because the concept specified by tuple TB7 covers the concepts specified by tuple TB1 to tuple TB6. - After the example labeling phase is finished, the computer system begins to extract targeted data from new documents. In the data extraction phase, the computer system first captures the HTML source code of the web page as shown in FIG. 15 (step B06), and tokenizes the HTML source code into the untested
token sequence 22′ as shown in FIG. 16 (step B07). After theconcept detection module 12′ defines concept candidates (step B08) and constructs thepreliminary concept graph 121′ as shown in FIG. 17 (step B09), thepreliminary concept graph 121′ is then converted into ahypothetical concept graph 131′ as shown in FIG. 18 by theconcept building module 13′ using the concept rules 113′ generated in the example labeling phase (step B10). - The
hypothetical concept graph 131′ is then sent to the conceptsequence selection module 14′ for comparing thehypothetical concept graph 131′ in FIG. 18 with theexemplary concept 111′ in FIG. 14 by applying the dynamic programming technique (step B11) to determine an optimalhypothetical concept sequence 141′ as shown in FIG. 19 (step B12). Next, the matchedtarget concept 33′ (identical to thetarget concept 32′ shown in FIG. 19) is extracted from the optimalhypothetical concept sequence 141′ (step B13). As a result, the matchedtarget concept 33′ in this embodiment is an aggregate concept, and it has to be spanned as illustrated in FIG. 20 based on the afore-determined concept rules (step B14) for enabling the computer system to extract the targeted data (step B15). That is, with reference to FIG. 16, the computer system can extract the currency names (corresponding to the target concept “NAME”), the exchange rates of buying cash (corresponding to the target concept “BUY”), and the exchange rates of selling cash (corresponding to the target concept “SELL”) from the untestedtoken sequence 22′. - According to the above-mentioned description, it is known that, in the present invention, the user only needs to specify the targeted data of interest as examples in the example labeling phase (namely, the user is allowed to specify target concepts and filler concepts) for enabling the computer system to efficiently predict the location of the targeted data in the untested token sequence based on the user-specified examples. Even if the convention of web page is slightly changed, the method described in the present invention still can accurately extract targeted data. For example, in the second embodiment, the example web page shown in FIG. 12 records the exchange rates of four kinds of currencies with 62 tokens being tokenized, while the untested web page shown in FIG. 15 records the exchange rates of only three kinds of currencies with 51 tokens being tokenized. But the computer system still can precisely extract the exchange rates of USD, JPD, and EUR according to the definition of specific concepts and concept rules. Therefore, the present invention provides a highly-practical and flexible method which is able to extract data of interest without asking the user to specify too many exmaples.
- Although the present invention has been explained in relation to its preferred embodiment, it is to be understood that many other possible modifications and variations can be made without departing from the spirit and scope of the invention as hereinafter claimed.
Claims (18)
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW091137158A TWI221989B (en) | 2002-12-24 | 2002-12-24 | Example-based concept-oriented data extraction method |
TW91137158 | 2002-12-24 |
Publications (2)
Publication Number | Publication Date |
---|---|
US20040123237A1 true US20040123237A1 (en) | 2004-06-24 |
US7107524B2 US7107524B2 (en) | 2006-09-12 |
Family
ID=32590622
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/442,300 Expired - Fee Related US7107524B2 (en) | 2002-12-24 | 2003-05-21 | Computer implemented example-based concept-oriented data extraction method |
Country Status (2)
Country | Link |
---|---|
US (1) | US7107524B2 (en) |
TW (1) | TWI221989B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050251536A1 (en) * | 2004-05-04 | 2005-11-10 | Ralph Harik | Extracting information from Web pages |
US20110258213A1 (en) * | 2006-10-30 | 2011-10-20 | Noblis, Inc. | Method and system for personal information extraction and modeling with fully generalized extraction contexts |
US20150067810A1 (en) * | 2010-11-04 | 2015-03-05 | Ratinder Paul Singh Ahuja | System and method for protecting specified data combinations |
US9652529B1 (en) * | 2004-09-30 | 2017-05-16 | Google Inc. | Methods and systems for augmenting a token lexicon |
US10367786B2 (en) | 2008-08-12 | 2019-07-30 | Mcafee, Llc | Configuration management for a capture/registration system |
CN111428052A (en) * | 2020-03-30 | 2020-07-17 | 中国科学技术大学 | Method for constructing educational concept graph with multiple relations from multi-source data |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7254577B2 (en) * | 2004-06-29 | 2007-08-07 | International Business Machines Corporation | Methods, apparatus and computer programs for evaluating and using a resilient data representation |
US7630972B2 (en) * | 2007-01-05 | 2009-12-08 | Yahoo! Inc. | Clustered search processing |
US20090012841A1 (en) * | 2007-01-05 | 2009-01-08 | Yahoo! Inc. | Event communication platform for mobile device users |
US20080235260A1 (en) * | 2007-03-23 | 2008-09-25 | International Business Machines Corporation | Scalable algorithms for mapping-based xml transformation |
US20090043736A1 (en) * | 2007-08-08 | 2009-02-12 | Wook-Shin Han | Efficient tuple extraction from streaming xml data |
US8880537B2 (en) | 2009-10-19 | 2014-11-04 | Gil Fuchs | System and method for use of semantic understanding in storage, searching and providing of data or other content information |
US8625782B2 (en) * | 2010-02-09 | 2014-01-07 | Mitsubishi Electric Research Laboratories, Inc. | Method for privacy-preserving computation of edit distance of symbol sequences |
WO2019077405A1 (en) | 2017-10-17 | 2019-04-25 | Handycontract, LLC | Method, device, and system, for identifying data elements in data structures |
US11475209B2 (en) | 2017-10-17 | 2022-10-18 | Handycontract Llc | Device, system, and method for extracting named entities from sectioned documents |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6304870B1 (en) * | 1997-12-02 | 2001-10-16 | The Board Of Regents Of The University Of Washington, Office Of Technology Transfer | Method and apparatus of automatically generating a procedure for extracting information from textual information sources |
US20020016796A1 (en) * | 2000-06-23 | 2002-02-07 | Hurst Matthew F. | Document processing method, system and medium |
US6532469B1 (en) * | 1999-09-20 | 2003-03-11 | Clearforest Corp. | Determining trends using text mining |
US6901441B2 (en) * | 2000-07-12 | 2005-05-31 | International Business Machines Corporation | Knowledge sharing between heterogeneous devices |
-
2002
- 2002-12-24 TW TW091137158A patent/TWI221989B/en not_active IP Right Cessation
-
2003
- 2003-05-21 US US10/442,300 patent/US7107524B2/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6304870B1 (en) * | 1997-12-02 | 2001-10-16 | The Board Of Regents Of The University Of Washington, Office Of Technology Transfer | Method and apparatus of automatically generating a procedure for extracting information from textual information sources |
US6532469B1 (en) * | 1999-09-20 | 2003-03-11 | Clearforest Corp. | Determining trends using text mining |
US20020016796A1 (en) * | 2000-06-23 | 2002-02-07 | Hurst Matthew F. | Document processing method, system and medium |
US6901441B2 (en) * | 2000-07-12 | 2005-05-31 | International Business Machines Corporation | Knowledge sharing between heterogeneous devices |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050251536A1 (en) * | 2004-05-04 | 2005-11-10 | Ralph Harik | Extracting information from Web pages |
US7519621B2 (en) * | 2004-05-04 | 2009-04-14 | Pagebites, Inc. | Extracting information from Web pages |
US9652529B1 (en) * | 2004-09-30 | 2017-05-16 | Google Inc. | Methods and systems for augmenting a token lexicon |
US20110258213A1 (en) * | 2006-10-30 | 2011-10-20 | Noblis, Inc. | Method and system for personal information extraction and modeling with fully generalized extraction contexts |
US9177051B2 (en) * | 2006-10-30 | 2015-11-03 | Noblis, Inc. | Method and system for personal information extraction and modeling with fully generalized extraction contexts |
US10367786B2 (en) | 2008-08-12 | 2019-07-30 | Mcafee, Llc | Configuration management for a capture/registration system |
US20150067810A1 (en) * | 2010-11-04 | 2015-03-05 | Ratinder Paul Singh Ahuja | System and method for protecting specified data combinations |
US9794254B2 (en) * | 2010-11-04 | 2017-10-17 | Mcafee, Inc. | System and method for protecting specified data combinations |
US10313337B2 (en) * | 2010-11-04 | 2019-06-04 | Mcafee, Llc | System and method for protecting specified data combinations |
US10666646B2 (en) * | 2010-11-04 | 2020-05-26 | Mcafee, Llc | System and method for protecting specified data combinations |
US11316848B2 (en) | 2010-11-04 | 2022-04-26 | Mcafee, Llc | System and method for protecting specified data combinations |
CN111428052A (en) * | 2020-03-30 | 2020-07-17 | 中国科学技术大学 | Method for constructing educational concept graph with multiple relations from multi-source data |
Also Published As
Publication number | Publication date |
---|---|
US7107524B2 (en) | 2006-09-12 |
TWI221989B (en) | 2004-10-11 |
TW200411414A (en) | 2004-07-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5883557B2 (en) | How to add metadata to data | |
US7107524B2 (en) | Computer implemented example-based concept-oriented data extraction method | |
JP4509366B2 (en) | A system that scans and formats information on documents | |
US9690788B2 (en) | File type recognition analysis method and system | |
US7313514B2 (en) | Validating content of localization data files | |
CN109190092A (en) | The consistency checking method of separate sources file | |
CN106844413B (en) | Method and device for extracting entity relationship | |
US7853595B2 (en) | Method and apparatus for creating a tool for generating an index for a document | |
CN106960058A (en) | A kind of structure of web page alteration detection method and system | |
CN111401012A (en) | Text error correction method, electronic device and computer readable storage medium | |
CN103778141A (en) | Mixed PDF book catalogue automatic extracting algorithm | |
CN113419721B (en) | Web-based expression editing method, device, equipment and storage medium | |
CN110188207A (en) | Knowledge mapping construction method and device, readable storage medium storing program for executing, electronic equipment | |
Souza et al. | ARCTIC: metadata extraction from scientific papers in pdf using two-layer CRF | |
CN117708352A (en) | Data processing method, device, equipment and storage medium | |
CN113886420B (en) | SQL sentence generation method and device, electronic equipment and storage medium | |
US6081773A (en) | Translation apparatus and storage medium therefor | |
CN109522407A (en) | Business connection prediction technique, device, computer equipment and storage medium | |
KR102338949B1 (en) | System for Supporting Translation of Technical Sentences | |
KR20230053361A (en) | Method, apparatus and computer-readable recording medium for generating product images displayed in an internet shopping mall based on an input image | |
AU2018100324B4 (en) | Image Analysis | |
JPH0748217B2 (en) | Document summarization device | |
EP1072986A2 (en) | System and method for extracting data from semi-structured text | |
US20240354517A1 (en) | Systems and methods for detecting sensitive text in documents | |
JPH08115330A (en) | Similar document retrieval method and apparatus |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: INDUSTRIAL TECHNOLOGY RESEARCH INSTITUTE, TAIWAN Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LIN, YI-CHUNG;CHIU, CHUNG-JEN;REEL/FRAME:014100/0341 Effective date: 20030505 |
|
FPAY | Fee payment |
Year of fee payment: 4 |
|
FPAY | Fee payment |
Year of fee payment: 8 |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.) |
|
LAPS | Lapse for failure to pay maintenance fees |
Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20180912 |