US20140156567A1 - System and method for automatic document classification in ediscovery, compliance and legacy information clean-up - Google Patents
System and method for automatic document classification in ediscovery, compliance and legacy information clean-up Download PDFInfo
- Publication number
- US20140156567A1 US20140156567A1 US13/693,075 US201213693075A US2014156567A1 US 20140156567 A1 US20140156567 A1 US 20140156567A1 US 201213693075 A US201213693075 A US 201213693075A US 2014156567 A1 US2014156567 A1 US 2014156567A1
- Authority
- US
- United States
- Prior art keywords
- document
- machine learning
- classification
- information
- model representation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 75
- 238000010801 machine learning Methods 0.000 claims abstract description 106
- 239000013598 vector Substances 0.000 claims abstract description 50
- 238000000605 extraction Methods 0.000 claims abstract description 24
- 238000004590 computer program Methods 0.000 claims abstract description 8
- 238000004422 calculation algorithm Methods 0.000 claims description 22
- 238000012706 support-vector machine Methods 0.000 claims description 15
- 238000010606 normalization Methods 0.000 claims description 13
- 238000004458 analytical method Methods 0.000 claims description 11
- 230000014509 gene expression Effects 0.000 claims description 7
- 238000005065 mining Methods 0.000 claims description 7
- 238000003066 decision tree Methods 0.000 claims description 6
- 241001417495 Serranidae Species 0.000 claims description 5
- 238000013179 statistical model Methods 0.000 claims description 5
- 230000008569 process Effects 0.000 description 39
- 238000012549 training Methods 0.000 description 28
- 238000012360 testing method Methods 0.000 description 27
- 238000013459 approach Methods 0.000 description 16
- 241000282472 Canis lupus familiaris Species 0.000 description 8
- 238000004891 communication Methods 0.000 description 6
- 230000008901 benefit Effects 0.000 description 5
- 230000015654 memory Effects 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 5
- 238000012545 processing Methods 0.000 description 5
- 230000005540 biological transmission Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000012552 review Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 230000033228 biological regulation Effects 0.000 description 2
- 238000013079 data visualisation Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 238000013178 mathematical model Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 229930091051 Arenine Natural products 0.000 description 1
- RYGMFSIKBFXOCR-UHFFFAOYSA-N Copper Chemical compound [Cu] RYGMFSIKBFXOCR-UHFFFAOYSA-N 0.000 description 1
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000003339 best practice Methods 0.000 description 1
- 230000010267 cellular communication Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013144 data compression Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- BQJCRHHNABKAKU-KBQPJGBKSA-N morphine Chemical compound O([C@H]1[C@H](C=C[C@H]23)O)C4=C5[C@@]12CCN(C)[C@@H]3CC5=CC=C4O BQJCRHHNABKAKU-KBQPJGBKSA-N 0.000 description 1
- 229920001690 polydopamine Polymers 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000010076 replication Effects 0.000 description 1
- 238000013515 script Methods 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Images
Classifications
-
- G06N99/005—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Creation or modification of classes or clusters
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Definitions
- the present invention generally relates to systems and methods for document classification, and more particularly to systems and methods for automatic document classification for electronic discovery (eDiscovery), compliance, clean-up of legacy information, and the like.
- the illustrative embodiments of the present invention provide improved systems and methods that addresses limitations of what is referred to as a bag-of-word (BOW) approach.
- BOW bag-of-word
- the illustrative systems and methods can provide automatic document classification for eDiscovery, compliance, legacy-information clean-up, and the like, while allowing for usage of various machine-learning approaches, and the like, in multi-lingual environments, and the like.
- a system, method, and computer program product for automatic document classification including an extraction module configured to extract structural, syntactical and/or semantic information from a document and normalize the extracted information; a machine learning module configured to generate a model representation for automatic document classification based on feature vectors built from the normalized and extracted semantic information for supervised and/or unsupervised clustering or machine learning; and a classification module configured to select a non-classified document from a document collection, and via the extraction module extract normalized structural, syntactical and/or semantic information from the selected document, and generate via the machine learning module a model representation of the selected document based on feature vectors, and match the model representation of the selected document against the machine learning model representation to generate a document category, and/or classification for display to a user.
- the extracted information includes named entities, properties of entities, noun-phrases, facts, events, and/or concepts.
- the extraction module employs text-mining, language identification, gazetteers, regular expressions, noun-phrase identification with part-of-speech taggers, and/or statistical models and rules, and is configured to identify patterns, and the patterns include libraries, and/or algorithms shared among cases, and which can be tuned for a specific case, to generate case-specific semantic information.
- the extracted information is normalized by using normalization rules, groupers, thesauri, taxonomies, and/or string-matching algorithms.
- the model representation of the document is a TF-IDF document representation of the extracted information
- the clustering or machine learning includes a classifier based on decision trees, support vector machines (SVM), na ⁇ ve-bayes classifiers, k-nearest neighbors, rules-based classification, Linear discriminant analysis (LDA), Maximum Entropy Markov Model (MEMM), scatter-gather clustering, and/or hierarchical agglomerate clustering (HAC).
- FIG. 1 illustrates a system for automatic document classification
- FIG. 2 illustrates a process of assignment of a unique identifier per document and extraction and storage of various structural, syntactical and semantic information from each individual document;
- FIG. 3 illustrates a machine learning process with training and testing of a machine learning model
- FIG. 4 illustrates an automatic classification process of new documents with a machine learning model
- FIG. 5 illustrates a process to create a meta data record for each document
- FIG. 6 illustrates a process to create a unique document identifier (ID) for each document and store the ID in meta data information storage;
- FIG. 7 illustrates a process to extract and store various structural, syntactical and semantic information from a document
- FIG. 8 illustrates a data structure to extract and store various structural, syntactical and semantic information from a document
- FIG. 9 illustrates a data structure to manually or otherwise label training and text documents for machine learning
- FIG. 10 illustrates a process to train a machine learning model with a supervised or unsupervised machine learning algorithm
- FIG. 11 illustrates a process to test a machine learning model for a supervised or unsupervised machine learning algorithm
- FIG. 12 illustrates a process to classify new documents with a machine learning model
- FIG. 13 illustrates a process to extract textual content from a document
- FIG. 14 illustrates an overview of structural, syntactical and semantic information that can be extracted from documents to represent feature vectors for machine learning
- FIG. 15 illustrates an overview of creation of a feature vector fro extracted information
- FIG. 16 illustrates an overview of a bag-of-words (BOW) approach and creation of feature vectors for machine learning.
- BOW bag-of-words
- the present invention includes recognition that the ongoing information explosion is reaching epic proportions and has earned its own name: Big Data.
- Big Data encompasses both challenges and opportunities.
- the opportunity, as focused on by many parties, is to use the collective Big Data to predict and recognize patterns and behavior and to increase revenue and optimize business processes.
- requirements for eDiscovery, compliance, legacy-information clean-up, governance privacy and storage can lead to enormous costs and unexpected or unknown risks.
- New data formats e.g., multimedia, in particular), different languages, cloud and other off-side locations and the continual increase in regulations and legislation—which may contradict previous protocols—add even more complexity to this puzzle.
- Applying content analytics helps to assuage the dark side of Big Data.
- Content analytics such as text-mining and machine-learning technology from the field of artificial intelligence can be used very effectively to manage Big Data.
- Consider tasks for example, such as identifying exact and near-duplicates, structuring and enriching the content of text and multimedia data, identifying relevant (e.g., semantic) information, facts and events, and ultimately, automatically clustering and classifying information, and the like.
- Content-analytics can be used for any suitable type of application where unstructured data needs to be classified, categorized or sorted.
- Other examples are early-case assessment and legal review in eDiscovery (e.g., also known as machine-assisted review, technology-assisted review or predictive coding), enforcement of existing rules, policies and regulations in compliance.
- eDiscovery e.g., also known as machine-assisted review, technology-assisted review or predictive coding
- enforcement of existing rules, policies and regulations in compliance e.g., also known as machine-assisted review, technology-assisted review or predictive coding
- identifying privacy-sensitive information, legacy-information clean-up and information valuation in enterprise information management are good examples.
- users can explore and understand repositories of Big Data better and also apply combinations of advanced search and data visualization techniques easier.
- Both supervised and unsupervised machine learning techniques can be used to classify documents automatically and reveal more complex insights into Big Data.
- a machine learning model can be trained with a seed set of documents (e.g., samples), which are often annotated documents for particular information categories or known information patterns. Based on these training documents, a machine learning algorithm can derive a model that can classify other documents into the thought classes, or temporal, geographical, correlational or hierarchical patterns can be identified from these training examples.
- Machine learning is not perfect: the more document categories there are, the lower the quality can be for the document classification. This is very logical as it is easier to differentiate only black from white than it is to differentiate 1,000 types of gray values. The absence of sufficient relevant training documents will also lower the quality of classification. The number of required training documents grows faster than the increase of the number of categorization classes. So, for 2 times more classes one may need 4 times more training documents.
- Machine learning and other artificial intelligence techniques used to predict patterns and behavior are not based on “hocus pocus”: they are based on solid mathematical and statistical frameworks in combination with common-sense or biology-inspired heuristics.
- text-mining there is an extra complication: the content of textual documents has to be translated, so to speak, into numbers (e.g., probabilities, mathematical notions such as vectors, etc.) that machine learning algorithms can interpret. The choices that are made during this translation can highly influence the results of the machine learning algorithms.
- the documents can be converted into a manageable representation. Typically, they are represented by so-called feature vectors.
- a feature vector can include a number of dimensions of features and corresponding weights.
- the process of choosing features for the feature vector is called feature selection.
- the commonly used representation can be referred to as a bag-of-words (BOW), where each word is a feature in the vector and the weights are either 1 if the word is present in the document or 0 if not.
- More complex weighting schemes are, for example, Term Frequency-Inverse Document Frequency (TF-IDF), and the like, which gives different weights based on frequency of words in a document and in the overall collection.
- TF-IDF Term Frequency-Inverse Document Frequency
- the TF-IDF approach provides a numerical measure of the importance of a particular word to a document in a corpus of documents.
- the advantage of this technique is that the value increases proportionally to the number of times the given word occurs in the document, but decreases if the word occurs more often in the whole corpus of documents. This relates to the fact that the distributions of words in different languages vary extremely.
- the bag-of-word model has several practical limitations, for example, including: (1) typically, it is not possible to use the approach on documents that use different languages within and in-between documents, (2) machine-learning models typically cannot be re-used between different cases, one has to start all over again for each case, (3) the model typically cannot handle dynamic collections, when new documents are added to a case, one has to start all over again, (4) when the model does not perform relatively good enough, one has to start training all over again with a better training set, (5) typically there is no possibility to patch the model, nor is there a guarantee to success, and (6) in application where defensibility in court and clarity are important, such as eDiscovery, compliance, legacy-information clean-up, and the like, an additional complication of the bag-of-word approach is that it is hard to understand and explain to an audience laymen's terms.
- the bag-of-word model also has several technical limitations that may result in having completely different documents ending up in the exact same vector for machine learning and having documents with the same meaning ending up as completely different vectors.
- the high-dimensional feature vectors are very sparse, that is, most of the dimensions can be (e.g., close to) zero. This opens up the opportunity for data compression, but also causes machine learning problems, such as a very high computational complexity, resulting in relatively huge memory and processing requirements, over-fitting (e.g., random error and noise in the training set is used instead of the actual underlying relationships to derive the machine learning model from the training examples), rounding errors (e.g., multiplying very small probabilities over and over again may result in a floating-point underflow), and the like.
- bag of word approach takes simplicity one step too far.
- Variant Identification and Grouping It is sometimes needed to recognize variant names as different forms of the same entity giving accurate entity counts as well as the location of all suitable appearances of a given entity. For example, one may need to recognize that the word “Smith”, in an example, refers to the “Joe Smith” identified earlier and therefore groups them together as aliases of the same entity.
- Normalization Normalizes entities such as dates, currencies, and measurements into standard formats, taking the guesswork out of the metadata creation, search, data mining, and link analysis processes.
- Entity Boundary Detection Will the technology consider “Mr. and Ms. John Jones” as one or two entities? And what will the processor consider to be the start and end of an excerpt, such as “VP John M. P. Kaplan-Jones, Ph.D. M.D.”?
- the text can include various references and co-references.
- Various anaphora and co-references can be disambiguated before it is possible to fully understand and extract the more complex patterns of events.
- the following list shows examples of these (e.g., mutual) references:
- Pronominal Anaphora he, she, we, oneself, etc.
- Apposition the additional information given to an entity, such as “John Doe, the father of Peter Doe”.
- Predicate Nominative the additional description given to an entity, for example “John Doe, who is the chairman of the soccer club”.
- Identical Sets A number of reference sets referring to equivalent entities, such as “Giants”, “the best team”, and the “group of players” which all refer to the same group of people.
- NLP natural language processing
- Documents are represented by extracted semantic information, such as (e.g., named) entities, properties of entities, noun-phrases, facts, events and other high-level concepts, and the like. Extraction is done by using any known techniques from text-mining, for example: language identification, gazetteers, regular expressions, noun-phrase identification with part-of-speech taggers, statistical models and rules, and the like, to identify more complex patterns. These patterns, libraries, algorithms, and the like, can be shared among cases, but can also be fine-tuned for a specific case, so only case-specific relevant semantic information is extracted. Extracted information can be normalized by using any suitable type of known technique, for example, including normalization rules, groupers, thesauri, taxonomies, string-matching algorithms, and the like.
- vectors built of the normalized and extracted semantic information are used as feature vectors for any suitably known supervised or unsupervised clustering and machine learning technique.
- machine-learning algorithms include Decision Trees, Support Vector Machines (SVM), Na ⁇ ve-Bayes Classifiers, k-Nearest Neighbors, rules-based classification, Scatter-Gather Clustering, r Hierarchical Agglomerate Clustering, (HAC), and the like.
- Feature vectors can be built for specific types of cases by extracting only suitable information that is relevant for the case. This can make machine learning more defensibility in court and create more clarity in applications, for example, such as eDiscovery, compliance, legacy-information clean-up, and the like.
- the automatic document classification system 100 provides for automatically extracting structural, semantic, syntactic, and the like, information from relevant training models, based on, for example, entities, facts, events, concepts, and the like, to train a machine learning model, and the like, and use the derived machine learning model for automatic document classification, and the like.
- the system includes for example, a document storage 113 (e.g., a computer storage device, etc.) including one or more document collections 111 , one or more document meta data information storage 109 and one or more machine learning models 304 , accessed through one or more servers 101 , 103 and 105 .
- the system 100 can be used for (1) automatic extraction of structural, semantic, and syntactic information from relevant training models, for example, based on entities, facts, events, concepts, and the like, (2) training of a machine learning model, and the like, and (3) using the derived machine learning model for automatic document classification, and the like, into various trained categories, and the like.
- One or more local computers 121 can provide connectivity to one or more users 123 and 125 , for example, via a local-area network (LAN), and the like.
- one or more remote computers 127 can provide connectivity to one or more remote users 117 and 119 , for example, via the Internet, an Intranet, a wide-area network (WAN) 115 , and the like.
- the computers 121 and 127 connect to the document storage 113 and to allow the one or more users 123 , 125 , 119 and 117 to manually or automatically access the document collection 111 , view documents, document groups, document meta information, training documents, training results, the machine learning model, document classifications, and the like.
- the servers 101 , 103 and 105 communicate with the computer storage 113 to extract meta data information 109 for each document in the document collection 111 , to create unique document identifiers for each document, to label the document meta data 109 with the document identifiers of the document groups, to create a machine learning model 304 , and to automatically train the machine learning model 304 and use this machine learning model 304 for automatic document classification of other documents (e.g., documents not used for training the machine learning model), test the quality of the machine learning model 304 with pre-labeled test documents from the document collection 111 , and the like.
- other documents e.g., documents not used for training the machine learning model
- the users 123 , 125 , 119 and 117 can access the document collection 111 by using the computers 121 and 127 connected over a LAN or the Internet or Intranet 115 .
- the system can show the content of the documents 111 , the meta information of the documents in the meta information storage 109 , the training documents (e.g., selection from 111 ), the machine learning model 304 , and the labels of the automatically categorized documents from 111 in the meta data storage 109 .
- FIG. 2 illustrates a process 200 of the assignment of a unique identifier per document and the extraction, and the storage of various types of structural, syntactical and semantic information from each individual document.
- a record in the meta information storage 109 is created and stored.
- a unique document identifier for example, such as a unique serial number, a MD-5, SHA-1, SHA-2, SHA-128 hash value, and the like, is created.
- ID unique document identifier
- the unique identifier is stored in a record in the meta data information storage 109 that belongs to the corresponding document in the document collection database 111 .
- step 217 for each document in the document collection 111 , various types of structural, syntactic, and semantic information is extracted by using certain user setting from a database 201 , as set by using various information extraction techniques by, for example, a user or a group of users 203 .
- the extracted information is stored in a record in the meta data information storage 109 that belongs to the corresponding document in the document collection database 111 .
- FIG. 3 illustrates a machine learning process 300 with training and testing of the machine learning model.
- a user or a group of users 310 manually or otherwise identify a set of relevant training and testing documents from the document collection 111 .
- the set of training documents need not include documents in common with the set of testing documents. These sets can be mutually exclusive.
- Selection of relevant training material can also be done by using clustering or concept search techniques that cluster similar documents for certain document categories, for example, by self-organization or vector decomposition techniques (e.g., Hierarchical Clustering algorithms, Kohonen Self-Organizing maps, linear discriminant analysis (LDA), etc.), and the like.
- LDA linear discriminant analysis
- a user or a group of users 311 manually or otherwise tag the selected training and testing documents from document collection 111 with the corresponding document categories.
- the machine learning model 304 is trained by using a vector representation created from the records with the extracted information for each document in the meta information storage 109 , together with the document categorization label, which exists for each document from the training set in the document collection database 111 .
- Both supervised as unsupervised machine learning algorithms can be used, for example, such as Support Vector Machines (SVM), k-Nearest Neighbor (kNN), na ⁇ ve Bayes, Decision Rules, k-means, Hierarchical Clustering, Kohonen self-organizing feature maps, linear discriminant analysis (LDA), and the like.
- SVM Support Vector Machines
- kNN k-Nearest Neighbor
- na ⁇ ve Bayes na ⁇ ve Bayes
- Decision Rules k-means
- Hierarchical Clustering Kohonen self-organizing feature maps
- LDA linear discriminant analysis
- the machine learning model 304 is tested by comparing recognized categories with pre-labeled categories from documents in the test documents in document database 111 . This testing can be done by a user or a user group 313 . Results of step 305 can be reported for example in terms of precision, recall, f-values, and the like, and other best practice measurements from the fields of information retrieval, and the like.
- FIG. 4 illustrates an automatic classification process 400 of new documents with the machine learning model 400 .
- a non-classified document is selected from document collection 111 .
- This can be a document that is also not part of the training or test set of documents used in 300 .
- documents are classified.
- Process 404 includes a number of steps. For example, at step 217 , the various structural, syntactical and semantic information for the selected document is obtained from the meta data information store. This information is converted into a vector representation in step 402 and then matched against the machine learning model 304 . From the machine learning model 304 , a document category or classification is obtained in step 403 .
- FIG. 5 illustrates a process 500 explaining in more detail step 211 from process 200 to create a meta data record for each document.
- a meta data record is created in step 503 .
- Each document 501 can hold a unique corresponding meta data record 507 , for example, illustrated as documents with linked meta data records in 505 .
- FIG. 6 illustrates a process 600 explaining in more detail step 213 from process 200 to create a unique document identifier (ID) for each document and store the ID in the meta data information storage.
- each document 501 holding a unique corresponding meta data record 507 illustrated as documents with meta data records in 505 and have associated therewith a unique identifier, for example, such as a unique serial number, a MD-5, SHA-1, SHA-2, SHA-128 hash value, and the like, representing the document.
- Each document 601 can include a unique corresponding meta data record with a unique identifier 607 , for example, illustrated as documents with linked meta data records in 603 .
- FIG. 7 illustrates a process 700 to extract and store various types of structural, syntactical and semantic information from a document, and explaining in more detail step 217 from process 200 .
- step 701 reads user preferences related to the information extraction from user settings and preferences in database 201 .
- the document textual content is extracted at step 703 , optionally, non-relevant information (e.g., such as numbers, reading signs, and other data that is not relevant to distinguish the document category) are filtered out at step 704 , a language dependent part-of-speech tagging is implemented to assign a linguistic category to each word, for example, such as NOUN, VERB, DETERMINER, PRONOUN, ADVERB, PROPER NOUN, CONJUNCTION, and the like, and so as to find linguistic structures, for example, such as VERB PHRASES, NOUN PHASES, and the like.
- non-relevant information e.g., such as numbers, reading signs, and other data that is not relevant to distinguish the document category
- a language dependent part-of-speech tagging is implemented to assign a linguistic category to each word, for example, such as NOUN, VERB, DETERMINER, PRONOUN, ADVERB, PROPER NOUN, CONJUNCTION, and the
- Step 705 also can include automatic language recognition.
- Part-of-speech tagging is harder than just having a list of words and their parts of speech, because some words can represent more than one part of speech at different times, and because some parts of speech are complex or unspoken. This is not rare—in natural languages (e.g., as opposed to many artificial languages), a large percentage of word-forms are ambiguous. For example, even “dogs”, which is usually thought of as just a plural noun, can also be a verb:
- performing grammatical tagging can indicate that “dogs” is a verb, and not the more common plural noun, since one of the words must be the main verb, and the noun reading is less likely following “sailor” (e.g., sailor ! ⁇ dogs). Semantic analysis can then extrapolate that “sailor” and “barmaid” implicate “dogs” as (1) in the nautical context (e.g., sailor ⁇ verb> ⁇ barmaid), and (2) an action applied to the object “barmaid” (e.g., [subject] dogs ⁇ barmaid).
- “dogs” is a nautical term meaning “fastens (e.g., a watertight barmaid) securely; applies a dog to”.
- a proper linguistic grammatical analysis is very relevant to solve linguistic boundary problems (e.g., where does a named entity start and end and which parts of a sentence belong to the names entity, e.g., all words in “Mr. John D. Jones Ph.D. Jr.” are all part of one entity) and to find if a named entity is one entity or a conjunction of entities (e.g., “Mr. and Mrs. Jones” are two entities, and “state university of New York” is one entity).
- Grammatical analysis also helps resolving co-reference and pronoun ambiguity, which is advantageous for the machine learning later on.
- the named entities in a document can be obtained reliably in step 706 .
- techniques for example, such as gazetteers, dictionaries, regular expressions, rules, patterns, Hidden Markov Models, Support Vector Machines, Maximal Entropy Models and other suitable statistics, and the like, can be used to classify the named entities into structural, syntactic, semantic and pragmatic classes, for example, such as person, city, country, address, job title, credit card number, social security number, and the like, but also more complex relations, for example, such as sentiments, locations, problems, route, concepts, facts, events, and thousands more such roles and meanings, and the like, in step 709 .
- users can select what extracted information is most relevant for the application in step 707 and use this in the steps 708 and 709 before the data is entered into the machine learning model.
- all suitable entities are normalized, for example, including the following functions: normalization of entities such as names, dates, numbers and (e.g., email) addresses, have textual entities refer to the same real world object in a database, semantic normalization (e.g., meaning), resolve synonyms and homonyms, stemming of verbs and nouns. Normalization can reduce the number of unique entities by up to 80%. Normalization greatly improves the quality of the machine learning.
- the extracted information which is a result of all the of the previous steps of step 217 , for each document is stored in the meta data information storage 109 .
- FIG. 8 illustrates a data structure to extract and store various types of structural, syntactical and semantic information from a document 800 .
- step 217 is specified in more detail, and in particular as to how the extracted various types of structural, syntactical and semantic information is stored in the meta information data store.
- the additional structural, syntactical and semantic information 801 is stored in each unique record 607 linked to a document in sub-process 805 .
- FIG. 9 illustrates a data structure 900 to manually label training and text documents for machine learning.
- step 302 is specified in more detail.
- a relevant set of training- and testing documents is selected by a user of a group of users using manual and/or automatic techniques, for example, such as intuition, statistics, cross validation, maximum likelihood estimation, clustering, self-organization, feature extraction and feature selection methods, and the like.
- Each document 501 is then labeled manually or otherwise with the class or classes 900 it belongs to. This additional information is stored in the unique record 607 that exists for each document 501 in sub-process 905 .
- FIG. 10 illustrates a process to train the machine learning model with a supervised or unsupervised machine learning algorithm 1000 .
- step 303 is explained in more detail.
- mathematical vectors can be created form the categorical data in step 1001 and as explained in more detail in FIG. 16 .
- these vectors are used as feature vectors for any known supervised or unsupervised clustering and machine learning technique, and the like, at step 1002 .
- Machine-learning algorithms that can be used, for example, include Decision Trees, Support Vector Machines (SVM), Na ⁇ ve-Bayes Classifiers, k-Nearest Neighbors, rules-based classification, Scatter-Gather Clustering, Latent Discriminant Analysis (LDA), or Hierarchical Agglomerate Clustering, (HAC), and the like.
- SVM Support Vector Machines
- LDA Latent Discriminant Analysis
- HAC Hierarchical Agglomerate Clustering
- a machine learning model 304 is obtained that can be used for automatic document classification.
- the machine learning model 304 can be a binary classifier, with one trained classifier per category or a multi-class classifier trained to recognize multiple classes with one classifier, and the like.
- FIG. 11 illustrates a process 1100 to test the machine learning model for a supervised or unsupervised machine learning algorithm.
- step 305 for testing the machine learning model is explained in more detail.
- a feature vector is created from the extracted structural, syntactical and semantic information that is stored meta data records 109 for each test document in step 1101 .
- this vector is then mapped against the machine learning model 304 and the machine learning model 304 returns a recognized document class.
- the vector of the test document is compared to each classifier and a value representing the measure of recognition is returned.
- a value representing the measure of recognition is returned.
- the test document can be included or excluded for one or more classes.
- the recognized classes are returned as one or more return values in a predefined range, for example, where higher values represent a match and lower values represent a miss-match with each category.
- the values of the classes which are a best match for the vector of the test document are returned.
- the name of the class(es) of the highest values returned can be resolved by using information in 109 to a categorical value in step 1103 .
- test results can be expressed in terms of precision and recall, in a combination of precision and recall, and the like, for example, the so-called f-values of eleven points of precision based on an arithmetic average.
- FIG. 12 illustrates a process 1200 to classify new documents with the machine learning model.
- step 404 is explained in more detail.
- a feature vector is created from the extracted structural, syntactical and semantic information that is stored meta data records 109 for each test document in step 1201 .
- this vector is then mapped against the machine learning model 304 and the machine learning model 304 returns a recognized document class.
- the vector of the test document is compared to each classifier and a value representing the measure of recognition is returned.
- a value representing the measure of recognition is returned.
- the test document can be included or excluded for one or more classes.
- the recognized classes are returned as one or more return values in a predefined range, where higher values represent a match, and lower values represent a miss-match with each category.
- the values of the classes which are a best match for the vector of the test document are returned.
- the name of the class(es) of the highest values returned is resolved by using information in 109 to a categorical value in step 1203 .
- the system then return the recognized document class(es) in step 1204 .
- FIG. 13 illustrates a process 1300 to extract textual content from a document 1300 .
- step 702 is explained in more detail.
- the document is first opened in step 1305 , and all suitable textual content is extracted from the document in step 1307 .
- This also includes next to all suitable visible text and document layout, any suitable type of non-visible textual information, for example, such as file security, document properties, project properties, user information, file system and storage properties, and any other suitable type of hidden or meta data information, and the like.
- low-level document encoding e.g., UNICODE, Windows code pages, ASCII, ANSI, etc.
- one common text representation e.g., often 16 bits UNICODE
- sequence of the words e.g., left-right for Roman-, right-left for Arabic-, and top-down for Asian languages
- layout sequence is normalized.
- the result is a uniform textual representation of all suitable textual content of a document in 1311 .
- FIG. 14 is an illustrative overview 1400 of structural, syntactical and semantic and information that can be extracted from documents to represent the feature vectors for machine learning.
- examples of named entities such as CITY, COMPANY, COUNTRY and CURRENCY, and the like, but also more relatively complex patterns, such as sentiments, problems, and the like, can be derived.
- extracted information can be anything that is relevant, for example, structural, syntactical or semantic information, and the like, that is unique for a particular document class. Extracted information can include inference by rules, so if information type A and B occur in a document, then the system can inference that the document is about information C, and also tag the document with that value.
- FIG. 15 is an illustrative overview of the creation of a feature vector 1500 from the extracted information.
- each categorical value can be represented by a numerical value, and the like. This numerical value can be unique for each value a category can hold in the entire data set.
- this representation is created, the quality of the machine learning and clustering is best when all of the suitable available addressing space is used. This process can be implemented automatically, by taking into account all suitable values for all different categories of the extracted structural, syntactic and semantic information, and the like.
- users can also select the most relevant categories of extracted information, and the most relevant values per category, as input for the feature vectors, thereby highly reducing the number of dimensions of the feature vectors, and which focuses the model on the most relevant (e.g., most distinguishing) features per document class, and thus reducing the complexity of the machine learning model, and the training, testing and classification time, and the like.
- FIG. 16 is an illustrative overview 1600 of the bag-of-words approach and creation of feature vectors for machine learning. From this example, it can be seen that very different sentences obtain similar vector representation, and visa versa different sentences obtain similar vectors. This disadvantageous effect highly disturbs and confuses the quality of the machine learning and clustering algorithms, as compared to the systems and methods of the present invention.
- the above-described devices and subsystems of the illustrative embodiments can include, for example, any suitable servers, workstations, PCs, laptop computers, PDAs, Internet appliances, handheld devices, cellular telephones, smart phones, wireless devices, other devices, and the like, capable of performing the processes of the illustrative embodiments.
- the devices and subsystems of the illustrative embodiments can communicate with each other using any suitable protocol and can be implemented using one or more programmed computer systems or devices.
- One or more interface mechanisms can be used with the illustrative embodiments, including, for example, Internet access, telecommunications in any suitable form (e.g., voice, modem, and the like), wireless communications media, and the like.
- employed communications networks or links can include one or more wireless communications networks, cellular communications networks, G3 communications networks, Public Switched Telephone Network (PSTNs), Packet Data Networks (PDNs), the Internet, intranets, a combination thereof, and the like.
- PSTNs Public Switched Telephone Network
- PDNs Packet Data Networks
- the Internet intranets, a combination thereof, and the like.
- the devices and subsystems of the illustrative embodiments are for illustrative purposes, as many variations of the specific hardware used to implement the illustrative embodiments are possible, as will be appreciated by those skilled in the relevant art(s).
- the functionality of one or more of the devices and subsystems of the illustrative embodiments can be implemented via one or more programmed computer systems or devices.
- a single computer system can be programmed to perform the special purpose functions of one or more of the devices and subsystems of the illustrative embodiments.
- two or more programmed computer systems or devices can be substituted for any one of the devices and subsystems of the illustrative embodiments. Accordingly, principles and advantages of distributed processing, such as redundancy, replication, and the like, also can be implemented, as desired, to increase the robustness and performance of the devices and subsystems of the illustrative embodiments.
- the devices and subsystems of the illustrative embodiments can store information relating to various processes described herein. This information can be stored in one or more memories, such as a hard disk, optical disk, magneto-optical disk, RAM, and the like, of the devices and subsystems of the illustrative embodiments.
- One or more databases of the devices and subsystems of the illustrative embodiments can store the information used to implement the illustrative embodiments of the present inventions.
- the databases can be organized using data structures (e.g., records, tables, arrays, fields, graphs, trees, lists, and the like) included in one or more memories or storage devices listed herein.
- the processes described with respect to the illustrative embodiments can include appropriate data structures for storing data collected and/or generated by the processes of the devices and subsystems of the illustrative embodiments in one or more databases thereof.
- All or a portion of the devices and subsystems of the illustrative embodiments can be conveniently implemented using one or more general purpose computer systems, microprocessors, digital signal processors, micro-controllers, and the like, programmed according to the teachings of the illustrative embodiments of the present inventions, as will be appreciated by those skilled in the computer and software arts.
- Appropriate software can be readily prepared by programmers of ordinary skill based on the teachings of the illustrative embodiments, as will be appreciated by those skilled in the software art.
- the devices and subsystems of the illustrative embodiments can be implemented on the World Wide Web.
- the devices and subsystems of the illustrative embodiments can be implemented by the preparation of application-specific integrated circuits or by interconnecting an appropriate network of conventional component circuits, as will be appreciated by those skilled in the electrical art(s).
- the illustrative embodiments are not limited to any specific combination of hardware circuitry and/or software.
- the illustrative embodiments of the present inventions can include software for controlling the devices and subsystems of the illustrative embodiments, for driving the devices and subsystems of the illustrative embodiments, for enabling the devices and subsystems of the illustrative embodiments to interact with a human user, and the like.
- software can include, but is not limited to, device drivers, firmware, operating systems, development tools, applications software, and the like.
- Such computer readable media further can include the computer program product of an embodiment of the present inventions for performing all or a portion (if processing is distributed) of the processing performed in implementing the inventions.
- Computer code devices of the illustrative embodiments of the present inventions can include any suitable interpretable or executable code mechanism, including but not limited to scripts, interpretable programs, dynamic link libraries (DLLs), Java classes and applets, complete executable programs, Common Object Request Broker Architecture (CORBA) objects, and the like. Moreover, parts of the processing of the illustrative embodiments of the present inventions can be distributed for better performance, reliability, cost, and the like.
- DLLs dynamic link libraries
- Java classes and applets Java classes and applets
- CORBA Common Object Request Broker Architecture
- the devices and subsystems of the illustrative embodiments can include computer readable medium or memories for holding instructions programmed according to the teachings of the present inventions and for holding data structures, tables, records, and/or other data described herein.
- Computer readable medium can include any suitable medium that participates in providing instructions to a processor for execution. Such a medium can take many forms, including but not limited to, non-volatile media, volatile media, transmission media, and the like.
- Non-volatile media can include, for example, optical or magnetic disks, magneto-optical disks, and the like.
- Volatile media can include dynamic memories, and the like.
- Transmission media can include coaxial cables, copper wire, fiber optics, and the like.
- Transmission media also can take the form of acoustic, optical, electromagnetic waves, and the like, such as those generated during radio frequency (RF) communications, infrared (IR) data communications, and the like.
- RF radio frequency
- IR infrared
- Common forms of computer-readable media can include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, any other suitable magnetic medium, a CD-ROM, CDRW, DVD, any other suitable optical medium, punch cards, paper tape, optical mark sheets, any other suitable physical medium with patterns of holes or other optically recognizable indicia, a RAM, a PROM, an EPROM, a FLASH-EPROM, any other suitable memory chip or cartridge, a carrier wave or any other suitable medium from which a computer can read.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
Description
- 1. Field of the Invention
- The present invention generally relates to systems and methods for document classification, and more particularly to systems and methods for automatic document classification for electronic discovery (eDiscovery), compliance, clean-up of legacy information, and the like.
- 2. Discussion of the Background
- In recent years, various types of document classification systems and methods have been developed. However, with such document classification systems and methods, there is still a need to provide improved systems and methods that addresses limitations of what is referred to as a bag-of-word (BOW) approach.
- Therefore, there is a need for a method and system that addresses the above and other problems with document classification systems and methods. The above and other problems are addressed by the illustrative embodiments of the present invention, which provide improved systems and methods that addresses limitations of what is referred to as a bag-of-word (BOW) approach. Advantageously, the illustrative systems and methods can provide automatic document classification for eDiscovery, compliance, legacy-information clean-up, and the like, while allowing for usage of various machine-learning approaches, and the like, in multi-lingual environments, and the like.
- Accordingly, in illustrative aspects of the present invention there is provided a system, method, and computer program product for automatic document classification, including an extraction module configured to extract structural, syntactical and/or semantic information from a document and normalize the extracted information; a machine learning module configured to generate a model representation for automatic document classification based on feature vectors built from the normalized and extracted semantic information for supervised and/or unsupervised clustering or machine learning; and a classification module configured to select a non-classified document from a document collection, and via the extraction module extract normalized structural, syntactical and/or semantic information from the selected document, and generate via the machine learning module a model representation of the selected document based on feature vectors, and match the model representation of the selected document against the machine learning model representation to generate a document category, and/or classification for display to a user.
- The extracted information includes named entities, properties of entities, noun-phrases, facts, events, and/or concepts.
- The extraction module employs text-mining, language identification, gazetteers, regular expressions, noun-phrase identification with part-of-speech taggers, and/or statistical models and rules, and is configured to identify patterns, and the patterns include libraries, and/or algorithms shared among cases, and which can be tuned for a specific case, to generate case-specific semantic information.
- The extracted information is normalized by using normalization rules, groupers, thesauri, taxonomies, and/or string-matching algorithms.
- The model representation of the document is a TF-IDF document representation of the extracted information, and the clustering or machine learning includes a classifier based on decision trees, support vector machines (SVM), naïve-bayes classifiers, k-nearest neighbors, rules-based classification, Linear discriminant analysis (LDA), Maximum Entropy Markov Model (MEMM), scatter-gather clustering, and/or hierarchical agglomerate clustering (HAC).
- Still other aspects, features, and advantages of the present invention are readily apparent from the following detailed description, by illustrating a number of illustrative embodiments and implementations, including the best mode contemplated for carrying out the present invention. The present invention is also capable of other and different embodiments, and its several details can be modified in various respects, all without departing from the spirit and scope of the present invention. Accordingly, the drawings and descriptions are to be regarded as illustrative in nature, and not as restrictive.
- The embodiments of the present invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:
-
FIG. 1 illustrates a system for automatic document classification; and -
FIG. 2 illustrates a process of assignment of a unique identifier per document and extraction and storage of various structural, syntactical and semantic information from each individual document; -
FIG. 3 illustrates a machine learning process with training and testing of a machine learning model; -
FIG. 4 illustrates an automatic classification process of new documents with a machine learning model; -
FIG. 5 illustrates a process to create a meta data record for each document; -
FIG. 6 illustrates a process to create a unique document identifier (ID) for each document and store the ID in meta data information storage; -
FIG. 7 illustrates a process to extract and store various structural, syntactical and semantic information from a document; -
FIG. 8 illustrates a data structure to extract and store various structural, syntactical and semantic information from a document; -
FIG. 9 illustrates a data structure to manually or otherwise label training and text documents for machine learning; -
FIG. 10 illustrates a process to train a machine learning model with a supervised or unsupervised machine learning algorithm; -
FIG. 11 illustrates a process to test a machine learning model for a supervised or unsupervised machine learning algorithm; -
FIG. 12 illustrates a process to classify new documents with a machine learning model; -
FIG. 13 illustrates a process to extract textual content from a document; -
FIG. 14 illustrates an overview of structural, syntactical and semantic information that can be extracted from documents to represent feature vectors for machine learning; -
FIG. 15 illustrates an overview of creation of a feature vector fro extracted information; and -
FIG. 16 illustrates an overview of a bag-of-words (BOW) approach and creation of feature vectors for machine learning. - The present invention includes recognition that the ongoing information explosion is reaching epic proportions and has earned its own name: Big Data. Big Data encompasses both challenges and opportunities. The opportunity, as focused on by many parties, is to use the collective Big Data to predict and recognize patterns and behavior and to increase revenue and optimize business processes. But there is also a dark side to Big Data: requirements for eDiscovery, compliance, legacy-information clean-up, governance privacy and storage can lead to enormous costs and unexpected or unknown risks. New data formats (e.g., multimedia, in particular), different languages, cloud and other off-side locations and the continual increase in regulations and legislation—which may contradict previous protocols—add even more complexity to this puzzle.
- Applying content analytics helps to assuage the dark side of Big Data. Content analytics such as text-mining and machine-learning technology from the field of artificial intelligence can be used very effectively to manage Big Data. Consider tasks, for example, such as identifying exact and near-duplicates, structuring and enriching the content of text and multimedia data, identifying relevant (e.g., semantic) information, facts and events, and ultimately, automatically clustering and classifying information, and the like.
- Content-analytics can be used for any suitable type of application where unstructured data needs to be classified, categorized or sorted. Other examples are early-case assessment and legal review in eDiscovery (e.g., also known as machine-assisted review, technology-assisted review or predictive coding), enforcement of existing rules, policies and regulations in compliance. But also identifying privacy-sensitive information, legacy-information clean-up and information valuation in enterprise information management are good examples. As a result of these content analytics efforts, users can explore and understand repositories of Big Data better and also apply combinations of advanced search and data visualization techniques easier.
- Both supervised and unsupervised machine learning techniques can be used to classify documents automatically and reveal more complex insights into Big Data. A machine learning model can be trained with a seed set of documents (e.g., samples), which are often annotated documents for particular information categories or known information patterns. Based on these training documents, a machine learning algorithm can derive a model that can classify other documents into the thought classes, or temporal, geographical, correlational or hierarchical patterns can be identified from these training examples.
- Machine learning is not perfect: the more document categories there are, the lower the quality can be for the document classification. This is very logical as it is easier to differentiate only black from white than it is to differentiate 1,000 types of gray values. The absence of sufficient relevant training documents will also lower the quality of classification. The number of required training documents grows faster than the increase of the number of categorization classes. So, for 2 times more classes one may need 4 times more training documents.
- Machine learning and other artificial intelligence techniques used to predict patterns and behavior are not based on “hocus pocus”: they are based on solid mathematical and statistical frameworks in combination with common-sense or biology-inspired heuristics. In the case of text-mining, there is an extra complication: the content of textual documents has to be translated, so to speak, into numbers (e.g., probabilities, mathematical notions such as vectors, etc.) that machine learning algorithms can interpret. The choices that are made during this translation can highly influence the results of the machine learning algorithms.
- During a pre-processing step, the documents can be converted into a manageable representation. Typically, they are represented by so-called feature vectors. A feature vector can include a number of dimensions of features and corresponding weights. The process of choosing features for the feature vector is called feature selection. In text representation, the commonly used representation can be referred to as a bag-of-words (BOW), where each word is a feature in the vector and the weights are either 1 if the word is present in the document or 0 if not. More complex weighting schemes are, for example, Term Frequency-Inverse Document Frequency (TF-IDF), and the like, which gives different weights based on frequency of words in a document and in the overall collection. The TF-IDF approach provides a numerical measure of the importance of a particular word to a document in a corpus of documents. The advantage of this technique is that the value increases proportionally to the number of times the given word occurs in the document, but decreases if the word occurs more often in the whole corpus of documents. This relates to the fact that the distributions of words in different languages vary extremely.
- The bag-of-word model has several practical limitations, for example, including: (1) typically, it is not possible to use the approach on documents that use different languages within and in-between documents, (2) machine-learning models typically cannot be re-used between different cases, one has to start all over again for each case, (3) the model typically cannot handle dynamic collections, when new documents are added to a case, one has to start all over again, (4) when the model does not perform relatively good enough, one has to start training all over again with a better training set, (5) typically there is no possibility to patch the model, nor is there a guarantee to success, and (6) in application where defensibility in court and clarity are important, such as eDiscovery, compliance, legacy-information clean-up, and the like, an additional complication of the bag-of-word approach is that it is hard to understand and explain to an audience laymen's terms.
- The bag-of-word model also has several technical limitations that may result in having completely different documents ending up in the exact same vector for machine learning and having documents with the same meaning ending up as completely different vectors. Also, the high-dimensional feature vectors are very sparse, that is, most of the dimensions can be (e.g., close to) zero. This opens up the opportunity for data compression, but also causes machine learning problems, such as a very high computational complexity, resulting in relatively huge memory and processing requirements, over-fitting (e.g., random error and noise in the training set is used instead of the actual underlying relationships to derive the machine learning model from the training examples), rounding errors (e.g., multiplying very small probabilities over and over again may result in a floating-point underflow), and the like.
- Moreover, a most serious structural limitation of the bag of word approach is, that all suitable words (e.g., maybe with the exception of a list of high frequency noise words) are more or less dumped into a mathematical model, without additional knowledge or interpretation of linguistic patterns and properties, such as word order (e.g., “a good book” versus “book a good”), synonyms, spelling and syntactical variations, co-references and pronouns resolution or negations, and the like. Therefore, the bag of words approach takes simplicity one step too far. For example, just a few of the examples of the problems and limitations can include: (1) Variant Identification and Grouping: It is sometimes needed to recognize variant names as different forms of the same entity giving accurate entity counts as well as the location of all suitable appearances of a given entity. For example, one may need to recognize that the word “Smith”, in an example, refers to the “Joe Smith” identified earlier and therefore groups them together as aliases of the same entity. (2) Normalization: Normalizes entities such as dates, currencies, and measurements into standard formats, taking the guesswork out of the metadata creation, search, data mining, and link analysis processes. (3) Entity Boundary Detection: Will the technology consider “Mr. and Ms. John Jones” as one or two entities? And what will the processor consider to be the start and end of an excerpt, such as “VP John M. P. Kaplan-Jones, Ph.D. M.D.”?
- Such basic operations will not only dramatically reduce the size of the data set, they will also result in better data analysis and visualization: entities that would not be related without normalization can be the missing link between two datasets especially if they are written differently in different parts of the data set or if they are not recognized as being a singular or plural entity properly. In addition, one of the other limitations of the usage of a bag-of-word approach is the absence of the resolving of the so called anaphora and co-references. This is the linguistic problem to associate pairs of linguistic expressions that refer to the same entities in the real world.
- For example, consider the following text:
- “A man walks to the station and tries to catch the train. His name is John Doe. Later he meets his colleague, who has just bought a card for the same train. They work together at the Rail Company as technical employees and they are going to a meeting with colleagues in New York.”
- The text can include various references and co-references. Various anaphora and co-references can be disambiguated before it is possible to fully understand and extract the more complex patterns of events. The following list shows examples of these (e.g., mutual) references:
- Pronominal Anaphora: he, she, we, oneself, etc.
- Proper Name Co-reference: For example, multiple references to the same name.
- Apposition: the additional information given to an entity, such as “John Doe, the father of Peter Doe”.
- Predicate Nominative: the additional description given to an entity, for example “John Doe, who is the chairman of the soccer club”.
- Identical Sets: A number of reference sets referring to equivalent entities, such as “Giants”, “the best team”, and the “group of players” which all refer to the same group of people.
- It can be stated that natural language is not a jumbled bag of words; ignoring simple linguistic structures, such as synonyms, spelling and syntax variations, co-references and pronouns resolution or negations, and the like. As a result, machine learning based on the so-called bag-of-words feature extraction is limited from its start. To many end users in eDiscovery, Governance, Enterprise Information Archiving and other Information Management initiatives, such built-in limitation is unacceptable.
- Even with limited natural language processing (NLP) techniques, one can be able to do a better job than the bag-of-words approach and recognize and disambiguate much more relevant linguistic information and build better feature vectors for the machine-learning process. As a result of this, the overall performance of the machine-learning system can easily be increased.
- Documents are represented by extracted semantic information, such as (e.g., named) entities, properties of entities, noun-phrases, facts, events and other high-level concepts, and the like. Extraction is done by using any known techniques from text-mining, for example: language identification, gazetteers, regular expressions, noun-phrase identification with part-of-speech taggers, statistical models and rules, and the like, to identify more complex patterns. These patterns, libraries, algorithms, and the like, can be shared among cases, but can also be fine-tuned for a specific case, so only case-specific relevant semantic information is extracted. Extracted information can be normalized by using any suitable type of known technique, for example, including normalization rules, groupers, thesauri, taxonomies, string-matching algorithms, and the like.
- Next, instead of using the bag-of-word, TF-IDF document representation, vectors built of the normalized and extracted semantic information are used as feature vectors for any suitably known supervised or unsupervised clustering and machine learning technique. Examples of machine-learning algorithms that can be used include Decision Trees, Support Vector Machines (SVM), Naïve-Bayes Classifiers, k-Nearest Neighbors, rules-based classification, Scatter-Gather Clustering, r Hierarchical Agglomerate Clustering, (HAC), and the like.
- This approach has several benefits over the bag-of-word approach: (1) the dimensionality of the derived feature vectors are orders of a magnitude smaller than the bag-of-words feature vectors. As a result, machine-learning training can be much faster (e.g., which is a huge benefit for dynamic collections), compression need not be employed (e.g., with the risks of information-loss), and the risks for over-fitting and rounding errors are relatively much smaller to non-present. The system can also handle document collections with very different document types better than a bag-of-word approach (e.g., very different length, structure, writing style, vocabulary, sentence length, etc.). (2) It is possible to use this approach on documents that use different languages within and in-between documents: extracted semantic information can be translated by using machine translation and multi-lingual glossaries. (3) Different documents need not end up as similar feature vectors for machine learning: machine-learning feature vectors can be relatively much better because of the application, understanding and resolution of basic linguistic operations such as normalization, negation, and co-reference/anaphora resolution. As a result, the performance of the machine learning can easily increase with double digit percentages. (4) There is a significant chance that the derived machine-learning models can be re-used between different cases, as they are based on high-level semantic information that need not rely on the actual words used in the original documents, as a result, one need not have to start all over again for each case. (5) Feature vectors can be built for specific types of cases by extracting only suitable information that is relevant for the case. This can make machine learning more defensibility in court and create more clarity in applications, for example, such as eDiscovery, compliance, legacy-information clean-up, and the like.
- Referring now to the drawings, wherein like reference numerals designate identical or corresponding parts throughout the several views, and more particularly to
FIG. 1 thereof, there is illustrated an automaticdocument classification system 100, according to an illustrative embodiment. InFIG. 1 , generally, the automaticdocument classification system 100 provides for automatically extracting structural, semantic, syntactic, and the like, information from relevant training models, based on, for example, entities, facts, events, concepts, and the like, to train a machine learning model, and the like, and use the derived machine learning model for automatic document classification, and the like. - The system includes for example, a document storage 113 (e.g., a computer storage device, etc.) including one or
more document collections 111, one or more document metadata information storage 109 and one or moremachine learning models 304, accessed through one ormore servers system 100 can be used for (1) automatic extraction of structural, semantic, and syntactic information from relevant training models, for example, based on entities, facts, events, concepts, and the like, (2) training of a machine learning model, and the like, and (3) using the derived machine learning model for automatic document classification, and the like, into various trained categories, and the like. - One or more
local computers 121 can provide connectivity to one ormore users remote computers 127 can provide connectivity to one or moreremote users computers document storage 113 and to allow the one ormore users document collection 111, view documents, document groups, document meta information, training documents, training results, the machine learning model, document classifications, and the like. - The
servers computer storage 113 to extractmeta data information 109 for each document in thedocument collection 111, to create unique document identifiers for each document, to label the documentmeta data 109 with the document identifiers of the document groups, to create amachine learning model 304, and to automatically train themachine learning model 304 and use thismachine learning model 304 for automatic document classification of other documents (e.g., documents not used for training the machine learning model), test the quality of themachine learning model 304 with pre-labeled test documents from thedocument collection 111, and the like. - As described above, the
users document collection 111 by using thecomputers Intranet 115. When a document is found, the system can show the content of thedocuments 111, the meta information of the documents in themeta information storage 109, the training documents (e.g., selection from 111), themachine learning model 304, and the labels of the automatically categorized documents from 111 in themeta data storage 109. -
FIG. 2 illustrates aprocess 200 of the assignment of a unique identifier per document and the extraction, and the storage of various types of structural, syntactical and semantic information from each individual document. In FIG. 2., atstep 211, for each document from thedocument collection 111, a record in themeta information storage 109 is created and stored. Atstep 213, for each document in thedocument collection 111, a unique document identifier (ID), for example, such as a unique serial number, a MD-5, SHA-1, SHA-2, SHA-128 hash value, and the like, is created. The unique identifier is stored in a record in the metadata information storage 109 that belongs to the corresponding document in thedocument collection database 111. Atstep 217, for each document in thedocument collection 111, various types of structural, syntactic, and semantic information is extracted by using certain user setting from adatabase 201, as set by using various information extraction techniques by, for example, a user or a group ofusers 203. Instep 221, the extracted information is stored in a record in the metadata information storage 109 that belongs to the corresponding document in thedocument collection database 111. -
FIG. 3 illustrates amachine learning process 300 with training and testing of the machine learning model. Atstep 301, a user or a group ofusers 310 manually or otherwise identify a set of relevant training and testing documents from thedocument collection 111. The set of training documents need not include documents in common with the set of testing documents. These sets can be mutually exclusive. Selection of relevant training material can also be done by using clustering or concept search techniques that cluster similar documents for certain document categories, for example, by self-organization or vector decomposition techniques (e.g., Hierarchical Clustering algorithms, Kohonen Self-Organizing maps, linear discriminant analysis (LDA), etc.), and the like. - At
step 302, a user or a group ofusers 311 manually or otherwise tag the selected training and testing documents fromdocument collection 111 with the corresponding document categories. Atstep 303, themachine learning model 304 is trained by using a vector representation created from the records with the extracted information for each document in themeta information storage 109, together with the document categorization label, which exists for each document from the training set in thedocument collection database 111. Both supervised as unsupervised machine learning algorithms can be used, for example, such as Support Vector Machines (SVM), k-Nearest Neighbor (kNN), naïve Bayes, Decision Rules, k-means, Hierarchical Clustering, Kohonen self-organizing feature maps, linear discriminant analysis (LDA), and the like. - At
step 305, themachine learning model 304 is tested by comparing recognized categories with pre-labeled categories from documents in the test documents indocument database 111. This testing can be done by a user or auser group 313. Results ofstep 305 can be reported for example in terms of precision, recall, f-values, and the like, and other best practice measurements from the fields of information retrieval, and the like. -
FIG. 4 illustrates anautomatic classification process 400 of new documents with themachine learning model 400. Atstep 401, a non-classified document is selected fromdocument collection 111. This can be a document that is also not part of the training or test set of documents used in 300. Accordingly, inprocess 404, documents are classified.Process 404 includes a number of steps. For example, atstep 217, the various structural, syntactical and semantic information for the selected document is obtained from the meta data information store. This information is converted into a vector representation instep 402 and then matched against themachine learning model 304. From themachine learning model 304, a document category or classification is obtained instep 403. -
FIG. 5 illustrates aprocess 500 explaining in more detail step 211 fromprocess 200 to create a meta data record for each document. InFIG. 5 , for each document or set ofdocuments 501, which originate from thedocument collection database 111, a meta data record is created instep 503. Eachdocument 501 can hold a unique correspondingmeta data record 507, for example, illustrated as documents with linked meta data records in 505. -
FIG. 6 illustrates aprocess 600 explaining in more detail step 213 fromprocess 200 to create a unique document identifier (ID) for each document and store the ID in the meta data information storage. InFIG. 6 , eachdocument 501 holding a unique correspondingmeta data record 507, illustrated as documents with meta data records in 505 and have associated therewith a unique identifier, for example, such as a unique serial number, a MD-5, SHA-1, SHA-2, SHA-128 hash value, and the like, representing the document. Eachdocument 601 can include a unique corresponding meta data record with aunique identifier 607, for example, illustrated as documents with linked meta data records in 603. -
FIG. 7 illustrates aprocess 700 to extract and store various types of structural, syntactical and semantic information from a document, and explaining in more detail step 217 fromprocess 200. InFIG. 7 , after starting the process, step 701 reads user preferences related to the information extraction from user settings and preferences indatabase 201. Atstep 217, for each document, the document textual content is extracted atstep 703, optionally, non-relevant information (e.g., such as numbers, reading signs, and other data that is not relevant to distinguish the document category) are filtered out atstep 704, a language dependent part-of-speech tagging is implemented to assign a linguistic category to each word, for example, such as NOUN, VERB, DETERMINER, PRONOUN, ADVERB, PROPER NOUN, CONJUNCTION, and the like, and so as to find linguistic structures, for example, such as VERB PHRASES, NOUN PHASES, and the like. - Step 705 also can include automatic language recognition. Part-of-speech tagging is harder than just having a list of words and their parts of speech, because some words can represent more than one part of speech at different times, and because some parts of speech are complex or unspoken. This is not rare—in natural languages (e.g., as opposed to many artificial languages), a large percentage of word-forms are ambiguous. For example, even “dogs”, which is usually thought of as just a plural noun, can also be a verb:
- The sailor dogs the barmaid.
- Accordingly, performing grammatical tagging can indicate that “dogs” is a verb, and not the more common plural noun, since one of the words must be the main verb, and the noun reading is less likely following “sailor” (e.g., sailor !→dogs). Semantic analysis can then extrapolate that “sailor” and “barmaid” implicate “dogs” as (1) in the nautical context (e.g., sailor→<verb>→barmaid), and (2) an action applied to the object “barmaid” (e.g., [subject] dogs→barmaid). In this context, “dogs” is a nautical term meaning “fastens (e.g., a watertight barmaid) securely; applies a dog to”.
- “Dogged”, on the other hand, can be either an adjective or a past-tense verb. Just which parts of speech a word can represent varies greatly. Trained linguists can identify the grammatical parts of speech to various fine degrees depending on the tagging system. Schools commonly teach that there are nine parts of speech in English: noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection. However, there are clearly many more categories and sub-categories. For nouns, plural, possessive, and singular forms can be distinguished. In many languages, words are also marked for their “case” (e,g, role as subject, object, etc.), grammatical gender, and so on; while verbs are marked for tense, aspect, and other things. Automatic part-of-speech tagging can be done by various techniques, for example, such as Hidden Markov Models (HMM), Finite State Grammars (FSG), Dependency Grammars and various other suitable types of linguistic parsers, and the like.
- A proper linguistic grammatical analysis is very relevant to solve linguistic boundary problems (e.g., where does a named entity start and end and which parts of a sentence belong to the names entity, e.g., all words in “Mr. John D. Jones Ph.D. Jr.” are all part of one entity) and to find if a named entity is one entity or a conjunction of entities (e.g., “Mr. and Mrs. Jones” are two entities, and “state university of New York” is one entity).
- Grammatical analysis also helps resolving co-reference and pronoun ambiguity, which is advantageous for the machine learning later on. After the speech tagging in
step 705, the named entities in a document can be obtained reliably instep 706. For each named entity, techniques, for example, such as gazetteers, dictionaries, regular expressions, rules, patterns, Hidden Markov Models, Support Vector Machines, Maximal Entropy Models and other suitable statistics, and the like, can be used to classify the named entities into structural, syntactic, semantic and pragmatic classes, for example, such as person, city, country, address, job title, credit card number, social security number, and the like, but also more complex relations, for example, such as sentiments, locations, problems, route, concepts, facts, events, and thousands more such roles and meanings, and the like, instep 709. Instep 709, it is also advantageous to resolve the found co-references and pronouns and replace them by the value of the named entity that they refer to. For example, consider the following text: - “A man walks to the station and tries to catch a train. His name is Jan Jansen. Later he meets his colleague, who has just bought a ticket for the same train. They work together at the Rail Company as technical employees. They are going to a meeting with colleagues in Utrecht.”
- This sentence looks after co-references and pronoun resolution as follows:
- A man <Jan Jansen> walks to the station and <Jan Jansen> tries to catch a train. His name is Jan Jansen. Later he <Jan Jansen meets his colleague <John Johnson>, who <John Johnson> has just bought a ticket for the same train <as Jan Jansen>. His name is John Johnson. They <Jan Jansen and John Johnson> work together at the Rail Company as technical employees. They <Jan Jansen and John Johnson> are going to a meeting with colleagues in Utrecht.
- Without co-reference and pronoun resolution, the following patterns would not have been detected and could not have been taught to the machine learning process:
- <Jan Jansen> walks to the station, <Jan Jansen tries to catch a train, <Jan Jansen meets his colleague <John Johnson>, <John Johnson> has just bought a ticket for the same train <as Jan Jansen>, <Jan Jansen and John Johnson> work together at the Rail Company as technical employees, and <Jan Jansen and John Johnson> are going to a meeting with colleagues in Utrecht.
- Based on the final purpose of the classification and machine learning problem, users can select what extracted information is most relevant for the application in
step 707 and use this in thesteps - In
step 710, all suitable entities are normalized, for example, including the following functions: normalization of entities such as names, dates, numbers and (e.g., email) addresses, have textual entities refer to the same real world object in a database, semantic normalization (e.g., meaning), resolve synonyms and homonyms, stemming of verbs and nouns. Normalization can reduce the number of unique entities by up to 80%. Normalization greatly improves the quality of the machine learning. - After normalization, the extracted information, which is a result of all the of the previous steps of
step 217, for each document is stored in the metadata information storage 109. -
FIG. 8 illustrates a data structure to extract and store various types of structural, syntactical and semantic information from adocument 800. InFIG. 8 ,step 217 is specified in more detail, and in particular as to how the extracted various types of structural, syntactical and semantic information is stored in the meta information data store. For eachdocument 501, the additional structural, syntactical andsemantic information 801 is stored in eachunique record 607 linked to a document insub-process 805. -
FIG. 9 illustrates adata structure 900 to manually label training and text documents for machine learning. InFIG. 9 ,step 302 is specified in more detail. For each class, or for a set of classes, a relevant set of training- and testing documents is selected by a user of a group of users using manual and/or automatic techniques, for example, such as intuition, statistics, cross validation, maximum likelihood estimation, clustering, self-organization, feature extraction and feature selection methods, and the like. Eachdocument 501 is then labeled manually or otherwise with the class orclasses 900 it belongs to. This additional information is stored in theunique record 607 that exists for eachdocument 501 insub-process 905. -
FIG. 10 illustrates a process to train the machine learning model with a supervised or unsupervisedmachine learning algorithm 1000. InFIG. 10 ,step 303 is explained in more detail. In order to train the machine learning model either supervised or on-supervised to predict the category of a document from the extracted meta data information, mathematical vectors can be created form the categorical data instep 1001 and as explained in more detail inFIG. 16 . Next, these vectors are used as feature vectors for any known supervised or unsupervised clustering and machine learning technique, and the like, atstep 1002. Machine-learning algorithms that can be used, for example, include Decision Trees, Support Vector Machines (SVM), Naïve-Bayes Classifiers, k-Nearest Neighbors, rules-based classification, Scatter-Gather Clustering, Latent Discriminant Analysis (LDA), or Hierarchical Agglomerate Clustering, (HAC), and the like. At the end of such a process, amachine learning model 304 is obtained that can be used for automatic document classification. Themachine learning model 304 can be a binary classifier, with one trained classifier per category or a multi-class classifier trained to recognize multiple classes with one classifier, and the like. -
FIG. 11 illustrates aprocess 1100 to test the machine learning model for a supervised or unsupervised machine learning algorithm. InFIG. 11 ,step 305 for testing the machine learning model is explained in more detail. For each document in the test set, a feature vector is created from the extracted structural, syntactical and semantic information that is storedmeta data records 109 for each test document instep 1101. Instep 1102, this vector is then mapped against themachine learning model 304 and themachine learning model 304 returns a recognized document class. - When binary classifiers are used, the vector of the test document is compared to each classifier and a value representing the measure of recognition is returned. By using a (e.g., a user definable threshold), the test document can be included or excluded for one or more classes. The recognized classes are returned as one or more return values in a predefined range, for example, where higher values represent a match and lower values represent a miss-match with each category. In the case of a multi-class classifier, the values of the classes which are a best match for the vector of the test document are returned. In both cases, the name of the class(es) of the highest values returned, can be resolved by using information in 109 to a categorical value in
step 1103. - Next, the recognized document class is compared to the pre-labeled document category in
step 1104. A user or a group ofusers 1105 can then compare the results and obtain an overall set of test results representing the quality of the machine learning model in 1106. Test results, for example, can be expressed in terms of precision and recall, in a combination of precision and recall, and the like, for example, the so-called f-values of eleven points of precision based on an arithmetic average. -
FIG. 12 illustrates aprocess 1200 to classify new documents with the machine learning model. InFIG. 12 ,step 404 is explained in more detail. For each document in the test set, a feature vector is created from the extracted structural, syntactical and semantic information that is storedmeta data records 109 for each test document instep 1201. Instep 1202, this vector is then mapped against themachine learning model 304 and themachine learning model 304 returns a recognized document class. - When binary classifiers are used, the vector of the test document is compared to each classifier and a value representing the measure of recognition is returned. By using a (e.g., a user definable threshold), the test document can be included or excluded for one or more classes. The recognized classes are returned as one or more return values in a predefined range, where higher values represent a match, and lower values represent a miss-match with each category. In the case of a multi-class classifier, the values of the classes which are a best match for the vector of the test document are returned. In both cases, the name of the class(es) of the highest values returned is resolved by using information in 109 to a categorical value in
step 1203. The system then return the recognized document class(es) instep 1204. -
FIG. 13 illustrates aprocess 1300 to extract textual content from adocument 1300. InFIG. 13 ,step 702 is explained in more detail. When the content of a document is extracted, the document is first opened instep 1305, and all suitable textual content is extracted from the document instep 1307. This also includes next to all suitable visible text and document layout, any suitable type of non-visible textual information, for example, such as file security, document properties, project properties, user information, file system and storage properties, and any other suitable type of hidden or meta data information, and the like. In theprocess 1300, low-level document encoding (e.g., UNICODE, Windows code pages, ASCII, ANSI, etc.) is resolved and normalized to one common text representation (e.g., often 16 bits UNICODE), and the sequence of the words (e.g., left-right for Roman-, right-left for Arabic-, and top-down for Asian languages), and layout sequence is normalized. The result is a uniform textual representation of all suitable textual content of a document in 1311. -
FIG. 14 is anillustrative overview 1400 of structural, syntactical and semantic and information that can be extracted from documents to represent the feature vectors for machine learning. InFIG. 14 , examples of named entities, such as CITY, COMPANY, COUNTRY and CURRENCY, and the like, but also more relatively complex patterns, such as sentiments, problems, and the like, can be derived. In principle, extracted information can be anything that is relevant, for example, structural, syntactical or semantic information, and the like, that is unique for a particular document class. Extracted information can include inference by rules, so if information type A and B occur in a document, then the system can inference that the document is about information C, and also tag the document with that value. -
FIG. 15 is an illustrative overview of the creation of afeature vector 1500 from the extracted information. When categorical data is represented in a mathematical model, each categorical value can be represented by a numerical value, and the like. This numerical value can be unique for each value a category can hold in the entire data set. When this representation is created, the quality of the machine learning and clustering is best when all of the suitable available addressing space is used. This process can be implemented automatically, by taking into account all suitable values for all different categories of the extracted structural, syntactic and semantic information, and the like. - In this process, users can also select the most relevant categories of extracted information, and the most relevant values per category, as input for the feature vectors, thereby highly reducing the number of dimensions of the feature vectors, and which focuses the model on the most relevant (e.g., most distinguishing) features per document class, and thus reducing the complexity of the machine learning model, and the training, testing and classification time, and the like.
-
FIG. 16 is anillustrative overview 1600 of the bag-of-words approach and creation of feature vectors for machine learning. From this example, it can be seen that very different sentences obtain similar vector representation, and visa versa different sentences obtain similar vectors. This disadvantageous effect highly disturbs and confuses the quality of the machine learning and clustering algorithms, as compared to the systems and methods of the present invention. - The above-described devices and subsystems of the illustrative embodiments can include, for example, any suitable servers, workstations, PCs, laptop computers, PDAs, Internet appliances, handheld devices, cellular telephones, smart phones, wireless devices, other devices, and the like, capable of performing the processes of the illustrative embodiments. The devices and subsystems of the illustrative embodiments can communicate with each other using any suitable protocol and can be implemented using one or more programmed computer systems or devices.
- One or more interface mechanisms can be used with the illustrative embodiments, including, for example, Internet access, telecommunications in any suitable form (e.g., voice, modem, and the like), wireless communications media, and the like. For example, employed communications networks or links can include one or more wireless communications networks, cellular communications networks, G3 communications networks, Public Switched Telephone Network (PSTNs), Packet Data Networks (PDNs), the Internet, intranets, a combination thereof, and the like.
- It is to be understood that the devices and subsystems of the illustrative embodiments are for illustrative purposes, as many variations of the specific hardware used to implement the illustrative embodiments are possible, as will be appreciated by those skilled in the relevant art(s). For example, the functionality of one or more of the devices and subsystems of the illustrative embodiments can be implemented via one or more programmed computer systems or devices.
- To implement such variations as well as other variations, a single computer system can be programmed to perform the special purpose functions of one or more of the devices and subsystems of the illustrative embodiments. On the other hand, two or more programmed computer systems or devices can be substituted for any one of the devices and subsystems of the illustrative embodiments. Accordingly, principles and advantages of distributed processing, such as redundancy, replication, and the like, also can be implemented, as desired, to increase the robustness and performance of the devices and subsystems of the illustrative embodiments.
- The devices and subsystems of the illustrative embodiments can store information relating to various processes described herein. This information can be stored in one or more memories, such as a hard disk, optical disk, magneto-optical disk, RAM, and the like, of the devices and subsystems of the illustrative embodiments. One or more databases of the devices and subsystems of the illustrative embodiments can store the information used to implement the illustrative embodiments of the present inventions. The databases can be organized using data structures (e.g., records, tables, arrays, fields, graphs, trees, lists, and the like) included in one or more memories or storage devices listed herein. The processes described with respect to the illustrative embodiments can include appropriate data structures for storing data collected and/or generated by the processes of the devices and subsystems of the illustrative embodiments in one or more databases thereof.
- All or a portion of the devices and subsystems of the illustrative embodiments can be conveniently implemented using one or more general purpose computer systems, microprocessors, digital signal processors, micro-controllers, and the like, programmed according to the teachings of the illustrative embodiments of the present inventions, as will be appreciated by those skilled in the computer and software arts. Appropriate software can be readily prepared by programmers of ordinary skill based on the teachings of the illustrative embodiments, as will be appreciated by those skilled in the software art. Further, the devices and subsystems of the illustrative embodiments can be implemented on the World Wide Web. In addition, the devices and subsystems of the illustrative embodiments can be implemented by the preparation of application-specific integrated circuits or by interconnecting an appropriate network of conventional component circuits, as will be appreciated by those skilled in the electrical art(s). Thus, the illustrative embodiments are not limited to any specific combination of hardware circuitry and/or software.
- Stored on any one or on a combination of computer readable media, the illustrative embodiments of the present inventions can include software for controlling the devices and subsystems of the illustrative embodiments, for driving the devices and subsystems of the illustrative embodiments, for enabling the devices and subsystems of the illustrative embodiments to interact with a human user, and the like. Such software can include, but is not limited to, device drivers, firmware, operating systems, development tools, applications software, and the like. Such computer readable media further can include the computer program product of an embodiment of the present inventions for performing all or a portion (if processing is distributed) of the processing performed in implementing the inventions. Computer code devices of the illustrative embodiments of the present inventions can include any suitable interpretable or executable code mechanism, including but not limited to scripts, interpretable programs, dynamic link libraries (DLLs), Java classes and applets, complete executable programs, Common Object Request Broker Architecture (CORBA) objects, and the like. Moreover, parts of the processing of the illustrative embodiments of the present inventions can be distributed for better performance, reliability, cost, and the like.
- As stated above, the devices and subsystems of the illustrative embodiments can include computer readable medium or memories for holding instructions programmed according to the teachings of the present inventions and for holding data structures, tables, records, and/or other data described herein. Computer readable medium can include any suitable medium that participates in providing instructions to a processor for execution. Such a medium can take many forms, including but not limited to, non-volatile media, volatile media, transmission media, and the like. Non-volatile media can include, for example, optical or magnetic disks, magneto-optical disks, and the like. Volatile media can include dynamic memories, and the like. Transmission media can include coaxial cables, copper wire, fiber optics, and the like. Transmission media also can take the form of acoustic, optical, electromagnetic waves, and the like, such as those generated during radio frequency (RF) communications, infrared (IR) data communications, and the like. Common forms of computer-readable media can include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, any other suitable magnetic medium, a CD-ROM, CDRW, DVD, any other suitable optical medium, punch cards, paper tape, optical mark sheets, any other suitable physical medium with patterns of holes or other optically recognizable indicia, a RAM, a PROM, an EPROM, a FLASH-EPROM, any other suitable memory chip or cartridge, a carrier wave or any other suitable medium from which a computer can read.
- While the present inventions have been described in connection with a number of illustrative embodiments, and implementations, the present inventions are not so limited, but rather cover various modifications, and equivalent arrangements, which fall within the purview of the appended claims.
Claims (15)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/693,075 US9235812B2 (en) | 2012-12-04 | 2012-12-04 | System and method for automatic document classification in ediscovery, compliance and legacy information clean-up |
US14/989,969 US10565502B2 (en) | 2012-12-04 | 2016-01-07 | System and method for automatic document classification in eDiscovery, compliance and legacy information clean-up |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/693,075 US9235812B2 (en) | 2012-12-04 | 2012-12-04 | System and method for automatic document classification in ediscovery, compliance and legacy information clean-up |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/989,969 Continuation US10565502B2 (en) | 2012-12-04 | 2016-01-07 | System and method for automatic document classification in eDiscovery, compliance and legacy information clean-up |
Publications (2)
Publication Number | Publication Date |
---|---|
US20140156567A1 true US20140156567A1 (en) | 2014-06-05 |
US9235812B2 US9235812B2 (en) | 2016-01-12 |
Family
ID=50826469
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/693,075 Active 2033-11-11 US9235812B2 (en) | 2012-12-04 | 2012-12-04 | System and method for automatic document classification in ediscovery, compliance and legacy information clean-up |
US14/989,969 Active 2035-09-25 US10565502B2 (en) | 2012-12-04 | 2016-01-07 | System and method for automatic document classification in eDiscovery, compliance and legacy information clean-up |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/989,969 Active 2035-09-25 US10565502B2 (en) | 2012-12-04 | 2016-01-07 | System and method for automatic document classification in eDiscovery, compliance and legacy information clean-up |
Country Status (1)
Country | Link |
---|---|
US (2) | US9235812B2 (en) |
Cited By (82)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104199959A (en) * | 2014-09-18 | 2014-12-10 | 浪潮软件集团有限公司 | A text classification method for Internet tax-related data |
US20150254791A1 (en) * | 2014-03-10 | 2015-09-10 | Fmr Llc | Quality control calculator for document review |
US20150269275A1 (en) * | 2012-12-07 | 2015-09-24 | Tencent Technology (Shenzhen) Company Limited | Method, system, and storage medium for displaying media content applicable to social platform |
US20150278195A1 (en) * | 2014-03-31 | 2015-10-01 | Abbyy Infopoisk Llc | Text data sentiment analysis method |
US20160071119A1 (en) * | 2013-04-11 | 2016-03-10 | Longsand Limited | Sentiment feedback |
US20160170983A1 (en) * | 2013-07-30 | 2016-06-16 | Nippon Telegraph And Telephone Corporation | Information management apparatus and information management method |
US20160246870A1 (en) * | 2013-10-31 | 2016-08-25 | Raghu Anantharangachar | Classifying a document using patterns |
US20160350672A1 (en) * | 2015-05-26 | 2016-12-01 | Textio, Inc. | Using Machine Learning to Predict Outcomes for Documents |
US20160357855A1 (en) * | 2015-06-02 | 2016-12-08 | International Business Machines Corporation | Utilizing Word Embeddings for Term Matching in Question Answering Systems |
WO2017039684A1 (en) * | 2015-09-04 | 2017-03-09 | Hewlett Packard Enterprise Development Lp | Classifier |
US20170132205A1 (en) * | 2015-11-05 | 2017-05-11 | Abbyy Infopoisk Llc | Identifying word collocations in natural language texts |
US20170132539A1 (en) * | 2015-11-11 | 2017-05-11 | Tata Consultancy Services Limited | Systems and methods for governance, risk, and compliance analytics for competitive edge |
US9659560B2 (en) * | 2015-05-08 | 2017-05-23 | International Business Machines Corporation | Semi-supervised learning of word embeddings |
US20170163672A1 (en) * | 2013-08-09 | 2017-06-08 | Omni Al, Inc. | Cognitive information security using a behavioral recognition system |
WO2017139575A1 (en) * | 2016-02-11 | 2017-08-17 | Ebay Inc. | Semantic category classification |
US9792282B1 (en) | 2016-07-11 | 2017-10-17 | International Business Machines Corporation | Automatic identification of machine translation review candidates |
US20170337177A1 (en) * | 2016-05-19 | 2017-11-23 | Palo Alto Research Center Incorporated | Natural language web browser |
JP2018034347A (en) * | 2016-08-30 | 2018-03-08 | 京セラドキュメントソリューションズ株式会社 | Image formation apparatus and character drawing program |
JP2018034348A (en) * | 2016-08-30 | 2018-03-08 | 京セラドキュメントソリューションズ株式会社 | Image formation apparatus and character drawing program |
US10002301B1 (en) * | 2017-09-19 | 2018-06-19 | King Fahd University Of Petroleum And Minerals | System, apparatus, and method for arabic handwriting recognition |
WO2018142266A1 (en) * | 2017-01-31 | 2018-08-09 | Mocsy Inc. | Information extraction from documents |
CN108415928A (en) * | 2018-01-18 | 2018-08-17 | 郝宁宁 | A kind of book recommendation method and system based on weighted blend k- nearest neighbor algorithms |
US10078688B2 (en) | 2016-04-12 | 2018-09-18 | Abbyy Production Llc | Evaluating text classifier parameters based on semantic features |
CN108664633A (en) * | 2018-05-15 | 2018-10-16 | 南京大学 | A method of carrying out text classification using diversified text feature |
CN108710620A (en) * | 2018-01-18 | 2018-10-26 | 郝宁宁 | A kind of book recommendation method and system of the k- nearest neighbor algorithms based on user |
US10162850B1 (en) * | 2018-04-10 | 2018-12-25 | Icertis, Inc. | Clause discovery for validation of documents |
CN109145095A (en) * | 2017-06-16 | 2019-01-04 | 贵州小爱机器人科技有限公司 | Information of place names matching process, information matching method, device and computer equipment |
CN109144999A (en) * | 2018-08-02 | 2019-01-04 | 东软集团股份有限公司 | A kind of data positioning method, device and storage medium, program product |
WO2019018732A1 (en) * | 2017-07-21 | 2019-01-24 | Pearson Education, Inc. | Systems and methods for automated feature-based alert triggering |
US20190073997A1 (en) * | 2017-09-05 | 2019-03-07 | International Business Machines Corporation | Machine training for native language and fluency identification |
US20190164083A1 (en) * | 2017-11-28 | 2019-05-30 | Adobe Inc. | Categorical Data Transformation and Clustering for Machine Learning using Natural Language Processing |
US20190180327A1 (en) * | 2017-12-08 | 2019-06-13 | Arun BALAGOPALAN | Systems and methods of topic modeling for large scale web page classification |
US20190286747A1 (en) * | 2018-03-16 | 2019-09-19 | Adobe Inc. | Categorical Data Transformation and Clustering for Machine Learning using Data Repository Systems |
WO2020028109A1 (en) * | 2018-08-03 | 2020-02-06 | Intuit Inc. | Automated document extraction and classification |
US10599731B2 (en) * | 2016-04-26 | 2020-03-24 | Baidu Usa Llc | Method and system of determining categories associated with keywords using a trained model |
US20200097883A1 (en) * | 2018-09-26 | 2020-03-26 | International Business Machines Corporation | Dynamically evolving textual taxonomies |
US10635727B2 (en) | 2016-08-16 | 2020-04-28 | Ebay Inc. | Semantic forward search indexing of publication corpus |
CN111177375A (en) * | 2019-12-16 | 2020-05-19 | 医渡云(北京)技术有限公司 | Electronic document classification method and device |
CN111401058A (en) * | 2020-03-12 | 2020-07-10 | 广州大学 | Attribute value extraction method and device based on named entity recognition tool |
US10726374B1 (en) | 2019-02-19 | 2020-07-28 | Icertis, Inc. | Risk prediction based on automated analysis of documents |
US10776399B1 (en) * | 2016-06-06 | 2020-09-15 | Casepoint LLC | Document classification prediction and content analytics using artificial intelligence |
WO2020204535A1 (en) * | 2019-03-29 | 2020-10-08 | 주식회사 워트인텔리전스 | Machine learning-based user-customized automatic patent document classification method, device, and system |
CN111813870A (en) * | 2020-06-01 | 2020-10-23 | 武汉大学 | Machine learning algorithm resource sharing method and system based on unified description expression |
US10839315B2 (en) * | 2016-08-05 | 2020-11-17 | Yandex Europe Ag | Method and system of selecting training features for a machine learning algorithm |
WO2020243532A1 (en) * | 2019-05-29 | 2020-12-03 | Iron Mountain Incorporated | Systems and methods for cloud content-based document clustering and classification integration |
US10878335B1 (en) * | 2016-06-14 | 2020-12-29 | Amazon Technologies, Inc. | Scalable text analysis using probabilistic data structures |
US10884769B2 (en) | 2018-02-17 | 2021-01-05 | Adobe Inc. | Photo-editing application recommendations |
US10902066B2 (en) | 2018-07-23 | 2021-01-26 | Open Text Holdings, Inc. | Electronic discovery using predictive filtering |
US10936974B2 (en) | 2018-12-24 | 2021-03-02 | Icertis, Inc. | Automated training and selection of models for document analysis |
US10938592B2 (en) | 2017-07-21 | 2021-03-02 | Pearson Education, Inc. | Systems and methods for automated platform-based algorithm monitoring |
US20210110278A1 (en) * | 2019-10-14 | 2021-04-15 | Microsoft Technology Licensing, Llc | Enterprise knowledge graph |
US10984122B2 (en) * | 2018-04-13 | 2021-04-20 | Sophos Limited | Enterprise document classification |
US11023828B2 (en) | 2010-05-25 | 2021-06-01 | Open Text Holdings, Inc. | Systems and methods for predictive coding |
WO2021107447A1 (en) * | 2019-11-25 | 2021-06-03 | 주식회사 데이터마케팅코리아 | Document classification method for marketing knowledge graph, and apparatus therefor |
CN112905790A (en) * | 2021-02-04 | 2021-06-04 | 中国建设银行股份有限公司 | Method, device and system for extracting qualitative indexes of supervision events |
US11042700B1 (en) * | 2020-04-16 | 2021-06-22 | Capital One Services, Llc | Conciseness reconstruction of a content presentation via natural language processing |
US11048934B2 (en) | 2015-08-27 | 2021-06-29 | Longsand Limited | Identifying augmented features based on a bayesian analysis of a text document |
US11055666B2 (en) * | 2020-11-09 | 2021-07-06 | The Abstract Operations Company | Systems and methods for automation of corporate workflow processes via machine learning techniques |
US20210216578A1 (en) * | 2018-01-08 | 2021-07-15 | Magic Number, Inc. | Interactive patent visualization systems and methods |
CN113168586A (en) * | 2018-11-02 | 2021-07-23 | 威尔乌集团 | Text classification and management |
US11182540B2 (en) | 2019-04-23 | 2021-11-23 | Textio, Inc. | Passively suggesting text in an electronic document |
US11205103B2 (en) | 2016-12-09 | 2021-12-21 | The Research Foundation for the State University | Semisupervised autoencoder for sentiment analysis |
US20220019740A1 (en) * | 2020-07-20 | 2022-01-20 | Microsoft Technology Licensing, Llc | Enterprise knowledge graphs using enterprise named entity recognition |
US20220019579A1 (en) * | 2020-07-20 | 2022-01-20 | Microsoft Technology Licensing, Llc. | Enterprise knowledge graphs using multiple toolkits |
US20220059085A1 (en) * | 2020-08-18 | 2022-02-24 | Bank Of America Corporation | Multi-pipeline language processing platform |
US11269496B2 (en) * | 2018-12-06 | 2022-03-08 | Canon Kabushiki Kaisha | Information processing apparatus, control method, and storage medium |
CN114596621A (en) * | 2022-05-10 | 2022-06-07 | 慧医谷中医药科技(天津)股份有限公司 | Tongue picture data processing method and system based on machine vision |
US11361034B1 (en) | 2021-11-30 | 2022-06-14 | Icertis, Inc. | Representing documents using document keys |
US11397756B2 (en) * | 2019-11-04 | 2022-07-26 | Hon Hai Precision Industry Co., Ltd. | Data archiving method and computing device implementing same |
US11537668B2 (en) * | 2019-08-14 | 2022-12-27 | Proofpoint, Inc. | Using a machine learning system to process a corpus of documents associated with a user to determine a user-specific and/or process-specific consequence index |
US11544323B2 (en) | 2020-07-20 | 2023-01-03 | Microsoft Technology Licensing, Llc | Annotations for enterprise knowledge graphs using multiple toolkits |
US11561869B2 (en) * | 2015-11-16 | 2023-01-24 | Kyndryl, Inc. | Optimized disaster-recovery-as-a-service system |
US11630853B2 (en) * | 2021-01-29 | 2023-04-18 | Snowflake Inc. | Metadata classification |
US11698921B2 (en) | 2018-09-17 | 2023-07-11 | Ebay Inc. | Search system for providing search results using query understanding and semantic binary signatures |
US11734582B2 (en) * | 2019-10-31 | 2023-08-22 | Sap Se | Automated rule generation framework using machine learning for classification problems |
US11741163B2 (en) * | 2020-09-14 | 2023-08-29 | Box, Inc. | Mapping of personally-identifiable information to a person based on natural language coreference resolution |
US11748501B2 (en) * | 2017-02-03 | 2023-09-05 | Adobe Inc. | Tagging documents with security policies |
KR20240043843A (en) * | 2022-09-27 | 2024-04-04 | 한국딥러닝 주식회사 | Deep learning natural language processing based unstructured document understanding system |
US20240126872A1 (en) * | 2022-10-12 | 2024-04-18 | Institute For Information Industry | Labeling method for information security detection rules and tactic, technique and procedure labeling device for the same |
US11977722B2 (en) | 2018-01-08 | 2024-05-07 | Magic Number, Inc. | Interactive patent visualization systems and methods |
US20240202215A1 (en) * | 2022-12-18 | 2024-06-20 | Concentric Software, Inc. | Method and electronic device to assign appropriate semantic categories to documents with arbitrary granularity |
US12182725B2 (en) | 2020-07-20 | 2024-12-31 | Microsoft Technology Licensing, Llc | Enterprise knowledge graphs using user-based mining |
Families Citing this family (48)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9122681B2 (en) | 2013-03-15 | 2015-09-01 | Gordon Villy Cormack | Systems and methods for classifying electronic information using advanced active learning techniques |
RU2583716C2 (en) * | 2013-12-18 | 2016-05-10 | Общество с ограниченной ответственностью "Аби ИнфоПоиск" | Method of constructing and detection of theme hull structure |
US10445374B2 (en) * | 2015-06-19 | 2019-10-15 | Gordon V. Cormack | Systems and methods for conducting and terminating a technology-assisted review |
FI20165240A (en) * | 2016-03-22 | 2017-09-23 | Utopia Analytics Oy | PROCEDURES, SYSTEMS AND RESOURCES FOR MODERATING CONTENTS |
US10740563B2 (en) * | 2017-02-03 | 2020-08-11 | Benedict R. Dugan | System and methods for text classification |
US10657368B1 (en) | 2017-02-03 | 2020-05-19 | Aon Risk Services, Inc. Of Maryland | Automatic human-emulative document analysis |
US10755045B2 (en) | 2017-03-03 | 2020-08-25 | Aon Risk Services, Inc. Of Maryland | Automatic human-emulative document analysis enhancements |
US10534825B2 (en) | 2017-05-22 | 2020-01-14 | Microsoft Technology Licensing, Llc | Named entity-based document recommendations |
US10848494B2 (en) | 2017-08-14 | 2020-11-24 | Microsoft Technology Licensing, Llc | Compliance boundaries for multi-tenant cloud environment |
US10984180B2 (en) | 2017-11-06 | 2021-04-20 | Microsoft Technology Licensing, Llc | Electronic document supplementation with online social networking information |
CN107832305A (en) * | 2017-11-28 | 2018-03-23 | 百度在线网络技术(北京)有限公司 | Method and apparatus for generating information |
CN110045960B (en) * | 2018-01-16 | 2022-02-18 | 腾讯科技(深圳)有限公司 | Chip-based instruction set processing method and device and storage medium |
US12020160B2 (en) | 2018-01-19 | 2024-06-25 | International Business Machines Corporation | Generation of neural network containing middle layer background |
CN110390094B (en) * | 2018-04-20 | 2023-05-23 | 伊姆西Ip控股有限责任公司 | Method, electronic device and computer program product for classifying documents |
CN109086375B (en) * | 2018-07-24 | 2021-10-22 | 武汉大学 | A short text topic extraction method based on word vector enhancement |
US11763321B2 (en) | 2018-09-07 | 2023-09-19 | Moore And Gasperecz Global, Inc. | Systems and methods for extracting requirements from regulatory content |
US11270226B2 (en) * | 2018-10-01 | 2022-03-08 | International Business Machines Corporation | Hybrid learning-based ticket classification and response |
US11580301B2 (en) * | 2019-01-08 | 2023-02-14 | Genpact Luxembourg S.à r.l. II | Method and system for hybrid entity recognition |
US10614345B1 (en) | 2019-04-12 | 2020-04-07 | Ernst & Young U.S. Llp | Machine learning based extraction of partition objects from electronic documents |
US11720809B2 (en) | 2019-06-05 | 2023-08-08 | The Ronin Project, Inc. | Modeling for complex outcomes using clustering and machine learning algorithms |
US11113518B2 (en) | 2019-06-28 | 2021-09-07 | Eygs Llp | Apparatus and methods for extracting data from lineless tables using Delaunay triangulation and excess edge removal |
US11915465B2 (en) | 2019-08-21 | 2024-02-27 | Eygs Llp | Apparatus and methods for converting lineless tables into lined tables using generative adversarial networks |
US11507828B2 (en) * | 2019-10-29 | 2022-11-22 | International Business Machines Corporation | Unsupervised hypernym induction machine learning |
US10810709B1 (en) | 2019-11-21 | 2020-10-20 | Eygs Llp | Systems and methods for improving the quality of text documents using artificial intelligence |
US11625934B2 (en) | 2020-02-04 | 2023-04-11 | Eygs Llp | Machine learning based end-to-end extraction of tables from electronic documents |
US11410447B2 (en) | 2020-06-19 | 2022-08-09 | Bank Of America Corporation | Information security assessment translation engine |
US12106051B2 (en) * | 2020-07-16 | 2024-10-01 | Optum Technology, Inc. | Unsupervised approach to assignment of pre-defined labels to text documents |
US10956673B1 (en) | 2020-09-10 | 2021-03-23 | Moore & Gasperecz Global Inc. | Method and system for identifying citations within regulatory content |
US12105728B2 (en) | 2020-09-14 | 2024-10-01 | DeepSee.ai Inc. | Extensible data objects for use in machine learning models |
US11797770B2 (en) | 2020-09-24 | 2023-10-24 | UiPath, Inc. | Self-improving document classification and splitting for document processing in robotic process automation |
US11314922B1 (en) | 2020-11-27 | 2022-04-26 | Moore & Gasperecz Global Inc. | System and method for generating regulatory content requirement descriptions |
US20220147814A1 (en) | 2020-11-09 | 2022-05-12 | Moore & Gasperecz Global Inc. | Task specific processing of regulatory content |
US20220309276A1 (en) * | 2021-03-29 | 2022-09-29 | International Business Machines Corporation | Automatically classifying heterogenous documents using machine learning techniques |
US11798258B2 (en) | 2021-05-03 | 2023-10-24 | Bank Of America Corporation | Automated categorization and assembly of low-quality images into electronic documents |
US11704352B2 (en) | 2021-05-03 | 2023-07-18 | Bank Of America Corporation | Automated categorization and assembly of low-quality images into electronic documents |
US11941357B2 (en) | 2021-06-23 | 2024-03-26 | Optum Technology, Inc. | Machine learning techniques for word-based text similarity determinations |
US12204862B2 (en) * | 2021-07-16 | 2025-01-21 | Microsoft Technology Licensing, Llc | Modular self-supervision for document-level relation extraction |
US20230102198A1 (en) * | 2021-09-30 | 2023-03-30 | Intuit Inc. | Artificial intelligence based compliance document processing |
US11977841B2 (en) | 2021-12-22 | 2024-05-07 | Bank Of America Corporation | Classification of documents |
CN114266255B (en) * | 2022-03-01 | 2022-05-17 | 深圳壹账通科技服务有限公司 | Corpus classification method, apparatus, device and storage medium based on clustering model |
US12231511B2 (en) | 2022-04-01 | 2025-02-18 | Blackberry Limited | Event data processing |
US12229294B2 (en) * | 2022-04-01 | 2025-02-18 | Blackberry Limited | Event data processing |
US11989240B2 (en) | 2022-06-22 | 2024-05-21 | Optum Services (Ireland) Limited | Natural language processing machine learning frameworks trained using multi-task training routines |
US12112132B2 (en) | 2022-06-22 | 2024-10-08 | Optum Services (Ireland) Limited | Natural language processing machine learning frameworks trained using multi-task training routines |
US12111869B2 (en) | 2022-08-08 | 2024-10-08 | Bank Of America Corporation | Identifying an implementation of a user-desired interaction using machine learning |
US11823477B1 (en) | 2022-08-30 | 2023-11-21 | Moore And Gasperecz Global, Inc. | Method and system for extracting data from tables within regulatory content |
US20240370464A1 (en) * | 2023-05-03 | 2024-11-07 | Portal Innovations, LLC | Cognition management system and methods for managing research and development activity |
US20250068592A1 (en) * | 2023-08-22 | 2025-02-27 | Wells Fargo Bank, N.A. | Automatic file creation and location selection standardization |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110066585A1 (en) * | 2009-09-11 | 2011-03-17 | Arcsight, Inc. | Extracting information from unstructured data and mapping the information to a structured schema using the naïve bayesian probability model |
US8078573B2 (en) * | 2005-05-31 | 2011-12-13 | Google Inc. | Identifying the unifying subject of a set of facts |
US8082248B2 (en) * | 2008-05-29 | 2011-12-20 | Rania Abouyounes | Method and system for document classification based on document structure and written style |
US20120278336A1 (en) * | 2011-04-29 | 2012-11-01 | Malik Hassan H | Representing information from documents |
US8713007B1 (en) * | 2009-03-13 | 2014-04-29 | Google Inc. | Classifying documents using multiple classifiers |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090144277A1 (en) * | 2007-12-03 | 2009-06-04 | Microsoft Corporation | Electronic table of contents entry classification and labeling scheme |
-
2012
- 2012-12-04 US US13/693,075 patent/US9235812B2/en active Active
-
2016
- 2016-01-07 US US14/989,969 patent/US10565502B2/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8078573B2 (en) * | 2005-05-31 | 2011-12-13 | Google Inc. | Identifying the unifying subject of a set of facts |
US8082248B2 (en) * | 2008-05-29 | 2011-12-20 | Rania Abouyounes | Method and system for document classification based on document structure and written style |
US8713007B1 (en) * | 2009-03-13 | 2014-04-29 | Google Inc. | Classifying documents using multiple classifiers |
US20110066585A1 (en) * | 2009-09-11 | 2011-03-17 | Arcsight, Inc. | Extracting information from unstructured data and mapping the information to a structured schema using the naïve bayesian probability model |
US20120278336A1 (en) * | 2011-04-29 | 2012-11-01 | Malik Hassan H | Representing information from documents |
Non-Patent Citations (4)
Title |
---|
Akers, Steve, Jennifer Keadle Mason, and Peter L. Mansmann. "An Intelligent Approach to E-Discovery." DESI IV: The ICAIL 2011 Workshop on Setting Standards for Searching Electronically Stored Information in Discovery Proceedings. 2011. * |
Barnett, Thomas, et al. "Machine learning classification for document review." ICAIL 2009 Workshop on Supporting Search and Sensemaking for Electronically Stored Information in Discovery. Retrieved July. Vol. 24. 2009. * |
Kershaw, Anne. "Automated document review proves its reliability." Digital Discovery & e-Evidence 5.11 (2005). * |
Scholtes, Johannes Cornelis. "Unsupervised learning and the information retrieval problem." Neural Networks, 1991. 1991 IEEE International Joint Conference on. IEEE, 1991. * |
Cited By (139)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11282000B2 (en) | 2010-05-25 | 2022-03-22 | Open Text Holdings, Inc. | Systems and methods for predictive coding |
US11023828B2 (en) | 2010-05-25 | 2021-06-01 | Open Text Holdings, Inc. | Systems and methods for predictive coding |
US9916392B2 (en) * | 2012-12-07 | 2018-03-13 | Tencent Technology (Shenzhen) Company Limited | Method, system, and storage medium for displaying media content applicable to social platform |
US20150269275A1 (en) * | 2012-12-07 | 2015-09-24 | Tencent Technology (Shenzhen) Company Limited | Method, system, and storage medium for displaying media content applicable to social platform |
US20160071119A1 (en) * | 2013-04-11 | 2016-03-10 | Longsand Limited | Sentiment feedback |
US20160170983A1 (en) * | 2013-07-30 | 2016-06-16 | Nippon Telegraph And Telephone Corporation | Information management apparatus and information management method |
US12200002B2 (en) | 2013-08-09 | 2025-01-14 | Intellective Ai, Inc. | Cognitive information security using a behavior recognition system |
US10735446B2 (en) * | 2013-08-09 | 2020-08-04 | Intellective Ai, Inc. | Cognitive information security using a behavioral recognition system |
US9973523B2 (en) * | 2013-08-09 | 2018-05-15 | Omni Ai, Inc. | Cognitive information security using a behavioral recognition system |
US20190124101A1 (en) * | 2013-08-09 | 2019-04-25 | Omni Ai, Inc. | Cognitive information security using a behavioral recognition system |
US10187415B2 (en) | 2013-08-09 | 2019-01-22 | Omni Ai, Inc. | Cognitive information security using a behavioral recognition system |
US11818155B2 (en) | 2013-08-09 | 2023-11-14 | Intellective Ai, Inc. | Cognitive information security using a behavior recognition system |
US11991194B2 (en) | 2013-08-09 | 2024-05-21 | Intellective Ai, Inc. | Cognitive neuro-linguistic behavior recognition system for multi-sensor data fusion |
US20170163672A1 (en) * | 2013-08-09 | 2017-06-08 | Omni Al, Inc. | Cognitive information security using a behavioral recognition system |
US20160246870A1 (en) * | 2013-10-31 | 2016-08-25 | Raghu Anantharangachar | Classifying a document using patterns |
US10552459B2 (en) * | 2013-10-31 | 2020-02-04 | Micro Focus Llc | Classifying a document using patterns |
US20150254791A1 (en) * | 2014-03-10 | 2015-09-10 | Fmr Llc | Quality control calculator for document review |
US20150278195A1 (en) * | 2014-03-31 | 2015-10-01 | Abbyy Infopoisk Llc | Text data sentiment analysis method |
CN104199959A (en) * | 2014-09-18 | 2014-12-10 | 浪潮软件集团有限公司 | A text classification method for Internet tax-related data |
US20170162189A1 (en) * | 2015-05-08 | 2017-06-08 | International Business Machines Corporation | Semi-supervised learning of word embeddings |
US9672814B2 (en) * | 2015-05-08 | 2017-06-06 | International Business Machines Corporation | Semi-supervised learning of word embeddings |
US9659560B2 (en) * | 2015-05-08 | 2017-05-23 | International Business Machines Corporation | Semi-supervised learning of word embeddings |
US9947314B2 (en) * | 2015-05-08 | 2018-04-17 | International Business Machines Corporation | Semi-supervised learning of word embeddings |
US20160350672A1 (en) * | 2015-05-26 | 2016-12-01 | Textio, Inc. | Using Machine Learning to Predict Outcomes for Documents |
US10607152B2 (en) * | 2015-05-26 | 2020-03-31 | Textio, Inc. | Using machine learning to predict outcomes for documents |
US11270229B2 (en) * | 2015-05-26 | 2022-03-08 | Textio, Inc. | Using machine learning to predict outcomes for documents |
US11288295B2 (en) | 2015-06-02 | 2022-03-29 | Green Market Square Limited | Utilizing word embeddings for term matching in question answering systems |
US20160358094A1 (en) * | 2015-06-02 | 2016-12-08 | International Business Machines Corporation | Utilizing Word Embeddings for Term Matching in Question Answering Systems |
US20160357855A1 (en) * | 2015-06-02 | 2016-12-08 | International Business Machines Corporation | Utilizing Word Embeddings for Term Matching in Question Answering Systems |
US10467268B2 (en) * | 2015-06-02 | 2019-11-05 | International Business Machines Corporation | Utilizing word embeddings for term matching in question answering systems |
US10467270B2 (en) * | 2015-06-02 | 2019-11-05 | International Business Machines Corporation | Utilizing word embeddings for term matching in question answering systems |
US11048934B2 (en) | 2015-08-27 | 2021-06-29 | Longsand Limited | Identifying augmented features based on a bayesian analysis of a text document |
WO2017039684A1 (en) * | 2015-09-04 | 2017-03-09 | Hewlett Packard Enterprise Development Lp | Classifier |
US11403550B2 (en) | 2015-09-04 | 2022-08-02 | Micro Focus Llc | Classifier |
US9817812B2 (en) * | 2015-11-05 | 2017-11-14 | Abbyy Production Llc | Identifying word collocations in natural language texts |
US20170132205A1 (en) * | 2015-11-05 | 2017-05-11 | Abbyy Infopoisk Llc | Identifying word collocations in natural language texts |
US20170132539A1 (en) * | 2015-11-11 | 2017-05-11 | Tata Consultancy Services Limited | Systems and methods for governance, risk, and compliance analytics for competitive edge |
US11561869B2 (en) * | 2015-11-16 | 2023-01-24 | Kyndryl, Inc. | Optimized disaster-recovery-as-a-service system |
US10599701B2 (en) | 2016-02-11 | 2020-03-24 | Ebay Inc. | Semantic category classification |
US11227004B2 (en) | 2016-02-11 | 2022-01-18 | Ebay Inc. | Semantic category classification |
WO2017139575A1 (en) * | 2016-02-11 | 2017-08-17 | Ebay Inc. | Semantic category classification |
US10078688B2 (en) | 2016-04-12 | 2018-09-18 | Abbyy Production Llc | Evaluating text classifier parameters based on semantic features |
US10599731B2 (en) * | 2016-04-26 | 2020-03-24 | Baidu Usa Llc | Method and system of determining categories associated with keywords using a trained model |
US11599709B2 (en) * | 2016-05-19 | 2023-03-07 | Palo Alto Research Center Incorporated | Natural language web browser |
US20170337177A1 (en) * | 2016-05-19 | 2017-11-23 | Palo Alto Research Center Incorporated | Natural language web browser |
US10776399B1 (en) * | 2016-06-06 | 2020-09-15 | Casepoint LLC | Document classification prediction and content analytics using artificial intelligence |
US10878335B1 (en) * | 2016-06-14 | 2020-12-29 | Amazon Technologies, Inc. | Scalable text analysis using probabilistic data structures |
US9792282B1 (en) | 2016-07-11 | 2017-10-17 | International Business Machines Corporation | Automatic identification of machine translation review candidates |
US10839315B2 (en) * | 2016-08-05 | 2020-11-17 | Yandex Europe Ag | Method and system of selecting training features for a machine learning algorithm |
US10635727B2 (en) | 2016-08-16 | 2020-04-28 | Ebay Inc. | Semantic forward search indexing of publication corpus |
JP2018034347A (en) * | 2016-08-30 | 2018-03-08 | 京セラドキュメントソリューションズ株式会社 | Image formation apparatus and character drawing program |
JP2018034348A (en) * | 2016-08-30 | 2018-03-08 | 京セラドキュメントソリューションズ株式会社 | Image formation apparatus and character drawing program |
US11205103B2 (en) | 2016-12-09 | 2021-12-21 | The Research Foundation for the State University | Semisupervised autoencoder for sentiment analysis |
WO2018142266A1 (en) * | 2017-01-31 | 2018-08-09 | Mocsy Inc. | Information extraction from documents |
EP3577570A4 (en) * | 2017-01-31 | 2020-12-02 | Mocsy Inc. | Information extraction from documents |
US11748501B2 (en) * | 2017-02-03 | 2023-09-05 | Adobe Inc. | Tagging documents with security policies |
CN109145095A (en) * | 2017-06-16 | 2019-01-04 | 贵州小爱机器人科技有限公司 | Information of place names matching process, information matching method, device and computer equipment |
WO2019018732A1 (en) * | 2017-07-21 | 2019-01-24 | Pearson Education, Inc. | Systems and methods for automated feature-based alert triggering |
US10938592B2 (en) | 2017-07-21 | 2021-03-02 | Pearson Education, Inc. | Systems and methods for automated platform-based algorithm monitoring |
US10621975B2 (en) * | 2017-09-05 | 2020-04-14 | International Business Machines Corporation | Machine training for native language and fluency identification |
US10431203B2 (en) * | 2017-09-05 | 2019-10-01 | International Business Machines Corporation | Machine training for native language and fluency identification |
US20190073997A1 (en) * | 2017-09-05 | 2019-03-07 | International Business Machines Corporation | Machine training for native language and fluency identification |
US10002301B1 (en) * | 2017-09-19 | 2018-06-19 | King Fahd University Of Petroleum And Minerals | System, apparatus, and method for arabic handwriting recognition |
US10055660B1 (en) | 2017-09-19 | 2018-08-21 | King Fahd University Of Petroleum And Minerals | Arabic handwriting recognition utilizing bag of features representation |
US10163019B1 (en) | 2017-09-19 | 2018-12-25 | King Fahd University Of Petroleum And Minerals | Arabic handwriting recognition system and method |
US10176391B1 (en) | 2017-09-19 | 2019-01-08 | King Fahd University Of Petroleum And Minerals | Discrete hidden markov model basis for arabic handwriting recognition |
US11531927B2 (en) * | 2017-11-28 | 2022-12-20 | Adobe Inc. | Categorical data transformation and clustering for machine learning using natural language processing |
US20190164083A1 (en) * | 2017-11-28 | 2019-05-30 | Adobe Inc. | Categorical Data Transformation and Clustering for Machine Learning using Natural Language Processing |
US20190180327A1 (en) * | 2017-12-08 | 2019-06-13 | Arun BALAGOPALAN | Systems and methods of topic modeling for large scale web page classification |
US20240281458A1 (en) * | 2018-01-08 | 2024-08-22 | Magic Number, Inc. | Interactive patent visualization systems and methods |
US11977722B2 (en) | 2018-01-08 | 2024-05-07 | Magic Number, Inc. | Interactive patent visualization systems and methods |
US20210216578A1 (en) * | 2018-01-08 | 2021-07-15 | Magic Number, Inc. | Interactive patent visualization systems and methods |
US11977571B2 (en) * | 2018-01-08 | 2024-05-07 | Magic Number, Inc. | Interactive patent visualization systems and methods |
CN108415928A (en) * | 2018-01-18 | 2018-08-17 | 郝宁宁 | A kind of book recommendation method and system based on weighted blend k- nearest neighbor algorithms |
CN108710620A (en) * | 2018-01-18 | 2018-10-26 | 郝宁宁 | A kind of book recommendation method and system of the k- nearest neighbor algorithms based on user |
US10884769B2 (en) | 2018-02-17 | 2021-01-05 | Adobe Inc. | Photo-editing application recommendations |
US11036811B2 (en) * | 2018-03-16 | 2021-06-15 | Adobe Inc. | Categorical data transformation and clustering for machine learning using data repository systems |
US20190286747A1 (en) * | 2018-03-16 | 2019-09-19 | Adobe Inc. | Categorical Data Transformation and Clustering for Machine Learning using Data Repository Systems |
US10409805B1 (en) | 2018-04-10 | 2019-09-10 | Icertis, Inc. | Clause discovery for validation of documents |
US10162850B1 (en) * | 2018-04-10 | 2018-12-25 | Icertis, Inc. | Clause discovery for validation of documents |
WO2019200014A1 (en) * | 2018-04-10 | 2019-10-17 | Icertis, Inc. | Clause discovery for validation of documents |
US10984122B2 (en) * | 2018-04-13 | 2021-04-20 | Sophos Limited | Enterprise document classification |
US11783069B2 (en) | 2018-04-13 | 2023-10-10 | Sophos Limited | Enterprise document classification |
US11288385B2 (en) | 2018-04-13 | 2022-03-29 | Sophos Limited | Chain of custody for enterprise documents |
CN108664633A (en) * | 2018-05-15 | 2018-10-16 | 南京大学 | A method of carrying out text classification using diversified text feature |
US10902066B2 (en) | 2018-07-23 | 2021-01-26 | Open Text Holdings, Inc. | Electronic discovery using predictive filtering |
US12299051B2 (en) | 2018-07-23 | 2025-05-13 | Open Text Holdings, Inc. | Systems and methods of predictive filtering using document field values |
CN109144999A (en) * | 2018-08-02 | 2019-01-04 | 东软集团股份有限公司 | A kind of data positioning method, device and storage medium, program product |
WO2020028109A1 (en) * | 2018-08-03 | 2020-02-06 | Intuit Inc. | Automated document extraction and classification |
US11698921B2 (en) | 2018-09-17 | 2023-07-11 | Ebay Inc. | Search system for providing search results using query understanding and semantic binary signatures |
US20200097883A1 (en) * | 2018-09-26 | 2020-03-26 | International Business Machines Corporation | Dynamically evolving textual taxonomies |
CN113168586A (en) * | 2018-11-02 | 2021-07-23 | 威尔乌集团 | Text classification and management |
US11269496B2 (en) * | 2018-12-06 | 2022-03-08 | Canon Kabushiki Kaisha | Information processing apparatus, control method, and storage medium |
US12020130B2 (en) | 2018-12-24 | 2024-06-25 | Icertis, Inc. | Automated training and selection of models for document analysis |
US10936974B2 (en) | 2018-12-24 | 2021-03-02 | Icertis, Inc. | Automated training and selection of models for document analysis |
US11151501B2 (en) | 2019-02-19 | 2021-10-19 | Icertis, Inc. | Risk prediction based on automated analysis of documents |
US10726374B1 (en) | 2019-02-19 | 2020-07-28 | Icertis, Inc. | Risk prediction based on automated analysis of documents |
WO2020204535A1 (en) * | 2019-03-29 | 2020-10-08 | 주식회사 워트인텔리전스 | Machine learning-based user-customized automatic patent document classification method, device, and system |
US11880400B2 (en) | 2019-03-29 | 2024-01-23 | Wert Intelligence Co., Ltd. | Machine learning-based user-customized automatic patent document classification method, device, and system |
CN113711206A (en) * | 2019-03-29 | 2021-11-26 | 韦尔特智力株式会社 | Method, device and system for automatically classifying user-customized patent documents based on machine learning |
US11182540B2 (en) | 2019-04-23 | 2021-11-23 | Textio, Inc. | Passively suggesting text in an electronic document |
US12248503B2 (en) | 2019-05-29 | 2025-03-11 | Iron Mountain Incorporated | Systems and methods for cloud content-based document clustering and classification integration |
WO2020243532A1 (en) * | 2019-05-29 | 2020-12-03 | Iron Mountain Incorporated | Systems and methods for cloud content-based document clustering and classification integration |
US11537668B2 (en) * | 2019-08-14 | 2022-12-27 | Proofpoint, Inc. | Using a machine learning system to process a corpus of documents associated with a user to determine a user-specific and/or process-specific consequence index |
US12038984B2 (en) | 2019-08-14 | 2024-07-16 | Proofpoint, Inc. | Using a machine learning system to process a corpus of documents associated with a user to determine a user-specific and/or process-specific consequence index |
US11709878B2 (en) * | 2019-10-14 | 2023-07-25 | Microsoft Technology Licensing, Llc | Enterprise knowledge graph |
US20210110278A1 (en) * | 2019-10-14 | 2021-04-15 | Microsoft Technology Licensing, Llc | Enterprise knowledge graph |
US11734582B2 (en) * | 2019-10-31 | 2023-08-22 | Sap Se | Automated rule generation framework using machine learning for classification problems |
US11397756B2 (en) * | 2019-11-04 | 2022-07-26 | Hon Hai Precision Industry Co., Ltd. | Data archiving method and computing device implementing same |
WO2021107447A1 (en) * | 2019-11-25 | 2021-06-03 | 주식회사 데이터마케팅코리아 | Document classification method for marketing knowledge graph, and apparatus therefor |
CN111177375A (en) * | 2019-12-16 | 2020-05-19 | 医渡云(北京)技术有限公司 | Electronic document classification method and device |
CN111401058A (en) * | 2020-03-12 | 2020-07-10 | 广州大学 | Attribute value extraction method and device based on named entity recognition tool |
US20210326523A1 (en) * | 2020-04-16 | 2021-10-21 | Capital One Services, Llc | Conciseness reconstruction of a content presentation via natural language processing |
US11651152B2 (en) * | 2020-04-16 | 2023-05-16 | Capital One Services, Llc | Conciseness reconstruction of a content presentation via natural language processing |
US20230237257A1 (en) * | 2020-04-16 | 2023-07-27 | Capital One Services, Llc | Conciseness reconstruction of a content presentation via natural language processing |
US11042700B1 (en) * | 2020-04-16 | 2021-06-22 | Capital One Services, Llc | Conciseness reconstruction of a content presentation via natural language processing |
US12026459B2 (en) * | 2020-04-16 | 2024-07-02 | Capital One Services, Llc | Conciseness reconstruction of a content presentation via natural language processing |
CN111813870A (en) * | 2020-06-01 | 2020-10-23 | 武汉大学 | Machine learning algorithm resource sharing method and system based on unified description expression |
US20220019579A1 (en) * | 2020-07-20 | 2022-01-20 | Microsoft Technology Licensing, Llc. | Enterprise knowledge graphs using multiple toolkits |
US11573967B2 (en) * | 2020-07-20 | 2023-02-07 | Microsoft Technology Licensing, Llc | Enterprise knowledge graphs using multiple toolkits |
US12086546B2 (en) * | 2020-07-20 | 2024-09-10 | Microsoft Technology Licensing, Llc | Enterprise knowledge graphs using enterprise named entity recognition |
US12182725B2 (en) | 2020-07-20 | 2024-12-31 | Microsoft Technology Licensing, Llc | Enterprise knowledge graphs using user-based mining |
US11544323B2 (en) | 2020-07-20 | 2023-01-03 | Microsoft Technology Licensing, Llc | Annotations for enterprise knowledge graphs using multiple toolkits |
US20220019740A1 (en) * | 2020-07-20 | 2022-01-20 | Microsoft Technology Licensing, Llc | Enterprise knowledge graphs using enterprise named entity recognition |
US20220059085A1 (en) * | 2020-08-18 | 2022-02-24 | Bank Of America Corporation | Multi-pipeline language processing platform |
US11551674B2 (en) * | 2020-08-18 | 2023-01-10 | Bank Of America Corporation | Multi-pipeline language processing platform |
US11741163B2 (en) * | 2020-09-14 | 2023-08-29 | Box, Inc. | Mapping of personally-identifiable information to a person based on natural language coreference resolution |
US12008045B2 (en) | 2020-09-14 | 2024-06-11 | Box, Inc. | Mapping of personally-identifiable information to a person-based on traversal of a graph |
US11055666B2 (en) * | 2020-11-09 | 2021-07-06 | The Abstract Operations Company | Systems and methods for automation of corporate workflow processes via machine learning techniques |
US11630853B2 (en) * | 2021-01-29 | 2023-04-18 | Snowflake Inc. | Metadata classification |
US11853329B2 (en) | 2021-01-29 | 2023-12-26 | Snowflake Inc. | Metadata classification |
CN112905790A (en) * | 2021-02-04 | 2021-06-04 | 中国建设银行股份有限公司 | Method, device and system for extracting qualitative indexes of supervision events |
US11593440B1 (en) | 2021-11-30 | 2023-02-28 | Icertis, Inc. | Representing documents using document keys |
US11361034B1 (en) | 2021-11-30 | 2022-06-14 | Icertis, Inc. | Representing documents using document keys |
CN114596621A (en) * | 2022-05-10 | 2022-06-07 | 慧医谷中医药科技(天津)股份有限公司 | Tongue picture data processing method and system based on machine vision |
KR102784878B1 (en) | 2022-09-27 | 2025-03-24 | 한국딥러닝 주식회사 | Deep learning natural language processing based unstructured document understanding system |
KR20240043843A (en) * | 2022-09-27 | 2024-04-04 | 한국딥러닝 주식회사 | Deep learning natural language processing based unstructured document understanding system |
US20240126872A1 (en) * | 2022-10-12 | 2024-04-18 | Institute For Information Industry | Labeling method for information security detection rules and tactic, technique and procedure labeling device for the same |
US20240202215A1 (en) * | 2022-12-18 | 2024-06-20 | Concentric Software, Inc. | Method and electronic device to assign appropriate semantic categories to documents with arbitrary granularity |
Also Published As
Publication number | Publication date |
---|---|
US10565502B2 (en) | 2020-02-18 |
US20160117589A1 (en) | 2016-04-28 |
US9235812B2 (en) | 2016-01-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10565502B2 (en) | System and method for automatic document classification in eDiscovery, compliance and legacy information clean-up | |
Chen et al. | A comparative study of automated legal text classification using random forests and deep learning | |
Eke et al. | Sarcasm identification in textual data: systematic review, research challenges and open directions | |
Bilal et al. | Sentiment classification of Roman-Urdu opinions using Naïve Bayesian, Decision Tree and KNN classification techniques | |
Zhang et al. | Authorship identification from unstructured texts | |
Shanmugavadivel et al. | An analysis of machine learning models for sentiment analysis of Tamil code-mixed data | |
US20150120738A1 (en) | System and method for document classification based on semantic analysis of the document | |
US20160188568A1 (en) | System and method for determining the meaning of a document with respect to a concept | |
Shahade et al. | Multi-lingual opinion mining for social media discourses: an approach using deep learning based hybrid fine-tuned smith algorithm with adam optimizer | |
Zheng et al. | A review on authorship attribution in text mining | |
Shekhar et al. | An effective cybernated word embedding system for analysis and language identification in code-mixed social media text | |
Low et al. | Decoding violence against women: analysing harassment in middle eastern literature with machine learning and sentiment analysis | |
Dhankhar et al. | A statistically based sentence scoring method using mathematical combination for extractive Hindi text summarization | |
Malik et al. | NLP techniques, tools, and algorithms for data science | |
Yusuf et al. | A technical review of the state-of-the-art methods in aspect-based sentiment analysis | |
Elloumi et al. | General learning approach for event extraction: Case of management change event | |
Haq et al. | A semi-supervised approach for aspect category detection and aspect term extraction from opinionated text | |
Bhattacharjee et al. | Named entity recognition: A survey for indian languages | |
Padia et al. | UMBC at SemEval-2018 Task 8: Understanding text about malware | |
Yadlapalli et al. | Advanced Twitter sentiment analysis using supervised techniques and minimalistic features | |
Jahnavi et al. | A cogitate study on text mining | |
Contreras et al. | Ontology learning using hybrid machine learning algorithms for disaster risk management | |
Elarnaoty et al. | Machine learning implementations in arabic text classification | |
Gurram et al. | String Kernel-based techniques for native language identification | |
Sharma et al. | An Optimized Approach for Sarcasm Detection Using Machine Learning Classifier |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: MSC INTELLECTUAL PROPERTIES B.V., NETHERLANDS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:SCHOLTES, JOHANNES CORNELIS;REEL/FRAME:033314/0634 Effective date: 20140715 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |
|
AS | Assignment |
Owner name: WELLS FARGO BANK, NATIONAL ASSOCIATION, AS AGENT, CALIFORNIA Free format text: SECURITY INTEREST;ASSIGNORS:MSC INFORMATION RETRIEVAL TECHNOLOGIES B.V.;MSC INTELLECTUAL PROPERTIES B.V.;ZYLAB TECHNOLOGIES B.V.;AND OTHERS;REEL/FRAME:057484/0493 Effective date: 20210915 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |
|
AS | Assignment |
Owner name: MSC INTELLECTUAL PROPERTIES B.V., NETHERLANDS Free format text: RELEASE OF SECURITY INTERESTS IN PATENTS RECORDED AT R/F 057484/0493;ASSIGNOR:WELLS FARGO BANK, NATIONAL ASSOCIATION;REEL/FRAME:064853/0202 Effective date: 20230829 |