Tech

Artificial Intelligence: from labs to software products

CIOL Bureau

27 Jan 2004 00:00 IST

Updated On 27 Jan 2004 23:47 IST

New Update

Laura Ramos

In the areas of text mining and automated classification, artificial intelligence (AI) has moved out of the lab and into software products that support portals, information retrieval, Web content management (WCM) systems and the like. While they have a long way to go before they truly emulate human reasoning, the use of AI techniques like Bayesian probability algorithms, Support Vector Machines (SVMs), neural networks and clustering can help knowledge workers to organize, mine and discover important relationships in large bodies of documents. Groups required to organize large amounts of legacy content as part of a portal or WCM initiative should examine taxonomy building, concept extraction, automated classification and mapping tools with AI roots to speed the process of determining which content is relevant or a good candidate for semiformal or formal content control.

Among the technologies that strive to "understand the meaning in text" there are two schools that the various techniques fall into broadly: one based on understanding linguistic meaning and the other on mathematics, statistics and pattern recognition. The use of AI tends to dominate the statistical approaches because, at some point, linguistic techniques require libraries of language-specific rules to decipher grammatical constructs and language morphology if highly relevant results are to be achieved. AI can enhance statistical models by using feedback to strengthen trained models as well. System designers should understand the four most common approaches to using AI-based technology in text classification and concept extraction so they can more easily compare vendor claims and understand why demonstrations achieve the results that they do.

Bayesian Probability

Originating from the work of a 18th century minister named Thomas Bayes, Bayesian probability is one of the earliest methods used to classify documents and is the foundation of many algorithms, like those used in products from Autonomy, Stratify and Intktomi/Quiver. Typically, it builds statistical models of a topic from the words found within training sets of documents closely representing the topic. The software classifies new documents by comparing an individual document’s model to the trained topic model and assigning a probability that estimates the degree to which it matches the topic. The software can assign documents into multiple topics because it can see beyond individual words to the underlying patterns within the documents and identify key themes. Bayesian methods can be efficient because they use only the models, not all documents in the training sets, when classifying content. They can take time to set up and tune as they may require as many as 50 documents to train a topic effectively.

Support Vector Machine

SVMs are a more recent approach somewhat similar to Bayesian, and companies like Microsoft, Mohomine, Verity and YellowBrix use this technique as the basis for the algorithms in their classification offerings.

Similar to Bayesian, SVMs try to determine how closely a document matches a topic. They differ because the SVM calculates whether a document falls within the margins mathematically separating groups of documents along a number of dimensions/comparisons – where these dimensions are typically defined by the occurrence of words or patterns, but can be extended to other features. SVMs have produced more precise results in academic competition because Bayesian systems use simplifications that SVMs do not. For example, Bayesian techniques tend to weigh all training set documents equally, while SVMs can use negative examples and give different weights to outlying training documents. SVMs can determine closest matches to topics by focusing on certain dimensions, but can be slow when used to classify a document into a large number of topics – although vendors work on tuning performance regularly. The quality of training sets becomes more important since outliers can influence the performance in accidental or negative ways.

Neural Networks

Neural networks differ from Bayesian and SVM techniques because they can use new documents to update and modify models in a more automated fashion. Training models are arranged in a computational network of nodes that receive inputs and compute an output or response. Nodes need not be arranged hierarchically, so more nuanced relationships can be captured. The nodes calculate key topics from the words and act more like concept extractors or tagging systems than classifiers. They try to show all the possibilities, not only the best, therefore they can be used to determine the relationship of similar topics to each other (rather than documents to a topic). They can work with visualization techniques to allow users to navigate between nodes and classified documents. Because of their network focus, neural networks can behave like "black boxes" – where new training documents can affect results in unexpected or unintended ways, plus, because of the various connections between nodes, it can be difficult to understand why. This sometimes results in classifications or tags that cannot easily be explained, even in retrospect. While Convera is well-known to have based its products in part on neural networks, other vendors like SER use this technique as well since it works with a variety of inputs, like images or bit maps, as well as documents.

Clustering, K-Means or K-Nearest-Neighbors

This technique can be used to cluster a large number of documents into groups in an unsupervised, automatic manner. Clustering tries to find a specified number of "nearest" or most similar documents and arrange them into groups. Administrators typically control how many groups are formed and, in some systems, have ready access to information describing why documents were clustered together. Inxight has been an early pioneer of this technique, and newcomers like LingoMotors, Quiver, RecomMind and Stratify use it – sometimes in conjunction with natural language/semantic techniques – for training set development, results navigation or automated hierarchy building. Once the clusters are formed, the algorithm can be used to classify documents into one or more of the clusters, depending on how closely the documents match.

This method produces good results when there are enough documents to clearly delineate each topic. Depending on implementation, it can have problems scaling as the number of documents and topics increases because the engine must use all documents in the classification process, a memory-intensive process. Some systems summarize documents to reduce this, but summarization can introduce errors of omission.

Increasingly, vendors are combining a number of these techniques in their offerings. Clustering helps assemble training sets for classification models or as input to automated taxonomy building processes. Neural networks can augment tagging, visualization or taxonomy building systems. SVMs and Bayesian techniques are often the foundation in complex algorithms companies develop to automatically classify or find related documents, in conjunction with indices and statistical models or in real-time without reliance on a taxonomy or controlled vocabulary. Any of these techniques will work well on a specified set of documents, but no single technique works well on all of them. Evaluators should be skeptical of vendor claims regarding the amount of setup costs and maintenance effort required because these techniques can help automate classification processes, but human intervention and supervision is still required. Test the requirements for system initiation and manageability in pilots and from customer reference calls. Understanding the principles underlying these AI-based technologies, and how vendors assemble them, will help IT managers to evaluate which will work best for their particular set of documents or content profiles.

tech-news