BANGALORE, INDIA: There are multiple gigs of data to deal with in our lives and this only goes on increasing with each passing day. The gap between the data generated and analyzed is also growing. So, you ought to look for techniques that make things easier. Machine learning is one such technique that searches a very large dataset of possible hypothesis to determine the best fit in the observed data and any prior knowledge held by the learning system. Data Mining augments the search and understanding of the electronically stored data.
What is WEKA?
Waikato Environment for Knowledge Analysis (WEKA), developed at the University of Waikato, New Zealand, is a collection of machine learning algorithms with data preprocessing tools to provide input to these algorithms. The tool was developed in Java and runs on Linux as well as Windows. It can also be used to develop and analyze new machine learning algorithms.
It is open source software and distributed under the terms of GNU General Public License. The input to the Machine Learning algorithms is in the form of a relational table in the ARFF format. Weka comes with an API documentation generated using Javadoc. More details on Weka and its usage are available across a few chapters in the book written by Ian H Witten and Eibe Frank, 'Data Mining: Practical Machine Learning Tools and Techniques,' 2nd Edition, Morgan Kaufmann Series, San Francisco, 2005.
How it helps
Some key features of WEKA include:
Preprocess: Weka has file format converters for spreadsheets, C4.5 file formats and serialized instances. It can also open a URL and use HTTP to download an ARFF file from the Web or open a database using JDBC, and retrieve instances using SQL. It also provides a list of filters to delete specified attributes from a dataset.
Cluster: Weka shows the clusters and the number of instances in the cluster. Thereafter it determines the majority class in each cluster and gives the confusion matrix.
Associate: Weka contains three algorithms for determining
association rules-apriory, predictive apriory and filtered associators. It has no methods for evaluating such rules.
Attribute Selection: Weka gives access to several methods for attribute selection, which involves an attribute evaluator and a search method. Attribute selection can be performed using the full training set or cross-validation.
In the Preprocess tab, you can view attributes in the input file, properties of the selected attribute, and visualisation of class distribution for each attribute. | Building a Na?ve Bayes Classifier with 10 fold cross-validation. The correctly classified instances can be viewed by right clicking on Classifier in Results Window. |
Visualization - It displays a matrix of two-dimensional scatter plots of each pair of attributes.
Preparing input
Major effort in the process of data mining/machine learning goes into the preparation of input. In order to analyze data using Weka, you need to prepare it in the Attribute Relation File Format (ARFF) and then load it in its Explorer. Spreadsheets, Comma Separated Value (CSV) files and databases can be converted to ARFF. In ARFF, there is an @relation tag, @attribute tag and @data tag to represent the dataset name, attribute information and values respectively.
Classifying data
Weka should preferably be used through a graphical user interface called 'Explorer' than the command-line interface. The other two interfaces are 'Knowledge Flow Interface,' which supports design configuration for streamed data processing and 'Experimenter,' which helps users compare a variety of learning techniques. In this example, we use an ARFF named age.arff which contains a few selective words in the attribute and @data contains their number of occurrences per 10,000 words in a blog dataset written by bloggers belonging to various age groups.
1. Open the file you want to analyze using the Open file option in the Preprocess tab in Weka explorer, ie open the age data file, age.arff.
2. Once the input file has been opened, all attributes in the input file are shown in the Attributes Window. Properties of the selected attributes like Attribute Name, Attribute Type, number of missing values, etc are displayed in the 'Selected Attribute' window. Here, you can select attributes that you want to include in working relations, eg age prediction.
3. Select the classifier algorithm in the Classify tab. In this example, we selected Na?ve Bayes with 10 fold Cross-Validation. Next, click on Start. The result is displayed in the Classifier Output window as shown in figure on the left.
Output of the Na?ve Bayes Classifier in terms of errors, accuracy by class and confusion matrix, on Age dataset. | View of an ARFF dataset which consists of a list of instances, and the attribute values for each instance separated by commas. |
Analyzing the result
The result displays the summary of the data set followed by the algorithm used to analyze it. It also gives the predictive performance of the machine-learning algorithm applied on the dataset. Thereafter the confusion matrix displays the number of instances classified properly and those misclassified. The classification error is displayed mentioning the mean absolute error and the root mean squared error of the class probability estimates.
Processing huge datasets
If the dataset is too huge, running to a few thousand attributes and a few lakh records, it can happen that Weka runs into an 'OutOfMemory' exception. Most Java virtual machines allocate a certain maximum amount of memory which is much less than the amount of RAM to run Java programs. However, we can extend the memory available for the virtual machine by setting appropriate options. Alternately, Weka offers several filters for re-sampling a dataset and generating a new dataset reduced in size. Besides, there are schemes that can be trained in an incremental fashion, not just in batch mode unlike most classifiers which require all the data before they can be trained. Such a classifier will load the dataset incrementally and feed the data instance by instance to the classifier.
Conclusion
It is difficult for a single machine learning tool to suite all data mining requirements even as the universal learner is still a distant dream. In order to obtain an accurate model of real datasets, the learning algorithm must match the domain. Data mining is an experimental science and provides a workbench for data preprocessing tools and machine learning algorithms. Weka helps in realizing the goal of data mining, by predicting missing values and validating that the predicted values are correct.
Abhinav Gupta & Sumit Goswami