Advertisment

HP, IISc gives push for Document Image Processing in Indian scripts

author-image
CIOL Bureau
Updated On
New Update

BANGALORE: Two years from now, computer users may be able to convert hand written or printed literature in Kannada, Telugu and other Indian languages into editable documents, audio files or even translate local language text to English in seconds.

Advertisment

This process of Document Image Processing (DIP) has gained importance for creating seamless co-existence of the paper and the digital world. A few modern enterprises completely depend on digital world, but most government works, company works continue to exist in paper forms, resulting in waste of time, space and maintenance costs.

This is where the area of document image processing comes in, to make a machine deal with paper, by trying to break the opacity of the paper. A paper can be digitized to enter the IT world using a scanner or cameras. However, unless they are intelligently processed, the paper images continue to have many of the bottlenecks of paper.

Though DIP has gone well with English and a few other languages with development of Optical Character Recognition (OCR) software, there has been no organized effort to develop the technology for any India scripts, and experts in the field have been scattered, working for different organizations.

Advertisment

Perhaps, the first ever effort to make linguistic experts, technicians, and researchers come together and work for developing technologies to make Indian languages available for DIP is being make at the Summer School on DIP, organized at the Indian Institute of Sciences, Bangalore in association with HP Labs, India.

"Unlike English, which has been successfully adopted for DIP, Indian languages aren't easy to be made recognizable by character as each word can come in different forms, conveying different meanings. The Summer School is an effort to establish a society of resources who are otherwise operate on different projects in their respective organizations," says R.N. Sitaram, coordinator of the summer school and senior research scientist, HP Labs.

Efforts to make Kannada or Telugu adoptable to DIP is still at the early stages. The most challenging issues faced by experts are related to making the machine recognize various forms of words. Only after these issues are sorted out that the OCR software can be developed for any language, Sitaram observed.

According to A.G. Ramakrishnan, one of the co-ordinators of the Summer DIP school and an associate professor with department of Electrical Engineering, IISc, OCR can recover valuable information and format it in reusable form. Information can be gathered from old paper files, resumes and applications, forms, address labels, etc., and can also help digitalize the libraries thus, saving time and money.

People are used to paper and value paper as an instrument to make any transaction authentic, like receipts, bills, transaction certificates, land records and legal agreements. Converting these documents to electronic format can benefit governments and organizations in terms of speedy execution of works, cost, time efficiency besides bringing more transparency in work.