Financial Daily from THE HINDU group of publications
Tuesday, Mar 09, 2004

News
Features
Stocks
Cross Currency
Shipping
Archives
Google

Group Sites

Industry & Economy - Science & Technology


C-DAC tool for editing scanned Malayalam text

Vinson Kurian

Thiruvananthapuram , March 8

THE Centre for Development of Advanced Computing (C-DAC), Thiruvananthapuram, has developed an Optical Character Recognition (OCR) system that converts scanned images of printed Malayalam documents to editable text.

The Malayalam OCR, christened `Nayana' is a multi-font system that works across a range of font sizes. The system has a recognition speed of 50 characters per second, says Ms K.G. Sulochana, Deputy Director, Resource Centre for Indian Language Technology Solution (RCILTS-Malayalam) at C-DAC.

This is one of the 13 resource centres set up across the country by the Union Ministry of Communications and Information Technology under the Technology Development for Indian Languages (TDIL) programme. These resource centres seek to take information technologies to masses in their local languages and cater to all the constitutionally recognised Indian, as also some foreign languages. The language of focus at C-DAC Thiruvananthapuram is Malayalam, the official language of the State.

Technology solution system consists of a pre-processing module, the OCR engine and a post-processing module, Ms Sulochana said. The pre-processing tasks performed by the first module include noise removal, conversion of grey scale image to binary, skew detection and correction and line, word, and character segmentation. The scanned images in grey tone are converted into two-tone (binary) images using a histogram-based `thresholding' approach. Skew detection is done using the projection profile-based technique. After estimating the skew angle, the skew is corrected by rotating the image against the estimated skew angle.

The OCR engine is based on the feature extraction method of character recognition. Feature extraction can be equated to finding a set of vectors, which effectively represent the information content of a character. The features are selected in such a way that they help discriminate between characters. A multistage classification procedure is used, which reduces the processing time while maintaining the accuracy.

After passing through different stages of the classifier, the character is identified and corresponding character code is assigned. A training module is incorporated in the OCR engine to recognise characters, which are different from normal characters in their shape and style (example - decorative fonts). In the post-processing module, linguistic rules are applied to the recognised text to correct classification errors.

For example, certain characters never occur at the beginning of a word and, if found so, are remapped appropriately. Similarly, dependent vowel signs can occur only with consonants or consonant conjuncts; if found along with vowels or soft consonants, they are remapped into consonants/conjuncts similar in shape to the vowel sign. Independent vowels occur only at the beginning of a word and, if found anywhere else, will be mapped into a consonant or ligature having similar shape.

Extensive testing has been done on approximately 500 pages of different quality printed documents. The system has undergone certification testing at the Electronics Test and Development Centre (ETDC), Chennai.

The OCR can be integrated with a Malayalam text to speech system to get a text reading system for the visually challenged. Other application areas include publishing sector, content creation, digital library and corpus development.

More Stories on : Science & Technology | Kerala

Article E-Mail :: Comment :: Syndication :: Printer Friendly Page



Stories in this Section
Insurance cover for family planning in the works — `It could act as check on rising population'


American jobs data bring bad news for India
Micro finance: Preparing for a major thrust
India to be among the faster growing economies: Kelkar
To mop up service tax collections — Excise Dept launches public awareness programme
Zanzibar invites Indian entrepreneurs
French cheese to vanish from Indian tables — Fall-out of ban on cattle treated with prostaglandin
Natco launches another anti-cancer drug
Boston Scientific gets US nod for coronary stent
Workshop on autism in Hyderabd
Apollo health card for women
Infrastructure development programme — Sikkim plans to source major loan from multilateral agencies
NPPA reduces prices of 34 bulk drugs
Dark days ahead as drought hits hydel power plants in Kerala
`Electricity Act may affect utilities, small consumers'
Textiles stocks draw special attention
Kerala HC declines to stay Govt order on Coke
Looking ahead
Assocham seeks better credit flow to SMEs, farm sector
CII, Myanmar chamber tie up
CII chief for Madurai zone
Textile machinery makers seek TUFS-like fund
C-DAC tool for editing scanned Malayalam text
NIIT gesture on Womens' Day
Meet calls for equal opportunities to women
Reward for savings agents
Indian textiles in quota-free world
Sankara Eye Foundation on a mission for vision
Charity plans Rs 100-cr hospital in Hyderabad
National Bread Day on March 11
`Invest Futura' on March 13
I-T refund advices
`Restore DEPB scheme for stainless steel'



The Hindu Group: Home | About Us | Copyright | Archives | Contacts | Subscription
Group Sites: The Hindu | Business Line | Sportstar | Frontline | The Hindu eBooks | The Hindu Images | Home |

Copyright © 2004, The Hindu Business Line. Republication or redissemination of the contents of this screen are expressly prohibited without the written consent of The Hindu Business Line