Financial Daily from THE HINDU group of publications Tuesday, Mar 09, 2004 |
||
|
|
||
|
Industry & Economy
-
Science & Technology C-DAC tool for editing scanned Malayalam text Vinson Kurian
Thiruvananthapuram , March 8 THE Centre for Development of Advanced Computing (C-DAC), Thiruvananthapuram, has developed an Optical Character Recognition (OCR) system that converts scanned images of printed Malayalam documents to editable text. The Malayalam OCR, christened `Nayana' is a multi-font system that works across a range of font sizes. The system has a recognition speed of 50 characters per second, says Ms K.G. Sulochana, Deputy Director, Resource Centre for Indian Language Technology Solution (RCILTS-Malayalam) at C-DAC. This is one of the 13 resource centres set up across the country by the Union Ministry of Communications and Information Technology under the Technology Development for Indian Languages (TDIL) programme. These resource centres seek to take information technologies to masses in their local languages and cater to all the constitutionally recognised Indian, as also some foreign languages. The language of focus at C-DAC Thiruvananthapuram is Malayalam, the official language of the State. Technology solution system consists of a pre-processing module, the OCR engine and a post-processing module, Ms Sulochana said. The pre-processing tasks performed by the first module include noise removal, conversion of grey scale image to binary, skew detection and correction and line, word, and character segmentation. The scanned images in grey tone are converted into two-tone (binary) images using a histogram-based `thresholding' approach. Skew detection is done using the projection profile-based technique. After estimating the skew angle, the skew is corrected by rotating the image against the estimated skew angle. The OCR engine is based on the feature extraction method of character recognition. Feature extraction can be equated to finding a set of vectors, which effectively represent the information content of a character. The features are selected in such a way that they help discriminate between characters. A multistage classification procedure is used, which reduces the processing time while maintaining the accuracy. After passing through different stages of the classifier, the character is identified and corresponding character code is assigned. A training module is incorporated in the OCR engine to recognise characters, which are different from normal characters in their shape and style (example - decorative fonts). In the post-processing module, linguistic rules are applied to the recognised text to correct classification errors. For example, certain characters never occur at the beginning of a word and, if found so, are remapped appropriately. Similarly, dependent vowel signs can occur only with consonants or consonant conjuncts; if found along with vowels or soft consonants, they are remapped into consonants/conjuncts similar in shape to the vowel sign. Independent vowels occur only at the beginning of a word and, if found anywhere else, will be mapped into a consonant or ligature having similar shape. Extensive testing has been done on approximately 500 pages of different quality printed documents. The system has undergone certification testing at the Electronics Test and Development Centre (ETDC), Chennai. The OCR can be integrated with a Malayalam text to speech system to get a text reading system for the visually challenged. Other application areas include publishing sector, content creation, digital library and corpus development.
More Stories on : Science & Technology | Kerala
Article E-Mail :: Comment :: Syndication :: Printer Friendly Page
|
Stories in this Section |
|
The Hindu Group: Home | About Us | Copyright | Archives | Contacts | Subscription Group Sites: The Hindu | Business Line | Sportstar | Frontline | The Hindu eBooks | The Hindu Images | Home |
Copyright © 2004, The
Hindu Business Line. Republication or redissemination of the contents of
this screen are expressly prohibited without the written consent of
The Hindu Business Line
|