optical character recognition
DESCRIPTION
Lecture 8. Optical Character Recognition. Qurat-ul-Ain ( Ainie ) Akram Sarmad Hussain Center for language Engineering Al- Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore, Pakistan. Syllable String Creation using lookup table. - PowerPoint PPT PresentationTRANSCRIPT
Optical Character Recognition
Qurat-ul-Ain (Ainie) AkramSarmad Hussain
Center for language EngineeringAl-Khawarizmi Institute of Computer Science
University of Engineering and Technology, Lahore, Pakistan
Lecture 8
ISSALE 2014 2
Syllable String Creation using lookup table
Syllable String
Main body ID
Diacritics1_ID
Diacritics1_ID
….
تا 500 2
و 501
پتھر 200 2 1 2
ISSALE 2014 3
Project Presentation
1. Front Page– Optical Character Recognition(in English)– Optical Character Recognition(in Your Language)– Document Image– Output of OCR (Recognized Syllable Strings of
OCR)– Syllable String Recognition Accuracy(Syllables
/Total Syllables*100)– Group Members Name
ISSALE 2014 4
1. Preprocessing– Line Segmentation
• Samples of line segmentation• Line segmentation accuracy results
• Samples of incorrect line segmentation
– Syllable/Ligature Segmentation• Samples of Syllable/Ligature segmentation• Syllable/Ligature Segmentation Accuracy Results• Samples of incorrect Syllable/Ligature segmentation
Total Lines Correctl Lines Incorrect Lines
% Accuracy
Total Syllables Correctly Syllables
Incorrect Syllables
% Accuracy
ISSALE 2014 5
• Pre-processing– Main body and diacritics disambiguation
Total main bodies Correctly classified as main bodies
% Accuracy
Total diacritics Correctly classified as diacritics
% Accuracy
ISSALE 2014 6
• Classification and Recognition– Data Description
• 15 Main body Types (DataSet-1)– Training Data (35 Tokens)– Testing Data (15 Tokens)– Image samples
• Document Images(DataSet-2)– Testing Data
» X Tokens of Y main body Types» X Tokens of Y diacritics Types» Image sample
Main body Type Total tokens in document images
Total unique syllables in document images
500 15 4
ISSALE 2014 7
• Classification and recognition results– Recognition Results on DataSet-1 using Decision Trees
• Main body recognition accuracy– Diacritics recognition accuracy
– Recognition Results on DataSet-1 using Tesseract• Main body recognition accuracy– Diacritics recognition accuracy
Class Type Total SamplesTest data (15 Tokens)
Correctly Recognized
% Accuracy
Class Type Total Samples Test data (15 Tokens)
Correctly Recognized
% Accuracy
ISSALE 2014 8
• Classification and recognition results– Recognition Results on DataSet-2 using Decision Trees
• Main body recognition accuracy– Diacritics recognition accuracy
OR – Recognition Results on DataSet-2 using Tesseract
• Main body recognition accuracy– Diacritics recognition accuracy
Class Type Total Samples Correctly Recognized
% Accuracy
Class Type Total Samples Correctly Recognized
% Accuracy
ISSALE 2014 9
• Post-processing– Syllable String Creation
– Syllable String Recognition Accuracy
Syllable String
Main body ID
Diacritics1_ID
Diacritics1_ID
….
تا 500 2
و 501
Syllable Type Total Samples Correctly Recognized
% Accuracy
ISSALE 2014 11
Deliverables to submit
1. Presentation slides2. OCR Complete Code
1. Line segmentation2. Syllable segmentation3. Recognition of diacritics and main bodies4. Syllable string creation using lookup Table5. Output.txt file generation
3. Data Set-14. Data Set-25. Tesseract Traineddata file
ISSALE 2014 13
Document Image Creation• Syllable_of_MB1_Samples_1 Syllable_of_MB2_Samples_1 Syllable_of_MB2_Samples_1
Syllable_of_MB3_Samples_1 Syllable_of_MB4_Samples_1 Syllable_of_MB5_Samples_1 ,,, Syllable_of_MB15_Samples_1
• Syllable_of_MB1_Samples_2 Syllable_of_MB2_Samples_2 Syllable_of_MB2_Samples_2 Syllable_of_MB3_Samples_2 Syllable_of_MB4_Samples_2 Syllable_of_MB5_Samples_2 ,,, Syllable_of_MB15_Samples_2
• Syllable_of_MB1_Samples_3 Syllable_of_MB2_Samples_3 Syllable_of_MB2_Samples_3 Syllable_of_MB3_Samples_3 Syllable_of_MB4_Samples_3 Syllable_of_MB5_Samples_3 ,,, Syllable_of_MB15_Samples_3
• Syllable_of_MB1_Samples_4 Syllable_of_MB2_Samples_4 Syllable_of_MB2_Samples_4 Syllable_of_MB3_Samples_4 Syllable_of_MB4_Samples_4 Syllable_of_MB5_Samples_4 ,,, Syllable_of_MB15_Samples_4
• ,• ,• ,• Syllable_of_MB1_Samples_15 Syllable_of_MB2_Samples_15 Syllable_of_MB2_Samples_15
Syllable_of_MB3_Samples_15 Syllable_of_MB4_Samples_15 Syllable_of_MB5_Samples_15 ,,, Syllable_of_MB15_Samples_15
Syllable = MB + Diacritics or Syllable = MB