AUTOMATIC TABLE RECOGNITION AND EXTRACTION FROM HETEROGENEOUS DOCUMENTS

Show simple item record

dc.contributor.author BABATUNDE, FLORENCE FOLAKE
dc.date.accessioned 2020-11-02T09:41:38Z
dc.date.available 2020-11-02T09:41:38Z
dc.date.issued 2015-11
dc.identifier.uri http://196.220.128.81:8080/xmlui/handle/123456789/852
dc.description M.TECH THESIS en_US
dc.description.abstract The process of extracting data from tables manually is expensive and time-consuming for a large collection. Automatic table recognition and extraction provide scalability and usability for digital libraries and their collections. Data extraction from tables on a large scale could be useful for data mining applications. In this thesis, a system for recognising and extracting tables from heterogeneous documents (documents of different types, formats and structures) was developed. All the heterogeneous documents (except the HyperTextMarkup Language (HTML) documents) were initially pre-processed and converted to HTML codes, after which an algorithm recognises the table portion of the documents. Hidden Markov Model (HMM) was applied to the HTML code. The model was trained and tested with five hundred and twenty six (526) self-generated tables (three hundred and twenty-one (321) tables for training and two hundred and five (205) tables for testing). Viterbi algorithm was implemented for the testing part. PHP and MySQL were used for the implementation. The system was evaluated in terms of accuracy, precision, recall and F-measure; it was seen that the system had overall accuracy of 88.8%, precision of 96.8%, recallof 91.7% and F-measure of 88.8%. Since all the indices for evaluation were high, it showed that the model is useful for solving the problem of automatic table recognition and extraction.. en_US
dc.description.sponsorship FEDERAL UNIVERSITY OF TECHNOLOGY AKURE en_US
dc.language.iso en en_US
dc.publisher FEDERAL UNIVERSITY OF TECHNOLOGY AKURE en_US
dc.subject Data extraction en_US
dc.subject heterogeneous documents en_US
dc.subject Table Extraction (TE) en_US
dc.subject Automatic table recognition and extraction en_US
dc.title AUTOMATIC TABLE RECOGNITION AND EXTRACTION FROM HETEROGENEOUS DOCUMENTS en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search FUTAspace


Advanced Search

Browse

My Account