AUTOMATIC TABLE RECOGNITION AND EXTRACTION FROM HETEROGENEOUS DOCUMENTS

BABATUNDE, FLORENCE FOLAKE

dc.contributor.author	BABATUNDE, FLORENCE FOLAKE
dc.date.accessioned	2020-11-02T09:41:38Z
dc.date.available	2020-11-02T09:41:38Z
dc.date.issued	2015-11
dc.identifier.uri	http://196.220.128.81:8080/xmlui/handle/123456789/852
dc.description	M.TECH THESIS	en_US
dc.description.abstract	The process of extracting data from tables manually is expensive and time-consuming for a large collection. Automatic table recognition and extraction provide scalability and usability for digital libraries and their collections. Data extraction from tables on a large scale could be useful for data mining applications. In this thesis, a system for recognising and extracting tables from heterogeneous documents (documents of different types, formats and structures) was developed. All the heterogeneous documents (except the HyperTextMarkup Language (HTML) documents) were initially pre-processed and converted to HTML codes, after which an algorithm recognises the table portion of the documents. Hidden Markov Model (HMM) was applied to the HTML code. The model was trained and tested with five hundred and twenty six (526) self-generated tables (three hundred and twenty-one (321) tables for training and two hundred and five (205) tables for testing). Viterbi algorithm was implemented for the testing part. PHP and MySQL were used for the implementation. The system was evaluated in terms of accuracy, precision, recall and F-measure; it was seen that the system had overall accuracy of 88.8%, precision of 96.8%, recallof 91.7% and F-measure of 88.8%. Since all the indices for evaluation were high, it showed that the model is useful for solving the problem of automatic table recognition and extraction..	en_US
dc.description.sponsorship	FEDERAL UNIVERSITY OF TECHNOLOGY AKURE	en_US
dc.language.iso	en	en_US
dc.publisher	FEDERAL UNIVERSITY OF TECHNOLOGY AKURE	en_US
dc.subject	Data extraction	en_US
dc.subject	heterogeneous documents	en_US
dc.subject	Table Extraction (TE)	en_US
dc.subject	Automatic table recognition and extraction	en_US
dc.title	AUTOMATIC TABLE RECOGNITION AND EXTRACTION FROM HETEROGENEOUS DOCUMENTS	en_US
dc.type	Thesis	en_US