Abstract:
The process of extracting data from tables manually is expensive and time-consuming for a large collection. Automatic table recognition and extraction provide scalability and usability for digital libraries and their collections. Data extraction from tables on a large scale could be useful for data mining applications. In this thesis, a system for recognising and extracting tables from heterogeneous documents (documents of different types, formats and structures) was developed. All the heterogeneous documents (except the HyperTextMarkup Language (HTML) documents) were initially pre-processed and converted to HTML codes, after which an algorithm recognises the table portion of the documents. Hidden Markov Model (HMM) was applied to the HTML code. The model was trained and tested with five hundred and twenty six (526) self-generated tables (three hundred and twenty-one (321) tables for training and two hundred and five (205) tables for testing). Viterbi algorithm was implemented for the testing part. PHP and MySQL were used for the implementation. The system was evaluated in terms of accuracy, precision, recall and F-measure; it was seen that the system had overall accuracy of 88.8%, precision of 96.8%, recallof 91.7% and F-measure of 88.8%. Since all the indices for evaluation were high, it showed that the model is useful for solving the problem of automatic table recognition and extraction..