Abstract:
Big data has become ubiquitous as high volumes of wide varieties of valuable data of different veracities are made available at a high velocity through fast throughput machines and techniques for data gathering and curation in many real life applications. In bioinformatics, terabytes of bio-sequence data can now be generated within a few hours with the use of next generation sequencing (NGS) technologies. Though researchers have developed several algorithms for Bio-Sequence Motifs mining, these algorithms have become grossly inadequate to manage the recent data explosion in Bioinformatics and Biomedicine. Therefore, this research proposed a scalable sequential algorithm for high performance computing on bioinformatics based on association rule mining. Specifically, the parallel algorithm uses a fault-tolerant collection of resilient distributed datasets (RDDs) in Apache Spark computing framework to mine sequence motifs from uncertain big repositories of deoxyribonucleic acid (DNA) sequences. The proposed algorithm is divided into four (4) stages. In stage I the bio-sequence data is read from a text file on the local disk, converted to Resilient Distributed Datasets (RDD) and partitioned across all the memories of the worker (slave) nodes. In Stage II, map operator is applied to generate pairs of all subsequences of the input sequence on each partition of RDD on the worker nodes (N); retaining only unique pairs on each node. In Stage III, frequent patterns are generated using the minimum support threshold and the uncertainty distance function on each partition of RDD. In Stage IV, the frequent subsequences on each partition of RDD on all the worker nodes are grouped by their length and send to the master node. The algorithm was implemented in Scala on a standalone Apache Spark 1.6.1 cluster of (5) homogeneous distributed- memory systems running Ubuntu-15.0 and 64-bit operating system connected through a low-latency infini-band and gigabit Ethernet network. The algorithm was evaluated using three sets of datasets. With the bacteria dataset, the average speed up of the algorithm for motifs of length ranging from two to six is 8.900seconds while results from Drosophila ananassae RNA Chromosome and human genome GRCh38.p7 DNA datasets show that the average speed up of the algorithm for motifs of length ranging from two to six is 4.486 seconds. Thus, for a large input dataset, the speedup on the cluster is approximately five times the speed on the single node. The complexity of the algorithm on a single node is O(n2) while on a cluster of 5 nodes is O(log5n ).The proposed algorithm was also benchmarked with IMRSPM, MCES and PMSPMR and experimental results show that the proposed algorithm extracts more accurate motifs within a shorter time frame.