Interactivity for Big Data: Preprocessing genomic data with MapReduce

Reference:

Matti Niemenmaa. Interactivity for Big Data: Preprocessing genomic data with MapReduce, jun 2011. Bachelor's thesis.

Abstract:

Next-generation sequencing projects are generating vast amounts of genomic data. It is impractical to analyse these several-terabyte datasets without leveraging cloud computing. Interactive applications such as interactive visualization, in which latency needs to be minimized, are particularly affected by the dataset size. A cloud-hosted backend, though providing the computational power necessary, brings latency issues of its own.

This Thesis explains how the interactive zooming feature of the Chipster data analysis and visualization platform can be made performant on large datasets by using genome data preprocessing in the cloud. The implementation of a summarizing tool and its supporting library, Hadoop-BAM, is described. The programming model used, MapReduce, is explained, as well as some details concerning the Hadoop framework on which the tools are built. In particular, a heuristic approach to splitting the genomic data files for distributed processing is presented and compared to an indexing-based strategy.

Finally, experimental timings are shown: notably, a 50 gigabyte dataset can be summarized in well under an hour using only eight worker nodes. In addition, the heuristic splitting method is found to perform comparably to indexing without incurring the additional cost of computing the index.

Keywords:

BAM, Chipster, cloud, Hadoop, interactive

Suggested BibTeX entry:

@misc{NiemenmaaBachelor2011,
    author = {Matti Niemenmaa},
    language = {eng},
    month = {jun},
    note = {Bachelor's thesis.},
    title = {Interactivity for {B}ig {D}ata: Preprocessing genomic data with {M}ap{R}educe},
    year = {2011},
}

PDF (271 kB)