(aside image)

Computational Systems Biology and Medicine

Our active research topics include




Introduction

High-throughput measurement techniques allow genome-wide studies of biological function. Gene expression, gene regulation, protein content, protein interaction, and metabolic profiles can be measured and combined with sequence information. The major challenge is to extract meaningful findings from large, noisy, high-dimensional, heterogeneous, and incomplete data sets. We develop new computational data analysis methods for taking benefit from prior knowledge and previous experiments in biomedical research.

Translational medicine on the metabolic level

Translational medicine attempts to bring basic research findings to clinical practice. One of the necessary steps of this process is to translate inferences made on the molecular level, for example about metabolites, in model organisms into inferences about humans. Metabolomics is the study of the set of all metabolites found in a sample tissue. Metabolite concentrations are affected strongly by diseases and drugs, and hence they complement the genomic, proteomic, and transcriptomic measurements in an excellent way, in studies of the biological state of an organism. We have developed new computational methods for mapping observed metabolomics data between model organisms and humans.

  • Time series alignment. We devised computational methods for describing dynamic differences between time-series measurements of two populations, based on Hidden Markov Models (HMM).

  • Small sample-size multi-way analysis. Finding effects of multiple covariates is one interesting problem in analyzing biomedical data. As traditional Multivariate Analysis of Variance (MANOVA) cannot be used if the number of variables is huge compared to the number of observations, we developed a Bayesian model for multi-way analysis of small sample-size, high-dimensional data sets.

  • Multi-way, multi-source analysis. An extended model deals with different data views (here: tissues) that have different variable spaces. It is able to find the multi-way covariate effects and to partion them into shared and source-specific effects.

Collaborators

Specific Projects

Representative Publications

Retrieval and visualization of relevant experiments

Large repositories of genome-wide measurement data pose the research question of how to systematically relate different data sets. Re-usage of data sets increases the statistical power of novel studies and opens up the possibility to put biological results in the context of previous studies. To complement keyword search functionalities provided by most repositories for retrieval of similarly annotated studies, we developed machine learning methods that relate gene expression studies through their actual measurement data, along with visualization tools that allow exploring and interpreting the results. In the REx project (Retrieval of Relevant Experiments), relevance is defined by a model of biology that is both data- and knowledge-driven.

Collaborators

Specific Projects

Representative Publications

  • José Caldas, Nils Gehlenborg, Ali Faisal, Alvis Brazma, and Samuel Kaski. Probabilistic retrieval and visualization of biologically relevant microarray experiments. Bioinformatics, 25(12): i145-i153, 2009. (html). See also: Software, Poster (best poster award at the 5th ISCB Student Council Symposium).

Integration of heterogeneous biomedical data

A living cell is an extremely complex system, and hence integration of information from multiple sources is needed for revealing the true potential of the modern high-throughput measurement methods, such as gene expression or micro-RNA data, combined with relational information of the genes, environmental factors, and disease. Much of the data integration literature focuses either on well-targeted combinations of sources, such as using sequence-based regulators for explaining gene expression, or on well-focused prediction tasks such as predicting molecular interactions from several data sources. We have focused on knowledge discovery types of problems where the goal is to discover what is relevant in massive data sets by aiming to discover connections between data sources.

  • Dependency modeling. We consider the data fusion problem of combining two or more data sources where each source consists of vector-valued measurements from the same objects or entities, but on different variables. The task is to detect aspects that are shared between different sources.

  • Matching of entities. Often, measurements are performed with different platforms or in different sources (tissues or species). Then the mapping between the variables is not necessarily known. We have introduced methods to learn the matching in a data-driven way.

  • Bayesian biclustering. Biclustering is the computational task of simultaneously clustering objects and inferring which features of the objects contribute to the grouping. Recently, we have developed a hierarchical nonparametric biclustering method that is able to generate a flexible tree structure of biclusters.

  • Searching for functional modules. We have devised a generative model to detect functional gene modules and protein complexes from both protein-protein interaction and gene expression data. It is able to detect overlapping modules where proteins may have different roles.

Collaborators

Specific Projects

Representative Publications

Full publication list of the research group