Dienstag, 6. Juni 2017

Thoughts and questions after 10 years working in computational biology

It has been the 10th year that I work in the field of computational biology. And probably it is the right time to ask myself: where should my research next go?

Looking Back

I started in 2007 developing software modules and pipelines to allow quantitative analysis of biological systems. Together with Stefan Wiemann, my Ph.D. supervise, I developed the KEGGgraph software to translate biological pathways in KEGG, previously mostly used visually, into graph models that can be analyzed formally. That led to the very first peer-review publication of mine in the field.

Following that, I spent three years understanding how microRNA regulate gene expression in human breast cancer and gastrointestinal tumor. There I had the opportunity to work with outstanding colleagues like Florian Haller, Stefan Uhlmann, Heiko Mannsperger, Özgür Sahin, Agnes Hovrat, Katherina Zweig, etc to study microRNAs using latest technologies such as reverse phase protein arrays and network analysis. That's the time when I was fascinated by systems biology.

In 2011 I joined Roche. I am fortunate to work with Clemens Broger and Martin Ebeling and spend my time, besides regular project support activities, on large-scale data analysis of gene expression data and on development of novel platforms to support early drug discovery. In 2014, we characterized an early induced network of four genes that are predictive of toxicity in vitro an in vivo by mining the TG-GATEs database. Early this year, we published a manuscript describing the BioQC software, which detects tissue heterogeneity in gene expression data using knowledge derived from a compendium of gene expression profiles that we collected. A few weeks ago, together with Faye Drawnel, Martin Ebeling, and Marco Prunotto, we published the proof-of-concept study of molecular phenotpying and its application in early drug discovery. The results suggest that by integrating molecular phenotyping, i.e. digital quantification of pre-selected pathway reporter genes shortly after compound perturbation, we can gain insights into both pathways that are associated with disease-relevant phenotype as well as compounds that induce desired phenotypic changes.

Looking Forward

What comes next? I only have a few vague ideas and am open to more new ones
  1. How to build software for data integration and interpretation in order to empower both disease understanding and drug discovery? In particular, how can we systematically and formally integrate genomic, transcriptomic, genomic, proteomic, and chemoinformatic data to inform the drug discovery process?
  2. How to formally generate and test hypothesis about genetic and pharmacological perturbation in silico?
  3. How to utilize single-cell and single-mutation level information for drug discovery?
I sense there is tension between the ever-increasing amount of information that is available to us and the limited time to digest them and to connect between them. In addition, project support activities and research into the questions, which in ideal cases do not conflict with but rather benefit from each other, need constant balancing. As Yuri Lazebnik put it in his legendary essay Can a biologist fix a radio?—Or, what I learned while studying apoptosis, it's time to make good tools and to keep your mind clear under adverse circumstances

Just search and ask, until the next 10 years are gone.

Dienstag, 9. Mai 2017

Dose-response curve in R: the drc package

Thanks to the analysis of an interesting dataset I discovered the drc package for Dose-Response Analysis using R (Ritz et al, PLOS One 2015, https://doi.org/10.1371/journal.pone.0146021).

It comes with a very powerful optimiser for common models such as logistic function (or Hill function). Compared with the native R implementations using the nls function and self start models such as SSfpl, the drc package is far more reliable and robust: both initial parameter estimation and optimisation run without errors due to singularity : at least in the ~3,000 datasets that I tried, on which drc reported no single mistake, whereas nls failed as much as 600 cases despite of manually setting starting parameters with educated guess.

I still have to understand how the package achieves such good performance. However I am already very glad that we have finally a robust and reliable optimiser for curve fitting, which is a common task in computational biology and bioinformatics.

Figure: a 4-parameter logistic fit done and plotted with drc.

P.S. During the try-and-error process, I also accidentally found a website that is quite robust with regard to curve fitting: https://www.mycurvefit.com. Though I will not use it since I need programmatic access to the fitting capacity, the website's fitting function is quite robust in my opinion, at least better than the few examples that failed nls.

Montag, 8. Mai 2017

t-distributed Stochastic Neighbor Embedding (tSNE)

Notes by following the video lecture on t-SNE, given by Laurens van der Maaten at Google Tech Talk, available here on YouTube.

  • t-SNE has the advantage of other dimension reduction methods that it tries to keep local structures when mapping high dimensional spaces in 2D or 3D.
  • In the high-dimensional (hD) space, t-SNE uses multivariate Gaussian (normal) distributions to assign pairwise probabilities of data points.
  • In the low-dimensional (lD) space, t-SNE uses Euclidean distances to assign pairwise distances of data points.
  • In order to make the points in lD reflect their relationship in hD,  the Kullback–Leibler divergence between the two distributions is minimised.
  • The reason to use Student's t statistic is that it prevents the algorithm from rending dissimilar points too far apart in lD (TODO: I have not yet completely understood why).
  • In real-life applications, Barnes-Hut simulation and quadtree implementations make the algorithm efficient enough to handle reasonably large data sets with thousands of samples. 
  • Relation to physics: t-SNE gradient can be viewed as a simulation of the N-body problem, with a spring term, an exertion/compression term, and a sum. (TODO: The N-body problem is new to me).
  • As an extension, multi-map t-SNE allows representation of correlated features in different contexts.
  • t-SNE can be used to cluster samples and to assist feature selection.

Samstag, 23. Juli 2016

Weekly summary

In this week

  • I had an intensive learning phase of Python using Python Tutorial, The Hitchhiker's Guide to Python and countless other internet resources. Especially interesting I found following projects: nose for test-driven development, sphinx for generating documentations, and BioPython for bioinformatics tasks.
    • I thought about when to use R, python, and C/C++ appropriately and most effectively. I think R is very good at prototyping tools combining statistics and visualization. python is an excellent generic scripting language that has a large code base.  C/C++, being quite complex but efficient and powerful, remains my choice when it comes to optimize performance.
  • I spent some thought on how to integrate several layers of comics data together. The paper sent by my colleague Klas Hatje may be of interest for those who work in this field:  Integrating Transcriptomic and Proteomic Data Using Predictive Regulatory Network Models of Host Response to Pathogens by Chasman, et al.
  • My colleague Nikolaus Berntenis let me know about Paintomics, developed by another colleague Fernando Garcia-Alcade and his group. The web tool seems to be able to visualize multi-omics datasets using KEGG graphics.

Donnerstag, 14. Juli 2016

ANOVA (Doncaster and Davey): ANOVA model structures

Seven pricipal classes of ANOVA designs, with up to three treatment factors:

  1. One-factor: replicate measures at each level of a single explanation factor
  2. Nested: one factor nested in one or more other factors
  3. Factorial: fully replicated measures on two or more crossed factors
  4. Randomised blocks: repeated measures on spatial or temporal groups of sampling units
  5. Split plot: treatments applied at multiple spatial or temporal scales
  6. Repeated measures: subjects repeated measured or tested in temporal or spatial sequence
  7. Unreplicated factorial: a single measure per combination of two or more factors

Montag, 7. April 2014

Curious "cyclic namespace dependency" error

Dear R users, here I report a curious case of "cyclic namespace dependency error" and its solution, in case you meet the same trouble that confused me a lot.

In my case, I accidentally created a S4-method and another normal function with the same name. The package could be installed. However, as I tried to load the package, it prints the following error message:

Loading required package: cycle

Error in loadNamespace(package, c(which.lib.loc, lib.loc)) :
  cyclic namespace dependency detected when loading 'cycle', already loading 'cycle'

It looks like this package has a loading problem: see the messages for details.

For a minimal piece of code see https://gist.github.com/anonymous/10018579. When the file is put in a R file in the R package, and in NAMESPACE "export(myMethod)" is specified, the checking or loading of the package will fail due to the "cyclic namespace dependency".

The error message is unfortunately not particularly helpful. Normally it would point to a problem of reciprocal dependency, but in this case it is only caused by a name that is given to both a S4 method and a normal function.

It is straightforward to solve the issue: check carefully whether you have duplicated function/method names and fix them if there are any.