Two bar plots

What is the difference between the two bar plots below?

I am sitting on a conference and these type of plots are relatively frequent in the presentations. Complete with a log-scale.

The answer is, of course, that there is no difference between these two — the data is exactly the same, the only thing different is the vertical scale. These two plots explain why you should never, ever use a bar plot to represent log-scaled data: the position of the y axis is completely arbitrary, yet it influences greatly our perception of which plot shows a larger difference.

(See also “Kick the bar chart habit”)

Sample size / power calculations for Kaplan-Meier survival curves

The problem is simple: we have two groups of animals, treated and controls. Around 20% of the untreated animals will die during the course of the experiment, and we would like to be able to detect effect such that instead of 20%, 80% of animals will die in the treated group, with power 0.8 and alpha=0.05. Group sizes are equal and no other parameters are given.

What is the necessary group size?

I used the ssizeCT.default function from the powerSurvEpi R package. Based on the explanation in the package manual, this calculates (in my simple case) the required sample size in a group as follows:

$n = \frac{m}{p_E + p_C}$

where $p_E$ and $p_C$ are, respectively, probabilities of failure in the E(xpermiental) and C(ontrol) groups. I assume that in my case I should use 0.8 and 0.2, respectively, so $n=m$. The formulas here are simplified in comparison with the manual page of ssizeCT.default, simply because the group sizes are identical.

$m$ is calculated as

$m=\big(\frac{RR+1}{RR-1}\big)^2(z_{1-\alpha/2}+z_{1-\beta})^2$

RR is the minimal effect size that we would like to be able to observe with power 0.8 and at alpha 0.05. That means, if the real effect size is RR or greater, we have 80% chance of getting a p-value smaller than 0.05 if the group sizes are equal to $m$. To calculate RR, I first calculate $\theta$, the hazard ratio, and for this I use the same approximate, expected mortality rates (20% and 80%):

$\theta = \log(\frac{\log(0.8)}{\log(0.2)}) = -1.98$

Since $RR=exp(\theta)=0.139$; thus $m=18.3$. This seems reasonable (based on previous experience).

Principal component analysis

Easy to read, simple paper highlighting the different aspects of PCA and related diagnostic tools. Good reference for the future.

H. Abdi and L. J. Williams, Principal Component Analysis, Wiley Interdisciplinary Reviews:  Computational Statistics, 2, 2010.

A Tutorial on a Practical Bayesian Alternative to Null-Hypothesis Signifiance Testing

A little, easy-going paper on Bayesian vs. Pearson/Neyman framework. Easy to read and easy to follow.

M. E. J. Mason. A Tutorial on a Practical Bayesian Alternative to Null-Hypothesis Signifiance Testing. Behavior Research Methods, 2010.

Know when your numbers are significant

It is always nice to see a paper in a major journal that deals with statistics. Here, a popular commentary on statistical testing and significance in Nature. It includes a few simple rules that any biologist — with statistical training or without — should be aware of.

David L. Vaux, “Know when your numbers are significant”, Nature 2012