Sloppy Science

Last week, Science has published a paper by Rodriguez and Laio on a density-based clustering algorithm. As a non-expert, I found the results actually quite good compared to the standard tools that I am using in my everyday work. I even implemented the package as an R package (soon to be published on CRAN, look out for “fsf”).

However, there are problems with the paper. More than one.

1. The authors claim that the density for each sample is determined with a simple formula which is actually the number of other samples within a certain diameter. This does not add up, since then the density must be always a whole number. It is obvious from the figures that this is not the case. When you look up the original matlab code in the supplementary material, you see that the authors actually use a Gaussian kernel function for density calculation.

2. If you use the simple density count as described in the paper, the algorithm will not and cannot work. Imagine a relatively simple case with two distinct clusters. Imagine that in one cluster, there is a sample A with density 25, and in the other cluster, there are two samples, B and C, with identical densities 24. This is actually quite likely to happen. The algorithm now determines, for each sample, \delta, that is the distance to the next sample with higher density. The whole idea of the algorithm is that for putative cluster centres, this distance will be very high, because it will point to the center of another cluster.

However, with ties, we have the following problem. If we choose the approach described by the authors, then both of the samples with density B and C (which have identical density 24) will be assigned a large \delta value and will become cluster center candidates. If we choose to use a weak inequality, then B will point to C, and C to B, and both of them will have a small \delta.

Therefore, we either have both B and C as equivalent cluster candidates, or none of them. No wonder that the authors never used this approach!

3. The authors explicitly claim that their algorithm can “automatically find the correct number of clusters.” This does not seem to be true, at least there is nothing in the original paper that warrants this statement. If you study their matlab code, you will find that the selection of cluster centers is done manually by a user selecting a rectangle on the screen. Frankly, I cannot even comment on that, this is outrageous.

I think that Science might have done a great disservice to the authors — everyone will hate them for having a sloppy, half-baked paper that others would get rejected in PLoS ONE published in Science. I know I do 🙂

Check credit card numbers using the Luhn algorithm

You can find a nice infographic explaining how to understand the credit card number here.

Credit card numbers include, as the last digit, a simple checksum calculated using the Luhn algorithm. In essence, we add up the digits of the card number in a certain way, and if the resulting sum divided by 10 is not an integer (i.e., the reminder of dividing by 10 is zero), then the number is invalid (the reverse is not true; a number divisible by 10 can still be invalid).

This can be easily computed in R, although I suspect that there might be an easier way:

checkLuhn <- function( ccnum ) {

  # helper function
  sumdigits <- function( x ) 
    sapply( x, function( xx ) 
          sum( as.numeric( unlist( strsplit( as.character( xx ), "" ))))
  # split the digits
  ccnum <- as.character( ccnum )
  v <- as.numeric( unlist( strsplit( ccnum, "" ) ) )
  v <- rev( v )

  # indices of every second digit
  i2 <- seq( 2, length( v ), by= 2 )
  v[i2] <- v[i2] * 2
  v[ v > 9 ] <- sumdigits( v[ v > 9 ] )

  if( sum( v ) %% 10 == 0 ) return( TRUE )
  else return( FALSE )