Scientific Editor

I have a dream: a scientific editor that would be suitable for editing scientific papers. Currently, Word is king. There is no way around it: everybody has it, everybody uses it, sooner or later you will have a co-author who knows how to edit text only when writing an e-mail or when writing in Word. Chances are, that collaborator will be one of the big fishes on your authors list, maybe even your boss. Word has some basic collaborative options (like tracking changes and comments), bibliography (via Endnote or similar tools), and is accepted by most journals.

However, Word sucks. Big time. I know you know it.

I envision an open source solution that is based on an updated markdown syntax and the pandoc system. Here is, point by point, an informal specification of the system.

  1. Markdown as the primary text format. A user should be able to edit markdown directly without compromising any information contained in the document or write the document in rmarkdown and directly pass it down to the system.
  2. Zip file as the primary format: documents would be exchanged using a zip file containing
    • manuscript file in markdown
    • bibliography file (I would recommend BibTeX)
    • figure files
  3. Bibliography as in pandoc — bibliography in a format that is acceptable by pandoc, with CLS for reference formatting.
  4. Figures: figures are a major pain in the neck. Publishers require usually a vector graphic format or a high resolution image, but you want low-res previews in your print files or documents. The manuscript zip file should therefore either contain both, or only previews, with original files to be contained later.
  5. Markdown extensions:
    • extensions that would allow a “rich” export to Word’s docx: marking reviews, comments etc.
    • (better) figure and label captioning and cross-referencing
    • special bibliography sections (currently, you can only place the references at the end of the file)
  6. A visual UI with editor:
    • Java or similar that allows a painless installation process for even the least computer-savvy users, and allows them to edit the manuscript in a way that they are used to
    • GUI operations for an easy update of bibliography (I mean like really easy, just paste+copy of whatever: pubmed ids, google scholar links etc)
    • Equation editor, table editor etc. suitable for saving in markdown format
    • Version tracking
  7. Version tracking and managing revisions. Still pondering how to do this best, but this should be one of the major points for the system.
  8. Misc operations. The system should be able to quickly and painlessly accomplish following tasks:
    • split the manuscript into submission files by using logical definitions in the markdown (e.g. in the main manuscript file, separate figure files, separate supplementary data files)
    • provide detailed statistics on the document (word count)
    • possibly the visual UI could provide a plug-in to facilitate submission specifically in some of the most common manuscript submission systems (e.g. manuscript central).

Sloppy Science

Last week, Science has published a paper by Rodriguez and Laio on a density-based clustering algorithm. As a non-expert, I found the results actually quite good compared to the standard tools that I am using in my everyday work. I even implemented the package as an R package (soon to be published on CRAN, look out for “fsf”).

However, there are problems with the paper. More than one.

1. The authors claim that the density for each sample is determined with a simple formula which is actually the number of other samples within a certain diameter. This does not add up, since then the density must be always a whole number. It is obvious from the figures that this is not the case. When you look up the original matlab code in the supplementary material, you see that the authors actually use a Gaussian kernel function for density calculation.

2. If you use the simple density count as described in the paper, the algorithm will not and cannot work. Imagine a relatively simple case with two distinct clusters. Imagine that in one cluster, there is a sample A with density 25, and in the other cluster, there are two samples, B and C, with identical densities 24. This is actually quite likely to happen. The algorithm now determines, for each sample, \delta, that is the distance to the next sample with higher density. The whole idea of the algorithm is that for putative cluster centres, this distance will be very high, because it will point to the center of another cluster.

However, with ties, we have the following problem. If we choose the approach described by the authors, then both of the samples with density B and C (which have identical density 24) will be assigned a large \delta value and will become cluster center candidates. If we choose to use a weak inequality, then B will point to C, and C to B, and both of them will have a small \delta.

Therefore, we either have both B and C as equivalent cluster candidates, or none of them. No wonder that the authors never used this approach!

3. The authors explicitly claim that their algorithm can “automatically find the correct number of clusters.” This does not seem to be true, at least there is nothing in the original paper that warrants this statement. If you study their matlab code, you will find that the selection of cluster centers is done manually by a user selecting a rectangle on the screen. Frankly, I cannot even comment on that, this is outrageous.

I think that Science might have done a great disservice to the authors — everyone will hate them for having a sloppy, half-baked paper that others would get rejected in PLoS ONE published in Science. I know I do 🙂