Scientific Editor

I have a dream: a scientific editor that would be suitable for editing scientific papers. Currently, Word is king. There is no way around it: everybody has it, everybody uses it, sooner or later you will have a co-author who knows how to edit text only when writing an e-mail or when writing in Word. Chances are, that collaborator will be one of the big fishes on your authors list, maybe even your boss. Word has some basic collaborative options (like tracking changes and comments), bibliography (via Endnote or similar tools), and is accepted by most journals.

However, Word sucks. Big time. I know you know it.

I envision an open source solution that is based on an updated markdown syntax and the pandoc system. Here is, point by point, an informal specification of the system.

  1. Markdown as the primary text format. A user should be able to edit markdown directly without compromising any information contained in the document or write the document in rmarkdown and directly pass it down to the system.
  2. Zip file as the primary format: documents would be exchanged using a zip file containing
    • manuscript file in markdown
    • bibliography file (I would recommend BibTeX)
    • figure files
  3. Bibliography as in pandoc — bibliography in a format that is acceptable by pandoc, with CLS for reference formatting.
  4. Figures: figures are a major pain in the neck. Publishers require usually a vector graphic format or a high resolution image, but you want low-res previews in your print files or documents. The manuscript zip file should therefore either contain both, or only previews, with original files to be contained later.
  5. Markdown extensions:
    • extensions that would allow a “rich” export to Word’s docx: marking reviews, comments etc.
    • (better) figure and label captioning and cross-referencing
    • special bibliography sections (currently, you can only place the references at the end of the file)
  6. A visual UI with editor:
    • Java or similar that allows a painless installation process for even the least computer-savvy users, and allows them to edit the manuscript in a way that they are used to
    • GUI operations for an easy update of bibliography (I mean like really easy, just paste+copy of whatever: pubmed ids, google scholar links etc)
    • Equation editor, table editor etc. suitable for saving in markdown format
    • Version tracking
  7. Version tracking and managing revisions. Still pondering how to do this best, but this should be one of the major points for the system.
  8. Misc operations. The system should be able to quickly and painlessly accomplish following tasks:
    • split the manuscript into submission files by using logical definitions in the markdown (e.g. in the main manuscript file, separate figure files, separate supplementary data files)
    • provide detailed statistics on the document (word count)
    • possibly the visual UI could provide a plug-in to facilitate submission specifically in some of the most common manuscript submission systems (e.g. manuscript central).
Advertisements

Presentations in (R)markdown

There are many ways to turn a markdown or Rmarkdown document into a presentation. Way too many, and none of them is perfect. I made my first presentation with knitr / Rmarkdown for the tmod package.

After trying various options in knitr, I decided on an approach in which the Rmarkdown document is oblivious of the presentation system and the job of turning it into a presentation is taken up by pandoc. There were several bumps and problems, and I will give now a step – by – step guide.

1. Input file

Let’s start with an example Rmd. In the following, I assume it has been saved under “test.Rmd”.

---
title: "Example presentation"
author: January Weiner 
date: "`r Sys.Date()`"
---

# First part
## Slide 1
Code:

```{r plot1}
plot(1:10, 1:10)
```

## Slide 2
Some maths: $sum_{i=1}^{N}$

# Second part
## Slide 3
... contents ...

2. From Rmarkdown to markdown

I use knitr only to create a markdown file.

Rscript -e 'knitr::knit("test.Rmd")'

This produces the file test.md. With that, knitr’s job is finished, we will not need it anymore.

3. Download reveal.js

I decided for reveal.js. It was easy to work with and adapt to my needs, it had elegant default themes, it has a low footprint and shortcuts. And it has the “2D” layout, meaning that sections (level one headers) are arranged horizontally, while slides within one section are arranged vertically. Pressing “Esc” in a presentation shows the slide overview:

reveal_example

Anyway, download reveal.js and unpack it in the same directory as test.md.

Making the presentation

Use pandoc to create the reveal.js presentation. Note that this is not the final command line; in the following points I will discuss the problems which will influence the final version.

pandoc -s -S -t revealjs --mathjax -o test.html test.md

4. MathJax

On slide 2, we have a bit of maths. The maths is written in a LaTeX-like notation, and there are many ways to turn it into an elegant mathematical equation on the final presentation. I have tried many options with pandoc, and found that only MathJax works properly and without a major hassle. This is why on the previous command line I used the option --mathjax.

However, if you run the above command line, you will notice that on “Slide 2”, the maths doesn’t work, despite using the ‘–mathjax’ option. It would work, though, if we put the file on a server. The reason is that pandoc puts the URL to MathJax in the form ‘src=”//cdn.mathjax…”‘. This assumes the context of how we opened the file. If we opened it from a server, using http or https, this would have worked. If we open it directly in a browser, it uses “file://cdn.mathjax…” which is obviously not on our file system. We have two options.

4.1 External MathJax

Use the command line

pandoc -s -S -t revealjs --mathjax="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" -o test.html test.md

This works unless we have no Internet access, for example because we show our presentation in another institute, where our laptop cannot connect to the Internet, because then we are screwed.

4.2 Local MathJax

Alternatively, you can download the whole MathJax:

wget https://github.com/mathjax/MathJax/archive/v2.5-latest.zip
unzip v2.5-latest.zip
mv MathJax-2.5-latest/ MathJax

and specify the local installation with the following command line:

pandoc -s -S -t revealjs --mathjax="MathJax/MathJax.js?config=TeX-AMS-MML_HTMLorMML" -o test.html test.md

This works, but our presentation has suddenly over 170 megabytes. Which sucks.

5. 2D layout and section headers

I mentioned previously that reveal.js allows a neat 2D layout, in which slides from one section are arranged vertically, and sections are put next to each other. However, sections with only a title and no contents might be a bit boring, so let us modify the .md file changing the second section as follows:

# Second part

This is the second part, even more interesting.

## Slide 3
... contents ...

You run pandoc again, and…

reveal2

Huh, where is the 2D layout gone? Why are all slides next to each other? Why are all slides from one section all on one single slide?

Pandoc automatically guesses which level header denotes boundaries between slides. It defines “slide level” as “the highest level followed immediately by non-header contents”. After our modification, the top level header (starting with a single #) became the level at which slides are separated. OK, so maybe we try specifying the slide level manually?

pandoc -s -S -t revealjs --mathjax="MathJax/MathJax.js?config=TeX-AMS-MML_HTMLorMML" -o test.html test.md

OK, this works, but… the contents under the first level header (“This is the second part…”) is gone! This is because “Headers above the slide level in the hierarchy create “title slides,” which just contain the section title and help to break the slide show into sections.”

Turns out that there is no way we can have both: 2D with slides divided neatly into sections, and section slides which contain more than just a title. Not if we use pandoc, that is.

6. Modifying the layout

6.1 reveal.js theme

This is the easiest part: pick one of the existing reveal.js themes (I omit the mathjax command line for simplicity sake, do remember to put it back in):

pandoc -s -S -t revealjs -o test.html test.md -V theme=blood

Note that the themes listed on the reveal.js website start with a capital letter, but you must specify a lowercase letter in the above command line.

6.2 Fine tuning the theme

I did not like the sans-serif, capitalized and decorated fonts of the blood theme (shadows on titles, I beg you). Ugly. However, if you know a little CSS (and you’d better learn it!), you can easily adapt it to your needs.

Look up the file reveal.js/css/theme/blood.css for hints and create your own CSS file (let us call it test.css) in the same directory as test.md. In the file below, I reset all the ugly decorations and set two fonts for headers and body, respectively: Garamond for headers, and Quattrocento Sans for body, using the google fonts service:

@import url('http://fonts.googleapis.com/css?family=EB+Garamond');
@import url('http://fonts.googleapis.com/css?family=Quattrocento+Sans');

.reveal {
  font-size: 32px;
  font-family: 'Quattrocento Sans', 'sans-serif'; }

.reveal h1, .reveal h2, .reveal h3, .reveal h4, .reveal h5, .reveal h6 {
  font-family: 'EB Garamond', 'serif';
  font-weight:normal;
  text-transform: none;
  text-shadow: none; }

.reveal h1 { font-size: 2em; }
.reveal h2 { font-size: 1.7em; }
.reveal h3 { font-size: 1.4em; }
.reveal h4 { font-size: 1em; }

Also, as you might notice, I prefer smaller fonts here. We integrate our test.css file with the following option

pandoc -s -S -t revealjs -o test.html test.md -V theme=blood --css test.css

6.3 Adding a logo

You can add a logo (or whatever other background for your slides) by modifying the CSS file test.css. If logo.png is the name of your logo, adding this to your CSS will put it on all your slides in the top left corner:

body {
  background-image: url(logo.png);
  background-repeat: no-repeat;
  background-position:20px 20px;
}

6.4 Better syntax highliting

Pandoc’s syntax highlighting doesn’t look good on a dark background. You can add the following to the “test.css” file to reproduce the Solarized theme.

.reveal pre code { color: #839496; 
  background-color: #2B2B2B; } /* use #FDF6E3 for light background */

.sourceCode .kw { color: #268BD2; }
.sourceCode .dt { color: #268BD2; }
.sourceCode .dv, .sourceCode .bn, .sourceCode .fl { color: #D33682; }
.sourceCode .ch { color: #DC322F; }
.sourceCode .st { color: #2AA198; }
.sourceCode .co { color: #93A1A1; }
.sourceCode .ot { color: #A57800; }
.sourceCode .al { color: #CB4B16; font-weight: bold; }
.sourceCode .fu { color: #268BD2; }
.sourceCode .re { }
.sourceCode .er { color: #D30102; font-weight: bold; }
}

# 7. Creating a PDF of your presentation

Of course you need a PDF for printing and as a backup.

There are two ways for producing PDF from reveal.js. Each one is imperfect. 

## 7.1 Creating PDF using pandoc

Since the `test.md` file is a generic markup, we can turn it into a simple PDF

```bash
pandoc -s -S -o test.pdf test.md

Or even beamer presentation:

pandoc -s -S -t beamer -o test.pdf test.md

Unfortunately, this is not so nice as our presentation, and completely ignores whatever we have put in the CSS.

7.2 Using the reveal.js printing facility and Google Chrome

The second way is interactive only (you cannot create the PDF with a command line). Open the file in google chrome and add ?print-pdf to the file URL, such that the end of the URL reads test.html?print-pdf.

The output looks garbled: the slides overlap. Don’t worry, it’s OK. Open the print dialog (press Ctrl-P), and you will see that now the output is correct. You can save it as PDF or send it to a printer.

8. The final command line

pandoc -s -S -t revealjs --mathjax="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"  -V theme=blood --css test.css -o test.html test.md

pandoc, markdown and pander

Pandoc + markdown seem to be a great way of documenting my work.

Markdown syntax is very simple and allows to add basic formatting and figures to an otherwise simple text document, without obfuscating the actual text.

Then I simply compile the document using the pandoc command:

pandoc -o document.docx document.md
pandoc -o document.pdf  document.md

There are some more tricks, of course, and plenty of output formats are possible. One thing I was struggling with was that images in docx files were much too large. It turns out that the PNG graphics I generate from PDFs (which, in turn, come from R) lacked the information about density units. I was using the convert program from ImageMagick, and it turns out it is necessary to add the option -units PixelsPerInch:

convert -density 300 -units PixelsPerInch image.pdf image.png

Another thing that I found useful was the pander package. Of course, there is this whole science of generating dynamic documents and reports from R using Sweave or knittr, but at the moment I rather produce two files: a commented R pipeline and, separately, a report in markdown format.

(The reason for not using knittr is that I given that I work with some very large data sets that sometimes take ages to compute, I would have to work out the details of cacheing and handling code that takes a while to execute. Also, I want to have a document with all commands, for me, and report without any R code for everyone else).

Pander allows to create nice tables in R that can be directly copied and pasted to a markdown document (of course, pander is so much more, but this is my main use at the moment):

pander(foo, 
       emphasize.strong.cols=1, justify="left", 
       style="simple", digits=2, split.tables=Inf)

I was astonished how nice the resulting word file is. The PDFs, which are produced by TeX/LaTeX, I think, are actually more trouble, for example because LaTeX disregards my order of figures and tables, they are all floating objects and there is no easy way to change this from within the document.