echo http://$(hostname):8787/
click on the link generated to open Rstudio in your browser and login with your CyVerse credentials.
$ PS1='$ '
File
, New File
, Text file
) and write:rule fastqc_a_file:
shell:
"fastqc dc_workshop/data/SRR2584863_1.fastq.gz"
(I suggest copy/pasting this into RStudio.)
Snakefile
.$ snakemake
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 fastqc_a_file
1
[Wed Apr 10 06:29:17 2019]
rule fastqc_a_file:
jobid: 0
Upon execution there will be two new files, ‘.html’ & ‘.zip’
Note: snakemake configuration file is by default calledSnakefile
At the moment this is basically just a shell script with extra syntax… what’s the point?
Well, shell scripts - and this snakefile, too - will rerun the command every time you run the file, even if there’s no reason to do so because the file hasn’t changed.
Digression: This is particularly important for large or long workflows, where you’re dealing with 10s to 100s of files that may take hours to days to process! It can be hard to figure out which files to rerun, but (spoiler alert) snakemake can really help you do this!
It’s hard to track this kind of thing in a shell script - I usually just comment out the lines I don’t want run, or break my commands up into multiple shell scripts so they don’t take so long - but with snakemake, you can annotate the rule with input and output files!
rule fastqc_a_file:
input:
"dc_workshop/data/SRR2584863_1.fastq.gz"
output:
"dc_workshop/data/SRR2584863_1_fastqc.html",
"dc_workshop/data/SRR2584863_1_fastqc.zip"
shell:
"fastqc dc_workshop/data/SRR2584863_1.fastq.gz"
here, we’ve annotated the rule with the required input file, as well as the expected output files.
Question: how do we know what the output files are?
$ snakemake
What happened??
snakemake looked at the file, saw that the output files existed, and figured out that it didn’t need to do anything!
-f
:$ snakemake -f
$ rm -fr multiqc_report.html multiqc_data/
$ snakemake
Note that you don’t need to remove all the output files to rerun a command - just remove one of them.
$ touch dc_workshop/data/*.fastq.gz
$ snakemake
This will become important later :)
rule fastqc_a_file:
input:
"dc_workshop/data/SRR2584863_1.fastq.gz"
output:
"dc_workshop/data/SRR2584863_1_fastqc.html",
"dc_workshop/data/SRR2584863_1_fastqc.zip"
shell:
"fastqc dc_workshop/data/SRR2584863_1.fastq.gz"
rule fastqc_a_file2:
input:
"dc_workshop/data/SRR2584863_2.fastq.gz"
output:
"dc_workshop/data/SRR2584863_2_fastqc.html",
"dc_workshop/data/SRR2584863_2_fastqc.zip"
shell:
"fastqc dc_workshop/data/SRR2584863_2.fastq.gz"
Now, if you run this, the Right Thing won’t happen: snakemake will do nothing. Why?
$ snakemake fastqc_a_file2
and now you should see the second fastqc command run, with the appropriate output files!
Note that snakemake only runs the second rule, because it looks at the output files and sees that the first file you wanted,0Hour_001_1_fastqc.html
already exists!
Points to note:
Let’s start refactoring (cleaning up) this Snakefile.
rule all:
input:
"dc_workshop/data/SRR2584863_1_fastqc.html",
"dc_workshop/data/SRR2584863_2_fastqc.html"
rule fastqc_a_file:
input:
"dc_workshop/data/SRR2584863_1.fastq.gz"
output:
"dc_workshop/data/SRR2584863_1_fastqc.html",
"dc_workshop/data/SRR2584863_1_fastqc.zip"
shell:
"fastqc dc_workshop/data/SRR2584863_1.fastq.gz"
rule fastqc_a_file2:
input:
"dc_workshop/data/SRR2584863_2.fastq.gz"
output:
"dc_workshop/data/SRR2584863_2_fastqc.html",
"dc_workshop/data/SRR2584863_2_fastqc.zip"
shell:
"fastqc dc_workshop/data/SRR2584863_2.fastq.gz"
this rule, by convention calledall
, is a default rule that produces all the output files. But it’s a bit weird! It’s all input, and no output!
This is a blank rule that gathers together all of the various files you want produced, and says “hey, snakemake, I depend on all of these files for my input - make them for me!” And then, once those files are all there, it …does nothing.
Yep, this is perfectly legal in snakemake, and it’s one way to make your life easier.
snakemake -f
no longer works properly, because -f
only forces rerunning a single rule. To rerun everything, you can run touch data/*.fq.gz
to make all the output files stale; or rm data/*.html
to remove some of the output files.There’s a lot of repetition in each of these rules. Let’s collapse it down a little bit by replacing the filename in the fastqc command with a magic variable, {input}
.
rule all:
input:
"dc_workshop/data/SRR2584863_1_fastqc.html",
"dc_workshop/data/SRR2584863_2_fastqc.html"
rule fastqc_a_file:
input:
"dc_workshop/data/SRR2584863_1.fastq.gz"
output:
"dc_workshop/data/SRR2584863_1_fastqc.html",
"dc_workshop/data/SRR2584863_1_fastqc.zip"
shell:
"fastqc {input}"
rule fastqc_a_file2:
input:
"dc_workshop/data/SRR2584863_2.fastq.gz"
output:
"dc_workshop/data/SRR2584863_2_fastqc.html",
"dc_workshop/data/SRR2584863_2_fastqc.zip"
shell:
"fastqc {input}"
This all works as before, but now the rule is a bit more generic and will work with any input file. Sort of.
What do I mean, sort of?
Well, the output filenames ALSO depend on the input file names in some way - specifically, fastqc replace part of the filename with_fastqc.html
and_fastqc.zip
to make its two output files.
rule all:
input:
"dc_workshop/data/SRR2584863_1_fastqc.html",
"dc_workshop/data/SRR2584863_2_fastqc.html"
rule fastqc_a_file:
input:
"{filename}.fastq.gz"
output:
"{filename}_fastqc.html",
"{filename}_fastqc.zip"
shell:
"fastqc {input}"
rule fastqc_a_file2:
input:
"{filename}.fastq.gz"
output:
"{filename}_fastqc.html",
"{filename}_fastqc.zip"
shell:
"fastqc {input}"
What we’ve done here is tell snakemake that anytime we say we want a file that ends with_fastqc.html
, it should look for a file that ends in.fq.gz
and then runfastqc
on it.
$ snakemake
Oh no! We get aAmbiguousRuleException:
! What’s going on?
Well, if you look at the rule above, we’ve given snakemake two different rules to produce the same file(s)!fastqc_a_file
andfastqc_a_file2
are now identical rules! snakemake doesn’t like that.
rule all:
input:
"dc_workshop/data/SRR2584863_1_fastqc.html",
"dc_workshop/data/SRR2584863_2_fastqc.html"
rule fastqc_a_file:
input:
"{filename}.fastq.gz"
output:
"{filename}_fastqc.html",
"{filename}_fastqc.zip"
shell:
"fastqc {input}"
and THAT should now work just fine!
Now here’s the fun bit – if you look in the data directory, you’ll see that there are actually 8 files in there. Let’s modify the snakefile to run fastqc on all of them!
rule all:
input:
"dc_workshop/data/SRR2584863_1_fastqc.html",
"dc_workshop/data/SRR2584863_2_fastqc.html",
"dc_workshop/data/SRR2584866_1_fastqc.html",
"dc_workshop/data/SRR2584866_2_fastqc.html",
"dc_workshop/data/SRR2589044_1_fastqc.html",
"dc_workshop/data/SRR2589044_2_fastqc.html",
rule fastqc_a_file:
input:
"{bazinga}.fastq.gz"
output:
"{bazinga}_fastqc.html",
"{bazinga}_fastqc.zip"
shell:
"fastqc {input}"
Note you can just run snakemake whenever you want. It won’t do anything unless something’s changed.
$ snakemake
Let’s add in a new rule - multiqc, to summarize our fastqc results.
$ multiqc data
you can see that it creates two outputs, multiqc_report.html
and the directory multiqc_data/
which contains a bunch of files. Let’s create a snakemake rule for this; add:
rule all:
input:
"dc_workshop/data/SRR2584863_1_fastqc.html",
"dc_workshop/data/SRR2584863_2_fastqc.html",
"dc_workshop/data/SRR2584866_1_fastqc.html",
"dc_workshop/data/SRR2584866_2_fastqc.html",
"dc_workshop/data/SRR2589044_1_fastqc.html",
"dc_workshop/data/SRR2589044_2_fastqc.html"
rule fastqc_a_file:
input:
"{filename}.fastq.gz"
output:
"{filename}_fastqc.html",
"{filename}_fastqc.zip"
shell:
"fastqc {input}"
rule run_multiqc:
output:
"multiqc_report.html",
directory("multiqc_data")
shell:
"multiqc dc_workshop/data/"
to the bottom of the file. (Note, you need to tell snakemake if an output is a directory.)
$ snakemake run_multiqc
This …doesn’t really do what we want, for a few reasons.
multiqc_report.html
already exists, so snakemake doesn’t run the rule. How do we actually test the rule??multiqc_report.html
to the inputs for the first all.multiqc_report.html
and re-run snakemake.rule all:
input:
"dc_workshop/data/SRR2584863_1_fastqc.html",
"dc_workshop/data/SRR2584863_2_fastqc.html",
"dc_workshop/data/SRR2584866_1_fastqc.html",
"dc_workshop/data/SRR2584866_2_fastqc.html",
"dc_workshop/data/SRR2589044_1_fastqc.html",
"dc_workshop/data/SRR2589044_2_fastqc.html"
rule fastqc_a_file:
input:
"{filename}.fastq.gz"
output:
"{filename}_fastqc.html",
"{filename}_fastqc.zip"
shell:
"fastqc {input}"
rule run_multiqc:
output:
"multiqc_report.html",
directory("multiqc_data")
shell:
"multiqc dc_workshop/data/"
Yay, that seems to work!
Points to note:
The third problem, that multiqc doesn’t have any input dependencies, is a bit harder to fix.
(Why do we want to fix this? Well, this is how snakemake tracks “out of date” files - if we don’t specify input dependencies, then we may update one of the fastqc results that multiqc uses, but snakemake won’t re-run multiqc on it, and our multiqc results will be out of date.)
Basically we need to tell snakemake all of the files that we want. On the face of it, that’s easy – change the rule like so:
rule all:
input:
"dc_workshop/data/SRR2584863_1_fastqc.html",
"dc_workshop/data/SRR2584863_2_fastqc.html",
"dc_workshop/data/SRR2584866_1_fastqc.html",
"dc_workshop/data/SRR2584866_2_fastqc.html",
"dc_workshop/data/SRR2589044_1_fastqc.html",
"dc_workshop/data/SRR2589044_2_fastqc.html",
"multiqc_report.html"
rule fastqc_a_file:
input:
"{filename}.fastq.gz"
output:
"{filename}_fastqc.html",
"{filename}_fastqc.zip"
shell:
"fastqc {input}"
rule run_multiqc:
output:
"multiqc_report.html",
directory("multiqc_data")
shell:
"multiqc dc_workshop/data/"
This will work, but there are two reasons this is not great.
First, we’re repeating ourselves. Those input files are in the all
rule above, and also in this rule. If you are as forgetful as I am, this means that you run the risk of updating one rule and not the other, and getting out of sync.
Second, we’re explicitly listing out the files that we want produced. That’s not super great because it makes it annoying to add files.
We can fix the first issue by using “variables”. The second issue requires a bit more advanced programming, and we’ll leave it to the very end.
To use variables, let’s make a Python list at the very top, containing all of our expected output files from fastqc:
fastqc_output = ["dc_workshop/data/SRR2584863_1_fastqc.html",
"dc_workshop/data/SRR2584863_2_fastqc.html",
"dc_workshop/data/SRR2584866_1_fastqc.html",
"dc_workshop/data/SRR2584866_2_fastqc.html",
"dc_workshop/data/SRR2589044_1_fastqc.html",
"dc_workshop/data/SRR2589044_2_fastqc.html"]
and modify the all
and multiqc
rules to contain this list. The final snakefile looks like this:
fastqc_output = ["dc_workshop/data/SRR2584863_1_fastqc.html",
"dc_workshop/data/SRR2584863_2_fastqc.html",
"dc_workshop/data/SRR2584866_1_fastqc.html",
"dc_workshop/data/SRR2584866_2_fastqc.html",
"dc_workshop/data/SRR2589044_1_fastqc.html",
"dc_workshop/data/SRR2589044_2_fastqc.html"]
rule all:
input:
fastqc_output,
"multiqc_report.html"
rule fastqc_a_file:
input:
"{filename}.fastq.gz"
output:
"{filename}_fastqc.html",
"{filename}_fastqc.zip"
shell:
"fastqc {input}"
rule run_multiqc:
input:
fastqc_output
output:
"multiqc_report.html",
directory("multiqc_data")
shell:
"multiqc dc_workshop/data/"
We’ve got one more redundancy in this file - the fastqc_output
is listed in the all
rule, but you don’t need it there! Why?
Well, multiqc_report.html
is already in the all rule, and the multiqc rule depends on fastqc_output
, so fastqc_output
already needs to be created to satisfy the all rule, so… specifying it in the all rule is redundant! And you can remove it!
(It’s not a big deal and I usually leave it in. But I wanted to talk about dependencies!)
The Snakefile now looks like this:
fastqc_output = ["dc_workshop/data/SRR2584863_1_fastqc.html",
"dc_workshop/data/SRR2584863_2_fastqc.html",
"dc_workshop/data/SRR2584866_1_fastqc.html",
"dc_workshop/data/SRR2584866_2_fastqc.html",
"dc_workshop/data/SRR2589044_1_fastqc.html",
"dc_workshop/data/SRR2589044_2_fastqc.html"]
rule all:
input:
"multiqc_report.html"
rule fastqc_a_file:
input:
"{filename}.fastq.gz"
output:
"{filename}_fastqc.html",
"{filename}_fastqc.zip"
shell:
"fastqc {input}"
rule run_multiqc:
input:
fastqc_output
output:
"multiqc_report.html",
directory("multiqc_data")
shell:
"multiqc dc_workshop/data/"
and we can rerun it from scratch by doing:
$ rm -fr multiqc_report.html multiqc_data/
$ snakemake
snakemake will compare only at the very initial input files, and the specific output file(s) you are requesting, to decide if it needs to rerun the workflow.
In practical terms, this means that if you just delete the data/*.html
files above but leave multiqc_report.html
around, snakemake won’t rerun anything. You have to delete both the intermediaries and the end output files (as we do in the previous section), or update the raw input files, in order to force rerunning.
(This is a feature, not a bug - it helps deal with data-intensive pipelines where the intermediate files are really big.)
clean
rule¶It’s kind of annoying to have to delete things explicitly. Snakemake should take care of that for us. Let’s add a new rule, clean
, that forces rerunning of things –
rule clean:
shell:
"rm -fr multiqc_report.html multiqc_data/"
and now try re-running things:
$ snakemake -p clean
$ snakemake
{fastqc_output}
means “replace the thing in the curly quotes with the Python values in fastqc_output
.snakemake -p
to get a printout of the commands that are run.**What’s particularly nice about the clean
rule (the name is a convention, not a requirement) is that you only need to keep track of the expected output files in one or two places - the all rule, and the clean rule.
So, we’ve put all this work into making this snakefile with its input rules and its output rules… and there are a lot of advantages to our current approach already! Let’s list a few of them –
but there’s even more practical value in this, too – because we’ve given snakemake a recipe, rather than a script, we can now run these commands in parallel, too!
Try:
$ snakemake clean
$ snakemake -j 4
this will run up to four things in parallel!
Let’s add some trimming!
rule trim_reads:
input:
"{filename}_1.fastq.gz",
"{filename}_2.fastq.gz"
output:
"{filename}_1.pe.qc.fastq.gz",
"{filename}_1.se.qc.fastq.gz",
"{filename}_2.pe.qc.fastq.gz",
"{filename}_2.se.qc.fastq.gz"
shell:
"trimmomatic PE {input} {output} \
SLIDINGWINDOW:4:20 MINLEN:25 \
ILLUMINACLIP:NexteraPE-PE.fa:2:40:15"
{input}
and {output}
matter - they’re passed in that order to snakemake! (See the output of snakemake -p
.)Now add the appropriate files into the fastc output list too – change it to:
fastqc_output = ["dc_workshop/data/SRR2584863_1_fastqc.html",
"dc_workshop/data/SRR2584863_2_fastqc.html",
"dc_workshop/data/SRR2584866_1_fastqc.html",
"dc_workshop/data/SRR2584866_2_fastqc.html",
"dc_workshop/data/SRR2589044_1_fastqc.html",
"dc_workshop/data/SRR2589044_2_fastqc.html",
"dc_workshop/data/SRR2584863_1.pe.qc_fastqc.html",
"dc_workshop/data/SRR2584863_2.pe.qc_fastqc.html",
"dc_workshop/data/SRR2584866_1.pe.qc_fastqc.html",
"dc_workshop/data/SRR2584866_2.pe.qc_fastqc.html",
"dc_workshop/data/SRR2589044_1.pe.qc_fastqc.html",
"dc_workshop/data/SRR2589044_2.pe.qc_fastqc.html"]
rule all:
input:
fastqc_output,
"multiqc_report.html"
rule fastqc_a_file:
input:
"{filename}.fastq.gz"
output:
"{filename}_fastqc.html",
"{filename}_fastqc.zip"
shell:
"fastqc {input}"
rule run_multiqc:
input:
fastqc_output
output:
"multiqc_report.html",
directory("multiqc_data")
shell:
"multiqc dc_workshop/data/"
rule trim_reads:
input:
"{filename}_1.fastq.gz",
"{filename}_2.fastq.gz"
output:
"{filename}_1.pe.qc.fastq.gz",
"{filename}_1.se.qc.fastq.gz",
"{filename}_2.pe.qc.fastq.gz",
"{filename}_2.se.qc.fastq.gz"
shell:
"trimmomatic PE {input} {output} \
SLIDINGWINDOW:4:20 MINLEN:25 \
ILLUMINACLIP:NexteraPE-PE.fa:2:40:15"
rule clean:
shell:
"rm -fr multiqc_report.html multiqc_data/"
and re-run everything:
$ snakemake clean
$ snakemake -j 4
and voila - you’ve got a pile of trimmed output files, and a nice multiqc report on pre- and post- quality!
One other point to make: by naming the files “right”, i.e. by making the trimmed files meet the input requirements for the fastqc rule, we could automatically take advantage of that rule to run fastqc on them. A lot of snakemake seems to boil down to file naming, for better or for worse… :)
You can use snakemake -n
to see what would be run, without actually running it! This is called a “dry run”. This is useful when you have really big compute.
$ snakemake clean
$ snakemake -n
Above, we saw that you can use snakemake -j 4
to run four jobs in parallel. Here are some other
You can specify software on a per-rule basis! This is really helpful when you have incompatible software requirements for different rules, or want to run on a cluster, or just want to pin your snakemake workflow to a specific version.
For example, if you create a file env_fastqc.yml
with the following content,
channels:
- bioconda
- defaults
- conda-forge
dependencies:
- fastqc==0.11.8
and then change the fastqc rule to look like this:
rule fastqc_a_file:
input:
"{filename}.fastq.gz"
output:
"{filename}_fastqc.html",
"{filename}_fastqc.zip"
conda:
"env_fastqc.yml"
shell:
"fastqc {input}"
you can now run snakemake like so,
$ snakemake --use-conda
You can visualize your workflow like so!
$ snakemake --dag | dot -Tpng > dag.png
Snakemake can automatically generate detailed self-contained HTML reports that encompass runtime statistics, provenance information, workflow topology and results.
$ snakemake --report report.html
Just like scripting, or writing an R script, writing a snakefile is a kind of programming. So you’ll have to do a lot of debugging :) :(.
what do (snakemake) workflows do for me?
The main reason to use snakemake is that it lets be sure that my workflow completed properly. snakemake tracks which commands fails, and will stop the workflow in its tracks! This is not something that you usually do in shell scripts.