Learn my Linux Bootcamp…all from within a web browser window

I awoke yesterday to see a lot of twitter notifications on my phone. Sometimes this happens when I've written a post on this blog, but I hadn't added anything for over a week. Turns out that the activity was triggered by this tweet by Richard Smith-Unna (@blahah404 on twitter):

As the screenshot below indicates, Richard has worked some amazing black magic to enable a single browser window to contain a fully interactive terminal as well as a file viewer/navigator; all alongside a (slightly modified) version of my original Linux bootcamp material.

Click to enlarge

This new interactive command-line bootcamp is a wonderful resource and means that the only barrier to learning some simple, but powerful, Linux/Unix commands is the availability of a web browser.

Richard explains a little about how he put all of this together:

The Infrastructure, including adventure-time and docker-browser-server, was built by @maxogden and @mafintosh. The setup of this app was based on the get-dat adventure.

Another take on our new Unix & Python book

Following on from my announcement yesterday that I am involved in writing a new book (Unix and Python to the Rescue!), my co-author Michelle has written a few words about it on her blog which I encourage you to read. She tackles the issue of why people in our field (the life-sciences) should learn to code:

It is my belief--based on my own experience--that using a prefabricated tool, such as a spreadsheet or graphing program, inherently limits you to someone else's idea of what analytical questions you should be asking about your data.

In today's scientific world, the amount and type of data we need to understand changes rapidly, and these programs can quickly become limiting. By taking the time to learn a set of basic tools that can be combined in limitless ways, you empower yourself to ask the kinds of analytical questions you want to ask about your data

Taking steps to write a new book about programming!

The old book

I am very excited to announce that I am involved in writing another book about progamming! The 2012 book that I wrote with Ian Korf — Unix and Perl to the Rescue!: A Field Guide for the Life Sciences (and Other Data-rich Pursuits) — was enjoyable to write, and seemed to be well received — (4.5 star average on Amazon.com) and so we both wanted to do something else.

We wrote about Perl because it is the language that we had both used since the mid-1990s, and for a long while Perl was the language du jour for people working in bioinformatics. This has changed. The TIOBE software index uses search engine queries to track the popularity of all programming languages over time. In 2000, Perl was the 4th most popular language whereas Python ranked 24th. As of July 2015, Python has risen to 5th place, overtaking Perl which has dropped to 11th place. Not only is Python proving an extremely popular language, it is swiftly overtaking Perl in many areas involving the processing of biological data.

The new book

So we made a proposal to Cambridge University Press to write what we are provisionally calling Unix and Python to the Rescue! (this will no doubt be the start of a successful series which will culminate in Unix and Minecraft to the Rescue!). Happily, they have accepted our proposal and so we have recently started the process of writing the new book (hopefully due to be published in 2016).

We intend for this book to fulfill many of the same goals that we had for our earlier book:

  1. Contain basic material that introduces Unix & Python to someone who has never sat down at a terminal or written a line of code before.
  2. Include many advanced programming concepts in addition to the basics.
  3. Where possible, only introduce one new concept at a time.
  4. Write in a lively, engaging style in order to make the concepts fun!

For item #2, we envisage our book addressing topics such as NumPy, ScyPy, IPython Notebooks, and the pandas package, to name but a few!

 

The new co-author

For this new endeavor we have recruited the many talents of Michelle Gill (@modernscientist) who will bring her Python skills and all-round coding expertise to the project. Michelle is a scientist at the National Cancer Institute and has been using Python to analyze research data — and also using it for fun — for most of the last decade. You can find out more about Michelle, and see examples of her coding expertise on her excellent blog, themodernscientist. I asked her to say a few words about the new book:

"The purpose of this book is to equip scientists with the tools necessary to understand and analyze data in the way that directly suits their needs and can be reproduced in the future. With the ever increasing pace of research and volume of data generated, I am convinced the best way to accomplish this is by learning Python".

Michelle Gill

 

The new website

We previously had a website (unixandperl.com) and twitter account (@unixandperl) to support the old book and related materials. However, it seems fitting that we need to 'expand the brand', and so we have an updated website that can now be found at rescuedbycode.com (the old URL should still redirect here). This website continues to host information about the completely free Unix and Perl Primer for Biologists that we previously released, as well as the (also free) Command-line Linux Bootcamp that I recently added.

I expect that we will add some more posts to the new website in the coming months, and we will also continue to publish occasional items of relevance to the twitter account (also renamed to @rescuedbycode). Much of Python is new to me, and I hope to share some of my experiences from the point of view of someone who comes from a Perl background.

Okay, time for me to go and do some more research for the book.

Command-line bootcamp: learn the basics of Unix

Here is another contribution that I made to a UC Davis Bioinformatics Workshop that I helped teach last week. Adapting from some our much longer Unix & Perl Primer for Biologists, I made a short bootcamp that aims to teach the basics of the Unix/Linux command-line.

Unlike the Primer material that was written from the point of view of someone using a Mac, the new bootcamp course is written from the viewpoint of someone using Ubuntu Linux. Also, no example files are needed. The course is entirely self-contained and should take 1–3 hours to process (depending on your familiarity with Unix).

Download the PDF, view the HTML version, or work with the underlying Markdown file.

Current version: v1.01 — 2015-06-24

Filtering a SAM file generated by TopHat to find uniquely mapped, concordant read pairs: AWK vs SAMtools

Software suites like SAMtools (or should that be [SAMsam]tools?) offer a powerful way to slice and dice files in the SAM, BAM, and CRAM file formats. But sometimes other approaches work just as well.

If you have aligned paired RNA-Seq read data to a genome or transcriptome using TopHat you may be interested in filtering the resulting SAM/BAM file to keep reads that are:

  • a) uniquely aligned (only match one place in target sequence)
  • b) mapped as concordant read pairs (both pairs map to same sequence in correct orientation with suitable distance between them)

TopHat has a --no-discordant command-line option which only report read alignments if both reads in a pair can be mapped, but the name of option is somewhat misleading, as you still end up with discordantly mapped read pairs in the final output (always good to check what difference a command-line option actually makes to your output!.

So if you have a TopHat SAM file, and you wanted to filter it to only keep uniqueley mapped, concordant read pairs, you could use two of the options that the samtools view command provides:

  • -q 50 — This filters on the MAPQ field (5th column of SAM file). TopHat uses a value of 50 in this field to denote unique mappings (this important piece of information is not in the TopHat manual).
  • -f 0x2 — This option filters on the bitwise FLAG score (column 2 of SAM file), and will extract only those mappings where the 2nd bit is set. From the SAM documentation, this is described as each segment properly aligned according to the aligner. In practice this means looking for values that are either 83, 89, 147, or 163 (see this helpful Biobeat blog post for more information about this).

So if you have a SAM file, in essence you just need to filter that file based on matching certain numbers in two different columns. This is something that the Unix AWK tool excels at, and unlike SAMtools, AWK is installed on just about every Unix/Linux system by default. So do both tools give ou the same result? Only one way to find out:

Using SAMtools

The 'unfiltered.sam' file is the result of a TopHat run that used the --no-discordant and --no-mixed options. The SAM file contained 34,340,754 lines of mapping information:

time samtools  view -q 50 -f 0x2 unfiltered.sam > filtered_by_samtools.sam

real    1m57.068s
user    1m18.705s
sys     0m13.712s

Using AWK

time awk '$5 == 50 && ($2 == 163 || $2 == 147 || $2 == 83 || $2 == 99) {print}' unfiltered.sam  > filtered_by_awk.sam

real    1m31.734s
user    0m46.855s
sys     0m15.775s

Does it make a difference?

wc -l filtered_by_*.sam
31710476 filtered_by_awk.sam
31710476 filtered_by_samtools.sam

diff filtered_by_samtools.sam filtered_by_awk.sam

No difference in the final output files, with AWK running quite a bit quicker than SAMtools. In this situation, filtering throws away about 8% of the mapped reads.

Is this acceptable behavior for a bioinformatics program developed in the year 2014?

Last week I installed a relatively new read aligner with the humorous name of ARYANA:

The journal article describing the tool was published on September 10th 2014, and the associated code repository on GitHub first appeared earlier in the same year. So we're not talking about an old program here.

If I have time I'm planning to investigate the use of ARYANA alongside other established mapping tools like BWA and Bowtie 2. Installing ARYANA was straightforward, so then I proceeded to try the first thing that I attempt with all new bioinformatics software (and most Unix command-line software):

Run the program without any parameters to see what happens

I don't think I'm alone in this approach. In the absence of any necessary command-line options, a good Unix program will return helpful information about how it should be used. At the very least it might prompt you with the minimal use scenario and/or point out how you can find out more information by invoking the help mode. So here is what happened with ARYANA:

% aryana
Need more inputs

Not very helpful. So I tried the next obvious thing, let's see if there is a help mode:

% aryana -h
Need more inputs

% aryana --help
Need more inputs

Hmm. This is really not helpful. Out of curiosity, I tried to see if ARYANA would tell me what version it is (a fairly common behavior for a lot of command-line software):

% aryana -v
Need more inputs

% aryana --version
Need more inputs

At this point I sighed. Not figuratively. I literally sighed, because this type of feedback from a program — especially a bioinformatics program developed in the year 2014 — is maddening. I tweeted about this issue and judging by the feedback, I am not alone with my views on this.

It may have been less frustrating to return no output at all rather than return just those three words. I feel like the program is taunting me. It may as well have returned any of the following output:

% aryana
Not gonna work

% aryana
No can do

% aryana
Please go away

I could use this blog post to tell you about some of the basic requirements of a bioinformatics command-line program, but I don't need to do this because others have already done so. Specifically, people should look at this great paper by Torsten Seemann (@torstenseemann), published in GigaScience last year:

Ten recommendations for creating usable bioinformatics command line software

This is a fantastic set of recommendations, and coincidentally the first three things on the list relate to the first three things that I tried doing when running the ARYANA program:

  1. Print something if no parameters are supplied
  2. Always have a “-h” or “--help” switch
  3. Have a “-v” or “--version” switch

This is good advice of developers of bioinformatics software, but equally it is good advice for reviewers of bioinformatics software. If I was a reviewer of the ARYANA paper, I would have made comments regarding the lack of useful output from the program.

Installing Circos on CentOS

Circos is a software tool that lets you make beautiful circular figures, which are increasingly a common feature of genome-related papers (here are some examples). Circos, which is free to use, is built on Perl and requires a lot of Perl modules to be installed in order to work.

This is not always as straightforward as one might imagine, though Circos helps you by providing a tool to quickly see which modules you already have and which still need to be installed. I've previously set up Circos on Mac OS X, but today I had to install it on our Linux box (running CentOS 6). These are the steps I went through in order to get Circos up and running. Hopefully, this might be useful to others in the same situation.

### Installing Circos on CentOS ###

# These instructions assume that you are installing Circos in 
# a '/Bioinformatics/Packages' directory and also that you have root access

# Preparatory steps
su
mkdir /Bioinformatics/Packages/Circos
cd /Bioinformatics/Packages/Circos


# Download the latest *.tgz release and copy to Circos directory
wget http://circos.ca/distribution/circos-0.64.tgz
tar -xvzf circos-0.64.tgz
cd circos-0.64


# To test what modules we will need to install, use Circos' built-in tool
cd bin
./test.modules


# This leads to having to install:
perl -MCPAN -e 'install Clone'
perl -MCPAN -e 'install Config::General'
perl -MCPAN -e 'install Font::TTF::Font'
perl -MCPAN -e 'install GD'

# GD module installation failed and so I had to to also install gd library:
yum install gd gd-devel


# now back to the Perl modules
perl -MCPAN -e 'install GD'
perl -MCPAN -e 'install List::MoreUtils'
perl -MCPAN -e 'install Math::Round'
perl -MCPAN -e 'install Math::VecStat'
perl -MCPAN -e 'install Params::Validate'

# Params::Validate didn’t seem to install properly, so I tried a manual approach
cd
wget http://search.cpan.org/CPAN/authors/id/D/DR/DROLSKY/Params-Validate-1.07.tar.gz
tar -xvzf Params-Validate-1.07.tar.gz
cd Params-Validate-1.07
perl Build.PL

# This failed because I needed a newer version of some other Perl modules
# even though the ./test_modules script reported everything was okay
perl -MCPAN -e 'install Module::Build'
perl -MCPAN -e 'install Module::Implementation'


# Can now hand install Params-Validate
./Build
./Build test
./Build install

# And back to the Perl modules
perl -MCPAN -e 'install Readonly'
perl -MCPAN -e 'install Regexp::Common'
perl -MCPAN -e 'install Text::Format'


# Tidy up and set up symbolic links
rm -rf ~/Params-Validate-1.07
rm ~/Params-Validate-1.07.tar.gz
cd /Bioinformatics/Packages/Circos
ln -s circos-0.64 current
cd /Bioinformatics/bin
ln -s ../Packages/Circos/current/bin/circos

# Try running circos
./circos

# This still wasn’t enough!!!
# Running circos complained that I needed 2 more Perl Modules!
perl -MCPAN -e 'install Math::Bezier'
perl -MCPAN -e 'install Set::IntSpan'

# Try running circos again
./circos

# Success! And by 'success' I mean that you can now get to a point where you are
# seeing errors from the Circos program (i.e. you need to set up your Circos
# configuration file).

Happy to announce that our Unix & Perl book is finally on sale!

Unix and Perl to the RESCUE! A field-guide for life scientists (and other data-rich pursuits)

At least in the UK. It will take a few weeks for US warehouses to receive stock, and we have still not received our own copies, but at least the book is out there (somewhere). Amazon is discounting the book which puts it into a more affordable price range.

It will be very strange the first time I spot it 'in the wild' by chance. If I spot someone reading it on a bus/train/plane, I wonder whether I will be tempted to say anything.