Bioinformatics and genomics resources on reddit

July 27, 2015 by Keith Bradnam

Although many people in this field turn to sites like SEQanswers and Biostars to get help with bioinformatics problems, there are a number of subreddits that are devoted to discussion of bioinformatics and genomics. Reddit isn't just a forum for asking questions though, and people also share lots of relevant links (to papers, resources, news etc.). As a new subreddit appeared this week, I thought I'd present a quick roundup:

1. bioinformatics

URL: https://www.reddit.com/r/bioinformatics/
Description: news for genome hackers
Frequency of posts: Frequent, maybe 5–10 new items per day
Current readership: 8,641 readers

This is the most popular subreddit in this list (in terms of readers). The posts are roughly split equally between sharing links of interest and asking questions. Some of the questions frequently relate to career advice from people wanting to get into this field.

2. genomics

URL:https://www.reddit.com/r/genomics/
Description: Genomics, genetics, DNA, health, and personalized medicine
Frequency of posts: Infrequent, ~1–5 new items per week_
Current readership:_ 3,085 readers

This subreddit seems to be declining in popularity. It mostly features shared links rather than people asking questions.

3. genome

URL:https://www.reddit.com/r/genome/
Description: Please submit primary genomics literature, discussions of primary literature (e.g. blog posts and serious news stories), and resources for genomics research.
Frequency of posts: Moderate, ~5–10 new items per week
Current readership: 211 readers

This is a relatively new subreddit and it is focused on people sharing new and interesting papers, and using the subreddit to discuss those papers.

4. learnbioinformatics

URL:https://www.reddit.com/r/learnbioinformatics/
Description: This is a subreddit for providing you with the most relevant academic papers, textbooks, websites, and tutorials in the field of bioinformatics. If you have any recommended resources, please feel free to post away!
Frequency of posts: Too early to reliably predict
Current readership: 299 readers

This is the newest subreddit (just a few days old), and so has only attracted about a dozen posts. The intended role of this subreddit has obvious overlaps with the other subreddits.

Concerning the gender ratio of speakers at the 2015 Genome Science Meeting

July 25, 2015 by Keith Bradnam

The Genome Science 2015 meeting has announced their speaker line-up. At the time of writing, not all of the speaker positions are finalized, but currently the published agenda reveals:

13 men
9 women

So currently 41% of the speakers are women which is excellent. Hoping that the remaining 15 slots keep this conference free from notable gender bias.

Time for a couple of new JABBA awards for bogus bioinformatics acronyms

July 24, 2015 by Keith Bradnam

It's been a while since I handed out any JABBA awards (Just Another Bogus Bioinformatics Acronym). These are awarded to developers of bioinformatics tools who name their software using tenuously derived acronyms or initalisms. Both awardees featured in a recent issue of Nucleic Acids Research. First up, we have:

I-TASSER server: new development for protein structure and function predictions

I'm aways slightly worried when the publication doesn't reveal the source of an acronym. I had to go to the I-TASSER website to find out the grim truth:

Iterative Threading ASSEmbly Refinement

It's not an altogether terrible name, but to me it doesn't include any mention of protein structure or function prediction (which is the purpose of the tool). Even if you see this full name written down somewhere, one might assume that it is something to with other types of assembly?

Next up is a software tool which has a nice short name:

SCUDO: a tool for signature-based clustering of expression profiles

Just five letters in the name, they probably correspond neatly to five words in the software's full name, right? Er, no.

Signature-based ClUstering for DiagnOstic purposes

This is one of the most tenuous initialisms that I have seen. In these situations, I would urge developers to really not try to make into an acronym or intialism at all. Just calling the tool 'SCUDO' would be preferable — in my opinion — than trying to use such a clumsy initialism.

Another take on our new Unix & Python book →

July 22, 2015 by Keith Bradnam

Following on from my announcement yesterday that I am involved in writing a new book (Unix and Python to the Rescue!), my co-author Michelle has written a few words about it on her blog which I encourage you to read. She tackles the issue of why people in our field (the life-sciences) should learn to code:

It is my belief--based on my own experience--that using a prefabricated tool, such as a spreadsheet or graphing program, inherently limits you to someone else's idea of what analytical questions you should be asking about your data.

In today's scientific world, the amount and type of data we need to understand changes rapidly, and these programs can quickly become limiting. By taking the time to learn a set of basic tools that can be combined in limitless ways, you empower yourself to ask the kinds of analytical questions you want to ask about your data

Taking steps to write a new book about programming!

July 21, 2015 by Keith Bradnam

The old book

I am very excited to announce that I am involved in writing another book about progamming! The 2012 book that I wrote with Ian Korf — Unix and Perl to the Rescue!: A Field Guide for the Life Sciences (and Other Data-rich Pursuits) — was enjoyable to write, and seemed to be well received — (4.5 star average on Amazon.com) and so we both wanted to do something else.

We wrote about Perl because it is the language that we had both used since the mid-1990s, and for a long while Perl was the language du jour for people working in bioinformatics. This has changed. The TIOBE software index uses search engine queries to track the popularity of all programming languages over time. In 2000, Perl was the 4th most popular language whereas Python ranked 24th. As of July 2015, Python has risen to 5th place, overtaking Perl which has dropped to 11th place. Not only is Python proving an extremely popular language, it is swiftly overtaking Perl in many areas involving the processing of biological data.

The new book

So we made a proposal to Cambridge University Press to write what we are provisionally calling Unix and Python to the Rescue! (this will no doubt be the start of a successful series which will culminate in Unix and Minecraft to the Rescue!). Happily, they have accepted our proposal and so we have recently started the process of writing the new book (hopefully due to be published in 2016).

We intend for this book to fulfill many of the same goals that we had for our earlier book:

Contain basic material that introduces Unix & Python to someone who has never sat down at a terminal or written a line of code before.
Include many advanced programming concepts in addition to the basics.
Where possible, only introduce one new concept at a time.
Write in a lively, engaging style in order to make the concepts fun!

For item #2, we envisage our book addressing topics such as NumPy, ScyPy, IPython Notebooks, and the pandas package, to name but a few!

The new co-author

For this new endeavor we have recruited the many talents of Michelle Gill (@modernscientist) who will bring her Python skills and all-round coding expertise to the project. Michelle is a scientist at the National Cancer Institute and has been using Python to analyze research data — and also using it for fun — for most of the last decade. You can find out more about Michelle, and see examples of her coding expertise on her excellent blog, themodernscientist. I asked her to say a few words about the new book:

"The purpose of this book is to equip scientists with the tools necessary to understand and analyze data in the way that directly suits their needs and can be reproduced in the future. With the ever increasing pace of research and volume of data generated, I am convinced the best way to accomplish this is by learning Python".

Michelle Gill

Screen Shot 2015-07-21 at 8.15.52 AM.png

The new website

We previously had a website (unixandperl.com) and twitter account (@unixandperl) to support the old book and related materials. However, it seems fitting that we need to 'expand the brand', and so we have an updated website that can now be found at rescuedbycode.com (the old URL should still redirect here). This website continues to host information about the completely free Unix and Perl Primer for Biologists that we previously released, as well as the (also free) Command-line Linux Bootcamp that I recently added.

I expect that we will add some more posts to the new website in the coming months, and we will also continue to publish occasional items of relevance to the twitter account (also renamed to @rescuedbycode). Much of Python is new to me, and I hope to share some of my experiences from the point of view of someone who comes from a Perl background.

Okay, time for me to go and do some more research for the book.

101 questions with a bioinformatician #29: Jane Loveland

July 16, 2015 by Keith Bradnam

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.

Jane Loveland is a Senior Computer Biologist at The Wellcome Trust Sanger Institute where she is involved in a number of key projects relating to genome annotation and training.

As a manager in the HAVANA group (Human and Vertebrate Analysis and Annotation), she helps oversee the valuable work in using manual annotation to provide a reference gene set for the human, mouse, and zebrafish genomes. HAVANA's annotation is made publicly available via the Vega genome browser, which is in turn merged with the annotation in Ensembl to produce the reference GENCODE gene set.

Jane also leads a team of instructors for Wellcome Trust Advanced Courses which teach workshops all over the world, in particular the Open Door Workshops:

The Open Door Workshop provides an introduction to bioinformatics tools freely available on the internet, focussing primarily on the Human Genome data. The workshops provide hands-on training in the use of public databases and web-based sequence analysis tools, and are taught by experienced instructors.

And now, on to the 101 questions...

001. What's something that you enjoy about current bioinformatics research?

The speed of change. From an annotation view point, we are constantly having to find ways to use new data sources which in turn adds value to the annotation that we produce.

When I’m putting together a manual for a workshop I have to update everything, every time. I have come into bioinformatics from wet lab biochemistry/molecular biology and I once spent an entire week hand-crafting a multiple alignment figure for my thesis. I can do this in a few minutes now.

010. What's something that you don't enjoy about current bioinformatics research?

Everyone assumes that all genome sequences are 'finished' (KRB: I don't!). They may be sequenced but the quality is often pretty poor compared to the sequence that we were producing at the Sanger Institute about a decade ago.

You can’t interpret what’s going on in a genome if the underlying reference sequence is of poor quality. I do a lot of teaching and spend a lot of time explaining this to researchers.

011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Just go for it. Bit of a cliché I know. I had a crippling lack of confidence when I was younger which I think really held me back.

100. What's your all-time favorite piece of bioinformatics software, and why?

For annotation: Blixem. This is an interactive graphical BLAST viewer — old but essential for gene annotation. Means that I can view alignments to the genome at base pair level really quickly and simply.

For workshops: Ensembl. You have to be able to browse a genome.

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality, and why?

Can I have I for inosine? Reminds me of making degenerate primers for PCR. It's a multi-tasker, which is also how I see myself. It's not on the list though (KRB: everyone keeps breaking the rules!).

The Take-Home Message: a new web comic about biology, genomics, and bioinformatics

July 14, 2015 by Keith Bradnam

I am extremely excited to announce the launch of a new venture that I am involved in:

The Take-Home Message

The Take-Home Message is a new web comic that tries to graphically represent interesting and fun ideas from the world of biology, with a special focus on genomics and bioinformatics. It was partly inspired by the growing trend towards journals that require graphical abstracts, though we hope to be entertaining as well as informative.

The driving force behind THM (as it will surely become known) is the very talented Abby Yu. Until very recently, Abby was working hard as a Graduate Student in our lab at UC Davis. Abby's amazing artistic talents were always on evidence whenever she presented in our lab meetings. But she isn't just a great artist and illustrator, she uses those talents to be a great science communicator too.

After obtaining her PhD and getting a job in the real world ^TM, Abby approached me with the idea of starting a web comic together. We should all be grateful that she didn't suggest that I do any of the drawings! I play more of an editorial role and will help come up with the ideas and write the explanatory text that accompanies each comic. You can see more of Abby's amazing artistic creations on her Tumblr: Oh THAT sketch blog.

The first issue of THM is now online and concerns the Tuxedo Suite of bioinformatics software. We are provisionally aiming to have a new comic online every two weeks. As well as reading the comic online, you can also subscribe to the RSS feed, have the comics emailed to you, or follow us on Twitter (@takehomemessage).

To celebrate the launch of issue #1, Abby has prepared a little celebratory version of our banner image:

When 'verbose' mode is maybe a little too verbose: lessons from the Trinity transcriptome assembler

July 10, 2015 by Keith Bradnam

The transcriptome assembler Trinity, like many other bioinformatics command-line tools, sends its principle output (a transcriptome assembly) to a named output file. It writes other information about the status of the run to standard output.

Another feature in common with other bioinformatics programs is that provides a --verbose mode. The Trinity command-line help describes this as follows:

verbose: provide additional job status info during the run

I recently helped a colleague use Trinity to generate a primate transcriptome assembly, and when we ran the program we did two runs, one with standard logging and one with the verbose output turned on. In both cases we used file redirection to send the output to a file. So what did we end up with?

transcriptome.fasta - 60.4 MB
stdout.log - 2.1 MB
stdout_verbose.log - 140.7 MB

The verbose log file was 70 times bigger than the standard log file, and over twice the size of the final transcriptome assembly! I tried converting the verbose text file to a PDF which gave me a 15,385 page document. The Unix word count program tells me that this file contains over 15 million 'words', but the problem is that that these are not words that you would necessarily want to read. There are thousands and thousands of pages of output with text that looks like this:

Screen Shot 2015-07-10 at 6.38.47 AM.png

If you run Trinity without redirecting the output to a file, you will just see the percentage completion number overwrite itself on a single line of output. This doesn't work so well though if someone does choose to redirect the output to a file. You could also make an argument that no-one really needs to see such a high level of precision when reporting the state-of-completion of each step (four decimal places!).

I think this is an example where the verbose log file ends up being so big as to be largely unusable. If you wanted to search for a specific string in that file, then maybe it would be helpful. The main problem is that the Trinity developers are trying to be smart by having the program overwrite output — regarding the percentage completion status of each step — on various lines of output. However, this is only useful if the user chooses not to redirect the output to a file (something which is incredibly common in bioinformatics). I would argue that for 99% of cases, it is more than sufficient for a program to indicate 10–20 lines of output regarding the state of completion, e.g.

Calculating stage 1 of shamrock.pl…
10% complete
20% complete
30% complete
40% complete
50% complete
60% complete
70% complete
80% complete
90% complete
100% complete

About my idea of a 33% target for women speakers at genomics conferences…

July 09, 2015 by Keith Bradnam

Last week I wrote a post on the subject of gender bias at genomics/bioinformatics conferences. I suggested a figure of 33% might make for a minimum target for the proportion of women (and men) who give talks at such conferences. I also went so far as to end that post by saying:

I don't attend many conferences, but from now on I won't be attending any if at least 33% of the talks are not by women.

At the time that I wrote this, I knew that I was going to be speaking at a genomics conference myself later this year. What I didn't know at the time, was the gender ratio of speakers at this conference. That information only came to light this week. So what is the proportion of talks by women at this conference?

28.2%

If you're quick on the uptake, you will notice that 28 < 33 So what did I do? Well, I wrote to the conference organizers and explained my position and told them that I would like to withdraw my speaking role. I also suggested that they find a woman to take my place (and offered a suggestion of a female co-worker who has worked on the very project that I was intending to talk about).

The conference in question is the new Festival of Genomics that will take place in California in November. This is the second Festival of Genomics conference organized by Front Line Genomics and you may have read about the first conference in this series that recently took place in Boston. This conference was very well received (e.g. see this, this, or this) and so I was very much looking forward to speaking in November (especially as this was the first time that I have been asked to speak at a conference).

The current list of speakers shows 66 men and 26 women. It's possible that these numbers might change slightly; adding just 7 more women speakers, or replacing only 5 male speakers with women would be enough to reach my suggested 33% target.

I have had several productive exchanges with Front Line Genomics about this issue. They acknowledge the problem and seem to genuinely want to do something about it to reduce gender bias in this field. I'm confident that subsequent conferences that they organize will do an even better job at representing women in speaking roles. It also must be said that they are doing much better than most genomics conferences and 28% is higher than the current proportion of women in senior roles at most genome institutes. Once again, I want to reiterate that I have found Front Line Genomics to be extremely open about this issue, and I genuinely believe that they are receptive to suggestions that might improve the situation in future.

What can be done?

If you are a male scientist who is concerned by the current level of gender bias at genomics conferences, and if you are ever invited to give a talk at such a conference, then you do have the power to help change the situation. If you learn that women speakers are going to be underrepresented, you can withdraw your speaking position and instead make some suggestions of female scientists to take your place. You can also raise this issue when first invited to speak. If conference organizers received responses from all potential speakers saying 'I will only talk if your conference has an unbiased gender ratio of speakers', then this could change the situation dramatically.

Time to conclude this post by saying (once again): I don't attend many conferences, but from now on I won't be attending any if at least 33% of the talks are not by women.

Everything you ever wanted to know about working with RNA-Seq data but were afraid to ask

July 07, 2015 by Keith Bradnam

Do you work with RNA-Seq data?
Do you plan to work with RNA-Seq data?
Have you ever heard of RNA-Seq data?

If the answer to any of these questions is 'yes' (or even 'maybe') then you should definitely check out this fantastic online guide to all things RNA-Seq:

RNA-seqlopedia

The RNA-seqlopedia provides an overview of RNA-seq and of the choices necessary to carry out a successful RNA-seq experiment

Written by Rodger Voelker and Clay Small of the Cresko Lab at the University of Oregon, it is a fantastically detailed, beautifully written resource to walk you through every step of working with RNA-Seq data.

I wish there were more online guides like this! Here's the Table of Contents, with the 'Analysis' section expanded, to give you a feeling for what it covers:

Experimental Design

RNA Preparation

Library Preparation

Sequencing

Analysis
Overview

Initial Processing

Demultiplexing

Removing adapters

Trimming

Kmer Normalization

de Novo Assembly

de Bruijn Graph assembly

Overlap Layout Assembly

Aligning reads to a reference

Aligning to a ref. genome

Aligning to a transcriptome

microRNA Aligners

Short Read Aligners Output

Annotation of transcripts

Differential gene expression

Normalization

Discrete Discrete Models

Continuous Discrete Models

Nonparametric Models

Choice of Analysis Software

References