The Take-Home Message: a new web comic about biology, genomics, and bioinformatics

July 14, 2015 by Keith Bradnam

I am extremely excited to announce the launch of a new venture that I am involved in:

The Take-Home Message

The Take-Home Message is a new web comic that tries to graphically represent interesting and fun ideas from the world of biology, with a special focus on genomics and bioinformatics. It was partly inspired by the growing trend towards journals that require graphical abstracts, though we hope to be entertaining as well as informative.

The driving force behind THM (as it will surely become known) is the very talented Abby Yu. Until very recently, Abby was working hard as a Graduate Student in our lab at UC Davis. Abby's amazing artistic talents were always on evidence whenever she presented in our lab meetings. But she isn't just a great artist and illustrator, she uses those talents to be a great science communicator too.

After obtaining her PhD and getting a job in the real world ^TM, Abby approached me with the idea of starting a web comic together. We should all be grateful that she didn't suggest that I do any of the drawings! I play more of an editorial role and will help come up with the ideas and write the explanatory text that accompanies each comic. You can see more of Abby's amazing artistic creations on her Tumblr: Oh THAT sketch blog.

The first issue of THM is now online and concerns the Tuxedo Suite of bioinformatics software. We are provisionally aiming to have a new comic online every two weeks. As well as reading the comic online, you can also subscribe to the RSS feed, have the comics emailed to you, or follow us on Twitter (@takehomemessage).

To celebrate the launch of issue #1, Abby has prepared a little celebratory version of our banner image:

When 'verbose' mode is maybe a little too verbose: lessons from the Trinity transcriptome assembler

July 10, 2015 by Keith Bradnam

The transcriptome assembler Trinity, like many other bioinformatics command-line tools, sends its principle output (a transcriptome assembly) to a named output file. It writes other information about the status of the run to standard output.

Another feature in common with other bioinformatics programs is that provides a --verbose mode. The Trinity command-line help describes this as follows:

verbose: provide additional job status info during the run

I recently helped a colleague use Trinity to generate a primate transcriptome assembly, and when we ran the program we did two runs, one with standard logging and one with the verbose output turned on. In both cases we used file redirection to send the output to a file. So what did we end up with?

transcriptome.fasta - 60.4 MB
stdout.log - 2.1 MB
stdout_verbose.log - 140.7 MB

The verbose log file was 70 times bigger than the standard log file, and over twice the size of the final transcriptome assembly! I tried converting the verbose text file to a PDF which gave me a 15,385 page document. The Unix word count program tells me that this file contains over 15 million 'words', but the problem is that that these are not words that you would necessarily want to read. There are thousands and thousands of pages of output with text that looks like this:

Screen Shot 2015-07-10 at 6.38.47 AM.png

If you run Trinity without redirecting the output to a file, you will just see the percentage completion number overwrite itself on a single line of output. This doesn't work so well though if someone does choose to redirect the output to a file. You could also make an argument that no-one really needs to see such a high level of precision when reporting the state-of-completion of each step (four decimal places!).

I think this is an example where the verbose log file ends up being so big as to be largely unusable. If you wanted to search for a specific string in that file, then maybe it would be helpful. The main problem is that the Trinity developers are trying to be smart by having the program overwrite output — regarding the percentage completion status of each step — on various lines of output. However, this is only useful if the user chooses not to redirect the output to a file (something which is incredibly common in bioinformatics). I would argue that for 99% of cases, it is more than sufficient for a program to indicate 10–20 lines of output regarding the state of completion, e.g.

Calculating stage 1 of shamrock.pl…
10% complete
20% complete
30% complete
40% complete
50% complete
60% complete
70% complete
80% complete
90% complete
100% complete

About my idea of a 33% target for women speakers at genomics conferences…

July 09, 2015 by Keith Bradnam

Last week I wrote a post on the subject of gender bias at genomics/bioinformatics conferences. I suggested a figure of 33% might make for a minimum target for the proportion of women (and men) who give talks at such conferences. I also went so far as to end that post by saying:

I don't attend many conferences, but from now on I won't be attending any if at least 33% of the talks are not by women.

At the time that I wrote this, I knew that I was going to be speaking at a genomics conference myself later this year. What I didn't know at the time, was the gender ratio of speakers at this conference. That information only came to light this week. So what is the proportion of talks by women at this conference?

28.2%

If you're quick on the uptake, you will notice that 28 < 33 So what did I do? Well, I wrote to the conference organizers and explained my position and told them that I would like to withdraw my speaking role. I also suggested that they find a woman to take my place (and offered a suggestion of a female co-worker who has worked on the very project that I was intending to talk about).

The conference in question is the new Festival of Genomics that will take place in California in November. This is the second Festival of Genomics conference organized by Front Line Genomics and you may have read about the first conference in this series that recently took place in Boston. This conference was very well received (e.g. see this, this, or this) and so I was very much looking forward to speaking in November (especially as this was the first time that I have been asked to speak at a conference).

The current list of speakers shows 66 men and 26 women. It's possible that these numbers might change slightly; adding just 7 more women speakers, or replacing only 5 male speakers with women would be enough to reach my suggested 33% target.

I have had several productive exchanges with Front Line Genomics about this issue. They acknowledge the problem and seem to genuinely want to do something about it to reduce gender bias in this field. I'm confident that subsequent conferences that they organize will do an even better job at representing women in speaking roles. It also must be said that they are doing much better than most genomics conferences and 28% is higher than the current proportion of women in senior roles at most genome institutes. Once again, I want to reiterate that I have found Front Line Genomics to be extremely open about this issue, and I genuinely believe that they are receptive to suggestions that might improve the situation in future.

What can be done?

If you are a male scientist who is concerned by the current level of gender bias at genomics conferences, and if you are ever invited to give a talk at such a conference, then you do have the power to help change the situation. If you learn that women speakers are going to be underrepresented, you can withdraw your speaking position and instead make some suggestions of female scientists to take your place. You can also raise this issue when first invited to speak. If conference organizers received responses from all potential speakers saying 'I will only talk if your conference has an unbiased gender ratio of speakers', then this could change the situation dramatically.

Time to conclude this post by saying (once again): I don't attend many conferences, but from now on I won't be attending any if at least 33% of the talks are not by women.

Everything you ever wanted to know about working with RNA-Seq data but were afraid to ask

July 07, 2015 by Keith Bradnam

Do you work with RNA-Seq data?
Do you plan to work with RNA-Seq data?
Have you ever heard of RNA-Seq data?

If the answer to any of these questions is 'yes' (or even 'maybe') then you should definitely check out this fantastic online guide to all things RNA-Seq:

RNA-seqlopedia

The RNA-seqlopedia provides an overview of RNA-seq and of the choices necessary to carry out a successful RNA-seq experiment

Written by Rodger Voelker and Clay Small of the Cresko Lab at the University of Oregon, it is a fantastically detailed, beautifully written resource to walk you through every step of working with RNA-Seq data.

I wish there were more online guides like this! Here's the Table of Contents, with the 'Analysis' section expanded, to give you a feeling for what it covers:

Experimental Design

RNA Preparation

Library Preparation

Sequencing

Analysis
Overview

Initial Processing

Demultiplexing

Removing adapters

Trimming

Kmer Normalization

de Novo Assembly

de Bruijn Graph assembly

Overlap Layout Assembly

Aligning reads to a reference

Aligning to a ref. genome

Aligning to a transcriptome

microRNA Aligners

Short Read Aligners Output

Annotation of transcripts

Differential gene expression

Normalization

Discrete Discrete Models

Continuous Discrete Models

Nonparametric Models

Choice of Analysis Software

References

NCBI working on SAM output from BLAST+ →

July 04, 2015 by Keith Bradnam

A new post on Peter Cock's excellent Blasted Bioinformatics!? blog points to a surprising new feature in the recently released version 2.2.31 of NCBI BLAST+:

The command line help in BLAST+ 2.2.31 only describes output formats 0 to 14. I discovered by accident that [-outfmt option] 15 offers SAM format output

101 questions with a bioinformatician #28: Jonathan Eisen

July 01, 2015 by Keith Bradnam

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.

Jonathan Eisen is a man who almost needs no introduction. As a Professor at UC Davis, he holds appointments at the Genome Center and in the Departments of Medical Microbiology & Immunology and Evolution and Ecology. He also holds an adjunct scientist position at the Department of Energy Joint Genome Initiative. If you can't find Jonathan in any one these locations, you should also remember to try his 'other' office.

His research interests focus on the diversity of microbes and the role microbes play in the health of ecosystems. One of his many projects relating to this is the microbiology of the Built Environment network (microBEnet).

You might think that studying at Harvard, winning a Benjamin Franklin Award, being a extremely vocal advocate of open access publishing, and making your name as an evolutionary biologist working for the University of California in NorCal would be enough to uniquely identify anyone on this planet. However, these are all feats that that have been accomplished by Jonathan and his brother Michael.

This leads me to what I consider to be the real highlight of Jonathan's career. Forget about his many accomplishments as a scientist. Also, forget about his key role in popularizing and legitimizing the use of tools such as twitter as an important component of scientific outreach. And definitely forget about his tireless efforts to expose the horrible gender bias present at so many academic meetings. No, the real zenith of Jonathan's career is that he is leading the fight to rid the world of badomics words.

You can find out more about Jonathan by visiting his lab's website, reading his extremely popular Tree of Life blog, or by following him on twitter (@phylogenomics). And now, on to the 101 questions...

001. What's something that you enjoy about current bioinformatics research?

I really like the move to improve visualizations as part of bioinformatics workflows. For example see the work of Holly Bik and her work on the Phinch project.

I also really really like the move for more people to be discussing their work and the work of others on social media.

010. What's something that you don't enjoy about current bioinformatics research?

The challenge of long term funding for open source projects.

011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

A better understanding of ergonomics.

100. What's your all-time favorite piece of bioinformatics software, and why?

MacClade because it got me into informatics and evolution. The developers (the Maddison brothers) were TAs for a class I took as an undergrad (a course by Stephen Jay Gould)

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality, and why?

- KRB: this is a dash, one of two IUPAC characters that can be used to represent a gap.

DVD bonus materials

KRB: Because of the relative brevity of this interview, I thought that I would also share some answers Jonathan gave me to the questions I also include when asking people to do these interviews (this info sometimes helps me write my introductions):

0111. What is the correct way of describing your current position or title(s)

Guardian of Microbial Diversity.

1000. How long have you been in this role?

My whole life.

1001. In 1–2 sentences, describe what your role entails

I am a secret superhero trying to protect the microbes of the world.

You wait ages to see a tweet about gender bias in science, and then three come along at once!

June 30, 2015 by Keith Bradnam

I uploaded my last post about gender bias in genomics/bioinformatics to my blog late on Sunday night. When I checked my twitter feed on Monday morning I was pleasantly surprised to see how much traffic the post had already generated. I was also amused by the serendipitous nature of seeing the following tweets appear closely together in my timeline:

@TheCrick @BioMickWatson 10 male, 10 female. Nice work! #scidiverse
— Robert Davey (@froggleston) June 26, 2015

Barcelona institute of science and technology is born. And with zero (!) women... pic.twitter.com/hEEOkqH6w7
— Ben Lehner (@BenLehner) June 26, 2015

Where next on gender bias in science @utafrith new chair of @royalsociety Diversity Committee http://t.co/VaWEHcFpt0
— The Royal Society (@royalsociety) June 29, 2015

So well done to the Crick Institute regarding the news that they will have a perfectly equal mix of male and female group leaders. This figure of 50% females would put them top of my list of research institutes in this field.

In contrast, finding out that the new Barcelona Institute of Science and Technology has an all male roster (11 Professors, 18 staff in total) is kind of depressing. This result would put them bottom of my list.

What would be a suitable value for the absolute minimum proportion of female speakers at genomics/bioinformatics conferences?

June 29, 2015 by Keith Bradnam

Photo by ViktorCap/iStock / Getty Images

Background

Hopefully, many people reading this will be aware of Jonathan Eisen's valiant efforts to highlight the Yet Another Mostly Male Meeting (YAMMM) problem; these are conferences where the gender bias is disproportionately skewed towards male speakers. You can see all of Jonathan's YAMMM posts on his blog, and in his latest post he highlights a particularly egregious case: a CSHL meeting on the Evolution of Sequencing where only 7.8% of speakers are women.

In my last ACGT post I looked at how this figure (7.8%) compared to the male/female ratio of senior researchers at 10 different genomics/bioinformatics institutes. Nine out of the ten places that I looked at had a much higher proportion of female scientists. I tried making the point that this suggests that conference organizers have no excuses for not doing a better job at recruiting more female speakers.

But it struck me that my analysis was a bit too shallow, especially as the numbers of researchers in each place differed quite a bit (from 10 to almost 60). So I went back and looked at many more academic institutions and kept track of the absolute numbers of men and women in senior research roles.

Dataset

In total, my updated dataset comprises details from 40 different academic institutes (or centers/departments) that specialize in genomics and/or bioinformatics. The vast majority (33/40) mention 'genome', 'genomics', or 'bioinformatics' in their title (the exceptions to this include the Broad Institute, Cold Spring Harbor Laboratory, and the Wellcome Trust Sanger Institute).

The 40 different institutes represent locations in North America, Europe, Asia, and Australia. In some cases, the named institute represents an umbrella organization connecting researchers in different locations across that country (e.g. the Swiss Institute of Bioinformatics). There is probably an element of selection bias towards research institutes that provided an English-language version of their staff/personnel page (not all non-English websites have translations of every page available).

I think that this dataset contains most of the widely known research institutes that have a dedicated focus on genomics and bioinformatics. The list could probably be further expanded if I targeted more University departments that have a specialization in these fields.

In total I logged the gender of 1,039 people in various 'senior' research roles (e.g. Faculty, 'Group leaders', 'Project leaders', etc.). In many cases I deduced gender from first names, but looked for images of researchers where this was not easy to do so.

I've uploaded the main table of data to Figshare so that others can look at all of the detailed numbers if they so desire.

Results

The most equitable result for any one academic institute with at least 15 senior research scientists was the Duke Department of Biostatistics and Bioinformatics (40.4% female, N=52)
Only two other institutes had figures of 40% or higher: the National Human Genome Research Institute (40% female, N=40) and the Functional Genomics group at the Russian Academy of Medical Sciences (50% female, N=6).
Only 3 out of 40 institutes had a lower proportion of female scientists than at the aforementioned CSHL meeting with 7.8% female speakers.
Discounting the bottom placed institute due to small sample size (0% female, N=4), the next worse place was NBIC, the Netherlands Bioinformatics Center with only one female Faculty member (4.8% female, N=21).
The overall ratio of female scientists in senior research roles is 23.6% (N=1,039).

Conclusions

It is somewhat depressing to see such a systematic gender bias in my field, where female scientists only account for approximately a quarter of senior research positions. This figure is in line with UK data for the proportion of female professors all biological sciences (25.1%). The lack of equal gender representation is presumably due to bias and discrimination (conscious or otherwise). In 2014 I conducted a survey to look at gender bias in bioinformaticians across different career stages. This survey had 370 responses — from undergraduate level right through to Deans of academic schools — and showed that there is essentially no gender bias at all stages prior to the level of Faculty (or equivalent). This suggests that there is no shortage of talented women coming through the system, they just are hitting a barrier when it comes to attaining senior research positions…a situation clearly not helped by further discrimination at conferences.

Based on the figures that I show here, one might argue that the figure of approximately 25% could be seen as a minimum target for female participation at conferences. However, such a target would only be encouraging the current levels of discrimination. Far better would be a target that not only attempts to reduce discrimination, but which also better reflects the equal representation of female scientists in post-doc and graduate student positions.

For these reasons I feel that conference organizers — in the fields of genomics and bioinformatics — should be aiming for at least a third of all speakers to be female. Ideally, we want to be doing better than this which is why I suggest this as an absolute minimum target. Depressingly, even this low target is something which most (all?) of the YAMMM meetings described by Jonathan Eisen fail to meet. Of course, such a target should apply for male speakers too, though I'm doubtful that there has ever been a conference in this field where men accounted for less than a third of all speakers.

I don't attend many conferences, but from now on I won't be attending any if at least 33% of the talks are not by women.

Update 2015-06-30: Added link to data for percentage of female professors in UK biological sciences, and clarified that my suggested target figure should also apply to male speakers. I also added a caveat that my methods of choosing institutes is biased towards websites written in English (or with English translations available).

How does gender diversity of speakers at CSHL Evolution of Sequencing meeting compare to research institute gender diversity?

June 27, 2015 by Keith Bradnam

Following Jonathan Eisen's recent blog post about the tremendously poor representation of female speakers at the upcoming CSHL Evolution of Sequencing meeting, I was curious about something. Namely, how representative are female scientists in senior roles at prominent genome centers/institutes?

So I found 10 places which all have at least 10 people listed in senior roles (e.g. Faculty or Project leaders), including our own Genome Center. In all but one case, the gender ratio exceeds the pitiful 8.5% of female speakers at the Evolution of Sequencing meeting.

While it is still depressing that not a single institute has more than 40% of senior roles filled by women, it is a clear sign that there is a great pool of talented female genome scientists out there. Conferences should therefore have no excuses for single-digit percentages for female speakers.

NCBI BLAST+ v2.2.31 released →

June 25, 2015 by Keith Bradnam

NCBI have recently announced the release of v2.2.31 of BLAST+ (hat tip to Torsten Seemann for alerting me to this). Find out more here:

Release notes
Download the binaries (FTP link)

This release is of interest to me as it fixes a bug introduced in v2.2.30 that broke our CEGMA software (which relies on TBLASTN):

Reenabled support for word size 5 in tblastn.

This bug was something that I reported on back in February.