You wait ages to see a tweet about gender bias in science, and then three come along at once!

I uploaded my last post about gender bias in genomics/bioinformatics to my blog late on Sunday night. When I checked my twitter feed on Monday morning I was pleasantly surprised to see how much traffic the post had already generated. I was also amused by the serendipitous nature of seeing the following tweets appear closely together in my timeline:

 
 

So well done to the Crick Institute regarding the news that they will have a perfectly equal mix of male and female group leaders. This figure of 50% females would put them top of my list of research institutes in this field.

In contrast, finding out that the new Barcelona Institute of Science and Technology has an all male roster (11 Professors, 18 staff in total) is kind of depressing. This result would put them bottom of my list.

What would be a suitable value for the absolute minimum proportion of female speakers at genomics/bioinformatics conferences?

Photo by ViktorCap/iStock / Getty Images
Photo by ViktorCap/iStock / Getty Images

Background

Hopefully, many people reading this will be aware of Jonathan Eisen's valiant efforts to highlight the Yet Another Mostly Male Meeting (YAMMM) problem; these are conferences where the gender bias is disproportionately skewed towards male speakers. You can see all of Jonathan's YAMMM posts on his blog, and in his latest post he highlights a particularly egregious case: a CSHL meeting on the Evolution of Sequencing where only 7.8% of speakers are women.

In my last ACGT post I looked at how this figure (7.8%) compared to the male/female ratio of senior researchers at 10 different genomics/bioinformatics institutes. Nine out of the ten places that I looked at had a much higher proportion of female scientists. I tried making the point that this suggests that conference organizers have no excuses for not doing a better job at recruiting more female speakers.

But it struck me that my analysis was a bit too shallow, especially as the numbers of researchers in each place differed quite a bit (from 10 to almost 60). So I went back and looked at many more academic institutions and kept track of the absolute numbers of men and women in senior research roles.

Dataset

In total, my updated dataset comprises details from 40 different academic institutes (or centers/departments) that specialize in genomics and/or bioinformatics. The vast majority (33/40) mention 'genome', 'genomics', or 'bioinformatics' in their title (the exceptions to this include the Broad Institute, Cold Spring Harbor Laboratory, and the Wellcome Trust Sanger Institute).

The 40 different institutes represent locations in North America, Europe, Asia, and Australia. In some cases, the named institute represents an umbrella organization connecting researchers in different locations across that country (e.g. the Swiss Institute of Bioinformatics). There is probably an element of selection bias towards research institutes that provided an English-language version of their staff/personnel page (not all non-English websites have translations of every page available).

I think that this dataset contains most of the widely known research institutes that have a dedicated focus on genomics and bioinformatics. The list could probably be further expanded if I targeted more University departments that have a specialization in these fields.

In total I logged the gender of 1,039 people in various 'senior' research roles (e.g. Faculty, 'Group leaders', 'Project leaders', etc.). In many cases I deduced gender from first names, but looked for images of researchers where this was not easy to do so.

I've uploaded the main table of data to Figshare so that others can look at all of the detailed numbers if they so desire.

Results

  1. The most equitable result for any one academic institute with at least 15 senior research scientists was the Duke Department of Biostatistics and Bioinformatics (40.4% female, N=52)
  2. Only two other institutes had figures of 40% or higher: the National Human Genome Research Institute (40% female, N=40) and the Functional Genomics group at the Russian Academy of Medical Sciences (50% female, N=6).
  3. Only 3 out of 40 institutes had a lower proportion of female scientists than at the aforementioned CSHL meeting with 7.8% female speakers.
  4. Discounting the bottom placed institute due to small sample size (0% female, N=4), the next worse place was NBIC, the Netherlands Bioinformatics Center with only one female Faculty member (4.8% female, N=21).
  5. The overall ratio of female scientists in senior research roles is 23.6% (N=1,039).

Conclusions

It is somewhat depressing to see such a systematic gender bias in my field, where female scientists only account for approximately a quarter of senior research positions. This figure is in line with UK data for the proportion of female professors all biological sciences (25.1%). The lack of equal gender representation is presumably due to bias and discrimination (conscious or otherwise). In 2014 I conducted a survey to look at gender bias in bioinformaticians across different career stages. This survey had 370 responses — from undergraduate level right through to Deans of academic schools — and showed that there is essentially no gender bias at all stages prior to the level of Faculty (or equivalent). This suggests that there is no shortage of talented women coming through the system, they just are hitting a barrier when it comes to attaining senior research positions…a situation clearly not helped by further discrimination at conferences.

Based on the figures that I show here, one might argue that the figure of approximately 25% could be seen as a minimum target for female participation at conferences. However, such a target would only be encouraging the current levels of discrimination. Far better would be a target that not only attempts to reduce discrimination, but which also better reflects the equal representation of female scientists in post-doc and graduate student positions.

For these reasons I feel that conference organizers — in the fields of genomics and bioinformatics — should be aiming for at least a third of all speakers to be female. Ideally, we want to be doing better than this which is why I suggest this as an absolute minimum target. Depressingly, even this low target is something which most (all?) of the YAMMM meetings described by Jonathan Eisen fail to meet. Of course, such a target should apply for male speakers too, though I'm doubtful that there has ever been a conference in this field where men accounted for less than a third of all speakers.

I don't attend many conferences, but from now on I won't be attending any if at least 33% of the talks are not by women.


Update 2015-06-30: Added link to data for percentage of female professors in UK biological sciences, and clarified that my suggested target figure should also apply to male speakers. I also added a caveat that my methods of choosing institutes is biased towards websites written in English (or with English translations available).

How does gender diversity of speakers at CSHL Evolution of Sequencing meeting compare to research institute gender diversity?

Following Jonathan Eisen's recent blog post about the tremendously poor representation of female speakers at the upcoming CSHL Evolution of Sequencing meeting, I was curious about something. Namely, how representative are female scientists in senior roles at prominent genome centers/institutes?

So I found 10 places which all have at least 10 people listed in senior roles (e.g. Faculty or Project leaders), including our own Genome Center. In all but one case, the gender ratio exceeds the pitiful 8.5% of female speakers at the Evolution of Sequencing meeting.

click to enlarge

While it is still depressing that not a single institute has more than 40% of senior roles filled by women, it is a clear sign that there is a great pool of talented female genome scientists out there. Conferences should therefore have no excuses for single-digit percentages for female speakers.

NCBI BLAST+ v2.2.31 released

NCBI have recently announced the release of v2.2.31 of BLAST+ (hat tip to Torsten Seemann for alerting me to this). Find out more here:

This release is of interest to me as it fixes a bug introduced in v2.2.30 that broke our CEGMA software (which relies on TBLASTN):

Reenabled support for word size 5 in tblastn.

This bug was something that I reported on back in February.

Command-line bootcamp: learn the basics of Unix

Here is another contribution that I made to a UC Davis Bioinformatics Workshop that I helped teach last week. Adapting from some our much longer Unix & Perl Primer for Biologists, I made a short bootcamp that aims to teach the basics of the Unix/Linux command-line.

Unlike the Primer material that was written from the point of view of someone using a Mac, the new bootcamp course is written from the viewpoint of someone using Ubuntu Linux. Also, no example files are needed. The course is entirely self-contained and should take 1–3 hours to process (depending on your familiarity with Unix).

Download the PDF, view the HTML version, or work with the underlying Markdown file.

Current version: v1.01 — 2015-06-24

Some short slide decks from a recent Bioinformatics Core workshop

Last week I helped teach at a workshop organized by the Bioinformatics Core facility of the UC Davis Genome Center. The workshop was on:

  • Using the Linux Command Line for Analysis of High Throughput Sequence Data

I like that the Bioinformatics Core makes all of their workshop documentation available for free, even if you didn't attend the workshop. So have a look at the docs if you want to learn about genome assembly, RNA-Seq, or learning the basics of the Unix command-line (these were just some of the topics covered).

Anyway, I tried making some fun slide decks to kick off some topics. They are included below.

 

This bioinformatics lesson is brought to you by the letter 'D'

'D' is for 'Default parameters', 'Danger', and 'Documentation

 

This bioinformatics lesson is brought to you by the letter 'T'

'T' is for 'Text editors', 'Time', and 'Tab-completion'

 

This bioinformatics lesson is brought to you by the letter 'W'

'W' is for 'Worfklows', 'What?', and 'Why?'

Developments in high throughput sequencing – June 2015 edition

If you're at all interested in the latest developments in sequencing technology, then you should be following Lex Nederbragt's In beteween lines of code blog. In particular you should always take time to read his annual snapshop overview of how the major players are all faring.

This is the fourth edition of this visualisation…As before, full run throughput in gigabases (billion bases) is plotted against single-end read length for the different sequencing platforms, both on a log scale.

The 2015 update looks interesting because of the addition of a certain new player!

L50 vs N50: that's another fine mess that bioinformatics got us into

N50 is a statistic that is widely used to describe genome assemblies. It describes an average length of a set of sequences, but the average is not the mean or median length. Rather it is the length of the sequence that takes the sum length of all sequences — when summing from longest to shortest — past 50% of the total size of the assembly. The reasons for using N50, rather than the mean or median length, is something that I've written about before in detail.

The number of sequences evaluated at the point when the sum length exceeds 50% of the assembly size is sometimes referred to as the L50 number. Admittedly, this is somewhat confusing: N50 describes a sequence length whereas L50 describes a number of sequences. This oddity has led to many people inverting the usage of these terms. This doesn't help anyone and leads to confusion and to debate.

I believe that the aforementioned definition of N50 was first used in the 2001 publication of the human genome sequence:

We used a statistic called the ‘N50 length’, defined as the largest length L such that 50% of all nucleotides are contained in contigs of size at least L.

I've since had some independent confirmation of this from Deanna Church (@deannachurch):

I also have a vague memory that other genome sequences — that were made available by the Sanger Institute around this time — also included statistics such as N60, N70, N80 etc. (at least I recall seeing these details in README files on an FTP site). Deanna also pointed out that the Celera Human Genome paper (published in Science, also in 2001) describes something that we might call N25 and N90, even though they didn't use these terms in the paper:

More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or large

I don't know when L50 first started being used to describe lengths, but I would bet it was after 2001. If I'm wrong, please comment below and maybe we can settle this once and for all. Without evidence for an earlier use of L50 to describe lengths, I think people should stick to the 2001 definition of N50 (which I would also argue is the most common definition in use today).

Updated 2015-06-26 - Article includes new evidence from Deanna Church.