We have a winner!

September 29, 2015 by Keith Bradnam

The randomatic_3000 Perl script has chosen a winner:

I will be reaching out to the winner via twitter, and once I get Vince to sign the book, I will mail it to them. Congratulations!

A top 10 list of 'Useful Bioinformatics Skills'

September 28, 2015 by Keith Bradnam

The deadline for my competition to win a signed copy of Vince Buffalo's excellent Bioinformatics Data Skills book has now passed. There were 65 entries and later this week I will randomly choose a winner. For the competition I simply asked people to tweet an answer to the following question:

Name a useful bioinformatics skill

I thought I would share some of the entries that people tweeted. In reverse order, here are my ten favorite answers. It was difficult choosing which ones made the cut, and there were many other excellent answers. Thanks to everyone who took part! I hope to announce the winner later this week.

10

This skill may not be so easy to acquire…

@ACGT_blog Knowing @vsbuffalo. #acgt
— Paul Smaldino (@psmaldino) September 14, 2015

9

Two people came up with this suggestion…

Useful bioinformatics skill: Patience... #acgt
— David Joly (@idjoly) September 24, 2015

@ACGT_blog @kbradnam a useful bioinformatics data skill: patience #ACGT
— Dave Tang (@davetang31) September 14, 2015

8

I think this answer also applies to 'scripts you wrote yourself several years ago'…

Useful bioinformatics skill: The ability to understand the scripts written by others. #ACGT
— goutham atla (@Geek_y) September 15, 2015

7

Clouded by the Dark Side, your code is.

@acgt_blog A useful bioinformatics data skill: anger management #acgt
— Neil Saunders (@neilfws) September 14, 2015

6

If you ever come up with some useful code snippet, the chances are that you will want to reuse it at some point.

Keep your own oneliners in a online notebook #ACGT
— genomepandit (@genomepandit) September 14, 2015

5

This was the most popular answer in the competition…

Critical bioinformatics skill: VERSION CONTROL. #ACGT https://t.co/qCnHzbGDJt
— Aaron Barnes (@MicroTolo) September 17, 2015

My entry for ‘useful bioinformatics data skill’ #ACGT competition: version control
— Lex Nederbragt (@lexnederbragt) September 14, 2015

Useful bioinformatics skill: version control repository for every project #ACGT
— Katrina Kutchko (@kutchko) September 23, 2015

@ACGT_blog @vsbuffalo a useful bioinformatics data skill is version control! #ACGT
— Jasmine Dumas (@jasdumas) September 14, 2015

4

Yes, yes, a thousand times yes!

Proper code documentation. #ACGT
— will shoemaker (@shoemakah) September 14, 2015

3

If you ever run into any sort of bioinformatics problem, you can probably assume that someone has suffered from the same problem as you, and that someone else has posted a useful answer online.

useful bioinformatics data skill: read manuals and find solutions on BioStars, SEQanswers, and Twitter #ACGT
— copypasteusa (@copypasteusa) September 17, 2015

2

Two closely related answers, so they can both share the number two spot…

Useful bioinformatics skill: trust nothing without testing #ACGT
— Gitanshu Munjal (@grmunjal) September 15, 2015

#ACGT A very useful bioinformatics skill is that never test any program or any code with huge dataset and always use a subset of data
— upendra devisetty (@upendra_35) September 16, 2015

1

And my favorite answer was one by Bastien Chevreux (@BaCh_mira)…

#ACGT Useful BioInf skill: Be skeptical. Data isn't wrong just because it contradicts "basic textbook knowledge". Nature doesn't read books.
— Bastien Chevreux (@BaCh_mira) September 15, 2015

In bioinformatics it can be good to have some healthy skepticism about the tools and data that you use. Not all genome assemblies are perfect (many are far from perfect), not all gene annotations are correct, and not all tools use defafult values that will work well with your data. Be skeptical!

Maybe one of these answers will be lucky enough to be chosen by the magical 'Perl-script-of-destiny' (that I still need to write). The winner will hopefully be announced in a day or two.

3 important digital things all scientists should have nowadays

September 25, 2015 by Keith Bradnam

Good advice from Michael Koontz (@_mikoontz):

(1/n) A smart guy once told me there are 3 important digital things all scientists should have nowadays (cc @noamross @davisegsa):
— Michael Koontz (@mikoontz) September 24, 2015

(2/n) 1) A profile page. This could be your own custom website, a ResearchGate page, a Google Scholar profile, etc.
— Michael Koontz (@_mikoontz) September 24, 2015

(3/n) 2) an ORCID. All science products (e.g., blogs, code) should count for you. I like @carlystrasser's take on it: http://t.co/I8TYM2jR01
— Michael Koontz (@_mikoontz) September 24, 2015

(4/4) 3) An academic Twitter account. Stay current! Stay involved! People get jobs via Twitter connections!
— Michael Koontz (@_mikoontz) September 24, 2015

The second item on the list is something which I wrote about recently.

101 questions with a bioinformatician #34: Katie Pollard

September 24, 2015 by Keith Bradnam

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.

Katie Pollard is a Senior Investigator at Gladstone Institutes and a Professor in the Department of Epidemiology and Biostatistics at UC San Francisco. She is also a Faculty supervisor of a bioinformatics core that provides collaborative support for high-throughput biology across the UCSF campus.

Katie's work involves the development of statistical and computational methods for the analysis of large genomic datasets, with a particular interest in genome evolution and identifying sequences that differ significantly between or within species. Her work on the chimpanzee genome has led to lots of coverage by mainstream media, and if you want to know more about this topic, you should definitely watch the What makes us human? talk that she gave at the California Academy of Sciences (video is online here).

You can find out more about Katie by visiting her lab's website. And now, on to the 101 questions...

001. What's something that you enjoy about current bioinformatics research?

Growth in new sources of data, such as from citizen science and electronic medical records, as well as emerging technologies, like single cell imaging and genomics platforms.

010. What's something that you don't enjoy about current bioinformatics research?

Computing in the cloud is promising, but it is still to expensive to store massive data for ongoing active compute and too slow to move data into the cloud and out again for each analysis.

011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Keep taking math classes.

100. What's your all-time favorite piece of bioinformatics software, and why?

The UC Santa Cruz Genome Browser: you cannot underestimate the importance of looking at raw data, and the browser provides a way visualize a lot of data for every position of the genome. It is easy to check if your assumptions are right or not.

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality, and why?

S for strong.

An automated attempt to identify duplicated software names

September 22, 2015 by Keith Bradnam

From time to time I've been pointing out instances of duplicated software names in bioinformatics. I assume that many people reuse the name of an existing tool simply because they haven't first checked — or checked thoroughly — to see if someone else has already published a piece of software with the same name.

I am not alone in my concern over this issue and Neil Saunders (@neilfws on twitter) has gone one step further and written some code to try to track down instances of duplicated names. He recently wrote a post on his What You're Doing Is Rather Desperate blog to explain more:

Searching for duplicate resource names in PMC article titles

In the article, he describes how he used a Ruby script to parse the titles of articles downloaded from PubMed Central and then feeds this info into an R script to identify examples of duplicated software names. The result is a long list of duplicated software names (available on GitHub).

I'm not surprised to learn that generic names like 'FAST' and 'PAIR' appear on this list. However, I was surprised to see that in the same year (2011), two independent publications both decided to name their software 'COMBREX':

BUSCO — the tool that will hopefully replace CEGMA — now has a plant-specific dataset

September 21, 2015 by Keith Bradnam

With the demise of CEGMA I have previously pointed people towards BUSCO. This tool replicates most of what CEGMA did but seems to be much faster and requires fewer dependencies. Most importantly, it is also based on a much more updated set of orthologous genes (OrthoDB) compared to the aging KOGs database that CEGMA used.

The full publication of BUSCO appeared today in the journal Bioinformatics. I still haven't tried using the tool, but one critique that I have seen by others is that there are no plant-specific datasets of conserved genes to use with BUSCO. This appears to be something that the developers are aware of, because the BUSCO website now indicates that a plant dataset is available (though you have to request it).

Not to be confused with this website… →

September 21, 2015 by Keith Bradnam

I did a double-take when I first saw the title of this paper:

ATGme: Open-source web application for rare codon identification and custom DNA sequence optimization

Anatomy of an mainstream science piece →

September 20, 2015 by Keith Bradnam

A great blog post by Ewan Birney that describes the process of writing an commentary piece for the Guardian newspaper, and which also discusses the need for more involvement of scientists in the public discussion of science. I like the concluding remarks:

As practicing scientists, we need to continue laying the groundwork started long ago by many others…engaging consistently and non-judgmentally with our communities and policymakers about out work. There is a real task ahead of us in providing an accessible way for people to digest this information. We should take every opportunity to communicate on every level, from the most basic to state of the art. Only then can society really use the hard-earned information gleaned from genetics appropriately, and for the greater good.

13 questions you may have about using Galaxy

September 16, 2015 by Keith Bradnam

These slides are adapted from a talk I gave at a Galaxy training workshop this week. This workshop is organized and taught by the UC Davis Bioinformatics Core. More details about this workshop are available online. If you want to know what Galaxy is…well that's one of the 13 questions!

13 questions you might have about galaxy from Keith Bradnam

A long time ago in a Galaxy browser far, far away…

September 16, 2015 by Keith Bradnam

Today marks exactly 10 years since the first publication that describes the popular Galaxy platform:

Galaxy: A platform for interactive large-scale genome analysis

The tremendous success of the Galaxy project can be summarized by the following two graphs (taken from the statistics page on the Galaxy Wiki):

New user registrations on Galaxy main site

Publications that reference or mention Galaxy

I'm sure that the Galaxy team have a more official date to use as their anniversary, but I'll mark the 10th year since their initial publication to say congratulations and 'Happy Birthday'! I hope that Galaxy can emerge unscathed from those difficult 'teenage years' that lie just around the corner!