Today a colleague of mine asked the following question:

" Assuming I need to build from 0, a chromosome of a fish, with short reads but no other reference whatsoever [de novo assembly]:

  • how much work is that?
  • Is there a generic software (like SAMtools) that will align the reads in a scaffold one can use?
  • Basically, given a reasonably clear pipeline in terms of software, is it still blood sweat and tears or is it just a matter of getting it on a cluster?"

Very grateful for any suggestions, sources of information, software etc.

If you only want to use only sequencing techniques, you have a problem.

To get a feeling of what kind of results to expect, consider this paper published recently in Nature Genetics. They tried to assemble a whale genome de novo. They had 7 (!) paired-end libraries with different insert lengths ranging from 170bp to 20kb. Read lengths were mostly 100bp and in some cases 49bp. Average genome coverage was 91x.

Assembling this extensive data, they end up with over 100,000 contigs when the assembly is done.

So you really can't get a high-quality complex (i.e. large) genome assembled from only short-read sequencing data using the standard techniques.

That said, recent approaches such as libraries with much longer reads lengths (here) or the use of Hi-C data (here and here) do offer a way of getting high-quality complex genome assemblies using only sequencing data.

You can try looking around, which is like stackexchange, but for bioinformatics.

Velvet is one example of a de novo assembler.

But 30 bp is really short, and animals have big genomes (not as tough as lots of plants and fungi, but still tough)

What you would get is a bazillion short contigs. It would not be pretty.

I really like the genious software suite. It can multithread and really use the performance of your computer. Even complicated things like De Novo assembly are very very intuitive.