Bioinformatics / 02

Assembling whole phage genomes

Using sequencing data to assemble complete or near-complete phage genomes.

Assembling whole phage genomes

Slides: Assembling genomes

Slide material from the original manual is not embedded on this public page.

Activity 1: Command line basics

Rationale

We will learn the basics of how to use the terminal to give direct instructions to the computer.

Activity 1

  • Follow along and take notes in your printed cheat sheet:
  1. Open the Terminal Preview app
    • (Make the window half of the size of the screen so that you can see the Desktop)
  2. Write:
    bash 
  3. Write
    ls 
    • What shows up in the Terminal?
    • Take notes on your cheat sheet.
  4. Write:
    mkdir myfolder 
    • What showed up in the Desktop?
    • Open myfolder/ by clicking on it on the Desktop.
    • Take notes on your cheat sheet
  5. Write:
    cd myfolder 
    • What changed in the terminal?
  6. Write:
    ls 
    • Why happened? Why?
  7. Write
    mkdir myfolder2 
  8. Write
    ls 
  9. Write
    touch hello.txt 
  10. Open hello.txt on the Sublime Text app.
  11. Inside Sublime Text, write: “Hello world!”
  12. Save the file by going to File > Save. You can close Sublime Text now.
  13. Go back to the Terminal Preview
  14. Write
    less hello.txt 
  15. Exit the less view by typing the letter:
    q 
  16. Write
    rm text_file.txt 
  17. Write:
    ls 
    • What happened to hello.txt?
  18. Write:
    cd .. 
    • Where did we go?
  19. Write:
    ls 
  20. You can close the Terminal Preview app now.

Activity 2: Exploring your read files from the command line

Rationale

We will use sequencing read files and learn how to inspect them in the terminal.

Activity 2

  1. Put the read files for the exercise in a folder on your desktop.
    • The folder should contain the paired read files for your sample.
    • If the folder is zipped, open it first.
    • Keep the folder name short and easy to type.
  2. Open the Terminal Preview app.
  3. Write
    ls 
  4. Write
    cd <name of your folder> 
    • Here, replace <name of your folder> with the name of the folder you downloaded!!
    • Instructions between angle brackets < like this > mean that you have to write the name of the files YOU have.
  5. Write
    ls 
    • Do you see your read files?
    • There should be two, one that ends with _R1.sub.fastq.gz and another that ends with _R2.sub.fastq.gz
  6. To make it human readable, write
    gunzip -c <name_of_your_reads>_R1.sub.fastq.gz > reads_R1.fastq 
    • (You don’t need to remember this one.)
  7. To see it, write:
    less reads_R1.fastq 
    • What do you see?
    • How long is the first read?
    • You can exit this view by writing q
  8. To know how many reads are in the file, write:
    echo $(cat reads_R1.fastq|wc -l)/4|bc 
    • This is another “special command” to count the number of reads in our file. (You don’t need to remember this one.)
    • How many reads are in one of your read files?

Activity 3: Assembling your genome with Unicycler

Rationale

We will use a software called Unicycler to assemble our reads into whole genomes.

Activity 3

  1. Go to the class computer
  2. Write
    ls 
  3. Locate your folder, and enter the folder that contains your reads
    cd <name of your folder> 
  4. Then write:
    unicycler -h 
    • This is the software’s “help”. It tells us how to use it.
    • Scroll up and down to see the options.
  5. Now we will run the assembly. Write in a single line:
    unicycler -1 <name_of_your_reads>_R1.sub.fastq.gz -2 <name_of_your_reads>_R2.sub.fastq.gz -o assembly 
    • What are the options -1 and -2?
    • What is the -o option?
  6. Let your assembly run.

Activity 4: Looking at your assembled genome

Rationale

Once the assembler has run, you should have larger pieces of DNA sequence, which hopefully corresponds to a whole genome sequence.

Activity 4

  1. Use the assembly file produced for this exercise and keep it with your read files.
  2. Open the Terminal Preview app
  3. Write
    bash 
  4. Go to the folder in the Desktop that contains your reads
    cd <name of your folder> 
  5. Go to the folder that contains the results from the assembly:
    cd assembly 
  6. Write:
    ls 
    • There, you will find many files, but the one we care about is assembly.fasta
  7. The assembly might contain more than one large piece of DNA. Let’s check how many it has:
    grep -c "^>" assembly.fasta 
    • How many pieces of DNA does your assembly contain?
    • (This is another special command you don’t need to remember.)
  8. To extract only the first piece. Write
    cat assembly.fasta | awk "/^>/ {n++} n>1{exit} 1" > contig1.fasta 
    • (This is another special command you don’t need to remember.)
  9. Write:
    less contig1.fasta 
    • How long is the assembly you got?
  10. Exit this by typing
    q 
  11. Open the contig1.fasta file in the Sublime Text app.
  12. Open the BLAST website
  13. Copy the sequence inside contig1.fasta and paste it into the Query box of the BLAST website.
  14. Search by clicking “BLAST” and wait for the results.
  15. Look at the 5 best hits and make note about:
    • Description
    • Query Cover
    • Percentage ID
    • In your opinion, how similar is your phage to the previously known phages?
  16. Select one of the hits, and go to its genome record. Just like yesterday, try to find:
    • How big is the genome?
    • Is the genome DNA or RNA, linear or circular?
    • What type of phage is it? (Siphoviridae, Podoviridae, Myoviridae, other?)
    • What is the bacterial host?