Database Search ->

Term Project part 1 - Your Data

Microbiological data

For each of your isolates, put together a summary describing everything you know about them microbiologically, and turn this in with your project:

  • Source - where did the original sample the organism was isolated from originate?
  • Media & growth conditions
  • What these growth conditions tell you about the organism
  • Colony morphology
  • Cellular morphology
  • What these tell you about the organism

As always, the more details and information you can provide, the better. You will need all of this information at the end, to see if the phylotype of the organism(s) makes sense.

Data files

Images of the gels showing your PCR products:

Here are your sequence files:

Didn't get a usable sequence? Here are some from last year - choose one and go!

Sequencing data is listed by your PCR reaction numbers. All file names start with a sequencing run number, such as "6A-", followed by a machine sample number, followed by the PCR reaction number. The filenames also include the primer (_515Fshort), an obscure sample code (e.g. _H01) and file type suffix (.ab1, .pdf, .seq).

In other words, file 6A_005_72_515Fshort_G08.pdf is the PDF file of the data for sample 72.

All of our samples that contained a visible product of the right size were sent for sequencing, whether they seemed good enough to provide data or not.

Download your data files and save them with their .pdf or .seq suffix. Get the .pdf, and the .seq file for each of your reactions, whether they're good, bad, or weak.

NOTE: If you wish, you can also download the original .ab1 files that contains these tracings in raw form. These can be viewed and manipulated in any of several free programs: 4Peaks (Mac - this is what I use), Chromas (PC), BioEdit (PC - this is also a great alignment editor), or TracerView (Mac, PC, or various Unix flavors).

Where do these sequences comes from?

The DNA you purified from your PCR reaction and some oligonucleotide primer (515Fshort - a shorter version of the forward primer used in the PCR reaction) were sent to Eton for sequencing. A few of days later they sent back the sequence data by email. The sequences were downloaded and posted below for you. The .pdf files were generated by "printing" pdf files from the .ab1 files using 4Peaks. New batches of PCR products are being sent out each week as they are generated by the students.

your data

You can view your sequencing data by opening the .pdf files you downloaded. Look carefully at your data. How does it look? Here is an example section from the beginning of a good sequence:

good sequence

At the top is the sequence as the machine interprets it, from left to right, numbered just beneath. This example is from the start of the sequence - notice the sequence numbering "10", then "20" below to printed sequence. Below both the interpreted sequence and numbering is the raw data from the sequencing machine.

Some sequences don't start off this cleanly; the sequence only becomes clear after 20-30 bases:

The sequence reads directly from the printout. Hopefully the first 500 bases of sequence (after perhaps a a couple of dozen bases if it has a rough start) should be reliable. Somewhere between 500 and 800, the sequence quality will degrade to the point of unreliability:

As you can see, the peaks aren't as distinct this far out. Sometimes it's clear whether there is one or two bases in a peak (e.g. the "AA" at 828/829), sometimes not (e.g. how many "A"s at 835?), and the machine reader can clearly miss some (looks like probably "CC" at 846 rather than "C").

At the end of the sequence, the machine is just guessing - the sequence it spits out is meaningless, and then the reaction runs off the end of the PCR product at about 1000 nd flatlines.

If your sequence comes from more than one template, i.e. your culture wasn't pure or the PCR reaction was contaminated, you will have sequences in which some peaks look good (if both sequences have the same base at that position) and some are two peaks in the same place (where the two sequence differ):

mixed sequence data

If one of the sequences is much stronger than the other, this is no problem; the extra peak will be small compared to the main peak, and the machine can correctly read the stronger sequence. If they are close to the same strength, the machine will not correctly read either sequence. If the two sequences are from very closely-related organisms, these double peaks may be sporatic, and concentrated in the most variable regions of the rRNA. If they are distantly-related organisms, the double peaks will be more common, and as soon as the two sequences have a difference in length (an insertion/deletion relative to each other), they will be out of sync and most of the peaks will be twined.

Print out a copy of your sequencing data (the .pdf file); you'll need this to turn in with your Term Project. Scruntize the quality of your sequence carefully, especially noting where the reliable sequence begins and ends. Highlight the region of the sequence you think is reliable.

Now open the .seq file in a text editor (Notepad, Word, TextEdit, whatever). Now delete the "bad" sequence from each end of the .seq file, and add the following header as the top line of the file:


... and SAVE this as "unknown.txt". Be SURE it is saved as plain text, not "rich text" (.rtf) or a Word.doc.

Your sequence text file should look something like this:


Print out a copy of this file. This is the data you should use for all of your analysis.

Decision time



Mr Bill
Oh, No!

If any of your sequences are good, that's great. You may even have multiple good sequences - if so, use them all!

If you have a sequence from a mixed template, or marginal sequence data, use it only if it looks pretty good and if you don't have a clean sequences you can use. Marginal data would include very weak signal...

... sequence that runs out very quickly (200-500)...

... or sequence with a lot of background or "N"s :

No usable sequence data?

Some of you (only a few) will not get any PCR products from any of your reactions after purification. Others with PCR products will have failed to get any good sequence data - or any data at all (flatlines). If none of your sequences yeilded useable data, and if you have a friend in the class that has more than one good sequence, then your best bet is to ask him or her if you can use one of their sequences - this way you get to do one and they do the rest of theirs. Otherwise, use one of the sequences from a previous year (there is a link above with the other data).

Database Search ->