Tuesday, November 9, 2010

some science for ya

A quick word about how this sequencing thing works, then. A strand of DNA is made up of four different nucleotides, A, C, T, and G. They're mixed together pretty evenly, with a slight excess of A's and T's. When there's a stretch of one nucleotide repeated, say 8 A's in a row, that can cause problems when you want to replicate the strand. The polymerase (the enzyme that makes new DNA) can lose count, and do only 7 before going on to whatever comes next. Isolated polymerases in the laboratory make lots of that kind of mistake, and although the replication machinery in our cells (which includes the polymerase but also dozens of other proteins in an enormous complex that proofreads and makes sure the replication is faithful to the old strand to be copied) is quite good at making good copies, these 'homopolymer' sequences are often the source of mutations. A mistake is made, a base lost, and everything that comes after is out of phase.
So when I'm looking for mutations in a gene to tell a family where their cancer risk comes from, I pay special attention to these homopolymers.
Our new sequencer reads sequence by a trick that means each base incorporated into the strand being synthesized emits a certain amount of light. One base, one unit of light. Two bases of the same kind, two light units. The counting is pretty good through 5 bases, but once you get up to 7 or 8, it's hard to tell exactly how many units you have. Ten identical bases in a row, and the machine doesn't have time to get through all that in a single cycle, and so you just can't read 10 or more.
Our sequencer also reads each molecule of DNA that you give it one molecule at a time. Each one is called a 'read'. The software lines up the reads against the reference sequence and tells you how many of what kind of bases you have from start to finish.
Let me show you its counting problem.
The software knows that you have to have a whole number of bases. 7 or 8. No such thing as 7.6 bases. So it takes all of the reads with intermediate values for the number of bases in a series, and rounds off to the nearest whole base.
With data like this, no problem. The dark bars is what the furnished software says, and the light bars come from our own analysis of the raw data, admitting fractions of bases. 87% of the data rounds off to 8 bases, and that's great.
But most of the samples come out like this. The furnished software says about 45% of the sample has 7 bases, and another 39% has 8. If all I know is these two numbers, I'm thinking there's a deletion of a base on one of my patient's two copies of this gene. But look, there's clearly (well, okay) a single population of values, centered at 7.4 bases. This sample does not have a mutation, it has a bunch of intermediate data that's been incorrectly interpreted. I analyzed this one using a different method, and there is definitely no mutation.
So what about this?
This is just exactly what I expect from a real sample with a deletion of one base on one of her two copies of the gene. Ironically, the commercial software gives me exactly the same values as the preceding example! The second experiment to confirm or reject the proposed mutation is underway, but it looks pretty good. Two populations of values. The difference between the peaks is not exactly one base (it's 0.8), but it's pretty close. If this sample doesn't have a mutation, we'll be having some serious talks about the technique. I have some other examples of real deletion mutations, and they look just like this.
What really kills me is this sample - and these four graphs come from four samples in the very same experiment, showing the same bit of sequence. This data is ugly. I appear to have two distinct populations, at 7.2 and 8.1 bases. But the trough between the two is not so clear. And it's not so symmetrical. I do have the confirmation experiment done for this one, and guess what:
No mutation.
All the DNA has 8 bases.
Not 7. Not 6.


Niamh B said...

my brain hurts

Argent said...

I found this really interesting and it just goes to show that you can't just make assumptions based on one experiment. I guess with something as potentially life-changing to people as a indication of cancer would be, it pays to be super careful not to give out false positives just as much as to fail to spot genuine mutations.

Titus said...

I like the light blue and the dark blue colours on the graphs.

Totalfeckineejit said...