DNA sequencing
DNA sequencing is the process of determining the exact order of the bases A, T, C and G in a piece of DNA. In essence, the DNA is used as a template to generate a set of fragments that differ in length from each other by a single base. The fragments are then separated by size, and the bases at the end are identified, recreating the original sequence of the DNA.
The most commonly used method of sequencing DNA - the dideoxy or chain termination method - was developed by Fred Sanger in 1977 (for which he won his second Nobel Prize). The key to the method is the use of modified bases called dideoxy bases; when a piece of DNA is being replicated and a dideoxy base is incorporated into the new chain, it stops the replication reaction.
Key principles
A DNA molecule carries information in the form of four chemical groups or bases, represented by the letters A, C, G and T. The order of bases on a DNA strand is the DNA sequence.
Most DNA sequencing is carried out using the chain termination method. This involves the synthesis of new DNA strands on a single stranded template and the random incorporation of chain-terminating nucleotide analogues.
The chain termination method produces a set of DNA molecules differing in length by one nucleotide. The last base in each molecule can be identified by way of a unique label. Separation of these DNA molecules according to size places them in the correct order to read off the sequence.
How does it work?
The DNA to be sequenced is provided in single-stranded form. This acts as a template upon which a new DNA strand is synthesised. DNA synthesis requires a supply of the four nucleotides (the building blocks of DNA), the enzyme DNA polymerase and a primer (a short sequence annealed to the template which initiates the new DNA strand). The nucleotides added to the growing DNA strand are complementary to those in the template strand.
Sequencing is achieved by including in each reaction a nucleotide analogue that cannot be extended and thus acts as a chain terminator. Four reactions are set up, each containing the same template and primer but a chain terminator specific for A, C, G or T. Because only a small amount of the chain terminator is included, incorporation into the new DNA strand is a random event. Each reaction therefore generates a collection of fragments, but every DNA strand will end at the same type of base (A, C, G or T).
The primers or nucleotides included in each of the four reactions contain different fluorescent labels allowing DNA strands terminating at each of the four bases to be identified. The reaction products are then mixed and separated by gel electrophoresis, which separates DNA molecules according to size even if they differ in length by only a single nucleotide. As the DNA strands pass a specific point, the fluorescent signal is detected and the base identified. The whole process can be extensively automated.
How is it used?
The most obvious application of DNA sequencing technology is the accurate sequencing of genes and genomes. Only about 500-800 bases can be sequenced in one experiment so larger DNA molecules, including whole genomes, must be broken into smaller fragments before sequencing and then reassembled by searching for overlaps. Accuracy is achieved by sequencing each template several times.
Lower-fidelity single-pass sequencing is useful for the rapid accumulation of sequence data at the expense of some accuracy. Another application of DNA sequencing technology is resequencing the same DNA molecule over and over. This is necessary, for example, in the typing of single nucleotide polymorphisms.
Maxam-Gilbert sequencing
In 1976-1977, Allan Maxam and Walter Gilbert developed a DNA sequencing method based on chemical modification of DNA and subsequent cleavage at specific bases . Although Maxam and Gilbert published their chemical sequencing method two years after the ground-breaking paper of Sanger and Coulson on plus-minus sequencing, Maxam-Gilbert sequencing rapidly became more popular, since purified DNA could be used directly, while the initial Sanger method required that each read start be cloned for production of single-stranded DNA. However, with the development and improvement of the chain-termination method (see below), Maxam-Gilbert sequencing has fallen out of favour due to its technical complexity, extensive use of hazardous chemicals, and difficulties with scale-up. In addition, unlike the chain-termination method, chemicals used in the Maxam-Gilbert method cannot easily be customized for use in a standard molecular biology kit.
In brief, the method requires radioactive labelling at one end and purification of the DNA fragment to be sequenced. Chemical treatment generates breaks at a small proportion of one or two of the four nucleotide bases in each of four reactions (G, A+G, C, C+T). Thus a series of labelled fragments is generated, from the radiolabelled end to the first 'cut' site in each molecule. The fragments are then size-separated by gel electrophoresis, with the four reactions arranged side by side. To visualize the fragments generated in each reaction, the gel is exposed to X-ray film for autoradiography, yielding an image of a series of dark 'bands' corresponding to the radiolabelled DNA fragments, from which the sequence may be inferred.
Also sometimes known as 'chemical sequencing', this method originated in the study of DNA-protein interactions (footprinting), nucleic acid structure and epigenetic modifications to DNA, and within these it still has important applications.
Chain-termination methods
Part of a radioactively labelled sequencing gel
While the chemical sequencing method of Maxam and Gilbert, and the plus-minus method of Sanger and Coulson were orders of magnitude faster than previous methods, the chain-terminator method developed by Sanger was even more efficient, and rapidly became the method of choice. The Maxam-Gilbert technique requires the use of highly toxic chemicals, and large amounts of radiolabeled DNA, whereas the chain-terminator method uses fewer toxic chemicals and lower amounts of radioactivity. The key principle of the Sanger method was the use of dideoxynucleotides triphosphates (ddNTPs) as DNA chain terminators.
The classical chain-termination or Sanger method requires a single-stranded DNA template, a DNA primer, a DNA polymerase, radioactively or fluorescently labeled nucleotides, and modified nucleotides that terminate DNA strand elongation. The DNA sample is divided into four separate sequencing reactions, containing the four standard deoxynucleotides (dATP, dGTP, dCTP and dTTP) and the DNA polymerase. To each reaction is added only one of the four dideoxynucleotides (ddATP, ddGTP, ddCTP, or ddTTP). These dideoxynucleotides are the chain-terminating nucleotides, lacking a 3'-OH group required for the formation of a phosphodiester bond between two nucleotides during DNA strand elongation. Incorporation of a dideoxynucleotide into the nascent (elongating) DNA strand therefore terminates DNA strand extension, resulting in various DNA fragments of varying length. The dideoxynucleotides are added at lower concentration than the standard deoxynucleotides to allow strand elongation sufficient for sequence analysis.
The newly synthesized and labeled DNA fragments are heat denatured, and separated by size (with a resolution of just one nucleotide) by gel electrophoresis on a denaturing polyacrylamide-urea gel. Each of the four DNA synthesis reactions is run in one of four individual lanes (lanes A, T, G, C); the DNA bands are then visualized by autoradiography or UV light, and the DNA sequence can be directly read off the X-ray film or gel image. In the image on the right, X-ray film was exposed to the gel, and the dark bands correspond to DNA fragments of different lengths. A dark band in a lane indicates a DNA fragment that is the result of chain termination after incorporation of a dideoxynucleotide (ddATP, ddGTP, ddCTP, or ddTTP). The terminal nucleotide base can be identified according to which dideoxynucleotide was added in the reaction giving that band. The relative positions of the different bands among the four lanes are then used to read (from bottom to top) the DNA sequence as indicated.
DNA fragments can be labeled by using a radioactive or fluorescent tag on the primer (1), in the new DNA strand with a labeled dNTP, or with a labeled ddNTP. (click to expand)
There are some technical variations of chain-termination sequencing. In one method, the DNA fragments are tagged with nucleotides containing radioactive phosphorus for radiolabelling. Alternatively, a primer labeled at the 5’ end with a fluorescent dye is used for the tagging. Four separate reactions are still required, but DNA fragments with dye labels can be read using an optical system, facilitating faster and more economical analysis and automation. This approach is known as 'dye-primer sequencing'. The later development by L Hood and coworkers of fluorescently labeled ddNTPs and primers set the stage for automated, high-throughput DNA sequencing.
Sequence ladder by radioactive sequencing compared to fluorescent peaks (click to expand)
The different chain-termination methods have greatly simplified the amount of work and planning needed for DNA sequencing. For example, the chain-termination-based "Sequenase" kit from USB Biochemicals contains most of the reagents needed for sequencing, prealiquoted and ready to use. Some sequencing problems can occur with the Sanger Method, such as non-specific binding of the primer to the DNA, affecting accurate read out of the DNA sequence. In addition, secondary structures within the DNA template, or contaminating RNA randomly priming at the DNA template can also affect the fidelity of the obtained sequence. Other contaminants affecting the reaction may consist of extraneous DNA or inhibitors of the DNA polymerase.
Dye-terminator sequencing
Capillary electrophoresis (click to expand)
An alternative to primer labelling is labelling of the chain terminators, a method commonly called 'dye-terminator sequencing'. The major advantage of this method is that the sequencing can be performed in a single reaction, rather than four reactions as in the labelled-primer method. In dye-terminator sequencing, each of the four dideoxynucleotide chain terminators is labelled with a different fluorescent dye, each fluorescing at a different wavelength. This method is attractive because of its greater expediency and speed and is now the mainstay in automated sequencing with computer-controlled sequence analyzers (see below). Its potential limitations include dye effects due to differences in the incorporation of the dye-labelled chain terminators into the DNA fragment, resulting in unequal peak heights and shapes in the electronic DNA sequence trace chromatogram after capillary electrophoresis (see figure to the right). This problem has largely been overcome with the introduction of new DNA polymerase enzyme systems and dyes that minimize incorporation variability, as well as methods for eliminating "dye blobs", caused by certain chemical characteristics of the dyes that can result in artifacts in DNA sequence traces. The dye-terminator sequencing method, along with automated high-throughput DNA sequence analyzers, is now being used for the vast majority of sequencing projects, as it is both easier to perform and lower in cost than most previous sequencing methods.
Automation and sample preparation
View of the start of an example dye-terminator read (click to expand)
Modern automated DNA sequencing instruments (DNA sequencers) can sequence up to 384 fluorescently labelled samples in a single batch (run) and perform as many as 24 runs a day. However, automated DNA sequencers carry out only DNA size separation by capillary electrophoresis, detection and recording of dye fluorescence, and data output as fluorescent peak trace chromatograms. Sequencing reactions by thermocycling, cleanup and re-suspension in a buffer solution before loading onto the sequencer are performed separately. In the past, an operator had to trim the low quality ends (see image in the right) of every sequence manually in order to remove the sequencing errors. However, today, software like Fast Chromatogram Viewer can automatically trim the ends at batch.
Large-scale sequencing strategies
Current methods can directly sequence only relatively short (300-1000 nucleotides long) DNA fragments in a single reaction. . The main obstacle to sequencing DNA fragments above this size limit is insufficient power of separation for resolving large DNA fragments that differ in length by only one nucleotide. Limitations on ddNTP incorporation were largely solved by Tabor at Harvard Medical, Carl Fuller at USB biochemicals, and their coworkers].
Genomic DNA is fragmented into random pieces and cloned as a bacterial library. DNA from individual bacterial clones is sequenced and the sequence is assembled by using overlapping regions.(click to expand)
Large-scale sequencing aims at sequencing very long DNA fragments. Even relatively small bacterial genomes contain millions of nucleotides, and the human chromosome 1 alone contains about 246 million bases. Therefore, some approaches consist of cutting (with restriction enzymes) or shearing (with mechanical forces) large DNA fragments into shorter DNA fragments. The fragmented DNA is cloned into a DNA vector, usually a bacterial plasmid, and amplified in Escherichia coli. The amplified DNA can then be purified from the bacterial cells (a disadvantage of bacterial clones for sequencing is that some DNA sequences may be inherently un-clonable in some or all available bacterial strains, due to deleterious effect of the cloned sequence on the host bacterium or other effects). These short DNA fragments purified from individual bacterial colonies are then individually and completely sequenced and assembled electronically into one long, contiguous sequence by identifying 100%-identical overlapping sequences between them (shotgun sequencing). This method does not require any pre-existing information about the sequence of the DNA and is often referred to as de novo sequencing. Gaps in the assembled sequence may be filled by Primer walking, often with sub-cloning steps (or transposon-based sequencing depending on the size of the remaining region to be sequenced). These strategies all involve taking many small reads of the DNA by one of the above methods and subsequently assembling them into a contiguous sequence. The different strategies have different tradeoffs in speed and accuracy; the shotgun method is the most practical for sequencing large genomes, but its assembly process is complex and potentially error-prone - particularly in the presence of sequence repeats. Because of this, the assembly of the human genome is not literally complete — the repetitive sequences of the centromeres, telomeres, and some other parts of chromosomes result in gaps in the genome assembly. Despite having only 93% of the full genome assembled, the Human Genome Project was declared complete because their definition of human genome sequencing was limited to euchromatic sequence (99% complete at the time), excluding these intractable repetitive regions.
Resequencing steps. Sample prep: Extraction of nucleic acid. Template prep: Amplification and preparation of a small region of the target region. Sequencing steps. (click to expand)
The human genome is about 3 billion (3,000,000,000) bp long; if the average fragment length is 500 bases, it would take a minimum of six million (3 billion/500) to sequence the human genome (not allowing for overlap = 1-fold coverage). Keeping track of such a high number of sequences presents significant challenges, only held down by developing and coordinating several procedural and computational algorithms, such as efficient database development and management.
Resequencing or targeted sequencing is utilized for determining a change in DNA sequence from a "reference" sequence. It is often performed using PCR to amplify the region of interest (pre-existing DNA sequence is required to design the PCR primers). Resequencing uses three steps, extraction of DNA or RNA from biological tissue; amplification of the RNA or DNA (often by PCR); followed by sequencing. The resultant sequence is compared to a reference or a normal sample to detect mutations.
New sequencing methods
High-throughput sequencing
The high demand for low cost sequencing has given rise to a number of high-throughput sequencing technologies. These efforts have been funded by public and private institutions as well as privately researched and commercialized by biotechnology companies. High-throughput sequencing technologies are intended to lower the cost of sequencing DNA libraries beyond what is possible with the current dye-terminator method based on DNA separation by capillary electrophoresis. Many of the new high-throughput methods use methods that parallelize the sequencing process, producing thousands or millions of sequences at once.
In vitro clonal amplification
As molecular detection methods are often not sensitive enough for single molecule sequencing, most approaches use an in vitro cloning step to generate many copies of each individual molecule. Emulsion PCR is one method, isolating individual DNA molecules along with primer-coated beads in aqueous bubbles within an oil phase. A polymerase chain reaction (PCR) then coats each bead with clonal copies of the isolated library molecule and these beads are subsequently immobilized for later sequencing. Emulsion PCR is used in the methods published by Marguilis et al. (commercialized by 454 Life Sciences, acquired by Roche), Shendure and Porreca et al. (also known as "polony sequencing") and SOLiD sequencing, (developed by Agencourt and acquired by Applied Biosystems). Another method for in vitro clonal amplification is "bridge PCR", where fragments are amplified upon primers attached to a solid surface, developed and used by Solexa (now owned by Illumina). These methods both produce many physically isolated locations which each contain many copies of a single fragment. The single-molecule method developed by Stephen Quake's laboratory (later commercialized by Helicos) skips this amplification step, directly fixing DNA molecules to a surface.
Parallelized sequencing
Once clonal DNA sequences are physically localized to separate positions on a surface, various sequencing approaches may be used to determine the DNA sequences of all locations, in parallel. "Sequencing by synthesis", like the popular dye-termination electrophoretic sequencing, uses the process of DNA synthesis by DNA polymerase to identify the bases present in the complementary DNA molecule. Reversible terminator methods (used by Illumina and Helicos) use reversible versions of dye-terminators, adding one nucleotide at a time, detecting fluorescence corresponding to that position, then removing the blocking group to allow the polymerization of another nucleotide. Pyrosequencing (used by 454) also uses DNA polymerization to add nucleotides, adding one type of nucleotide at a time, then detecting and quantifying the number of nucleotides added to a given location through the light emitted by the release of attached pyrophosphates.
"Sequencing by ligation" is another enzymatic method of sequencing, using a DNA ligase enzyme rather than polymerase to identify the target sequence. Used in the polony method and in the SOLiD technology offered by Applied Biosystems, this method uses a pool of all possible oligonucleotides of a fixed length, labeled according to the sequenced position. Oligonucleotides are annealed and ligated; the preferential ligation by DNA ligase for matching sequences results in a signal corresponding to the complementary sequence at that position.
Other sequencing technologies
Other methods of DNA sequencing may have advantages in terms of efficiency or accuracy. Like traditional dye-terminator sequencing, they are limited to sequencing single isolated DNA fragments. "Sequencing by hybridization" is a non-enzymatic method that uses a DNA microarray. In this method, a single pool of unknown DNA is fluorescently labeled and hybridized to an array of known sequences. If the unknown DNA hybridizes strongly to a given spot on the array, causing it to "light up", then that sequence is inferred to exist within the unknown DNA being sequenced. Mass spectrometry can also be used to sequence DNA molecules; conventional chain-termination reactions produce DNA molecules of different lengths and the length of these fragments is then determined by the mass differences between them (rather than using gel separation).
There are new proposals for DNA sequencing, which are in development, but remain to be proven. These include labeling the DNA polymerase, reading the sequence as a DNA strand transits through nanopores, and microscopy-based techniques, such as AFM or electron microscopy that are used to identify the positions of individual nucleotides within long DNA fragments by nucleotide labeling with heavier elements (e.g., halogens) for visual detection and recording. In October 2006 the NIH issued a news release describing novel sequencing techniques and announcing several grant awards.
In October 2006, the X Prize Foundation established the Archon X Prize, intending to award $10 million to "the first Team that can build a device and use it to sequence 100 human genomes within 10 days or less, with an accuracy of no more than one error in every 100,000 bases sequenced, with sequences accurately covering at least 98% of the genome, and at a recurring cost of no more than $10,000 (US) per genome."