Introns and Exons

In complex [eukaryotic] organisms like humans, cats and cacti, genes are not single unbroken sequences of nucleotides from start codon to stop codon. Instead, complex organisms are broken up into sections which are translated called exons, with gaps of untranslated sequence between them called introns (Figure 1).

Figure 1: In eukaryotic organisms, such as plants, animals and fungi, genes are not single, unbroken sequences of DNA that encode for a protein. They are instead punctuated by introns. Introns are segments which occur between the coding sequences (called exons) and are not translated by the organism. When a eukaryotic gene is transcribed, introns are removed from the sequence prior to the mRNA being transported to the ribosomes for translation into protein.

Introns and exons allow flexibility in whether particular parts of the genes sequence are included in the final protein sequence (Figure 2). This means that one gene can encode for many different proteins. Being able to flexibly make slightly different proteins has advantages for the organism, as the proteins produced can be varied according to what the cell or organism needs at a given moment, but it can do so within a smaller genomic space than an organism that had different genes for each version of the protein.

Figure 2: The use of introns and exons can give an organism considerable flexibility to produce slightly different versions of the same protein. (A) An exon can be optionally included or skipped, allowing an organism to include or exclude part of a protein from within the inside of the gene. (B) Two exons in close proximity to one another can be alternatively included, so that only one of these exons is included in the mRNA. (C) And (D), alternative splice sites can be at different parts of the same exon, allowing the inclusion or exclusion of regions of the exon immediately before or after an intron. (E) Removing an intron can be optional, so a part of a sequence can be included, or not, depending on which version of the gene is being expessed. (F) Alternate splicing at the 3’ terminal of the gene can lead to different locations of stop codons, this in turn can lead to variations at the C-terminus of the protein. This flexibility breaks the idea of a one to one ratio between genes and proteins, as a single gene can encode many slightly different versions of the same gene.

The regions of the gene sequence where an exon becomes an intron are called splice sites. The regions where a splice site ends an exon and begins an intron are splice site donors, while those that end an intron but begin an exon are splice site acceptors. The full mechanism of how these sites work, and how variable transcription happens is not fully understood, nevertheless there are a number of splice site sequence patterns which have been identified (Figure 3), and a number of algorithms which have been developed to predict splice sites and changes in them.

Figure 3: Although the precise mechanisms behind all splice sites have been elucidated, sequence similarity has identified a number of consensus sequences of splice sites. These are distinct sequence patterns at the first splice site at the start of the intron (donor), at the splice site at the end of the intron (acceptor), and a sequence that occurs shortly before the splice acceptor (branching site). Firstly there is the standard canonical splice site (A) a widespread splice site pattern. Next there is a slightly different version of the canonical splice site (B) which nevertheless still has a high degree of similarity. Next there is the U12 introns (C) which bear little resemblance to the canonical splice site, but whose members nevertheless have a high degree of similarity to one another. Lastly there are known splice sites which do not resemble any of these splice site consensus sequences.

Changes to splice sites can cause catastrophic damage to a gene. A famous example of this is haemophilia in European royalty in the 19th and early 20th centuries. Thanks to the discovery of the bodies of the Romanov family (Figure 4), the cause of this disease was determined to be a mutation that changed a splice site in the gene for Factor IX, which caused haemophilia B, as the Factor IX gene was no longer functional, and so males who inherited a copy of this X chromosome gene from their mothers had this clotting disorder.

Figure 4: The Romanov family were the last Russian Royal Family, who were deposed during the October Revolution in 1917 and subsequently murdered by the Bolshevik regime. The Alexandra Feodorovna, the Empress consort was a descendant of Queen Victoria, and like many European royalty of the era carried a mutation for hemophilia, which her son Alexei Nikolaevich the Tsesarevich suffered from, and contributed to the crisis in confidence in the Russian monarchy. This disease was caused by a mutation in the Factor IX gene. This mutation created an extra splice site in the gene, which caused the production of an mRNA sequence with a frame shift that changed the sequence of codons read afterwards, and a premature stop codon thus making the gene non-functional.

It should be noted that bacteria (which are not eukaryotes) do not have introns or exons, rather their entire genes are translated whole. This has important implications, as bacteria are common expression systems for proteins, which often come from eukaryotic organisms. This means that the genes for these proteins have to be modified in order to remove the introns in order for them to be expressed properly in bacterial cells, this can be achieved by identifying the protein or mRNA sequence and producing the DNA sequence based on it (Figure 5).