Bioinformatics 101 : Solving RNA Splicing
Introduction
Protein synthesis is a central mechanism by which a cell can carry out its functions, from metabolism (anabolic or catabolic) to mitosis, to various cell-specific abilities. Since the genetic code (DNA) stores the information necessary for building proteins, there must be a mechanism (or several intermediary ones) through which the cell utilizes information stored in DNA to make desired proteins. The first step is called transcription; the synthesis of a messenger RNA (mRNA) using an enzyme called RNA polymerase. Although the process is very delicate and is full of details, for this purposes of this article I’ll only state that the mRNA is actually the final product of a series of biochemical modifications to a precursor mRNA, these biochemical modifications are called “RNA splicing”.
If we have a DNA strand s, we could easily infer its corresponding precursor mRNA strand by using the rules of complementarity and substituting Thymine with Uracile. But, cells in eucaryotes don’t directly use this pre-mRNA, instead, the spliceosome cuts it into multiple pieces called either Introns, or Exons. Introns are discarded while Exons are kind of stitched them together to form the final mature mRNA molecule which then exits the nucleus and is used by different ribosoms to make the final protein.
As one might imagine, there are many ways to cut and stitch a pre-mRNA strand, this Alternative Splicing is sometimes is used by cells to synthesize similiar yet different proteins of the same family (Hemoglobine, Myoglobine ..etc). In terms of DNA, the exons deriving from a gene are collectively known as the gene’s coding region.
The Computational Problem
After identifying the exons and introns of an RNA string, we only need to delete the introns and concatenate the exons to form a new string ready for translation.
Given: A DNA string s (of length at most 1 kbp) and a collection of substrings of s acting as introns. All strings are given in FASTA format.
Return: A protein string resulting from transcribing and translating the exons of s. (Note: Only one solution will exist for the dataset provided.)
Approach To Solving It
1. Extracting the DNA Sequence and Removing Introns :
After extracting the main DNA string $\nabla$ from the FASTA input (See: Parsing Fasta Files). We Iterate through the list of introns $\chi$ $1 \atop \Nu$ of length $\Nu$.
$$ \chi = \lbrace \phi_1, \phi_2, \phi_3, …. , \phi_\Nu \rbrace$$
We remove each substring $\phi_n$ from $\nabla$ if it matches with a substring $\nabla_n$.
Obviously we have to replace Thymine with Uracile After we’re done.
We’ll call our final mRNA strand $\lambda$.
2. Translating mRNA to Protein Sequence :
We divide $\lambda$ (mature mRNA string) into $\Nu/3$ codons $\Upsilon_i$ (triplets of bases of length $\Nu = 3$) and we use the standard codon table to translate each codon into its corresponding amino acid until a stop codon is encountered.
Practical Python Code :
# Define the codon table
CODON_TABLE = {
'UUU': 'F', 'UUC': 'F', 'UUA': 'L', 'UUG': 'L',
'UCU': 'S', 'UCC': 'S', 'UCA': 'S', 'UCG': 'S',
'UAU': 'Y', 'UAC': 'Y', 'UAA': 'Stop', 'UAG': 'Stop',
'UGU': 'C', 'UGC': 'C', 'UGA': 'Stop', 'UGG': 'W',
'CUU': 'L', 'CUC': 'L', 'CUA': 'L', 'CUG': 'L',
'CCU': 'P', 'CCC': 'P', 'CCA': 'P', 'CCG': 'P',
'CAU': 'H', 'CAC': 'H', 'CAA': 'Q', 'CAG': 'Q',
'CGU': 'R', 'CGC': 'R', 'CGA': 'R', 'CGG': 'R',
'AUU': 'I', 'AUC': 'I', 'AUA': 'I', 'AUG': 'M',
'ACU': 'T', 'ACC': 'T', 'ACA': 'T', 'ACG': 'T',
'AAU': 'N', 'AAC': 'N', 'AAA': 'K', 'AAG': 'K',
'AGU': 'S', 'AGC': 'S', 'AGA': 'R', 'AGG': 'R',
'GUU': 'V', 'GUC': 'V', 'GUA': 'V', 'GUG': 'V',
'GCU': 'A', 'GCC': 'A', 'GCA': 'A', 'GCG': 'A',
'GAU': 'D', 'GAC': 'D', 'GAA': 'E', 'GAG': 'E',
'GGU': 'G', 'GGC': 'G', 'GGA': 'G', 'GGG': 'G'
}
def parse_fasta(fasta_string):
"""Parse FASTA format input and return a dictionary with labels and sequences."""
sequences = {}
label = None
for line in fasta_string.strip().split('\n'):
if line.startswith('>'):
label = line[1:]
sequences[label] = ''
else:
sequences[label] += line
return sequences
def translate_rna_to_protein(rna):
"""Translate an RNA string into a protein string using the codon table."""
protein = []
for i in range(0, len(rna), 3):
codon = rna[i:i+3]
if len(codon) < 3:
break
amino_acid = CODON_TABLE.get(codon, 'Stop')
if amino_acid == 'Stop':
break
protein.append(amino_acid)
return ''.join(protein)
def remove_introns_and_translate(fasta_string):
"""Process FASTA input, remove introns, and return the translated protein string."""
sequences = parse_fasta(fasta_string)
# Get the main DNA string (first sequence in the input)
dna_string = list(sequences.values())[0]
# Get the introns (remaining sequences)
introns = list(sequences.values())[1:]
# Remove each intron from the DNA string
for intron in introns:
dna_string = dna_string.replace(intron, '')
# Transcribe DNA to RNA
rna_string = dna_string.replace('T', 'U')
# Translate RNA to protein
protein = translate_rna_to_protein(rna_string)
return protein
# Example FASTA input
fasta_input = """>Sequence1
ATGGTCTACATAGCTGACAAACAGCACGTAGTGGTGATGTAGCTAGCTCAGTGTAG
>Intron1
ATCGTAGCTCAG
>Intron2
GGTGATGTAGCTAGCTCAG
"""
# Compute the result
protein_string = remove_introns_and_translate(fasta_input)
print(protein_string)