042 Cristea

7/30/2019 042 Cristea

1/6

- 17 -

GENETIC SIGNALS: AN EMERGING CONCEPT

Paul Dan Cristea

"Politehnica" University of Bucharest

Bio-Medical Engineering [email protected]

Abst ract: Converting the DNA sequences into digital signals using a base four representation of thenucleotides leads to the conversion of the codons into numbers in the range 0-63 and of the aminoacids, together with the terminator, into numbers in the range 0-20. Correspondingly, this transformsthe symbolic DNA sequences into digital genetic signals of nucleotides, codons or amino acids andoffers the possibility to apply a whole range of powerful signal processing methods for their analysis.The paper proposes a new representation of the Genetic Code that reflects better its structure anddegeneracy. Optimal symbolic-to-digital mappings for nucleotides and amino acids are proposed onthis basis and some preliminary investigation of the resulting genetic signals is described. The use ofIndependent Component Analysis (ICA) for identifying control sequences in the DNA not coding

directly proteins is proposed.

Keywords: Genetic Code, Genetic Signals, Genome Analysis, Projection Pursuit, ICA

1. INTRODUCTIONThe almost complete sequencing of the humangenome, as well as the public access to most ofits content [1, 2], offer tremendous opportunitiesto explore in depth its content and to data minethis unique information depository. The classicapproach of representing DNA as symbolic

sequences of nitrogenous bases or of symboliccodons encoding polypeptide chains essentiallylimits the methodology of handling theinformation to mainly pattern matchingprocedures. Converting the DNA sequences intodigital signals using a base 4 representation ofthe nucleotides leads to the conversion of thecodons into numbers of the range 0-63 and of theamino acids, together with the terminator, intonumbers of the range 0-20. Correspondingly, thisleads to the conversion of DNA sequences intodigital genetic signals and opens the possibility toapply a whole range of powerful signal

processing methods for their analysis. Currently, only about 32000 genes containingthe instructions to make proteins, butrepresenting less than 5 percent of the humangenome, are considered of interest. The vastmajority of the genome is considered junk [3], asit has been discovered that it contains a largeamount of mobile (transposable) elements thatbear a close resemblance to the DNA ofindependent entities like viruses and bacteria.Similarly to the mitochondria, that have alsostarted their ancestral life as independententities, to become the main energy suppliers of

the eukariotic cells, a significant part of the extra-genic chromosomal DNA has very probable animportant role in the control of protein synthesis.

A special attention is given to fundament thechoice of a "natural" correspondence betweenthe nucleotides and the digits in base four(Thymine = 0, Cytosine =1, Adenine = 2,Guanine =3).

Independent Component Analysis (ICA) is aspecial case of the Blind Separation of theSources (BSS) method. Its goal is to recoverstatistically independent source signals from

some available linearly mixed signals producedby an unknown medium. Many applications areactively being developed, including speechrecognition, telecommunications, bio-medicalsignal and image processing. For complex real-life problems, the computational load becomesexcessive, especially because unknown relativeshifts of the independent components have to beconsidered. The paper proposes an ICAapproach to the search for independentcomponents in the extra-genic DNA [4].

2. GENETIC CODEThe Genetic Code is universal, as it is used by allknown organisms, with only small variations inmitochondria and certain microbes. Anyway, theGenetic Code applies to all known nucleargenetic material, DNA, mRNA and tRNA, andencompasses animals (including humans),plants, fungi, archaea, bacteria, and viruses.

The main genetic material is represented bythe DNA molecules that have a basically simplelinear un-branched structure formed bynucleotide chains. The repetitive unit, thenucleotide, has three components: phosphate,deoxyribose and a nitrogenous base. Only fourkinds of nitrogenous base are found in DNA:thymine (T) and cytosine (C), which are

7/30/2019 042 Cristea

2/6

- 18 -

pyrimidines, and adenine (A) and guanine (G)which are purines. The DNA molecule has twocomplementary chains that form a double helix inwhich a pyrimidine in one chain faces a purine inthe other, only the base pairs T-A and C-Gexisting.

Proteins are the main contributors to cellstructure and, as enzymes, catalyze the chemicalreactions specific to the functioning of the cells.The primary structure of a protein is given bypolypeptide chains formed of amino acidsequences. The coiling (secondary), folding(tertiary) and aggregation (quaternary) of thepolypeptides generate the complex spatialstructure of a protein, essential for its functioning.There are only twenty different amino acids in theproteins. A sequence of three nitrogenous basesencodes an amino acid according to the GeneticCode in two steps: transcription - one strand of

DNA is copied into a complementary mRNA(messenger) molecule, and translation - in whichthe language of nitrogenous bases is transformedby ribosomes into the language of amino acids.Only certain limited regions of the genome -- the

genes -- give information to make proteins.Human genes are few and far apart. There areabout 12 genes per million bases of human DNA.Genes are divided into exons -- sections of thecoding sequence, interrupted by introns -- non-coding spacers. Human genes have many small

exons, some just 19 bases long, separated byintrons of an average length of about 3,300bases, but with a large dispersion. Most intronsare only 87 bases long, while some are over10,000. Genes are reach in C and G, while non-coding DNA sequences are rich in T and A.

The Genetic Code is given in Table 1 [5], whileTable 2 lists the amino acids. There are 64codons, out of which 61 encode the 20 aminoacids, while 3 correspond to terminators -- "end"sequences. Consequently, there is a degeneracyof the genetic code, most amino acids beinginserted into a growing polypeptide chain in

response to two or more different triplets in themRNA. Fig. 1 gives the classic 3D Cartesianrepresentation of the codons and their translationto amino acids.

Table 1. Genetic Code

Table 2. Amino acid short names

Ala AlanineArg ArginineAsn AsparagineAsp Aspartic acidCys CysteineGln GlutamineGlu Glutamic AcidGly - GlycineIle IsoleucineHis HistidineLeu Leucine

Lys LysineMet MethioninePhe PhenylalaninePro ProlineSer SerineThr ThereonineTrp TryptophanTyr TyrosineVal ValineTer - Terminator

This representation does not grasp thecharacteristic symmetries and degeneracy of theGenetic Code. We propose the tetrahedronrepresentation in Fig. 2. Each nitrogenous basedefines a direction in the representation space,

towards one of the corners of a tetrahedron.

Figure 1. Classic Cartesian representation

of the Genetic Code

7/30/2019 042 Cristea

3/6

- 19 -

Figure 2. Tetrahedral representation of the Genetic Code

The first base in a codon selects one of the four

first order 16-codon tetrahedrons that composethe zero order tetrahedron of the overall GeneticCode. The second base in the codon selects oneof the second order 4-codon tetrahedrons thatcompose the first order tetrahedron. Finally, thethird base identifies one of the vertices.

Degeneracy is basically restricted to thesecond order tetrahedrons and most pairs ofinterchangeable bases are distributed on theedges along the pyrimidines and purinesdirections. This construction has also theadvantage to naturally suggest the putativeancestral coding sequences by the simple

passage to a lower level tetrahedron.

3. SYMBOLIC-TO-DIGITAL MAPPINGThe existence of the four different nitrogenousbases strongly suggests the mapping of thenucleotides to the digits {0, 1, 2, 3}, theinterpretation of the three-base-codons as three-digit-numbers written in base four, thus themapping of the codons along the linear DNAstrands to the numbers {0, 1, 2, , 63}. Actually, awhole DNA sequence can be seen as a hugenumber written in base four. Nevertheless, it ismore natural to interpret each codon as a distinct

sample of a digital genetic signal distributedalong the DNA strands. There are 4! =24 distinctchoices for attaching the digits 0-3 to the bases

A, C, G, T. The optimal choice given in Table 3results from the condition of minimally non-

monotonous correspondence between thecodons 0-63 and the amino acids plus theterminator 0-20 (See Fig.7), that leads to bestauto-correlated extra-genic genetic signals.

Table 3. Mapping of Nucleotides toDigits in Base Four

PyrimidinesThymine =T =0Cytosine =C =1

Purines

Adenine =A =2

Guanine =G =3

Figures 3 to 6 show the four first order 16-codontetrahedrons, the numerical codes attached tothe vertices, i.e., to the codons, and the encodedamino acids. It can be noticed that there are oneto one correspondences. Correspondingly,Tables 3 to 6 also give the numerical codeswhich result for the amino acids from the order oftheir first reference. There are only two onecodon - one amino acid (non degenerated)mappings for Tryptophan and Methionine, but tendouble, three triple, six quadruple, and two

sextuple degeneracies. From the frequency ofthe amino acids in the proteins, it results that the

7/30/2019 042 Cristea

4/6

- 20 -

Figure 3. Symbolic t o Digital Mapping ofCodons -- Thymine Tetrahedron

Table 3. Codes of the Amino Acids in theThymine Tetrahedron

Genetic Code has the features of an entropiccoding. On the other hand, a higher degeneracysuggests an ancestral amino acid. This allowsbuilding models of ancestral proteins.Table 7 summarizes the proposed optimalcorrespondence of numerical codons to aminoacids, while Fig. 7 represents the dependencenumeric codons to numeric codes attached toamino acids.

Figure 4. Symbolic t o Digital Mapping ofCodons -- Cytosine Tetrahedron

Table 4. Codes of the Amino Acids in theCytosine Tetrahedron

Figure 5. Symbolic to Digital Mapping ofCodons -- Adenine Tetrahedron

Table 5. Codes of the Amino Acids in theAdenine Tetrahedron

The minimum non-monotonic dependence hasonly three reversals of the normal order: for aterminator sequence and for the two sextupledegeneracies: serine and arginine

7/30/2019 042 Cristea

5/6

- 21 -

Figure 6. Symbolic t o Digital Mapping ofCodons -- Guanine Tetrahedron

Table 6. Codes of the Amino Acids in theGuanine Tetrahedron

An exhaustive search for all the 24 possiblecorrespondences nitrogenous bases -- digits 0-3has shown that there does not exist a moremonotonic dependence. The proposed codinggives a piece-wise constant function, with onlythe three mentioned reversals of the order.

Figure 7. Proposed Optimal (Minimally Non-Monotonous) Correspondence of NumericalCodons to Amino Acid Codes

Table 7. Proposed Optimal Correspondenceof Numerical Codons to Amino Acids

4. GENETIC SIGNALS

Figure 8. An excerpt from a Codon DigitalGenetic Signal

7/30/2019 042 Cristea

6/6

- 22 -

Fig. 8 represents an excerpt of 150 samples froma Codon Digital Genetic Signal, while Fig. 9shows the corresponding Amino Acid DigitalGenetic Signal.

It is significant that the genetic signals builtfrom genes show low auto-correlation, even for

neighboring samples. This is a feature usuallyassociated with noise and is consistent with thefact that the functionality of a protein is not givendirectly by its first order structure, i.e., thesequence of amino acids, but by its higher orderspatial structure On the other hand, the extragenic genetic signals obtained from non-codingDNA sequences have many features typical forpiecewise smooth "natural" signals like a goodcorrelation of close neighbors, that decreasesabruptly with the distance.

Figure 9. The Amino Acid Digital GeneticSignal correspond ing to Codons in Fig. 8

5. CONCLUSIONSThe paper proposes the TetrahedronRepresentation of the Genetic Code that reflectsbetter its structure and degeneracy. Optimalsymbolic - to - digital mappings for nucleotidesand amino acids are proposed on this basis.

Some features of the resulting genetic signalsare described.

It is suggested that the use of the ProjectionPursuit approach, specifically the IndependentComponent Analysis (ICA) on the geneticsignals derived from extra-genic DNAsequences, that do not encode proteins, couldreveal signals that control the functioning of thegenes, i.e., the synthesis of the proteins.

REFERENCES

[1] Venter, J .C. et al., A New Strategy for

Genome Sequencing, NATURE, 381, (May 30,1996), pp. 364-366,

[2] Venter, J .C. et al. Shotgun Sequencing of theHuman Genome, SCIENCE, 280, (J une 5,1998), pp. 1540-1542.

[3] H. Gee, J unk Science, Draft of A J ourney intothe Genome: What,s There, NATURE,www.nature.com.

[4] Cristea P., Independent Component Analysisfor Genetic Signals, SPIE ConferenceBiOS 2001

International Biomedical Optics Symposium,SC316, Short Course, San J ose, USA, 20-26

J anuary 2001.[5] J . C. Venter et al., Draft Analysis of theHuman Genome by Celera Genomics,SCIENCE, 291, (16 February 2001), pp. 1304-1351, www.sciencemag.org,

[6] Myers, E.W. et al. A Whole-GenomeAssembly of Drosophila, SCIENCE, 287, (March24, 2000), pp. 2196-2204.

[7] Doolittle, W.F., Phylogenetic classification andthe universal tree, SCIENCE, 284, (J une 25,1999), pp. 2124-2128.

[8] Andersson, J .O. & Nesb, C.L, Are therebugs in our genome?, SCIENCE EXPRESS,(May 17, 2001).

[9] R. H. Davis, S. G. Weller, The Gist ofGenetics, J ones & Bartlett Publishers, 1996,1998.

042 Cristea

Documents

Transcript of 042 Cristea