Perl Program To Calculate Gc Content

The perl programming language would be perfect for this task. Install active perl on your pc and write a perl script to extract numbers from the text file and perform your calculation. Title: GC Calculators. Short Description: This GC Calculator package includes a Pressure Flow Calculator, Vapor Volume Calculator and Solvent Vent Calculator to help you optimize your GC method parameters. After launching the installation wizard, you will have the ability to select the calculator (s) that you want to have installed.

Description

Calculates the fraction of G+C bases of the input nucleic acid sequence(s). It reads in nucleic acid sequences, sums the number of 'g' and 'c' bases and writes out the result as the fraction (in the interval 0.0 to 1.0) to the total number of 'a', 'c', 'g' and 't' bases. Global G+C content GC, G+C in the first position of the codon bases GC1, G+C in the second position of the codon bases GC2, and G+C in the third position of the codon bases GC3 can be computed. All functions can take ambiguous bases into account when requested.

Usage

Perl

Arguments

a nucleic acid sequence as a vector of single characters

for coding sequences, an integer (0, 1, 2) giving the frame

Perl Program To Calculate Gc Content Inventory

logical. if TRUE force sequence characters in lower-case. Turn this to FALSE to save time if your sequence is already in lower-case (cpu time is approximately divided by 3 when turned off)

logical: if TRUE ambiguous bases are taken into account when computing the G+C content (see details). Turn this to FALSE to save time if your you can neglect ambiguous bases in your sequence (cpu time is approximately divided by 3 when turned off)

what should be returned when the GC is impossible to compute from data, for instance with NNNNNNN. This behaviour could be different when argument exact is TRUE, for instance the G+C content of WWSS is NA by default, but is 0.5 when exact is set to TRUE

arguments passed to the function GC

for coding sequences, the codon position (1, 2, 3) that should be taken into account to compute the G+C content

logical defaulting to FALSE: should the GC content computed as in seqinR <= 1.0-6, that is as the sum of 'g' and 'c' bases divided by the length of the sequence. As from seqinR >= 1.1-3, this argument is deprecated and a warning is issued.

alphabet used. This allows you to choose ambiguous bases used during GC calculation.

PerlContent

Value

GC returns the fraction of G+C (in [0,1]) as a numeric vector of length one. GCpos returns GC at position pos. GC1, GC2, GC3 are wrappers for GCpos with the argument pos set to 1, 2, and 3, respectively. NA is returned when seq is NA. NA.GC defaulting to NA is returned when the G+C content can not be computed from data.

Details

When exact is set to TRUE the G+C content is estimated with ambiguous bases taken into account. Note that this is time expensive. A first pass is made on non-ambiguous bases to estimate the probabilities of the four bases in the sequence. They are then used to weight the contributions of ambiguous bases to the G+C content. Let note nx the total number of base 'x' in the sequence. For instance suppose that there are nb bases 'b'. 'b' stands for 'not a', that is for 'c', 'g' or 't'. The contribution of 'b' bases to the GC base count will be:

nb*(nc + ng)/(nc + ng + nt)

The contribution of 'b' bases to the AT base count will be:

nb*nt/(nc + ng + nt)

All ambiguous bases contributions to the AT and GC counts are weighted is similar way and then the G+C content is computed as ngc/(nat + ngc).

References

citation('seqinr').

The program codonW used here for comparison is available at http://codonw.sourceforge.net/.

See Also

You can use s2c to convert a string into a vetor of singlecharacter and tolower to convert upper-case characters intolower-case characters. Do not confuse with gc for garbage collection.

Examples

Intermediate Perl

GC content is a very interesting property of DNA sequences because it is correlated to repeats and gene deserts. A simple way to calculate GC content is to divide the sum of G and C letters by the total number of nucleotides in the sequence. Let’s assume that you start with a string $sequence.

The WRONG way in which I initially did this was to convert the string to an array of letters, as shown here:

This is a very inefficient way of calculating the GC content, because arrays in Perl are quite expensive in terms of memory. The result of this was that I run out of memory quite quickly.

I found a more efficient approach by using the substr function, looping through the whole sequence, taking one base at a time. However, according to a colleague, Andy Jenkinson, it contains some bugs:

The reasons for being wrong, Andy argues, are that “it ignores the first character of the sequence because the substr function is zero-index based. The rounding at the end using S{6} also only works where there are >=6 characters in the resulting fraction – so a string like “ATCG” has a GC content of 0.5, but will appear to your application as zero. If you need to do this, you should use S{0,6}.”

I addition to this, he adds that whilst it solves the memory issue, [one] might also consider a much more CPU-friendly and simpler implementation:

He carried out a test simulation of #METHOD 3 for human chromosome 1 (247 million characters), which took 12 seconds with the same memory footprint as #METHOD 2, which took 111 seconds. Here is the source code for Andy’s simulation:

Perl Program To Calculate Gc Content Analysis

I have not had time to test #METHOD 3 yet, but I hope this last addition helps people.

Perl Program To Calculate Gc Content Level

Happy coding!

Comments are closed.