- Perl Program To Calculate Gc Content Inventory
- Perl Program To Calculate Gc Content Analysis
- Perl Program To Calculate Gc Content Level
The perl programming language would be perfect for this task. Install active perl on your pc and write a perl script to extract numbers from the text file and perform your calculation. Title: GC Calculators. Short Description: This GC Calculator package includes a Pressure Flow Calculator, Vapor Volume Calculator and Solvent Vent Calculator to help you optimize your GC method parameters. After launching the installation wizard, you will have the ability to select the calculator (s) that you want to have installed.
Description
Calculates the fraction of G+C bases of the input nucleic acid sequence(s). It reads in nucleic acid sequences, sums the number of 'g' and 'c' bases and writes out the result as the fraction (in the interval 0.0 to 1.0) to the total number of 'a', 'c', 'g' and 't' bases. Global G+C content GC
, G+C in the first position of the codon bases GC1
, G+C in the second position of the codon bases GC2
, and G+C in the third position of the codon bases GC3
can be computed. All functions can take ambiguous bases into account when requested.
Usage

Arguments
a nucleic acid sequence as a vector of single characters
for coding sequences, an integer (0, 1, 2) giving the frame
Perl Program To Calculate Gc Content Inventory
logical. if TRUE
force sequence characters in lower-case. Turn this to FALSE
to save time if your sequence is already in lower-case (cpu time is approximately divided by 3 when turned off)
logical: if TRUE
ambiguous bases are taken into account when computing the G+C content (see details). Turn this to FALSE
to save time if your you can neglect ambiguous bases in your sequence (cpu time is approximately divided by 3 when turned off)
what should be returned when the GC is impossible to compute from data, for instance with NNNNNNN. This behaviour could be different when argument exact
is TRUE
, for instance the G+C content of WWSS is NA
by default, but is 0.5 when exact
is set to TRUE
arguments passed to the function GC
for coding sequences, the codon position (1, 2, 3) that should be taken into account to compute the G+C content
logical defaulting to FALSE
: should the GC content computed as in seqinR <= 1.0-6, that is as the sum of 'g' and 'c' bases divided by the length of the sequence. As from seqinR >= 1.1-3, this argument is deprecated and a warning is issued.
alphabet used. This allows you to choose ambiguous bases used during GC calculation.


Value
GC
returns the fraction of G+C (in [0,1]) as a numeric vector of length one. GCpos
returns GC at position pos
. GC1
, GC2
, GC3
are wrappers for GCpos
with the argument pos
set to 1, 2, and 3, respectively. NA
is returned when seq
is NA
. NA.GC
defaulting to NA
is returned when the G+C content can not be computed from data.
Details
When exact
is set to TRUE
the G+C content is estimated with ambiguous bases taken into account. Note that this is time expensive. A first pass is made on non-ambiguous bases to estimate the probabilities of the four bases in the sequence. They are then used to weight the contributions of ambiguous bases to the G+C content. Let note nx the total number of base 'x' in the sequence. For instance suppose that there are nb bases 'b'. 'b' stands for 'not a', that is for 'c', 'g' or 't'. The contribution of 'b' bases to the GC base count will be:
nb*(nc + ng)/(nc + ng + nt)
The contribution of 'b' bases to the AT base count will be:
nb*nt/(nc + ng + nt)
All ambiguous bases contributions to the AT and GC counts are weighted is similar way and then the G+C content is computed as ngc/(nat + ngc).
References
citation('seqinr')
.
The program codonW used here for comparison is available at http://codonw.sourceforge.net/.
See Also
You can use s2c
to convert a string into a vetor of singlecharacter and tolower
to convert upper-case characters intolower-case characters. Do not confuse with gc
for garbage collection.
Examples
Intermediate Perl
GC content is a very interesting property of DNA sequences because it is correlated to repeats and gene deserts. A simple way to calculate GC content is to divide the sum of G and C letters by the total number of nucleotides in the sequence. Let’s assume that you start with a string $sequence.
The WRONG way in which I initially did this was to convert the string to an array of letters, as shown here:
This is a very inefficient way of calculating the GC content, because arrays in Perl are quite expensive in terms of memory. The result of this was that I run out of memory quite quickly.
I found a more efficient approach by using the substr function, looping through the whole sequence, taking one base at a time. However, according to a colleague, Andy Jenkinson, it contains some bugs:
The reasons for being wrong, Andy argues, are that “it ignores the first character of the sequence because the substr function is zero-index based. The rounding at the end using S{6} also only works where there are >=6 characters in the resulting fraction – so a string like “ATCG” has a GC content of 0.5, but will appear to your application as zero. If you need to do this, you should use S{0,6}.”
I addition to this, he adds that whilst it solves the memory issue, [one] might also consider a much more CPU-friendly and simpler implementation:
He carried out a test simulation of #METHOD 3 for human chromosome 1 (247 million characters), which took 12 seconds with the same memory footprint as #METHOD 2, which took 111 seconds. Here is the source code for Andy’s simulation:
Perl Program To Calculate Gc Content Analysis
I have not had time to test #METHOD 3 yet, but I hope this last addition helps people.
Perl Program To Calculate Gc Content Level
Happy coding!