Jun 23, 2009, 10:34 PM
Post #1 of 2
Extracting DNA sequences from GenBank files using Perl
Using Perl, I need to extract DNA bases from a GenBank file for a given plant species. A sample GenBank file is here...
This is saved on my computer as NC_001666.gb. I also have a file that is saved on my computer as NC_001666.txt. This text file has a list of all the genes and their positions in the species corresponding to NC_001666 (corn). Here is a sample of how the text file is formatted...
For example, if in my command prompt I give input of the program name, the species number that I want, and the specific gene from that species whose DNA sequence I want:
perl nucleotide_bases.pl NC_001666 trnM
The program would go into NC_001666.txt, find trnM, see that it has a range from 54020 to 54092 and is on the positive strand(no negative sign). The program then goes into NC_001666.gb, goes to the long list of DNA bases at the bottom and starts at position 54020 and returns all base letters through 54092 (inclusively). So for this specific trnM, the output would be:
If a gene has a negative next to the position range (meaning it's on the negative strand of DNA), the output should be reversed, starting from the higher position, going to the lower. Also, when a negative is there, in that output, all A's should be switched to T's, and all G's to C's and vice versa.
Also, if a gene appears more than once in a text file, give an error message that it appears more than once, and end the program.
If I could get a Perl script to return this information for any species (NC_number) I want, and any gene from that species that I want, it would be a great help in the research I am conducting. Thank you all for your time, and any help on how to write this script would be appreciated.