CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Need a Custom or Prewritten Perl Program?: I need a program that...:
Need a program that prints out letters from another file

 



kylle345
New User

May 9, 2009, 7:14 PM

Post #1 of 3 (2298 views)
Need a program that prints out letters from another file Can't Post

So basically what I want to do is pull out DNA sequences for a particular gene name.

I have 2 files (FILE1 and FILE2) and I want an output into a separate file (FILE3).

FILE1 and 2 are MASSIVE so I am only posting examples from each file.

So FILE1 looks like this (tab deliminted, 4 columns):

##gff-version 1

1154 10 + AAD6
418 7429 + AAH1
702 759 + AAT1
584 10 - ABF2
642 4894 - ACC1
651 7213 - ACN9
1055 3454 - ADE1

The next file, FILE2, looks like this:


>1154
ATCTCACTCGTAATTCTACATAATTTTGTTTATGCTTTTATTGTCATTTTATATATTGTCAGTCATTATCCTATTACATTATCAATCCTTGCATTTCAGC TTCCACTTATTTCGATGACCGCTTCTCATAACTTATGTCATCTTCTAACACCGTATATGATAATGTACCAGTAGTATGAC
>584
GCAAGCTTTATAGTGACAACAATAAGGTATCACTCGGTTACAATTACCCCCACTTCCCCT


What I want to do is identify column 1 of FILE1 with the ># on FILE2. So for example, 1154 from FILE1 will match up with 1154 from FILE2. Next, I want it to identify the value on column 2 (so for 1154, it will identify the 10th letter which happens to be G). So if column 3 of FILE1 is + then it will print the first 8 letters in from of it (i.e. the 8 letters in front of G would be TCTCACTC). But if is it – on column 3, then it will take the reverse. So for ABF2 on “584” it will take the top 8 sequences starting from the reverse end. So instead of starting at “G” at >584, it will start at “T” (the end). So the position of ABF2 will be 25 letters away from “T” , so the letter will be “C”. Then it will take the values behind it… so CCACTTCC.

The output file will print out column 4 of FILE1, the top 8 letters from FILE2 and column 3 from FILE1.

The final file (FILE3) will look like this:

AAD6 TCTCACTC +
ABF2 CCACTTCC -


Could someone give me some help on this! I am new to perl and I am put in a situation where I have to program at a very high level.

Thanks


vikas.deep
User

May 12, 2009, 5:39 AM

Post #2 of 3 (2253 views)
Re: [kylle345] Need a program that prints out letters from another file [In reply to] Can't Post

Dear friend
It can be done but you are not clear in your requirement. Always use "Preview Post" button before posting. The forward strand requirement is clear I guess you want to design some primers for PCR amplification or some similar experiment.
But the reverse strand requirement is not clear "So----“584” so---- “G” so---- “T” so--- “T” so--- “C”- so-… "
So dear friend what is "“T”". If the end base pair is taken as starting point then in case of "584 10 - ABF2" you should be looking for "ACTTCCCC" (if the last thymine is to be taken as "start/ 0-point") but how you are counting "CCACTTCC" is not clear.
-For all my suggestions " I am sure someone else can do it in a better or elegant manner!"


JenniC
Novice

Jul 8, 2009, 10:57 AM

Post #3 of 3 (2132 views)
Re: [kylle345] Need a program that prints out letters from another file [In reply to] Can't Post

Hi Kylle:

Interesting problem. Which university do you work at ? (I know someone who does similar parsing and sequencing work.)

I will show you a quick script for the first part of your requirement - how to handle the + sign.

The second part of your requirement is unreadable. Feel free to repost that part.

Below is the script. I will show samples in comments.


Code
   

set $wsep= " "

# Read file 1.

var str data1 ; cat "FILE1" > $data1

# Read lines in file1 one by one.

while ($data1 <> "")

do

# Get the next line.

var str line1 ; lex "1" $data1 > $line1

# $line1 is "1154 10 + AAD6"

# Get columns 1, 2 and 3 in $line1

var str col1, col2, col3 ; wex -p "1" $line1 > $col1 ; wex -p "2" $line1 > $col2 ; wex -p "3" $line1 > $col3

# $col1 is "1154", $col2 is "10", $col3 is "+".

# Read file 2.

var str data2 ; cat "FILE2" > $data2

# Strip off everything up to $col1 (which has 1154) and the newline.

stex ("^"+col1+"\n^]") $data2 > null # We don't want to see the stripped off portion, thus > null.

# $data2 now begins with "ATCTCACTCGTAATTCTACATAATTTT...".

# Strip off everything beginning with the 10th ($col2) chars.

stex ("["+col2) $data2 > null # Again, we don't want to see the removed portion.

# $data2 now is "ATCTCACTC".

# We want to collect the last 8 chars. That means we want to remove the first (10-2) chars.

# Remember $col2 is "10".

var int chars ; set $chars = makeint(str($col2))- 8 - 1

# $chars is now 1.

# Extract starting at 1st char. This time, print the output to screen.

stex ("["+makestr(int($chars))) $data2

done



The script is in biterscripting. To try,

1. Download biterscripting from http://www.biterscripting.com .

2. Save the above script as "C:\Scripts\sequence.txt".

3. Start biterscripting. Run the sequence script by entering the following command.


Code
   script sequence.txt




You should see the correct output. Make sure you use correct paths for FILE1 and FILE2 and enclose them in double quotes.



Once you are done testing, you can call the script from your overall perl program as follows.


Code
 C:\biterScripting\biterScripting.exe sequence.txt > output.txt









Vikas:

Good suggestions for a new user. What is PCR amplification ?




Jenni


(This post was edited by JenniC on Jul 8, 2009, 11:30 AM)

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives