CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
[SOLVED] Regex on multiple lines for Science

 



puchu
New User

Apr 5, 2013, 2:28 PM

Post #1 of 6 (620 views)
[SOLVED] Regex on multiple lines for Science Can't Post

I am a scientist, I have a txt file like this:

Code
 
IDENTIFIERS

dbEST Id: 76635231
EST name: YG14511
GenBank Acc: JK818318
GenBank gi: 384372042

CLONE INFO
DNA type: cDNA

PRIMERS
PolyA Tail: Unknown

SEQUENCE
TGACTGCGTACGAATTTGATAGCTGCTTCTGCTTTCATGGAACCATTAAAGGAATGGATG
TTTGTTGACAAGGAAGGTGTACAAGTTGGACCACTGGAGAAGGATGCTATCAGAAGATTC
TGGTCAAAGAAAGGTATTGATTGGACAACAAGGTGCTGGGCTTCTGGGATGTCAGATTGG
AAGAGATTACGTGATATCCGTGAACTTCGGTGGGCACTAGCTGTTCGGGTTCCTGTTCTT
ACTTCAACGCAGGTACTGTTGTATTATGCATGTTTACTTGAGTCTATATTACATTGTGAA
TGCTACATTGCACTTTGAGTTGATTTTGACTAAGTTTACTTCACCACAAACTGCAGCTAC
TGGTGGACTGCATTTGTAGTTGTATCCGGTTTGGACTCA

Entry Created: Apr 16 2012
Last Updated: Apr 17 2012

I would like to extract only the "TGACT..." sequence and put that in a one line variable without \n and called $sequene in perl, could you tell me which is the correct regular expression or a right way to extract? The problem that the TGACT...sequence is distributed through several lines. Any help will be useful, thanks!


(This post was edited by puchu on Apr 6, 2013, 9:26 AM)


FishMonger
Veteran / Moderator

Apr 5, 2013, 2:44 PM

Post #2 of 6 (616 views)
Re: [puchu] Regex on multiple lines for Science [In reply to] Can't Post


Code
#!/usr/bin/perl 

use 5.10.1;
use strict;
use warnings;

my $sequence;
while (my $line = <DATA>) {
if ($line =~ /^SEQUENCE/ .. $line =~ /^\s*$/) {
next if $line =~ /^SEQUENCE/;
$sequence .= $line;
}
}
$sequence =~ s/\s+//g;
print $sequence;


# the following simulates your text file
__DATA__
IDENTIFIERS

dbEST Id: 76635231
EST name: YG14511
GenBank Acc: JK818318
GenBank gi: 384372042

CLONE INFO
DNA type: cDNA

PRIMERS
PolyA Tail: Unknown

SEQUENCE
TGACTGCGTACGAATTTGATAGCTGCTTCTGCTTTCATGGAACCATTAAAGGAATGGATG
TTTGTTGACAAGGAAGGTGTACAAGTTGGACCACTGGAGAAGGATGCTATCAGAAGATTC
TGGTCAAAGAAAGGTATTGATTGGACAACAAGGTGCTGGGCTTCTGGGATGTCAGATTGG
AAGAGATTACGTGATATCCGTGAACTTCGGTGGGCACTAGCTGTTCGGGTTCCTGTTCTT
ACTTCAACGCAGGTACTGTTGTATTATGCATGTTTACTTGAGTCTATATTACATTGTGAA
TGCTACATTGCACTTTGAGTTGATTTTGACTAAGTTTACTTCACCACAAACTGCAGCTAC
TGGTGGACTGCATTTGTAGTTGTATCCGGTTTGGACTCA

Entry Created: Apr 16 2012
Last Updated: Apr 17 2012


outputs

Quote
TGACTGCGTACGAATTTGATAGCTGCTTCTGCTTTCATGGAACCATTAAAGGAATGGATGTTTGTTGACAAGGAAGGTGTACAAGTTGGACCACTGGAGAAGGATGCTATCAGAAGATTCTGGTCAAAGAAAGGTATTGATTGGACAACAAGGTGCTGGGCTTCTGGGATGTCAGATTGGAAGAGATTACGTGATATCCGTGAACTTCGGTGGGCACTAGCTGTTCGGGTTCCTGTTC
AGGTACTGTTGTATTATGCATGTTTACTTGAGTCTATATTACATTGTGAATGCTACATTGCACTTTGAGTTGATTTTGACTAAGTTTACTTCACCACAAACTGCAGCTACTGGTGGACTGCATTTGTAGTTGTATCCGGTTTGGACTCA



FishMonger
Veteran / Moderator

Apr 5, 2013, 2:50 PM

Post #3 of 6 (614 views)
Re: [FishMonger] Regex on multiple lines for Science [In reply to] Can't Post

We could simplify it a little.


Code
while (my $line = <DATA>) { 
if ($line =~ /^\s+[ACGT]+$/) {
$sequence .= $line;
}
}



Kenosis
User

Apr 5, 2013, 7:27 PM

Post #4 of 6 (610 views)
Re: [puchu] Regex on multiple lines for Science [In reply to] Can't Post

Here's another option:


Code
use strict; 
use warnings;

my $data = do { local $/; <> };
$data = $1 and $data =~ s/\s+//g if $data =~ /(?<=SEQUENCE)(.+)(?>Entry Created)/s;
print $data;


Usage: perl script.pl inFile [>outFile]

The last, optional parameter will direct output to a file.

The above first slurps the file into a single variable. It then uses a regex to capture the sequence, and a substitution to remove all whitespace.

Hope this helps!


(This post was edited by Kenosis on Apr 5, 2013, 8:07 PM)


puchu
New User

Apr 6, 2013, 9:25 AM

Post #5 of 6 (589 views)
Re: [Kenosis] Regex on multiple lines for Science [In reply to] Can't Post

Thank you, it's working perfectly! You are great!


Kenosis
User

Apr 6, 2013, 11:02 AM

Post #6 of 6 (582 views)
Re: [puchu] Regex on multiple lines for Science [In reply to] Can't Post

You're most welcome, puchu!

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives