CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
Can't get the $1 and $2, but regex works fine with Reggexbuddy

 



Alvaro
Novice

Apr 29, 2011, 1:50 PM

Post #1 of 13 (1771 views)
Can't get the $1 and $2, but regex works fine with Reggexbuddy Can't Post

Hello,

I can't make this script work, but the regex is ok as I checked it on regexbuddy. Seems it's not passing the $1 and $2. If I put only one regex and only place $1) it works fine, but can't make the two work.

((?<=')\W*[0-9]*(?=') or (?<==)[a-z0-9]*(?=') alone works just fine.

================

#!C:\Perl\bin\perl.exe
use English;
use strict;
use warnings;
use locale;

open (FILE, "<aname.txt");
open (FILE1, ">>aname_result.txt");

while (<FILE>)
{

my($line) = $_;
chomp($line);

if (my($word) = $_ =~ m/((?<=')\W*[0-9]*(?='))((?<==)[a-z0-9]*(?='))/)
{
print "$1;$2;\n";

}
# else
# {print "Nothing\n";}
}

close (FILE);
close (FILE1);

==============


miller
User

Apr 29, 2011, 3:22 PM

Post #2 of 13 (1764 views)
Re: [Alvaro] Can't get the $1 and $2, but regex works fine with Reggexbuddy [In reply to] Can't Post

Your regex looks flawed. What text are you trying to match?

Give us a sample of aname.txt and what your trying to capture.


(This post was edited by miller on Apr 29, 2011, 3:23 PM)


Alvaro
Novice

Apr 29, 2011, 4:17 PM

Post #3 of 13 (1756 views)
Re: [miller] Can't get the $1 and $2, but regex works fine with Reggexbuddy [In reply to] Can't Post

Hi Miller,

Here are the regex as on RegexBuddy:
(?<=')\W*[0-9]*(?=')|(?<==)[a-z0-9]*(?=')

Here's a sampe
<a name='05678'><a href='test.php?user=alex1' class='un0' >alex1</a><b class='un'> - 09 Apr'11 - 13:10 - 05678 of 05679&nbsp;<a target='_blank' class='ttpost' href='http://test.com'><img style='margin:0 0 -2px 0' src='http://images.test.com/test1.gif' border='0'></a></b><br><div class='bbbody'><br>This is a test</div></SPAN><br>
</td></tr>
<tr><td align='left'>

If I use only one of the regex, it works fine, as below:

if (my($word) = $_ =~ m/((?<=')\W*[0-9]*(?='))/)
{
print "$1;\n";

Thanks for any help,

BTW, I use Padre.


miller
User

Apr 29, 2011, 4:43 PM

Post #4 of 13 (1753 views)
Re: [Alvaro] Can't get the $1 and $2, but regex works fine with Reggexbuddy [In reply to] Can't Post

Ok, the sample text helps. But what exactly are you trying to capture?


Alvaro
Novice

Apr 29, 2011, 4:47 PM

Post #5 of 13 (1751 views)
Re: [miller] Can't get the $1 and $2, but regex works fine with Reggexbuddy [In reply to] Can't Post

Sorry Miller,

The number at <a name='05678'>
and the user at <a href='test.php?user=alex1' ...>

So as an output I want something like:

05678;alex1;


Alvaro
Novice

Apr 29, 2011, 6:37 PM

Post #6 of 13 (1746 views)
Re: [Alvaro] Can't get the $1 and $2, but regex works fine with Reggexbuddy [In reply to] Can't Post

Hi Miller,

Strange that at RegexBuddy the second pattern works well, but when I put at the program, it doesn't. Although, the first one still works fine.

Not sure what i'm missing... :(


Alvaro
Novice

Apr 29, 2011, 7:00 PM

Post #7 of 13 (1744 views)
Re: [Alvaro] Can't get the $1 and $2, but regex works fine with Reggexbuddy [In reply to] Can't Post

Miller,

I changed the second Regex to (?<==)[a-z0-9]+(?=').
Now it matches when alone, but when I put the two patterns together, nothing happens... :(

Not sure why the * didn't worked...


Alvaro
Novice

Apr 29, 2011, 8:33 PM

Post #8 of 13 (1737 views)
Re: [Alvaro] Can't get the $1 and $2, but regex works fine with Reggexbuddy [In reply to] Can't Post

Miller,

I made it work with the following:

if (my($word) = $_ =~ m/\W([0-9]*)\W*\w\s[a-z]*\W*[a-z]*.[a-z]*\W[a-z]*\W([a-z0-9]*)\W\s[a-z]*\W\W[a-z0-9]*\W\s[a-z]*\W\W[a-zA-Z]*\W\W[a-z]*/)
{
print "$1;$2;\n";

}

But I think it's a bit overkill to do it word by word...


miller
User

Apr 30, 2011, 12:25 AM

Post #9 of 13 (1733 views)
Re: [Alvaro] Can't get the $1 and $2, but regex works fine with Reggexbuddy [In reply to] Can't Post

Hey Mate,

It's late, so I don't really have time to help you in detail.

However, I will say this. You're working with HTML, so use an actual HTML parser and forgo trying to hack a regex.

Take a look at cpan's HTML::Parser or HTML::TreeBuilder. Either of those will be a much more maintainable way of extracting the information you want. I could probably even write it for you if I had all of the html and knew what you wanted specifically, but that's my suggestion.

- Miller


Alvaro
Novice

Apr 30, 2011, 11:07 AM

Post #10 of 13 (1714 views)
Re: [miller] Can't get the $1 and $2, but regex works fine with Reggexbuddy [In reply to] Can't Post

Hey,

I made it work with the following, but surely there's a better way using Regex. Haven't got into the HTML Parser. But would love to have a easier (shorter regex) way.

#!C:\Perl\bin\perl.exe
use English;
use strict;
use warnings;
use locale;

open (FILE, "<aname.txt");
open (FILE1, ">>aname_result.txt");

while (<FILE>)
{

my($line) = $_;
chomp($line);

if (my($word) = $_ =~ m/\W([0-9]*)\W*\w\s[a-z]*\W*[a-z]*.[a-z]*\W[a-z]*\W[a-z0-9]*\W\s[a-z]*\W\W[a-z0-9]*\W\s[a-z]*\W\W[A-Za-z]*\W\W([a-zA-Z0-9]*)\W\W\w\W\W\w\s[a-z]*\W\W\w\w\W\W\s\W\s(\d\d)\s([a-zA-Z]*)\W(\d\d)\s\W\s(\d\d\W\d\d)/)
{
print "$1;$2;$3;$4;$5;$6;\n";

}
# else
# {print "Nothing\n";}
}

close (FILE);
close (FILE1);


miller
User

Apr 30, 2011, 11:10 AM

Post #11 of 13 (1712 views)
Re: [Alvaro] Can't get the $1 and $2, but regex works fine with Reggexbuddy [In reply to] Can't Post

Good morning.

Attach your full html file to a reply, and state exactly what you're trying to capture. I'll take a look at it as I could use some practice parsing html anyway.

- Miller


(This post was edited by miller on Apr 30, 2011, 11:10 AM)


Alvaro
Novice

Apr 30, 2011, 3:27 PM

Post #12 of 13 (1704 views)
Re: [miller] Can't get the $1 and $2, but regex works fine with Reggexbuddy [In reply to] Can't Post

Hi Miller,

Here's a sample of the text I want to parse. The real html is huge, but basically the posts keep repeating.

I want to get the data and parse it to a CSV file.

<tr><td align='left'>
<a name='05678'><a href='pby.php?user=alex1' class='un0' title='Unvalidated'>Alex1</a><b class='un'> - 01 Jan'11 - 13:20 - 05678 of 10000&nbsp;<a

target='_blank' class='twitpost'

href='http://twitter.com/share?url=http%3A%2F%2Fwww.test.com'><img

style='margin:0 0 -2px 0' src='http://images.test.com/images/twitter/tweet1.gif' border='0'></a></b><br><SPAN id="IntelliTXT"><div class='bbbody'><br>User imput data<br>
<br>
</div></SPAN><br>
</td></tr>
<tr><td align='left'>
<a name='05679'><a href='pby.php?user=alex2' class='un0' title='Unvalidated'>Alex2</a><b class='un'> - 01 Jan'11 - 13:21 - 05679 of 10000&nbsp;<a

target='_blank' class='twitpost'

href='http://twitter.com/share?url=http%3A%2F%2Fwww.test.com'><img

style='margin:0 0 -2px 0' src='http://images.test.com/images/twitter/tweet1.gif' border='0'></a></b><br><SPAN id="IntelliTXT"><div class='bbbody'><br>User imput data<br>
<br>
</div></SPAN><br>
</td></tr>
<tr><td align='left'>
<a name='05680'><a href='pby.php?user=alex3' class='un0' title='Unvalidated'>Alex2</a><b class='un'> - 01 Jan'11 - 13:23 - 05680 of 10000&nbsp;<a

target='_blank' class='twitpost'

href='http://twitter.com/share?url=http%3A%2F%2Fwww.test.com'><img

style='margin:0 0 -2px 0' src='http://images.test.com/images/twitter/tweet1.gif' border='0'></a></b><br><SPAN id="IntelliTXT"><div class='bbbody'><br>User imput data<br>
<br>
</div></SPAN><br>
</td></tr>


miller
User

Apr 30, 2011, 3:53 PM

Post #13 of 13 (1701 views)
Re: [Alvaro] Can't get the $1 and $2, but regex works fine with Reggexbuddy [In reply to] Can't Post


In Reply To
Here's a sample of the text I want to parse. The real html is huge, but basically the posts keep repeating.


That's why i said attach it, not post it.

If it's still too big, then cut out some of the repeated entries in a text editor before you attach it. The point is to have REAL data in which to process that includes fully valid html and the tags outside of the relevant sections that help identify it.

Either way, I don't have time to address this any more as I have wedding to go to. Be back tomorrow evening.

- Miller

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives