CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
Search Posts SEARCH
Who's Online WHO'S
Log in LOG

Home: Perl Programming Help: Regular Expressions:
Can someone explain this...


Peter Van Hoecke

May 17, 2000, 4:40 AM

Post #1 of 4 (13195 views)
Can someone explain this... Can't Post

I had to make a quick fix when dealing with directory structures and so on.

I get this very weird output:

$yearNumber = "year2000/Month4/Week3/Test.html"

$yearNumber =~ s/(^\/([^(\d)\/])*)((\d)+)(([^(\d)\/])*\/.*)/$1 - $2 - $3 - $4 - $5/;

Gives me this result
Year - r - 2000 - 0 - /Month4/Week3/Test.html

Why does he repeat the last letter he found. I would have guessed that $2 would contain the yearnumber, but apparently he somehow looks at the last instance of his * as a seperate finding?


PS: I changed the input a bit for privacy reasons, and I am uncertain of the term year or month or so on, that's the reason for the strange construction of my regex (=> don't tell me it could be simpeler... Smile I know that!)

Enthusiast / Moderator

May 18, 2000, 9:28 AM

Post #2 of 4 (13195 views)
Re: Can someone explain this... [In reply to] Can't Post

Ok, you're doing something wrong with this regex, and I don't know what exactly you were trying to accomplish.

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>

$string = "year2000/Month4/Week3/Test.html"
$string =~ s/(^\/([^(\d)\/])*)((\d)+)(([^(\d)\/])*\/.*)/$1 - $2 - $3 - $4 - $5/;

Let me dissect that regular expression for you, and perhaps that will help all of us figure out what's wrong.

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>

( # this starts $1
^/ # match a / at start-of-string
( # this starts $2
[^(\d)/] # any characters EXCEPT (, ), /, and 0-9
)* # end $2, match 0 or more times
) # end $1
( # this starts $3
( # this starts $4
\d # a digit
)+ # end $4, match 1 or more
) # end $3
( # this starts $5
( # this starts $6
[^(\d)/] # any characters EXCEPT (, ), /, and 0-9
)* # end $6, match 0 or more
/ # a /
.* # and anything else
) # this ends $6
}{$1 - $2 - $3 - $4 - $5}x;

Well, I still don't know totally what you were trying to do. It would help if you gave us your sample input string, "year2000/Month1/Week2/foo.html", and told us what you would like done to it. Are you trying to extract 2000, 1, 2, and "foo.html" from it?

As for why 'r' and '0' are repeated, this is a particularly interesting regular expression anomoly:

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>

"ABCDEF" =~ /((.)+)/;
print "$1 $2"; # prints "ABCDEF F"

$2 is "F", and not "A". This is just how it works.

If you want to extract the year, month, week, and filename, then I suggest you try:

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>

($yr,$mon,$wk,$fn) = split !/!, $string, 4;
for ($yr, $mon, $wk) { tr/0-9//cd } # remove non-digits from these

Or, if you want to use a regular expression:

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>

($yr,$mon,$wk,$fn) = $string =~ m{
year (\d+) / month (\d+) / week (\d+) / (.*)

Peter Van Hoecke

May 18, 2000, 11:42 PM

Post #3 of 4 (13195 views)
Re: Can someone explain this... [In reply to] Can't Post

The reason for this strange regex is because the script gets a directory tree, devided according to time. We allow the users to write any combination possible, but they have to use a time index in the directory.
Ex: I am from Belgium, so I want to write "Jaar2000" instead of "year2000". If my tree should start with the month index, and this tree is specifically for on of my sited called foo, I could very well choose to write my directory for May as Month05foo,... We have no idea how many subdirectories there will be, nor do I know if they will all have time indexes in them. We could be getting "logfiles/year2000/firstSite/month12/foo.log" or "2000/05/16/foofoo.log" or even "semester1/year2000/mon05foo/week01/day3/daily/logfile/foo.log". If they enter numbers at two places in the directory, then we give up, because we have no way of knowing which one is the time index.

We have to create a webpage showing the links sorted by date, And therefore I have to cut out the numbers that can start anywhere in the directory I am currently investigating. I first tried the regex with $2, but I got "r". So then I wrote $1 - $2 -... to look at the contents of these values.

This is not a serious problem, but I was fascinated by the $1 - $2 - ... contents.

Thanx for the explication!

PS: By the way, why does [^(\d)\/] exclude a whitespace?
PPS: Again this is not a serious problem, but if you want I invite everybody to write a better regex for this problem. How to sort this way, not knowing which kinds of directories you will get, and with the numbers somewhere in the subdir. This all in a regex! Have fun!

[This message has been edited by Peter Van Hoecke (edited 05-19-2000).]


May 19, 2000, 3:06 PM

Post #4 of 4 (13195 views)
Re: Can someone explain this... [In reply to] Can't Post

Try this :
<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>

@tries = (
foreach $try (@tries) {
print "$try --> ";
$try =~ s#[^\d/]*(\d+)[^\d/]*/#$1/#g;
print "$try\n";

Not sure if it helps you along the way or not...


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives