Home: Perl Programming Help: Regular Expressions:
stop after first match



monocle
User

Jul 20, 2000, 7:17 AM


Views: 8500
stop after first match

have a bit of trouble. I am reading an html file line by line in search of any line that has an href and printing out that href marked with line number. I am using this code to look for the href:<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>

if ($file_line =~ /<a href="(.*)">/i){
$linecode = "<a href=\"$1\">";
}</pre><HR></BLOCKQUOTE>$1 will contain the actual link destination. My problem is that if the href is linking on an image, $1 also picks up the image tag because it also ends with ">.

so how can I tell this to only grab what is between the <a href=" and the first ">?

or if any one has a better way to do this...plaease let me know.

thanks


------------------
Monocle
Hear great techno music by Monocle at http://www.mp3.com/monocle. CD now on sale!



monocle
User

Jul 19, 2000, 10:02 PM


Views: 8500
Re: stop after first match

thanks. that seemed to do the trick. I don't need this to be too robust. just a little script to index our entire site and check for orphaned files and broken links and stuff. kind of a hack at the moment. maybe i can improve it later. don't really have time right now to figure out how to get that HTML::Parser set up. I've never added modules before.

but another question: How can I accomodate multiple <a href in same line?


------------------
Monocle
Hear great techno music by Monocle at http://www.mp3.com/monocle. CD now on sale!



Kanji
User / Moderator

Jul 19, 2000, 10:45 PM


Views: 8500
Re: stop after first match

See Randal Schwartz's Web Techniques columns on this very subject.
<UL TYPE=SQUARE>
<LI> http://www.stonehenge.com/merlyn/WebTechniques/col35.html
<LI> http://www.stonehenge.com/merlyn/WebTechniques/col27.html
<LI> http://www.stonehenge.com/merlyn/WebTechniques/col14.html
<LI> http://www.stonehenge.com/merlyn/WebTechniques/col07.html
</UL>You should also check out the ultra-groovy HTML::LinkExtor (included with HTML::Parser, and has a great example of usage).

[This message has been edited by Kanji (edited 07-20-2000).]


TheGame+
Deleted

Jul 20, 2000, 8:29 AM


Views: 8500
Re: stop after first match

Basically, you want to grab everything between <a href=" and the first ">.
So you could use (at least) two basic methods here to replace your (.*) :

1) ([^"]*) : matches everything that's not a "
2) (.*?) : makes the regex "non-greedy"

You'll find more details by typing 'perldoc perlre'.

Of course, if you intend to do some serious parsing of more complex HTML documents, you would be better off using an existing module like HTML::LinkExtor. For instance, this code won't match links that are spread over different lines...


monocle
User

Jul 20, 2000, 12:15 PM


Views: 8500
Re: stop after first match

my problem with modules is this: I don't know how to install them. I have downloaded some that I would like to use but i can't figure out what to do.

Part of the problem is that up until last week, all i've ever done is write scripts for use on my hosts apache/unix set-up. Now i am trying to develop some stuff on my own NT machine with Sambar Server...and I have to admit...I am very confused. Sambar installs perl5.something but i can't make heads or tails of the documentation on how to add modules. I am very behind on my perl. I know how to do the things i need to do, until now. I have always just sent an email and the sysadmin added the modules. Smile

This is surely not the correct forum to get help on that. Where would I ask such questions?


------------------
Monocle
Hear great techno music by Monocle at http://www.mp3.com/monocle. CD now on sale!