Home: Need a Custom or Prewritten Perl Program?: I need a program that...:
Extract <img> from html


Jun 21, 2001, 7:14 AM

Views: 2547
Extract <img> from html

Anyone know of a script (or use one) that will parse through an html file and get all of the links that are to images? I need to be able to run the script against web files and not just local ones, as well.


Jun 21, 2001, 2:01 PM

Views: 2542
Re: Extract <img> from html

I would point you to the "HTML::LinkExtor" module, part of the HTML-Parser distribution, because that is what I use. Does the job just fine for me and writing the supporting code to implement this module was pretty straight forward. I have not tried it against web files but the module's SYNOPSIS clearly uses an "http://" example.



Jun 21, 2001, 2:23 PM

Views: 2542
Re: Extract <img> from html

From the docs for HTML::LinkExtor (part of the HTML::Parser distribution)

  use LWP::UserAgent; 
use HTML::LinkExtor;
use URI::URL;

$url = "http://www.perl.org/"; # for instance
$ua = LWP::UserAgent->new;

# Set up a callback that collect image links
my @imgs = ();
sub callback {
my($tag, %attr) = @_;
return if $tag ne 'img'; # we only look closer at <img ...>
push(@imgs, values %attr);

# Make the parser. Unfortunately, we don't know the base yet
# (it might be diffent from $url)
$p = HTML::LinkExtor->new(\&callback);

# Request document and parse it as it arrives
$res = $ua->request(HTTP::Request->new(GET => $url),
sub {$p->parse($_[0])});

# Expand all image URLs to absolute ones
my $base = $res->base;
@imgs = map { $_ = url($_, $base)->abs; } @imgs;

# Print them out
print join("\n", @imgs), "\n";