CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Regular Expressions: Re: [ax390] Remove all HTML code, except for certain tags: Edit Log



Zhris
Enthusiast

Oct 23, 2014, 7:30 PM


Views: 29229
Re: [ax390] Remove all HTML code, except for certain tags

Hi Alex,

How about /(?:[^\w]|^)\K(tag(?:_[a-z0-9]+){1,5})(?=[^\w]|$)/ig. Note the lookahead at the end is unnecessary, but is there for completeness.

Explanation:

Code
The regular expression: 

(?-imsx:(?:[^\w]|^)\K(tag(?:_[a-z0-9]+){1,5})(?=[^\w]|$))

matches as follows:

NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
[^\w] any character except: word characters
(a-z, A-Z, 0-9, _)
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
\K 'K' "keep" everything matched prior to \K but do not include it in $& (effectively variable-length look-behind).
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
tag 'tag'
----------------------------------------------------------------------
(?: group, but do not capture (between 1 and
5 times (matching the most amount
possible)):
----------------------------------------------------------------------
_ '_'
----------------------------------------------------------------------
[a-z0-9]+ any character of: 'a' to 'z', '0' to
'9' (1 or more times (matching the
most amount possible))
----------------------------------------------------------------------
){1,5} end of grouping
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
[^\w] any character except: word characters
(a-z, A-Z, 0-9, _)
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
$ before an optional \n, and the end of
the string
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------


Implementation:

Code
#!/usr/bin/perl  
use strict;
use warnings;
use Data::Dumper;

my $html = do { local $/ = undef; <DATA> };

my @tags = $html =~ /(?:[^\w]|^)\K(tag(?:_[a-z0-9]+){1,5})(?=[^\w]|$)/ig;

print Dumper \@tags;

__DATA__
<div id="tag_id" style="width: tag_width; height: tag_height;">
<font face="tag_font_face" size="tag_font_size">
available sizes: tag_s1, tag_s2, tag_s3;
available colors: tag_c1/tag_c2/tag_c3;
other options: tag_option1 tag_option2 tag_option3;
</font>
</div>
tag
tag_
tag_one
tag_one_two
tag_one_two_three
tag_one_two_three_four
tag_one_two_three_four_five
tag_one_two_three_four_five_six



Quote
I did manage to find a solution to get this done, but because it's using a 'foreach' loop, it's quite slow and can't really be used. This will be integrated in a Perl script that serves webpages on the fly, and speed is extremely important. Getting this done using regex (and/or any other Perl code that's fast) should work without any delays.


I'm interested to see your current solution. Regexps are usually slower than other approaches, therefore don't necessarily expect a performance boost. Perhaps caching would be an option either at client level or when the html being parsed is generated.

Regards,

Chris


(This post was edited by Zhris on Oct 24, 2014, 6:55 PM)


Edit Log:
Post edited by Zhris (Enthusiast) on Oct 23, 2014, 7:38 PM
Post edited by Zhris (Enthusiast) on Oct 23, 2014, 7:48 PM
Post edited by Zhris (Enthusiast) on Oct 23, 2014, 7:52 PM
Post edited by Zhris (Enthusiast) on Oct 23, 2014, 7:53 PM
Post edited by Zhris (Enthusiast) on Oct 23, 2014, 7:53 PM
Post edited by Zhris (Enthusiast) on Oct 23, 2014, 8:03 PM
Post edited by Zhris (Enthusiast) on Oct 24, 2014, 4:03 AM
Post edited by Zhris (Enthusiast) on Oct 24, 2014, 4:03 AM
Post edited by Zhris (Enthusiast) on Oct 24, 2014, 4:03 AM
Post edited by Zhris (Enthusiast) on Oct 24, 2014, 6:55 PM


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives