
Zhris
Enthusiast
Oct 23, 2014, 7:30 PM
Post #2 of 6
(23035 views)
|
Re: [ax390] Remove all HTML code, except for certain tags
[In reply to]
|
Can't Post
|
|
Hi Alex, How about /(?:[^\w]|^)\K(tag(?:_[a-z0-9]+){1,5})(?=[^\w]|$)/ig. Note the lookahead at the end is unnecessary, but is there for completeness. Explanation:
The regular expression: (?-imsx:(?:[^\w]|^)\K(tag(?:_[a-z0-9]+){1,5})(?=[^\w]|$)) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- (?: group, but do not capture: ---------------------------------------------------------------------- [^\w] any character except: word characters (a-z, A-Z, 0-9, _) ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- ^ the beginning of the string ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- \K 'K' "keep" everything matched prior to \K but do not include it in $& (effectively variable-length look-behind). ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- tag 'tag' ---------------------------------------------------------------------- (?: group, but do not capture (between 1 and 5 times (matching the most amount possible)): ---------------------------------------------------------------------- _ '_' ---------------------------------------------------------------------- [a-z0-9]+ any character of: 'a' to 'z', '0' to '9' (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ){1,5} end of grouping ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- (?= look ahead to see if there is: ---------------------------------------------------------------------- [^\w] any character except: word characters (a-z, A-Z, 0-9, _) ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- $ before an optional \n, and the end of the string ---------------------------------------------------------------------- ) end of look-ahead ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- Implementation:
#!/usr/bin/perl use strict; use warnings; use Data::Dumper; my $html = do { local $/ = undef; <DATA> }; my @tags = $html =~ /(?:[^\w]|^)\K(tag(?:_[a-z0-9]+){1,5})(?=[^\w]|$)/ig; print Dumper \@tags; __DATA__ <div id="tag_id" style="width: tag_width; height: tag_height;"> <font face="tag_font_face" size="tag_font_size"> available sizes: tag_s1, tag_s2, tag_s3; available colors: tag_c1/tag_c2/tag_c3; other options: tag_option1 tag_option2 tag_option3; </font> </div> tag tag_ tag_one tag_one_two tag_one_two_three tag_one_two_three_four tag_one_two_three_four_five tag_one_two_three_four_five_six
I did manage to find a solution to get this done, but because it's using a 'foreach' loop, it's quite slow and can't really be used. This will be integrated in a Perl script that serves webpages on the fly, and speed is extremely important. Getting this done using regex (and/or any other Perl code that's fast) should work without any delays. I'm interested to see your current solution. Regexps are usually slower than other approaches, therefore don't necessarily expect a performance boost. Perhaps caching would be an option either at client level or when the html being parsed is generated. Regards, Chris
(This post was edited by Zhris on Oct 24, 2014, 6:55 PM)
|