CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Regular Expressions:
Remove all HTML code, except for certain tags

 



ax390
New User

Oct 23, 2014, 12:30 PM

Post #1 of 6 (19066 views)
Remove all HTML code, except for certain tags Can't Post

Hello,

I'm looking for a Perl regex that would be able to remove all the characters of an HTML code, except for some certain tags. Here's an example of an HTML code that would need to be processed:


Code
<div id="tag_id" style="width: tag_width; height: tag_height;"> 
<font face="tag_font_face" size="tag_font_size">
available sizes: tag_s1, tag_s2, tag_s3;
available colors: tag_c1/tag_c2/tag_c3;
other options: tag_option1 tag_option2 tag_option3;
</font>
</div>


I'd like to keep all the tags that start with "tag_", and remove everything else. The regex should output one tag after another, like this:


Code
tag_id tag_width tag_height tag_font_face tag_font_size tag_s1 tag_s2 
tag_s3 tag_c1 tag_c2 tag_c3 tag_option1 tag_option2 tag_option3


The tags will always start with "tag_", and they will never contain any other characters, except for these:

1) letters;
2) numbers;
3) the "_" (underscore) character.

Plus, they can only have a minimum of one keyword (tag_keyword1), and a maximum of 5 keywords (tag_k1_k2_k3_k4_k5).

I did manage to find a solution to get this done, but because it's using a 'foreach' loop, it's quite slow and can't really be used. This will be integrated in a Perl script that serves webpages on the fly, and speed is extremely important. Getting this done using regex (and/or any other Perl code that's fast) should work without any delays.

I'm not very good with regex, and even though I did spend time trying to find a way, I couldn't do it.

If you could help me, that would be really great, and I'd really appreciate it!

Thank you!

Alex


Zhris
Enthusiast

Oct 23, 2014, 7:30 PM

Post #2 of 6 (19054 views)
Re: [ax390] Remove all HTML code, except for certain tags [In reply to] Can't Post

Hi Alex,

How about /(?:[^\w]|^)\K(tag(?:_[a-z0-9]+){1,5})(?=[^\w]|$)/ig. Note the lookahead at the end is unnecessary, but is there for completeness.

Explanation:

Code
The regular expression: 

(?-imsx:(?:[^\w]|^)\K(tag(?:_[a-z0-9]+){1,5})(?=[^\w]|$))

matches as follows:

NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
[^\w] any character except: word characters
(a-z, A-Z, 0-9, _)
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
\K 'K' "keep" everything matched prior to \K but do not include it in $& (effectively variable-length look-behind).
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
tag 'tag'
----------------------------------------------------------------------
(?: group, but do not capture (between 1 and
5 times (matching the most amount
possible)):
----------------------------------------------------------------------
_ '_'
----------------------------------------------------------------------
[a-z0-9]+ any character of: 'a' to 'z', '0' to
'9' (1 or more times (matching the
most amount possible))
----------------------------------------------------------------------
){1,5} end of grouping
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
[^\w] any character except: word characters
(a-z, A-Z, 0-9, _)
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
$ before an optional \n, and the end of
the string
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------


Implementation:

Code
#!/usr/bin/perl  
use strict;
use warnings;
use Data::Dumper;

my $html = do { local $/ = undef; <DATA> };

my @tags = $html =~ /(?:[^\w]|^)\K(tag(?:_[a-z0-9]+){1,5})(?=[^\w]|$)/ig;

print Dumper \@tags;

__DATA__
<div id="tag_id" style="width: tag_width; height: tag_height;">
<font face="tag_font_face" size="tag_font_size">
available sizes: tag_s1, tag_s2, tag_s3;
available colors: tag_c1/tag_c2/tag_c3;
other options: tag_option1 tag_option2 tag_option3;
</font>
</div>
tag
tag_
tag_one
tag_one_two
tag_one_two_three
tag_one_two_three_four
tag_one_two_three_four_five
tag_one_two_three_four_five_six



Quote
I did manage to find a solution to get this done, but because it's using a 'foreach' loop, it's quite slow and can't really be used. This will be integrated in a Perl script that serves webpages on the fly, and speed is extremely important. Getting this done using regex (and/or any other Perl code that's fast) should work without any delays.


I'm interested to see your current solution. Regexps are usually slower than other approaches, therefore don't necessarily expect a performance boost. Perhaps caching would be an option either at client level or when the html being parsed is generated.

Regards,

Chris


(This post was edited by Zhris on Oct 24, 2014, 6:55 PM)


Laurent_R
Veteran / Moderator

Oct 24, 2014, 1:02 AM

Post #3 of 6 (19042 views)
Re: [ax390] Remove all HTML code, except for certain tags [In reply to] Can't Post

Hmm, using regexes for parsing HTML is usually a rather bad idea, except possibly for the most simple cases (please note that this is not a religious belief on my part, I have done it a couple of time for very simple stuff, but I know it can very quickly get really hairy). The main reason for that is that HTML is not a regular grammar. There are several HTML parsing modules on the CPAN, try them.


Zhris
Enthusiast

Oct 24, 2014, 4:01 AM

Post #4 of 6 (19037 views)
Re: [Laurent_R] Remove all HTML code, except for certain tags [In reply to] Can't Post

Based on the sample HTML provided, the tag_* strings can potentially occur wherever (attribute values, content, not tag or attribute names but perhaps they might). Although possible to use a HTML parser for atleast part of the task, the tag_* strings are very regularly formatted i.e. we can easily match them directly without having to consider the surrounding HTML syntax. Therefore in my opinion a HTML parser would be an overkill.

Regards,

Chris


(This post was edited by Zhris on Oct 24, 2014, 4:11 AM)


ax390
New User

Oct 24, 2014, 7:10 AM

Post #5 of 6 (19020 views)
Re: [ax390] Remove all HTML code, except for certain tags [In reply to] Can't Post

Thank you very much, Chris!

Your code works flawlessly, and it's exactly what I needed! It's also much faster than what I had written.

My solution was basically another approch than regex, meaning that I would read all the tags from a flat database, and then run a foreach loop where I would check whether the current tag was found in the HTML code, or not (and mark it accordingly). Yes, I do know it's a very basic (and slow) approch, but I'm just a beginner in Perl, so I hope you understand. Wink

On a more positive note, I hope that other people on this forum will find something helpful in your post (especially in the step by step explanation you provided). This way, others will benefit too, not just me!

Thank you too, Laurent! I'm sure your idea would help me too, but I just needed a quick solution, and since I already have it and it fits my needs perfectly, I'll use it and move on. Like Leonardo da Vinci once said...

"Simplicity is the ultimate sophistication."

Alex


Laurent_R
Veteran / Moderator

Oct 24, 2014, 10:21 AM

Post #6 of 6 (19003 views)
Re: [ax390] Remove all HTML code, except for certain tags [In reply to] Can't Post


In Reply To

Thank you too, Laurent! I'm sure your idea would help me too, but I just needed a quick solution, and since I already have it and it fits my needs perfectly, I'll use it and move on.


I was just giving a general opinion on using regexes for parsing HTML, not really reacting to this specific case. As I noted above, I am not a religious fanatic on this subject, and yes, it can be done for simple cases such as this one and I have done it for such simple cases. I only wanted to warn you more broadly that in most cases, it becomes really hairy.

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives