CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
Regex search and replace

 



G_Ranger
Novice

Jul 2, 2016, 8:53 AM

Post #1 of 10 (2982 views)
Regex search and replace Can't Post

Hello,
I'm working on files where words have been tagged for part-of-speech categories in French. This is done automatically and the tagging software often makes predictable mistakes which can be corrected together. The word "en" can be a pronoun or a preposition.
I'm trying to put together a script to search for and replace this (tab separated):

Code
en	PR0:PER	en

with this

Code
en	PRP	en

When the first word on the next line is a present participle (in French these end in -ant). For example:

Code
en	PR0:PER	en 
vieillissant VER:ppre vieillir

needs to become

Code
en	PRP	en 
vieillissant VER:ppre vieillir


Here's the bit of script I've been trying to use:


Code
# get arguments off the command line 
($pattern1, $pattern2, $input_file, $output_file) = @ARGV;
# open input file
open $IN, "<:encoding(utf-8)", $input_file or die "unable to open $input_file for reading!\n";
# open output file
open $OUT, ">:encoding(utf-8)", $output_file or die "unable to open $output_file for writing!\n";
# loop over lines
while ($line = <$IN>) {
# test to see whether the line matches the pattern,
# also setting up a backreference
# replace pattern1 with pattern2
if ($line =~ /$pattern1/x) {
$line =~ s/($pattern1)/$pattern2/gx;
# write the current line number and the line with the match highlighted to the output file
print $OUT $.,": ",$line;
}
}
# close all filehandles
close $IN; close $OUT;


This works fine with simple terms, but when I enter my search text "pattern1" as a regex, it does not recognise it.
Here's the syntax I've been using:

Code
pattern1 en\tPRO:PER\ten\n(?!.*ant\b) 
pattern2 en\tPRP\ten\n


I'm fairly new to this, so I'm probably doing something very obviously wrong. Any help would be much appreciated.
G. R.


Laurent_R
Veteran / Moderator

Jul 2, 2016, 11:03 AM

Post #2 of 10 (2975 views)
Re: [G_Ranger] Regex search and replace [In reply to] Can't Post

Hi,

a couple of things.

You should start your script with the following pragmas:

Code
use strict; 
use warnings;

and declare your variables with the my key word:


Code
my ($pattern1, $pattern2, $input_file, $output_file) = @ARGV;

and

Code
open my $IN, "<:encoding(utf-8)", $input_file or die "unable to open $input_file for reading!\n";


Then your main problem is that you are reading the input file line by line, so there is no chance that your pattern "en\tPRO:PER\ten\n(?!.*ant\b) " with a newline character will ever be recognized.


G_Ranger
Novice

Jul 3, 2016, 12:20 AM

Post #3 of 10 (2964 views)
Re: [Laurent_R] Regex search and replace [In reply to] Can't Post

Many thanks for this Laurent_R. Back to the drawing board, for me, to see if I can resolve in particular this line-by-line problem, then!
Best, G. R.


Laurent_R
Veteran / Moderator

Jul 3, 2016, 3:29 AM

Post #4 of 10 (2958 views)
Re: [G_Ranger] Regex search and replace [In reply to] Can't Post

Some possible clues. If your file is not too large, you may slurp it into a variable and then use your regex in multiline mode.

Another possibility is to read line by line and always keep a sliding window or buffer of two lines in memory and to have two separate regexes, one for the first line and one for the second.

In either case, I would recommend to separate the problems, i.e. that you start testing with actual regexes hard-coded in your program and, only once this works properly, that you start to pass regexes as command line arguments, which will probably create other problems that you can solve as a second step.


G_Ranger
Novice

Jul 3, 2016, 7:04 AM

Post #5 of 10 (2949 views)
Re: [Laurent_R] Regex search and replace [In reply to] Can't Post

Thanks again. A quick update. This will now do what I want:


Code
#!/usr/bin/perl 
# file perl-findandreplace.pl

# use strict;
use warnings;

my ($in_file, $out_file) = @ARGV;
print "opening input file $in_file for reading ...\n";
open $in_file, "<:encoding(utf-8)", $in_file or die "unable to open input file $in_file!\n";
print "opening output file $out_file for writing ...\n";
print "starting replacement operation ...\n";
open $out_file, ">:encoding(utf-8)", $out_file or die "unable to open $out_file for writing!";
while (<$in_file>) {
$_ =~ s/en\tPRO:PER\ten\n(?!.*ant\b)/en\tPRP\ten\n/;
print $out_file $_;
}
close $in_file;close $out_file;
print "finished replacement from $in_file to $out_file.";


However, if I uncomment "use strict", I get the error:


Code
Can't use string ("test-sentence", i.e. the name of the input file) as a symbol ref while "strict refs" in use at perl-replace-1 line 12.


I'm trying to find an answer for this before moving on to using command line variables for the find and replace strings...


(This post was edited by G_Ranger on Jul 3, 2016, 7:27 AM)


FishMonger
Veteran / Moderator

Jul 3, 2016, 8:27 AM

Post #6 of 10 (2946 views)
Re: [G_Ranger] Regex search and replace [In reply to] Can't Post

You can't use the same variable for the filehandle that you used to store the filename.

Change that line to this:

Code
open my $out_fh, ">:encoding(utf-8)", $out_file or die "unable to open $out_file for writing! <$!>";


You'll also need to update the other lines that use that filehandle.


(This post was edited by FishMonger on Jul 3, 2016, 8:28 AM)


G_Ranger
Novice

Jul 3, 2016, 9:03 AM

Post #7 of 10 (2941 views)
Re: [FishMonger] Regex search and replace [In reply to] Can't Post

Thanks for this. A lot of trial and error involved, for me, but this now appears to work without warnings. I suspect I need to read up on file handles and filenames.


Code
#!/usr/bin/perl 

use strict;
use warnings;

my ($in_file, $out_file) = @ARGV;
print "opening input file $in_file for reading ...\n";
open my $in_fh, "<:encoding(utf-8)", $in_file or die "unable to open input file $in_file!\n";
print "opening output file $out_file for writing ...\n";
print "starting replacement operation ...\n";
open my $out_fh, ">:encoding(utf-8)", $out_file or die "unable to open $out_file for writing! <$!>";
while (<$in_fh>) {
$_ =~ s/en\tPRO:PER\ten\n(?!.*ant\b)/en\tPRP\ten\n/;
print $out_fh $_;
}
close $in_file;close $out_file;
print "finished replacement from $in_file to $out_file.";



BillKSmith
Veteran

Jul 3, 2016, 4:32 PM

Post #8 of 10 (2931 views)
Re: [G_Ranger] Regex search and replace [In reply to] Can't Post

You have improved the style of your code, but you still have not addressed the main issue which Laurent pointed out in his early replies. Your look-ahead assertion will never match. The diamond (<>) operator, by default, reads up to the next newline. (That newline is the last character of the string in $_.) Your substitution will work everywhere that you intend, but it will also be made in places that you expect the assertion to prevent it. Laurent suggested that you use slurp mode (Refer to $INPUT_RECORD_SEPARATOR in perldoc perlvar). Using a regex on a multi-line string introduces two new issues. (Do you want a dot to match a newline character? and Do you want the anchors (^ and $) to refer to the start/end of the line or the string?) Use the /s and /m switches on the substitute command to specify your requirements. (Refer to s/PATTERN/REPLACEMENT/ in perldoc perlop)
Good Luck,
Bill


G_Ranger
Novice

Jul 5, 2016, 11:36 PM

Post #9 of 10 (2902 views)
Re: [BillKSmith] Regex search and replace [In reply to] Can't Post

Thanks for your answer, and for pointing out the problems in my code.
With help from the web I've managed to cobble together two options which appear to work as intended. (I tried the /s and /m switches, but this seems to do the job.)
The first uses Path::Tiny and writes directly into the file, the second uses a subroutine to read in the whole file, then writes output into another file. I'm not sure which is advisable... and I still ultimately need to set up the different terms as command-line arguments.

First option:

Code
use strict; 
use warnings;
use Path::Tiny qw(path);

my $filename = 'sentence';
my $file = path($filename);
my $data = $file->slurp_utf8;
$data =~ s/en\tPRO:PER\ten\n(?=.*ant\b)/en\tPRP\ten\n/g;
$file->spew_utf8( $data );


Second option:

Code
use strict; 
use warnings;
my $filename = 'sentence';
my $out_file = 'output';
my $data = read_file($filename);
$data =~ s/en\tPRO:PER\ten\n(?=.*ant\b)/en\tPRP\ten\n/g;
open my $out, '>:encoding(UTF-8)', $out_file or die "Could not open '$out_file' for writing $!";;
print $out $data;
exit;

sub read_file {
my ($filename) = @_;
open my $in, '<:encoding(UTF-8)', $filename or die "Could not open '$filename' for reading $!";
local $/ = undef;
my $all = <$in>;
close $in;
return $all;
}



(This post was edited by G_Ranger on Jul 6, 2016, 2:28 AM)


BillKSmith
Veteran

Jul 6, 2016, 5:26 AM

Post #10 of 10 (2883 views)
Re: [G_Ranger] Regex search and replace [In reply to] Can't Post

Both slurping options are correct. Each has a place. I use option 2 when I need a quick answer. Option 1 is probably better for production work.

You still do not seem to understand how your regex works on a multi-line string. In your case, the /m option does not make any difference because you do not use the anchors that it affects. The /s does make a difference, especially with the greedy match (.*) in your assertion. My best guess is that you need a non-greedy (.*?) match and the /s, but you know your data better than I.

It is probably worth the effort to prepare a sample of fake test data which contains only special cases. Remember that a successful test is the one that finds an error.
Good Luck,
Bill

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives