
Laurent_R
Veteran
/ Moderator
Nov 17, 2012, 2:28 AM
Post #4 of 5
(5714 views)
|
Re: [wyndcrosser] Two questions for an old project
[In reply to]
|
Can't Post
|
|
Hi, you don't give much information, but it is not too difficult to clean your data for pure words if this is what you want. If you split your line on spaces:
$_ = "To be, or not to be: that is the question."; my @words = split; you get something like this in your @words array:
0 'To' 1 'be,' 2 'or' 3 'not' 4 'to' 5 'be:' 6 'that' 7 'is' 8 'the' 9 'question.' Now, you can get rid of the trailing punctuation marks with something like this:
s/[\W]//g foreach @words; Now, the @words contains:
0 'To' 1 'be' 2 'or' 3 'not' 4 'to' 5 'be' 6 'that' 7 'is' 8 'the' 9 'question' As you can see, the trailing ',', ':' and '.' are no longer there in elements 1, 5 and 9 of the array, you have "pure" words. You may also try to split on word boundaries ("\b") rather than on spaces. Then you use grep to get rid of spaces and punctuation marks. Something like this:
my $sentence = "To be, or not to be: that is the question."; my @words = split /\b/, $sentence; This gives you the following @words array:
0 'To' 1 ' ' 2 'be' 3 ', ' 4 'or' 5 ' ' 6 'not' 7 ' ' 8 'to' 9 ' ' 10 'be' 11 ': ' 12 'that' 13 ' ' 14 'is' 15 ' ' 16 'the' 17 ' ' 18 'question' 19 '.' Now, you can get rid of array elements containing non alphabetical characters with the grep function:
@words = grep {/\w+/} @words; which gives you the following @words array:
0 'To' 1 'be' 2 'or' 3 'not' 4 'to' 5 'be' 6 'that' 7 'is' 8 'the' 9 'question' Again, you have pure words which you can store in an hash for further processing. There might be a couple of issues, though, with "words" containing apostrophies ("you're doing this") or hyphens ("post-increment"). So you might have to refine your regular expressions to tackle these specific cases.
|