CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Fun With Perl: Perl Quizzes - Learn Perl the Fun Way:
Merging Data Files

 



sleuth
Enthusiast

Dec 9, 2000, 2:51 PM

Post #1 of 8 (31902 views)
Merging Data Files Can't Post

 
Say you have a list of country codes in one file, and they look like this.

US|United States|
GB|Great Britan|

And you have another file full of citys, and there counties in this format,

London|Great Britan|
Los Angeles|United States|
Hong Kong|China|

And asuming that the country names in the city/country data base, are exactly the same as the ones listed in the country code data base, how can you loop through the city/country data base, and add one field to the end of each line, containing the country code for that country, in a way, that's also case insensitive?

So you'll end up with

London|Great Britan|GB|
Los Angeles|United States|US|

And So on .....

Sleuth


sleuth
Enthusiast

Dec 11, 2000, 11:20 PM

Post #2 of 8 (31898 views)
Re: Merging Data Files [In reply to] Can't Post

Ok, even though no one participated in the quiz, since about 11 people at least looked at it, I'll give an answer.

So, first step you must take is open the data base that is going to merge with the other data base you have, usually the smaller of the two. In this data base, I'll call "db-1A", contains all the countries in the world with there ISO country code in the left field, in this format,

US|United States|
GB|Great Britan|

That's pretty simple.

Now the data base (db-2) you want to merge this data with only contains various cities around the world with there country at the left, like so,

London|Great Britan|
Los Angeles|United States|
Hong Kong|China|

Now, You are sure that the country at the right in db-2 is spelled correctly so that it will successfully match the second field in db-1, even if there isn't a match for it, it WILL NOT effect the overall process, the only thing that will happen is the third field in db-2 that you are looking to place the ISO country code from db-1 will be blank. So no big catastrophe there.

Your end result is the following,

London|Great Britan|GB|
Los Angeles|United States|US|

You can either write the merged information back to db-2 which is the data that you added the third field to, or you could create a new data base with the information to protect the original from being lost, which is the wiser choice in case you code doesn't produce the results you expected.

Ok, now for the code,


Code
# First you open the data base of Iso codes to read the first field, which is the field desired to be carried over to db-2. 

open(ISO_Codes, "<db-1.db");
while(<ISO_Codes>)
{
# read the file line by line, and split the line into fields, the pipe "|" is our delimiter.
($CODE, $COUNTRY)=split(/\|/, $_);
# read the first and second field into a hash.
$code{$COUNTRY} = "$CODE";
}
close(ISO_Codes); #close the data base


Now that's the first part of your program, what it does is it prepares the rest of the program for the data it is going to shortly need.

The hash is the important part of this program, and the real lesson here, you have field2 (the country name) in db-1, and also in field 2 of db-2, and you want field one (the country code) in db-1 to carry over to a third field in db-2, knowing that, you want to use the country as an anchor to identify the appropriate iso code for that country. That's the purpose of the hash. $code{$COUNTRY} = "$CODE"; is how you can call on that anchor to get the country code for any country in db-1. Like so,

If you print "$code{'United States'}"; the out put will be:

US

Since you assigned the data "US" (which is field one in db-1) to the name "United States", you can get the "value" of the hash "key" which in this case, was United States. Since we went through db-1 line by line and assigned the information in field1 to the name in field2, this applies of course, to all the records in that data base. So Now that your clear on how that works, we'll continue on,

Now you want to create a third field in db-2 with it's country code, so lets,


Code
open(cities, "<db-2.db"); 
while(<cities>){
chomp;
# I'll use the same scalar as above, but the casing is different, so $COUNTRY won't effect this loop.
($city, $country)=split(/\|/, $_);
push(@new,"$city|$country|$code{$country}|\n");
}close(cities);
open(finished, ">db-3.db");
print finished @new;
close(finished);


That just merged the information,

I started out by using chomp; , why? because the last field in each line is the field $country, so $country has a \n (newline) assigned to the end of it, unless I use chomp;. chomp by default uses $_ when you don't specify anything, so in a while loop, using chomp; is like writing chomp($_); And this way $country doesn't contain any newlines and won't mess with our new format we are trying to make, since we needed to use push, to create the array @new, with our new data if a newline had been assigned to $country by default, you would get

London|Great Britan
|GB|
Los Angeles|United States
|US|

Instead of

London|Great Britan|GB|
Los Angeles|United States|US|

It's one of those gotchas that sometimes is forgotten, and suddenly you get these strange files your script has created where it looks like the data is all over the place.

Now that we pushed our new data into the array @new, we can now either write it to a new file, like I chose, or write it back to db-2, I choose to create a new file, well, I said why already.

One thing however, this method is case sensitive, and I specifically asked for case insensitive. To do that is quite simple however if your familiar with lc(); the built in perl function for converting scalars, hashes, etc into lower case format. So if we make sure that the name if the hash keys are all lower case when reading db-1, and make sure that $country is lower case in the while loop of db-2, you will have a case-insensitive way of matching.

The final code with the case insensitive matching is as follows:



Code
# First you open the data base of Iso codes to read the first field, which is the field desired to be carried over to db-2. 

open(ISO_Codes, "<db-1.db");
while(<ISO_Codes>)
{
# read the file line by line, and split the line into fields, the pipe "|" is our delimiter.
($CODE, $COUNTRY)=split(/\|/, $_);
# read the first and second field into a hash.
$COUNTRY = lc($COUNTRY);
$code{$COUNTRY} = "$CODE";
}
close(ISO_Codes); #close the data base
open(cities, "<db-2.db");
while(<cities>){
chomp;
# I'll use the same scalar as above, but the casing is different, so $COUNTRY won't effect this loop.
($city, $country)=split(/\|/, $_);
$lcCountry = lc($country);
push(@new,"$city|$country|$code{$lcCountry}|\n");
}close(cities);
open(finished, ">db-3.db");
print finished @new;
close(finished);


I hope you have learned something, if you have found your self a bit lost due to any miss understandings about hashes (associative arrays), then please read:

http://tlc.perlarchive.com/0012/02.shtml

To learn all about hashes, they are very useful, the Administrator (Jasmine) of this site herself has written that.

if you have any questions, just post or e-mail me.

Sleuth



rGeoffrey
User

Dec 12, 2000, 7:30 PM

Post #3 of 8 (31888 views)
Re: Merging Data Files [In reply to] Can't Post

On Saturday I put these together, but I did not want to be first. Because I did not want to play with files while testing, all of the following will work in the middle of this block of code...


Code
#!/usr/local/bin/perl 

use strict;

my @nations = ('US|United States|',
'GB|Great Britan|');

my @cities = ('London|Great Britan|',
'Los Angeles|United States|',
'Hong Kong|China|');

my @newCities;

#------------------------------------------------------------

#Place the interesting stuff here

#------------------------------------------------------------

print join ("\n", @newCities), "\n";

I suppose I could go back and rewrite them to deal with the files directly, but there is not much interesting there. All versions will lowercase the country and abbreviation for consistency (and to meet your rules) and will generate a new array @newCities that would be returned to the file of your choice.

The most straight forward of the solutions...


Code
my %nations = map { (split ('\|', lc ($_)))[1,0] } @nations; 

foreach (@cities) {
my ($city, $nation) = (split ('\|', $_));
push (@newCities, join ('|', $city, lc($nation), $nations{lc($nation)}) . '|');
}

But foreach can give way to a pair of maps...


Code
my %nations = map { (split ('\|', lc ($_)))[1,0] } @nations; 

@newCities = map { join ('|', $_->[0], lc ($_->[1]), $nations{lc ($_->[1])}) . '|' }
map { [split ('\|', $_)] } @cities;

But $_ is a default for things like lc and split...


Code
my %nations = map { (split ('\|', lc))[1,0] } @nations; 

@newCities = map { join ('|', $_->[0], lc ($_->[1]), $nations{lc ($_->[1])}) . '|' }
map { [split ('\|')] } @cities;

The requirement to lc everything but the city name makes life a bit hard, but if I am allowed to skip lc we can reduce to this...


Code
my %nations = map{(split '\|')[1,0]}@nations; 

@newCities = map{join('|',@{$_},$nations{$_->[1]}).'|'}map{[split '\|']}@cities;



sleuth
Enthusiast

Dec 12, 2000, 10:45 PM

Post #4 of 8 (31883 views)
Re: Merging Data Files [In reply to] Can't Post

 
Wow rGeoffrey, all those answers are astonishing, and I tested them all and they work very well, You get an A+ plus 20% extra credit!!!

The last code you gave, I see what you were getting at, what you could reduce it to, but your other examples would match Great Britan with great britan, which was good, allthough all of you code is very very good. I'll be breaking it down for a while now (learning from it), he he.

Thanks for particapating mate,

Sleuth

PS. Actually, I'm not as far along as you, would you mind going through your use of map, and the whole code in any one of the blocks, preferably the first block of code, also, I know you always always use strict; why? Thanks



rGeoffrey
User

Dec 13, 2000, 8:39 AM

Post #5 of 8 (31880 views)
Re: Merging Data Files [In reply to] Can't Post

Here is a tour through the second version (the first one to use 3 maps). After each operatation I show what the data looks like...


Code
my %nations = map { (split ('\|', lc ($_)))[1,0] } @nations; 

The input is:
@nations = ('US|United States|',
'GB|Great Britan|');

On each pass we will do 'lc' so the anonymous array would look like:
@anon = ('us|united states|',
'gb|great britan|');

Then the entry is split so we have an array of arrays:
@anon = (['us','united states'],
['gb','great britan']);

Then using (@array)[1,0] we do an array slice to get them in the order we want:
@anon = (['united states','us'],
['great britan','gb']);

The map will make an array out of the answers from each element:
@anon = ('united states','us',
'great britan','gb');

And finally because we are doing %hash = @array, the hash sees it as:
%nations = ('united states' => 'us',
'great britan' => 'gb');

#-------------------------------------------

@newCities = map { join ('|', $_->[0], lc ($_->[1]), $nations{lc ($_->[1])}) . '|' }
map { [split ('\|', $_)] } @cities;

The input is:
@cities = ('London|Great Britan|',
'Los Angeles|United States|',
'Hong Kong|China|');

Inside the first map we do a split to get an anonymous array of arrays:
@anon = (['London','Great Britan'],
['Los Angeles','United States'],
['Hong Kong','China']);

That array is passed as input the the second map where join plays with these arrays:
@anon = (['London','great britan','gb'],
['Los Angeles','united states','us'],
['Hong Kong','china','']);

join then builds strings for each line:
@newCities = ('London|great britan|gb|',
'Los Angeles|united states|us|',
'Hong Kong|china||');

As for strict, I do it it keep me out of trouble. Perl has two built in safety features to keep us from doing something stupid, 'strict' and 'perl -w'.

Using strict will force us to declare all variables which helps find missspelled variable names and other things.

For any program longer than a page if you don't use strict it means one of two things. Either you have a very good reason, or you are a fool. There are very good reasons though. Some of the strange things, like type globs, have trouble working under strict. So you must have it turned off while doing these wizardly things. But then you are doing something dangerous, you are aware that you are doing something dangerous, and if bad things happen, you get what you deserve.

The other feature is used on the command line (or the #!/.../perl line at the top). It turns on warnings and will warn you about all kinds of things. It is good to use it while debugging, but you should probably turn it off in the production version because it can generate a lot of stuff.

I don't use -w because I often do things like "if ($hash{'key'})" when the key may or may not actually be there. But this generates a warning when it is not there. I know it is not there, but I don't want to write a more complicated if that first does an exists and then looks at the value. But if something is not working I occasionally turn it on to help find the problem.



Jasmine
Administrator

Dec 13, 2000, 11:01 AM

Post #6 of 8 (31876 views)
Re: Merging Data Files [In reply to] Can't Post

You may also want to check out Simon Cozens' map article.

Administrator
The Perl Archive


sleuth
Enthusiast

Dec 13, 2000, 12:11 PM

Post #7 of 8 (31872 views)
Re: Merging Data Files [In reply to] Can't Post

 Thanks rGeoffrey, That helped out a lot, Good explanation for using strict. Thanks for dedicatating your time :)

Thanks Mate,

Sleuth

PS. Jasmine, his article is confusing.


japhy
Enthusiast

Dec 15, 2000, 8:40 AM

Post #8 of 8 (31863 views)
Re: Merging Data Files [In reply to] Can't Post

Ok, here's my solution -- sorry if this is repetition of rGeoffrey's, but I wanted to tackle this as well. Notice I removed the trailing | from each line of data, as it serves no use.


Code
@codes = ( 
'US|United States', 'GB|Great Britain',
'CA|Canada', 'JP|Japan',
);

@cities = (
'Troy|United States', 'Waldwick|United States',
'Nagasaki|Japan', 'Montreal|Canada',
'Hiroshima|Japan', 'London|Great Britain',
);

We assume we've pulled these from two files. First, we make the table of country to country code:


Code
%code = map reverse(split /\|/, uc), @codes;

Since we want to go from country to code, and the @codes array was given code to country, we reverse the fields when we split.

Now, we go through the @cities array, appending the country code as needed:


Code
for (@cities) { 
my $country = (split /\|/)[1]; # the second field
$_ .= "|" . $code{uc $country};
}

The way the for loop works is that it determines the code, and appends it to $_, which is an alias to the actual element in the array.

Here's the final condensed code, assuming the two arrays already exist:


Code
my %codes = map reverse(split /\|/, uc), @codes; 
for (@cities) { $_ .= "|$codes{uc +(split /\|/)[1]}" }

I hope this is understandable. I see that sleuth was confused a bit by map(), and I've limited its use here. That block can actually be rewritten as:


Code
my %codes; 
for (@codes) {
my ($code,$country) = split /\|/, uc $_; # convert to uppercase
$codes{$country} = $code;
}

I hope that betters your understanding.

Jeff "japhy" Pinyan -- accomplished hacker, teacher, lecturer, and author

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives