CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Need a Custom or Prewritten Perl Program?: I need a program that...:
sort 10 million 7-digit phone numbers with ~1MB RAM

 



maneshmotts
Novice

Sep 18, 2012, 1:16 PM

Post #1 of 9 (8610 views)
sort 10 million 7-digit phone numbers with ~1MB RAM Can't Post

HI all

I am trying to solve this challenging problem using the perl language. I am trying to sort 10 million 7-digit phone numbers with 1MB RAM and by using one pass and no intermediate files Crazy. I need to know how to approach it . Or where to begin with .


(This post was edited by maneshmotts on Sep 18, 2012, 2:26 PM)


BillKSmith
Veteran

Sep 18, 2012, 4:42 PM

Post #2 of 9 (8604 views)
Re: [maneshmotts] sort 10 million 7-digit phone numbers with ~1MB RAM [In reply to] Can't Post

There are exactly 10 million seven-digit numbers. Not all of these are valid phone numbers. Perhaps you could create the desired output by generating all the numbers from 0 to 9999999 and eliminating invalid phone numbers.

Your constraint to one pass eliminates every sorting algorithm I ever heard of. It really does not make sense to constrain the design of your program. There may be very real constraints on resources such as time, memory, and open file handles. You should not care how the program uses the available resources.

If this constraint comes from a teacher (or an unreasonable customer), you need help in algorithm design long before you even think of perl.
Good Luck,
Bill


maneshmotts
Novice

Sep 18, 2012, 5:00 PM

Post #3 of 9 (8602 views)
Re: [BillKSmith] sort 10 million 7-digit phone numbers with ~1MB RAM [In reply to] Can't Post

HI Bill,

Thanks for the reply. if remove the constraint of single passing. how can be the problem approached

thanks


BillKSmith
Veteran

Sep 18, 2012, 7:22 PM

Post #4 of 9 (8597 views)
Re: [maneshmotts] sort 10 million 7-digit phone numbers with ~1MB RAM [In reply to] Can't Post

A file of seven-digit phone numbers should have exactly the same number of bytes in every record. This makes it possible to access and overwrite any desired record. Any algorithm which can sort an array "in place" can be modified to sort the file "in place" (Use seek to position to a desired record and read or print to read or overwrite it.) Bubble sort would work, but would be unacceptably slow for such a large file. I no longer have either the skills nor references to find and choose a faster algorithm which meets the "in-place" constraint.

I would write a bubble sort that correctly sorts a small array. Then modify it to sort a file. (Test it with a small file) Use profile to measure the time to swap two numbers. This is the time you need to estimate the running time of a proposed algorithm.
Good Luck,
Bill


BillKSmith
Veteran

Sep 19, 2012, 2:24 PM

Post #5 of 9 (8582 views)
Re: [BillKSmith] sort 10 million 7-digit phone numbers with ~1MB RAM [In reply to] Can't Post

I have written a bubblesort which sorts a file in-place. It works fine for small files. Crude extrapolation estimates it would take over three years to process your file. I doubt that a better algorithm of the same type could reduce the run time to less than several weeks. My suggestion is probably of no use at all. You will have to discuss your problem with sort experts to find out if it seems possible at all.
Good Luck,
Bill


Laurent_R
Veteran / Moderator

Sep 30, 2012, 4:52 AM

Post #6 of 9 (7805 views)
Re: [BillKSmith] sort 10 million 7-digit phone numbers with ~1MB RAM [In reply to] Can't Post

I have just tried sorting an array of one million random numbers with the built-in sort function. It took slightly less than 2 seconds to do the job (and about 4 seconds for 2 million numbers, 10 seconds for 5 million numbers, so it would seem to scale relatively well). But I can't do it with 10 million numbers, because I run out of memory (actually I can't even allocate an array of 10 million numbers).


Code
$ time perl -e '$c[$_] = int rand 10000000 foreach (1..1000000);  @d = sort {$a<=>$b} @c;' 

real 0m1.859s
user 0m1.718s
sys 0m0.093s

$ time perl -e '$c[$_] = int rand 10000000 foreach (1..2000000); @d = sort {$a<=>$b} @c;'

real 0m3.875s
user 0m3.656s
sys 0m0.108s

$ time perl -e '$c[$_] = int rand 10000000 foreach (1..5000000); @d = sort {$a<=>$b} @c;'

real 0m9.766s
user 0m0.015s
sys 0m0.078s


Code



      
    


BillKSmith
Veteran

Oct 1, 2012, 1:48 PM

Post #7 of 9 (7751 views)
Re: [Laurent_R] sort 10 million 7-digit phone numbers with ~1MB RAM [In reply to] Can't Post

Clearly, your system can allocate far more memory than the 1MB specified in the original problem. You have dramatically shown that the practical answer is get more memory!
Good Luck,
Bill


Laurent_R
Veteran / Moderator

Oct 1, 2012, 2:58 PM

Post #8 of 9 (7748 views)
Re: [BillKSmith] sort 10 million 7-digit phone numbers with ~1MB RAM [In reply to] Can't Post

Actually, thinking about the original post problem reminded me about one of the very first chapters of Jon Bentley's "Programming Pearls" (nothing to do with the Perl language, but a very good book). He had a problem with very similar constraints (sort on disk, very limited memory, etc.).

The solution turned out to be the use of a bit vector (bitmap).

Basically, the N th bit is set if integer N is in the file to be sorted.

The program can then be written in three easy phases;
- first create the bitmap and set each bit to 0
- second, read the file and for each integer N in the file, set the N th bit to 1
- just print, from bit 1 to Maxbit each integer N for which the N th bit is equal to 1.

This might not be enough to solve the whole problem (especially, 1 megabyte will not store 10 million bits, so some extra work might be needed, perhaps using a temporary file), but I think it gives a good direction as to where to look for a possible solution.

And this should be very fast, as you do not have to do an actual sort, the number's represented by the bitmap are actually already sorted.

Note that the solution given by Bentley in the book worked only because it was guaranteed that there was no duplicate in the file, but the same can probably be assumed in the OP problem, since phone numbers ought to have no duplicates.


Laurent_R
Veteran / Moderator

Oct 5, 2012, 9:02 AM

Post #9 of 9 (7222 views)
Re: [Laurent_R] sort 10 million 7-digit phone numbers with ~1MB RAM [In reply to] Can't Post

Using the idea described in the above post of not storing the numbers in the array, but storing 1 in the N th element of the array if N exists in the input, I can now allocate ten million random numbers and "sort" them, using the same command line type of test as before:



Code
$ time perl -e '$c[int rand 10000000]=1 foreach  (1..10000000); foreach  (1..10000000) {print "$_\n" if $c[$_] ==1}' > ten_million.txt 

real 0m16.359s
user 0m15.015s
sys 0m0.593s

~
$ head ten_million.txt
1
2
3
4
5
6
7
9
11
12

~
$ tail ten_million.txt
9999986
9999987
9999988
9999989
9999990
9999991
9999992
9999993
9999994
9999996


Of course, I am using here far more than 1 megabyte of memory because I am not using a bit vector, but a simple array. But at least, it shows that the idea works pretty well.

Basically, what is needed here is to build a routine to properly manage a bit vector of the adequate size.

Second, since one megabyte will not be quite enough to hold 10 million bits (1 MB = 8,388,608 bits), though by a relatively small margin, the process should be built in two passes, for example one with numbers between 1 et 5,000,000 and a second one between 5,000,001 et 10,000,000 (you read twice the entire input file; in the first pass, you check only numbers between 0 and five million; in the second pass, you check numbers between five million and ten million).

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives