Aug 26, 2012, 8:29 AM
Not quite right, Rushadrena. From the data examples you gave, each substrate has 1 or 2 products associated with it, not more (at least more often 1 than more than 2). So you don't get at all a Cartesian product between 740 substrates and 600 products (444000 elements), but only the actually existing combinations, i.e., assuming two products per substrate, at most 1480 elements. This is very very far from the 518160 elements of a full matrix, less than 0,3%. This means that more than 99,7% of the full matrix would be unemployed, or, I would rather say, totally useless and worthless.
Re: [rushadrena] A file parsing and 2D array/matrix problem.
The other point is that, anyway, from the way you described your problem, all you really care to know if whether a specific substrate/combination exists (where to assign the 1 value), or not. For that, the solution I suggested is totally sufficient. The sparce matrix approach I suggested just contains exactly as much useful information on your data as a full matrix taking more than 300 times more space in memory (and far longer to load).
I'm ready to make one concession, though. You may want to have available a full list of all the possible substrates and a full list of all the possible products, not just those in the input list, so that you can say: although this specific combination of substrate/product does not exists in the input list, it would still be a possible candidate, since both the product and the substrate exists. If you want that, then all what you need is two other simple hashes, one with all the possible 740 substrates and one with the all the possible 600 products. So you would end up with two simple hashes and one hash of hashes, keeping in memory about 3,000 elements, still very far less than 500 k-elements. These two hashes give you a virtual Cartesian product of all possibilities, but you never have to compute the actual Cartesian product.
But your description of the problem is an extremely strong indication that a sparce matrix is really exactly what you need. And an hash of hashes is the ideal data structure to store that, because you need just one (pretty fast) line of code to retrieve the information you need (i.e. whether a given substrate/combination exsists in the input data).
I hope I am being clear in my explanations. I work a lot on quite similar problems, the one thing you want to avoid, especially when the volume of data grows, is the quadratic burden of a full Cartesian product (or, even worse, an exponential or factorial explosion of possibilities). Some of the problems I work on at my job can be solved within a few hours of computation with various things similar to the sparce matrix approach described above, but would probably not have the time to finish by the final explosion of the sun and the end of the solar system if we were to try to compute all the possibilities in a super-matrix approach.
One last example. My company has a database with about 35 million customers and about a million possible products and services. What is stored in the database, is the list of services (usually 5 to 20) actually subscribed by the customer. Not a "super-matrix" of all possible customer-service combination s, with 0 and 1 to record if the service has been subscribed or not by the customer. This "supermatrix" would have 35,000 billion elements and would take ages to query and require disk space that I can't even imagine. What a standard business-oriented database (e.g., Oracle) does is, in effect, is to implement a slightly more complicated version of the sparce matrix approach I have described.
(This post was edited by Laurent_R on Aug 26, 2012, 8:32 AM)