Lately I've been coming across situations where I need to join one or more fields from a large dataset to another large dataset. The obvious answer is a vlookup. However, the speed of the vlookup drops dramatically the larger the datasets, which makes sense, a 100,000 row dataset being matched against another 100,000 row dataset = 10,000,000,000 comparisons that have to be made. A 10 minute calc time is not so bad in that context. So the challenge is to reduce the number of comparisons to be made.
My solution is below. It works much, much faster than the vlookups, but it is not particularly elegant and not especially memorable. It is so hard to remember the structure and syntax that I ended up creating a simple userform to gather the proper inputs and write the formula for me. What I'm really interested here is alternate takes on this issue, or maybe a simpler way to implement my solution. My ideal solution would be a UDF that implements the vlookup interface, i.e. flookup(c1,a1:b100000,2,0).
Here are the assumptions for my solution:
- G2 is the cell to be looked up
- H2 is the cell containing the formula
- Step2 is the worksheet to be matched against
- G2:G33270 is the range to be matched against
- G2:G33270 is sorted by ascending order
- H2:H33270 is the range containing the item to be returned
If I were using a vlookup, it would be:
=VLOOKUP(G2,STEP2!$G$2:$H$33270,2,0). Simple, yet slow.
My formula is:
=IFERROR(IF(INDEX(STEP2!$G$2:$G$33270,MATCH($G2,STEP2!$G$2:$G$33270))=$G2,INDEX(STEP2!$H$2:$H$33270, MATCH($G2,STEP2!$G$2:$G$33270)),"No Match"),"No Match")
Basically, this works by using the default 1 operator of the MATCH function, which returns the row number corresponding to the first value less than or equal to the item to be found, which may or may not be an exact match. It then uses the INDEX function against that result to return the value corresponding to that row number against the same range, which may or may not be the same as the item looked up. If it is not, then we know there is not an exact match and it returns the text "No Match". If it is, then it is an exact match, so it proceeds to use the INDEX function against that result against the range containing the item to be returned. Unfortunately we have to repeat the MATCH function. Fortunately, it does not have much of a speed impact. The gltich is that we may have an item to be looked up that is greater than the greatest value in the lookup list. In that case the formula would return an #N/A error, so our results may be either the the correct value, "No Match", or #N/A, where the last two are basically the same thing. I find that confusing so I wrapped the whole thing in the IFERROR function and return "No Match" in place of the #N/A. But that only works for Excel 2007. If you're using an earlier version you'd have to use the IF-ISERROR pair and then repeat the whole thing for a hideous formula that looks like this:
=IF(ISERROR(IF(INDEX(STEP2!$G$2:$G$33270,MATCH($G3,STEP2!$G$2:$G$33270))=$G3, INDEX(STEP2!$F$2:$F$33270,MATCH($G3,STEP2!$G$2:$G$33270)),"No Match")),"No Match",IF(INDEX(STEP2!$G$2:$G$33270, MATCH($G3,STEP2!$G$2:$G$33270))=$G3,INDEX(STEP2!$F$2:$F$33270,MATCH($G3,STEP2!$G$2:$G$33270)),"No Match"))
So who here can do this better?
Or, for a real challenge, any thoughts on faster lookups on an unsorted list? I was thinking something along the lines of reading the dataset into an array, sorting the array, then returning the results from there.
Oh, and "use access" is not a valid answer. It's just not reasonable to export, process, then reimportant from a secondary program. It may be fast, but I am looking for fast and convenient.