Now, since Excel 2007 was announced to be having over 1 million rows, I made the statement, "Be careful what you wish for". Excel is not a database, there are programs far more suited to that task such as Access, SQL Server, SAP, Oracle, etc. Excel is a great reporting tool for those databases and can do fast and quite complex calculations much better than a database can.
What I am conceding more and more now though, is that as my familiarity with Excel is far greater than with SQL Server. I find myself taking large data sets out of SQL to do stuff that either the database is not good at, or more often where my T-SQL language skills let me down.
One such occurrence yesterday was trying to get a twelve month running average over weekly data for 30,000 items (per week) a data set which was currently around 850,000 records. The weekly update each Saturday night, being added to the total was not the problem, the setting up of the historical data was. Either the server choked on the physical task, or the data was not quite what I wanted. (Back to my T-SQL inadequacies).
What would have been impossible before 2007, well not impossible, but messy. Was to take out the data into Excel, work with it there and then, after saving it back as a .txt, tab delimited file. (SQL Server drivers seem to not yet cope with the new Excel formats), upload it back into SQL Server, either directly updating the original table, or more often to a temporary table which can then simply be linked to the original table and the data updated or added, very simple and painless.
A couple of caveats though. Much of my work is on 'one-offs' to get me out of a corner and I use a lot of VBA snippets work the data. On this occasion I was having to drop in different worksheet functions based on the data. I wrote my 10-12 lines of code to run and set it off. Now, when I am running code over large data sets, although it has an overhead, I like to see where the code has got to, so, for example, in a for each...next loop, I always write the row number to the status bar. I then get an idea how long I need to walk away from my machine. (Coffee break, lunch or vacation?), like so.
For Each myCell in Range("D1:D856456")
'Do stuff here...
Application.Statusbar = myCell.Row
Next myCell
In the data I was working the other day it was a simple AVERAGE w/s function, but was taking a very long time. Of course, I had forgotten to turn calculation to manual at the start of my code so stopped the code (vacation otherwise) and added the line
Application.Calculation=xlCalculationManual
Remembering to switch it back on afterwards, although Excel will take the setting from the first workbook opened in a session, so we should be OK.
I ran it again and still noticed it was dragging a little and a quick sum showed I should take a half day off ;-)
What I had forgotten, is that, as I was inserting random ranges in the AVERAGE function based on other data, Excel was trying to set the 'error' flags, (The little green triangle in the top right corner of a cell when the formula does not match those of the cells around) on each cell as the code ran. Turning those off and we were away. The code now taking less than 20 minutes (a long coffee break) to run. The line to turn them off
application.ErrorCheckingOptions.BackgroundChecking=False
Again remember to turn them back on. (If that's what you want)
So, are 1,000,000+ rows in Excel a good thing? Not really, but it doesn't half come in handy. (With caveats)
Posted
Feb 24 2009, 08:38 AM
by
Nick Hodge