One topic I have not seen much discussion on is how to transform business logic into an automated user-maintainable workbook. I've had to deal with similar situations in various settings, so I am sure it is a common problem, so I'm curious to see what other approaches exist.
Two examples:
- A data set consisting of banking transactions -- based on combinations of various data fields (such as payee name, debit / credit type) meeting various criteria, they are coded to specific general ledger account numbers
- A data set consisting of hospital visits -- based on combinations of various data fields (such as insurance company, date of service, service type) meeting various criteria, the expected reimbursement is calculated using various formulas that rely on other fields in the data.
Both of these situations were manual processes where an analyst would apply various filters to identify records meeting certain critieria, apply the proper code or formula, then continue. It is basically a processes of refining the data set until all items have been accounted for. There may or may not be a final catch-all for records that do not meet predetermined criteria.
Here is a short excerpt from the rules to be applied for example 1:
- If (PAYEE contains AMERICAN EXP or PAYEE contains AMEX) and TYPE = CREDIT then GLCODE = 2001
- If (PAYEE contains AMERICAN EXP or PAYEE contains AMEX) then GLCODE = 1001
- If TYPE = CREDIT then GLCODE = 2002
- GLCODE =1001
Here is a short excerpt from the rules to be applied for example 2:
- If INSURANCE contains BLUECROSS and DISCHARGE >=1/1/2007 then PAYMENT = DAYS * 1100
- If INSURANCE contains BLUECROSS and DISCHARGE >=1/1/2006 then PAYMENT = DAYS * 1050
- If INSURANCE contains BLUECROSS then PAYMENT = DAYS * 1000
- PAYMENT = $4000
One thing to note is that the rules as defined have to be applied sequentially, in the order presented. In the first example, you can see that Rule 1 is a subset of Rule 2. If Rule 2 were applied first, it would pick up all records to be captured by Rule 1 so Rule 1 would have no results.
Besides the manual approach, the only other method I've seen applied has been to hardcode these rules into VBA code. I spent quite a long while trying to untangle such a set of code in order to modify the logic and was not happy with that approach at all. I like to consider myself a somewhat competent VBA guy, so clearly that approach was not maintainable by anyone other than the original author. That is bad.
My current approach has been to put the logic into a worksheet and use the code to parse and apply the logic on the backend. I don't know if that is the best approach. On the one hand, it makes the VBA code all the more complicated and less maintainable by anyone other than me. On the other, it makes the part that is most likely to be changed more readable and updatable.
The end result is that the logic worksheet is a kind of pseudo-code in two columns. Column one define the operator, either IF, AND, or THEN. Column two defines the formula to be applied. In case of an IF or AND operator, it is the logical test. In case of the THEN operator, it is the value or formula to be applied to records that have met the prior critieria. Of course you can't have a pseudo-code without pseudo-variables. The pseudo-variables are defined as all of the field headings in the record set, with a set of [ brackets surrounding them. Here's an example:
IF [INSURANCE] = "BLUECROSS"
AND MIN( [ADMITDATE], [DISCHARGEDATE] ) >= DATEVALUE( "10/1/07" )
AND OR([DRG]=767,[DRG]=768,[DRG]=774,[DRG]=775)
THEN [DAYS] * 500
IF [INSURANCE] = "BLUECROSS"
AND MIN( [ADMITDATE], [DISCHARGEDATE] ) < DATEVALUE( "10/1/07" )
AND OR([DRG]=372,[DRG]=373,[DRG]=374,[DRG]=375)
THEN [DAYS] * 500
IF [INSURANCE] = "BLUECROSS"
THEN 2000
IF [INSURANCE] = "MEDICARE"
THEN VLOOKUP([DRG],DRGTABLE,3,0)
In reality, that logic would go down for several hundred rows. A few things to notice:
- The IF and AND operators must evaluate to a boolean value
- The THEN operator can be a formula or a a value
- The THEN operator must be followed by an IF operator
- Standard Excel worksheet functions can be used as expected. For example, a fuzzy match could be accomplished with a statement such as NOT(ISERROR(FIND("UE CR","BLUE CROSS",1)))
- Outside references can be called as expected. In the last line above, DRGTABLE is a defined name in the workbook that refers to a 3 column table.
I hesitate to post the VBA code because, technically, I do not own that code. As much as I don't ascribe much value to an Excel file, my employer might disagree. However, I think for discussion purposes, the actually code is not relevant so much as the mechanics of it, which are pretty simple. Here's how it works.
- Main loop is through all rows in the dataset
- Nested loop is through all lines in the rules set
- If the rule operator is IF or AND, apply the rule code against the current row in the main loop, substituting the [ bracked field variables with the actual cell reference to the current row for that field in the dataset
- If the rule code evaluates to TRUE, continue nested loop
- If the rule code evaluates to FALSE, exit nested loop
- If the rule operator is THEN, apply the rule code as the result for the current row in the main loop, substituting the [ bracked field variables with the actual cell
reference to the current row for that field in the dataset
When I say "apply" in the above, I mean that there is a working column added immediately after the last used column in the dataset that is used for rule testing and ultimately stores the final result.
I hope the above makes sense! I'll be glad to share a working sample privately if you'd like.
Of course all of the above is surrounded by lots of error checking, wrapped up into arrays, etc..
The main drawback I see to this approach is speed. In a scenario where no record in the dataset matches any of the rules in the rule set, the result is that the total number of iterations is going to be equal to dataset row count * rule set row count. Imagine a dataset of 40,000 rows and 1,000 rules applied and you can see the potential problem. Of course, most records are going to match a rule (assuming the logic makes any sense at all) so you won't loop through all 1000 rows. Putting the most common rules at the top will make things faster, as well. If Medicare is 50% of your data population, you would want to check for that first. Also, you would want to put the most restrictive criteria at the top of each IF - THEN rule block. In this case, it makes more sense to test for insurance company name first and then for discharge date as you are more likely to eliminate a record based on insurance company name and therefore spare yourself the loop through the AND criteria for discharge date.
However, I can't help but feel asking the user to think of such things is contrary to the point of the whole process, which is to put a user-friendly front on an inherently complicated problem. I want the user to be able to drop in some common sense rules without having to think about that. More to the point, I want the user to be able to use the template without having to ask for help. I'm not sure this accomplishes that goal. Certainly it won't work with a half million line Excel 2007 xlsx monster.
If you've read this far, congratulations! If the above is useful to anyone, that's great, but I'd be more excited to get some thoughts / ideas on how to fix or tweak or completely replace it.
-Zach