Excel User Group
Microsoft Excel blogs, forums, files. Read, ask questions, provide answers.

most frequent string in a column report

Latest post Fri, Feb 26 2010 11:56 AM by Faisal. 83 replies.
  • Wed, Feb 13 2008 10:47 PM In reply to

    • hrlngrv
    • Top 25 Contributor
    • Joined on Thu, Jan 17 2008
    • somewhere in the western US
    • Posts 68
    • Points 1,092

    Re: most frequent string in a column report

    dermot:
    Anonymous:
    Apparently simplicity and clarity aren't sufficient to detect errors
    If you're going to be rude, at least have the courage to give your name.

    You have a thin skin. So you get to dish out platitudes to make yourself appear virtuous, but others can't counter with truisms of their own?

    dermot:
    You are criticising what was a quick hack, not a major project.

    But it was offered to the unsuspecting as a solution, and much of what you've written since could reasonably be interpreted as implying its correctness.

    If it was an untested quick hack, when are you going to fix it or remove it?

    dermot:
    Provable correctness is nevertheless very important to business, and especially in spreadsheets, given their inherent risks and the fact that all of us make mistakes.

    If you mean proof of correctness, we agree. Provable correctness is an abstract concept. Like documentation in all too many businesses.

    As for spreadsheet risks, there's research that shows error rates are comparable to other forms of programming.

    dermot:
    And even black boxes need to be built properly. What happens when they need changing, and the original developer has moved on? Doesn't it make sense to build them clearly and simply and document them, to make it easier to understand and modify them? If it's incomprehensibly clever, how does that help?

    Any model written by a professional developer should only be maintained by a professional developer. Some people operate in that environment. You don't. Consider the possibility that your experience just might not be universally applicable and that the simplicity of the pieces doesn't guarantee that the whole built from those pieces is correct.

    I've already stated my top() udf would be most efficient. It's also allows for the simplest formulas. It may even be the approach easiest to prove correct. Shame users who don't know ISEVEN from ISODD would be completely mystified by it.

    My turn to quote. 'Everything should be made as simple as possible, but not one bit simpler.' - Albert Einstein.

    • Post Points: 5
  • Thu, Feb 14 2008 2:34 AM In reply to

    • lecxe
    • Top 50 Contributor
    • Joined on Wed, Jan 9 2008
    • Gävle, Sweden
    • Posts 15
    • Points 267

    Re: most frequent string in a column report

    hrlngrv:

    lecxe:
    On the subject of long complex formulas; here's a way to solve the original problem without the need for two different formulas: . . .

    Clever, but I think the following would be just a bit shorter and more efficient.

    =INDEX(D,MATCH(LARGE(IF(MATCH(D,D,0)=ROW(D)-MIN(ROW(D))+1,COUNTIF(D,D)-(ROW(D)-MIN(ROW(D)))/ROWS(D)),
    ROWS(J$1:J1)),IF(MATCH(D,D,0)=ROW(D)-MIN(ROW(D))+1,COUNTIF(D,D)-(ROW(D)-MIN(ROW(D)))/ROWS(D)),0))

     Well, if I combine our efforts in one formula, I think it can still be shortened a bit:

     =INDEX($A$1:$A$10,MATCH(LARGE(IF(FREQUENCY((MATCH($A$1:$A$10,$A$1:$A$10,0)),ROW($A$1:$A$10)),COUNTIF($A$1:$A$10,$A$1:$A$10)+ROW($A$1:$A$10)/(1+ROW($A$1:$A$10))),ROW(A1)),IF(FREQUENCY((MATCH($A$1:$A$10,$A$1:$A$10,0)),ROW($A$1:$A$10)),COUNTIF($A$1:$A$10,$A$1:$A$10)+ROW($A$1:$A$10)/(1+ROW($A$1:$A$10))),0))

     Efficiency however, I cannot meassure. Oh, and it also lists the strings of equal frequency in a different order than our previous formulas. And I haven't tested the rather dubious COUNTIF($A$1:$A$10,$A$1:$A$10)+ROW($A$1:$A$10)/(1+ROW($A$1:$A$10)) on large lists, so it might be flawed in that aspect. So to sum it up, it might be better to stick with the previous ones..Confused

    • Post Points: 21
  • Thu, Feb 14 2008 1:35 PM In reply to

    • hrlngrv
    • Top 25 Contributor
    • Joined on Thu, Jan 17 2008
    • somewhere in the western US
    • Posts 68
    • Points 1,092

    Re: most frequent string in a column report

    lecxe:
    hrlngrv:
    ...
    =INDEX(D,MATCH(LARGE(IF(MATCH(D,D,0)=ROW(D)-MIN(ROW(D))+1,
    COUNTIF(D,D)-(ROW(D)-MIN(ROW(D)))/ROWS(D)),ROWS(J$1:J1)),
    IF(MATCH(D,D,0)=ROW(D)-MIN(ROW(D))+1,
    COUNTIF(D,D)-(ROW(D)-MIN(ROW(D)))/ROWS(D)),0))

    Well, if I combine our efforts in one formula, I think it can still be shortened a bit:

    range reference changed to D for comparability
    =INDEX(D,MATCH(LARGE(IF(FREQUENCY((MATCH(D,D,0)),ROW(D)),
    COUNTIF(D,D)+ROW(D)/(1+ROW(D))),ROW(A1)),
    IF(FREQUENCY((MATCH(D,D,0)),ROW(D)),
    COUNTIF(D,D)+ROW(D)/(1+ROW(D))),0))


    On the new subtopic of proof of correctness, you've got a problem in your FREQUENCY calls, namely, their dependence on thier second argument array starting at 1. Using A1:A10 provides this, but it makes your formula overly dependent on using such ranges. Generalizing it, fixing the positional term so it returns values in original order, and eliminating unnecessary parentheses, your formula becomes

    =INDEX(D,MATCH(LARGE(IF(FREQUENCY(MATCH(D,D,0),ROW(D)-MIN(ROW(D))+1),
    COUNTIF(D,D)-(ROW(D)-MIN(ROW(D)))/ROWS(D)),ROW(A1)),
    IF(FREQUENCY(MATCH(D,D,0),ROW(D)-MIN(ROW(D))+1),
    COUNTIF(D,D)-(ROW(D)-MIN(ROW(D)))/ROWS(D)),0))

    Still longer.

    Once you adapt your formula to refer to a generic range, there's no benefit to the FREQUENCY call.

    In different terms, if I were to assume D began in row 1, I could rewrite my formula as

    =INDEX(D,MATCH(LARGE(IF(MATCH(D,D,0)=ROW(D),
    COUNTIF(D,D)-ROW(D)/ROWS(D)),ROWS(J$1:J1)),
    IF(MATCH(D,D,0)=ROW(D),COUNTIF(D,D)-ROW(D)/ROWS(D)),0))

    Note on the positional term: if you want to display tied items in original order, and the base ordering is descending, then either add larger subordinate terms for items that appear earlier in the data or subtract smaller subordinate terms for items that appear earlier in the data. The latter is shorter.

    Note on ROW(A1) vs ROWS(J$1:J1): ROW always returns an array, even single item arrays. ROWS never returns an array. There are obscure cases in which Excel can't handle ROW calls returning single items in longer formulas, at least not without wrapping those ROW calls inside SUM or INDEX(ROW(.),1) calls to force them to scalars.

    lecxe:
    Efficiency however, I cannot meassure. . . .

    It's measurable. You could measure it if you knew how and had the inclination to do so. FREQUENCY(a,b) clearly does more work than a=b.

    • Post Points: 37
  • Fri, Feb 15 2008 1:47 AM In reply to

    Re: most frequent string in a column report

    >Efficiency however, I cannot meassure.

     There is code here to test calculation speed (efficiency):

     http://msdn2.microsoft.com/en-us/library/aa730921.aspx

    Biff

    • Post Points: 5
  • Fri, Feb 15 2008 6:10 AM In reply to

    • lecxe
    • Top 50 Contributor
    • Joined on Wed, Jan 9 2008
    • Gävle, Sweden
    • Posts 15
    • Points 267

    Re: most frequent string in a column report

    hrlngrv:

    On the new subtopic of proof of correctness, you've got a problem in your FREQUENCY calls, namely, their dependence on thier second argument array starting at 1. Using A1:A10 provides this, but it makes your formula overly dependent on using such ranges. Generalizing it, fixing the positional term so it returns values in original order, and eliminating unnecessary parentheses, your formula becomes

    =INDEX(D,MATCH(LARGE(IF(FREQUENCY(MATCH(D,D,0),ROW(D)-MIN(ROW(D))+1),
    COUNTIF(D,D)-(ROW(D)-MIN(ROW(D)))/ROWS(D)),ROW(A1)),
    IF(FREQUENCY(MATCH(D,D,0),ROW(D)-MIN(ROW(D))+1),
    COUNTIF(D,D)-(ROW(D)-MIN(ROW(D)))/ROWS(D)),0))

    Still longer.

    Once you adapt your formula to refer to a generic range, there's no benefit to the FREQUENCY call.

     Thank you for your suggestions. I'm learning a lot here! How bout if I write the 2nd FREQUENCY argument like this to handle generic range? (Formula also corrected for original order and parentheses elimination):

    =INDEX(D,MATCH(LARGE(IF(FREQUENCY(MATCH(D,D,0),ROW(INDIRECT("1:"&COUNTA(D)))),COUNTIF(D,D)-ROW(D)/(1+ROW(D))),ROW(A1)),IF(FREQUENCY(MATCH(D,D,0),ROW(INDIRECT("1:"&COUNT(D)))),COUNTA(D,D)-ROW(D)/(1+ROW(D))),0))

    When you write this formula and your formula next to each other, this one still uses more characters. However, this formula calls 22 functions and refers to D 15 times. Your formula calls 24 functions and refers to D 19 times. So I guess the term longer depends on how you define it. Thanks for the input on  ROW(A1) vs ROWS(J$1:J1), I didn't know that. Can you find an example? It would be interesting to see under which circumstances this appears.

     /lecxe

     

    • Post Points: 21
  • Fri, Feb 15 2008 11:19 AM In reply to

    • hrlngrv
    • Top 25 Contributor
    • Joined on Thu, Jan 17 2008
    • somewhere in the western US
    • Posts 68
    • Points 1,092

    Re: most frequent string in a column report

    lecxe:
    How bout if I write the 2nd FREQUENCY argument like this to handle generic range?
    ...

    Where'd you miss my statement that your FREQUENCY calls do nothing useful. I'll repeat, FREQUENCY(MATCH(D,D,0),{1;2;3;...;ROWS(D)}) does more work than MATCH(D,D,0)={1;2;3;...;ROWS(D)}; it returns and array with ROWS(D)+1 items, but the last one doesn't cause a problem; but for the first ROWS(D) items, your FREQUENCY call returns positive and zero values where the = comparison returns TRUE and FALSE, respectively. Since both are being used as 1st arguments to IF, your positive and zero values are just being used as TRUE and FALSE, respectively. QED.

    lecxe:

    =INDEX(D,MATCH(LARGE(IF(FREQUENCY(MATCH(D,D,0),ROW(INDIRECT("1:"&COUNTA(D)))),
    COUNTIF(D,D)-ROW(D)/(1+ROW(D))),ROW(A1)),IF(FREQUENCY(MATCH(D,D,0),ROW(INDIRECT("1:"&COUNT(D)))),
    COUNTA(D,D)-ROW(D)/(1+ROW(D))),0))

    I wasn't clear before. Unless it's your intention to return distinct values in D that appear the same number of times (ties) in a different order than the order in which they appear in D, you should replace -ROW(D)/(1+ROW(D)) with -ROW(D)/ROWS(D).

    lecxe:
    When you write this formula and your formula next to each other, this one still uses more characters. However, this formula calls 22 functions and refers to D 15 times. Your formula calls 24 functions and refers to D 19 times. So I guess the term longer depends on how you define it.

    Meaning you choose to define longer as having fewer characters? Or are you confusing formula length with efficiency? As for efficiency, if your formula only had to recalculate once your latest formula might recalculate faster. However, you're now using INDIRECT, which is a volatile function, which means it recalculates every time anything changes anywhere in any open workbook. My formula, on the other hand, recalculate only when cells in D change. Many formulas like your latest in a workbook will procude a user experience similar to that of 300 baud modems 25 years ago. Or the ever popular watching paint dry.

    lecxe:
    Thanks for the input on  ROW(A1) vs ROWS(J$1:J1), I didn't know that. Can you find an example? It would be interesting to see under which circumstances this appears.

    http://groups.google.com/group/microsoft.public.excel.worksheet.functions/browse_frm/thread/55f132b1c3761502/c69d2c9a4830d076?hl=en&lnk=st&q=#c69d2c9a4830d076
    (or http://tinyurl.com/2xwhog )

    Google Groups advanced search is a very handy tool to learn how to use.

    • Post Points: 21
  • Fri, Feb 15 2008 1:27 PM In reply to

    • lecxe
    • Top 50 Contributor
    • Joined on Wed, Jan 9 2008
    • Gävle, Sweden
    • Posts 15
    • Points 267

    Re: most frequent string in a column report

    hrlngrv:


    Where'd you miss my statement that your FREQUENCY calls do nothing useful. I'll repeat, FREQUENCY(MATCH(D,D,0),{1;2;3;...;ROWS(D)}) does more work than MATCH(D,D,0)={1;2;3;...;ROWS(D)}; it returns and array with ROWS(D)+1 items, but the last one doesn't cause a problem; but for the first ROWS(D) items, your FREQUENCY call returns positive and zero values where the = comparison returns TRUE and FALSE, respectively. Since both are being used as 1st arguments to IF, your positive and zero values are just being used as TRUE and FALSE, respectively. QED.

     Ok, I see what you mean now.

    hrlngrv:

    I wasn't clear before. Unless it's your intention to return distinct values in D that appear the same number of times (ties) in a different order than the order in which they appear in D, you should replace -ROW(D)/(1+ROW(D)) with -ROW(D)/ROWS(D).

     I'm confused. -ROW(D)/(1+ROW(D)) does return values in the order in which they appear.

    hrlngrv:

    Meaning you choose to define longer as having fewer characters?

    No, My point was that i f you use D in stead of the real references, your formula has fewer charcters. If you replace D with the real references, mine has fewer characters. And if you define longer as the number of calls to functions or references, yours is longer. Since you brought up the issue of shorter/longer, how do you define it? Short/long is of course of little importance in this case. What matters is how efficient the formula is, and in that aspect I would certainly go with your formula .

    hrlngrv:

    Google Groups advanced search is a very handy tool to learn how to use.

    Thanks for the tip. I'll be sure to bookmark it for future use. I'll admit though, I didn't do a search myself, just hoped you had an example close at hand. Next time, I'll search first and then ask.Smile

    • Post Points: 21
  • Fri, Feb 15 2008 1:49 PM In reply to

    • hrlngrv
    • Top 25 Contributor
    • Joined on Thu, Jan 17 2008
    • somewhere in the western US
    • Posts 68
    • Points 1,092

    Re: most frequent string in a column report

    lecxe:
    I'm confused. -ROW(D)/(1+ROW(D)) does return values in the order in which they appear.

    You're right. Sorry about that.

    lecxe:
    No, My point was that i f you use D in stead of the real references, your formula has fewer charcters. If you replace D with the real references, mine has fewer characters.

    OK, fair point, but that's why I almost always use short worksheet-level defined names for such references since that almost always shortens formulas and improves clarity. It's easier to spot the difference between ABC and ACB than between $A$11:$A$10000 and $A$111:$A$1000.

    • Post Points: 5
  • Fri, Feb 26 2010 11:56 AM In reply to

    • Faisal
    • Not Ranked
    • Joined on Fri, Feb 26 2010
    • Posts 1
    • Points 5

    Re: most frequent string in a column report

    Excellent!  Registered on this forum just to salute you!

    Thanks

    Faisal

     

    Yes

    • Post Points: 5
Page 6 of 6 (84 items) « First ... < Previous 2 3 4 5 6 | RSS
Copyright Excel User Group and the relevant contributors, 2009. All Trade Marks acknowledged. This site is a peer-to-peer site and NOT affiliated in any way to Microsoft Corp. All rights reserved.