Distinct Count in PowerPivot v2 – MUCH Faster

July 28, 2011

A quick update, as tonight I managed to sit down and experiment a bit with the V2 Beta for the first time in many days.

(OK, actually, I am standing up…  because I have a fantastic new treadmill desk and have walked nearly 8 miles today while working, but that’s a topic for another post).

I was working with one of our HostedPowerPivot clients today who was observing slower-than-expected slicer click performance in one of their reports.  And after some sleuthing, I found that they had a “distinct count” measure in their model, something like this:

[Measure] = COUNTROWS(DISTINCT(Table[Column]))

Now, that’s a pretty useful measure in many situations, like “does product X sell every day?”  Simply taking a count of the date column doesn’t cut it, because if it sells twice on one day and not at all the next day, the count is still 2, and you want it to be 1.

Trouble is, in PowerPivot v1, that operation can be slow.  I won’t bore you with all the details, because frankly, I don’t know all of them, heh heh.  But I do know that PowerPivot v1 does end up creating all kinds of temporary new tables in memory behind the scenes when it is evaluating the measure.

Anyway, I’d heard a long time ago that PowerPivot V2 was going to make distinct counts a lot faster, so I thought I’d try it out.

The Test Case

First I needed to create a test case that was challenging.  I don’t want to compare “fast query vs. fast query” because that often gets skewed by fixed overhead.  Much better to test a “worst case scenario.”

So, I took a 15 million row table, and added a calculated column to it.  Calculated columns in PowerPivot are much less compressed than imported columns, and are therefore more challenging for the measure engine to scan through, looking for duplicates etc. which is required for a distinct count.

And then, to make matters worse, I just used the RAND() function so that there are many unique values:

image

Cutting to the Chase:  The Results

A distinct count measure in PowerPivot v1 took 35 seconds to complete in my test pivot.  This was a big part of why our client was seeing slow perf today.

That same measure, in that same pivot, took less than a second in the V2 beta.

Wow.  This is gonna be nice.

One Last Note:  Two Ways to Write the Formula

PowerPivot v1 did not have a dedicated function for distinct count, you had to do the countrows of distinct thing like what I showed above.

But in PowerPivot v2, there IS a DISTINCTCOUNT() function.  So you can write your distinct count measures in two ways:

[Measure] = COUNTROWS(DISTINCT(Table[Column]))

[Measure] = DISTINCTCOUNT(Table[Column])

That new function is nice, makes for a more readable formula.  BUT…  the performance is still the same.  Either way I wrote it, the measure was sub-second fast.  Very nice.


Guest Post: TOPN in PowerPivot V2

July 20, 2011

Well folks I haven’t had much time to play with the V2 Beta (aka CTP3) yet.  A lot going on over the past week:

  1. Authoring a brand new suite of models and reports for our Duane Reade DR-Direct solution
  2. A nasty crashing bug in PowerPivot for SharePoint that we’ve now learned to catch and repair
  3. I submitted a PowerPivot article to CIMA Insight magazine (for August)
  4. Running refresh tests with PowerPivot servers in Chicago pulling data from SQL in Dallas (over Internet VPN – more on this later) for the HostedPowerPivot offering

So in the meantime, David Hager has a short article to share on the new TOPN function.  I haven’t had a chance to try it yet, and I’m a little surprised that it doesn’t have special handling for ties (as he’s discovered).

David Hager on TOPN

Using the New PowerPivot Version 2 TOPN Function: Up, Down and Between

One of the new DAX functions included in the Denali PowerPivot CTP3 release is the TOPN function. This function returns a table which can be aggregated for the top n values. Prior to the release of this function, workarounds were available to achieve the same results, mainly through DAX ranking formulas (and Denali PowerPivot now has native functions for ranking too!). So, a typical DAX formula which uses the TOPN function is:

=SUMX(TOPN(N,SalesTable,SalesTable[SalesTotal]),SalesTable[SalesTotal])

where SalesTable is the table and SalesTotal is the column in that table containing the top n results to be analyzed.

Well, that was easy enough. However, there is a systemic problem here and it’s mainly data-related. In order to understand why the TOPN function don’t work the way you might expect, read the following that comes from the Denali PowerPivot CTP3 help documentation on TOPN.

“If there is a tie, in order_by values, at the Nth row of the table, then all tied rows are returned. Then, when there are ties at the Nth row the function might return more than n rows.”

In English, that means if there are duplicate values in your data, the TOPN function will return the WRONG answer if the Nth value is adjacent to duplicates. In order to get a meaningful result then, the data must be converted so that each data point has a unique value. The question is, how do you do that and retain the integrity of the data? An Excel trick can be used here as a workaround. This method requires that each value be incremented by a very small but unique value. The best implementation of this comes in the form of a calculated column with the following formula

SalesTable[UniqueIncremValue] = 
IF(RANDBETWEEN(-1,1)=-1,-RAND()*2,RAND())*0.000001

The goal of this formula is to create a column of unique numbers add up basically to zero. There is a balance that is required in this formula between affecting data and insuring that all values in the dataset are unique. The RAND function returns a 15-place decimal, which is the limit in Excel. However, if all of the decimal places were used, the value of the number could be as high as 1 or as low as -2. By multiplying the random numbers by a very small number, the numbers will not interfere with the individual values if they themselves are not small numbers. However, if the multiplier is made too small, not all of the random numbers may be unique depending on the size of the dataset. So, if you are going to use this workaround, you need to test [UniqueIncremValue] to see if all of the values are unique. The best way to do this is to compare its distinct count with the distinct count of a primary key column in the table. If they are equal, TOPN will work for any value of N.

Now, the original formula will work fine if the new calculated column is used in place of SalesTable[SalesTotal].

SalesTable[SalesTotalMod]=
SalesTable[SalesTotal]+ SalesTable[UniqueIncremValue]

It should also be noted that the TOPN function can used in a DAX like the one shown below to afford a BOTTONM result. Note that this formula takes advantage of the new DISTICTCOUNT function.

=SUMX(SalesTable,[SalesTotalMod])
-SUMX(
     TOPN(
         DISTINCTCOUNT(SalesTable[SalesTotalMod])- N,
         SalesTable,
         SalesTable[SalesTotalMod]
         ),
     SalesTable[SalesTotalMod]
    
)

I think you can see that the TOPN function is a valuable new item for your DAX toolbox. Have fun!


Follow

Get every new post delivered to your Inbox.

Join 164 other followers