Distinct Count in PowerPivot v2 – MUCH Faster

July 28, 2011

A quick update, as tonight I managed to sit down and experiment a bit with the V2 Beta for the first time in many days.

(OK, actually, I am standing up…  because I have a fantastic new treadmill desk and have walked nearly 8 miles today while working, but that’s a topic for another post).

I was working with one of our HostedPowerPivot clients today who was observing slower-than-expected slicer click performance in one of their reports.  And after some sleuthing, I found that they had a “distinct count” measure in their model, something like this:

[Measure] = COUNTROWS(DISTINCT(Table[Column]))

Now, that’s a pretty useful measure in many situations, like “does product X sell every day?”  Simply taking a count of the date column doesn’t cut it, because if it sells twice on one day and not at all the next day, the count is still 2, and you want it to be 1.

Trouble is, in PowerPivot v1, that operation can be slow.  I won’t bore you with all the details, because frankly, I don’t know all of them, heh heh.  But I do know that PowerPivot v1 does end up creating all kinds of temporary new tables in memory behind the scenes when it is evaluating the measure.

Anyway, I’d heard a long time ago that PowerPivot V2 was going to make distinct counts a lot faster, so I thought I’d try it out.

The Test Case

First I needed to create a test case that was challenging.  I don’t want to compare “fast query vs. fast query” because that often gets skewed by fixed overhead.  Much better to test a “worst case scenario.”

So, I took a 15 million row table, and added a calculated column to it.  Calculated columns in PowerPivot are much less compressed than imported columns, and are therefore more challenging for the measure engine to scan through, looking for duplicates etc. which is required for a distinct count.

And then, to make matters worse, I just used the RAND() function so that there are many unique values:

image

Cutting to the Chase:  The Results

A distinct count measure in PowerPivot v1 took 35 seconds to complete in my test pivot.  This was a big part of why our client was seeing slow perf today.

That same measure, in that same pivot, took less than a second in the V2 beta.

Wow.  This is gonna be nice.

One Last Note:  Two Ways to Write the Formula

PowerPivot v1 did not have a dedicated function for distinct count, you had to do the countrows of distinct thing like what I showed above.

But in PowerPivot v2, there IS a DISTINCTCOUNT() function.  So you can write your distinct count measures in two ways:

[Measure] = COUNTROWS(DISTINCT(Table[Column]))

[Measure] = DISTINCTCOUNT(Table[Column])

That new function is nice, makes for a more readable formula.  BUT…  the performance is still the same.  Either way I wrote it, the measure was sub-second fast.  Very nice.


Guest Post: TOPN in PowerPivot V2

July 20, 2011

Well folks I haven’t had much time to play with the V2 Beta (aka CTP3) yet.  A lot going on over the past week:

  1. Authoring a brand new suite of models and reports for our Duane Reade DR-Direct solution
  2. A nasty crashing bug in PowerPivot for SharePoint that we’ve now learned to catch and repair
  3. I submitted a PowerPivot article to CIMA Insight magazine (for August)
  4. Running refresh tests with PowerPivot servers in Chicago pulling data from SQL in Dallas (over Internet VPN – more on this later) for the HostedPowerPivot offering

So in the meantime, David Hager has a short article to share on the new TOPN function.  I haven’t had a chance to try it yet, and I’m a little surprised that it doesn’t have special handling for ties (as he’s discovered).

David Hager on TOPN

Using the New PowerPivot Version 2 TOPN Function: Up, Down and Between

One of the new DAX functions included in the Denali PowerPivot CTP3 release is the TOPN function. This function returns a table which can be aggregated for the top n values. Prior to the release of this function, workarounds were available to achieve the same results, mainly through DAX ranking formulas (and Denali PowerPivot now has native functions for ranking too!). So, a typical DAX formula which uses the TOPN function is:

=SUMX(TOPN(N,SalesTable,SalesTable[SalesTotal]),SalesTable[SalesTotal])

where SalesTable is the table and SalesTotal is the column in that table containing the top n results to be analyzed.

Well, that was easy enough. However, there is a systemic problem here and it’s mainly data-related. In order to understand why the TOPN function don’t work the way you might expect, read the following that comes from the Denali PowerPivot CTP3 help documentation on TOPN.

“If there is a tie, in order_by values, at the Nth row of the table, then all tied rows are returned. Then, when there are ties at the Nth row the function might return more than n rows.”

In English, that means if there are duplicate values in your data, the TOPN function will return the WRONG answer if the Nth value is adjacent to duplicates. In order to get a meaningful result then, the data must be converted so that each data point has a unique value. The question is, how do you do that and retain the integrity of the data? An Excel trick can be used here as a workaround. This method requires that each value be incremented by a very small but unique value. The best implementation of this comes in the form of a calculated column with the following formula

SalesTable[UniqueIncremValue] = 
IF(RANDBETWEEN(-1,1)=-1,-RAND()*2,RAND())*0.000001

The goal of this formula is to create a column of unique numbers add up basically to zero. There is a balance that is required in this formula between affecting data and insuring that all values in the dataset are unique. The RAND function returns a 15-place decimal, which is the limit in Excel. However, if all of the decimal places were used, the value of the number could be as high as 1 or as low as -2. By multiplying the random numbers by a very small number, the numbers will not interfere with the individual values if they themselves are not small numbers. However, if the multiplier is made too small, not all of the random numbers may be unique depending on the size of the dataset. So, if you are going to use this workaround, you need to test [UniqueIncremValue] to see if all of the values are unique. The best way to do this is to compare its distinct count with the distinct count of a primary key column in the table. If they are equal, TOPN will work for any value of N.

Now, the original formula will work fine if the new calculated column is used in place of SalesTable[SalesTotal].

SalesTable[SalesTotalMod]=
SalesTable[SalesTotal]+ SalesTable[UniqueIncremValue]

It should also be noted that the TOPN function can used in a DAX like the one shown below to afford a BOTTONM result. Note that this formula takes advantage of the new DISTICTCOUNT function.

=SUMX(SalesTable,[SalesTotalMod])
-SUMX(
     TOPN(
         DISTINCTCOUNT(SalesTable[SalesTotalMod])- N,
         SalesTable,
         SalesTable[SalesTotalMod]
         ),
     SalesTable[SalesTotalMod]
    
)

I think you can see that the TOPN function is a valuable new item for your DAX toolbox. Have fun!


Running Totals Without a Traditional Calendar Table

June 30, 2011

PowerPivot provides a host of great functions like DATESYTD, DATESMTD, DATESBETWEEN, etc. that are useful for calculating many things, including a running total.

But with the exception of DATESBETWEEN, I seldom get to use any of those “time intelligence” functions, for the simple reason that our clients almost never operate on a traditional calendar.

I very often find myself working with a “calendar” that looks like this:

Non Traditional Calendar

Can’t Use Time Intelligence Functions With a Calendar Like This

And sometimes I even find myself with “calendars” that don’t have any date columns in them at all:

Another Non Traditional Calendar

This Table DOES Truly Represent the Business Calendar But Has No Dates in It!

So what do you do when you still need a “Year to Date” Total?

Silly humans.  They don’t care that the data is structured one way or another under the hood.  They just keep insisting on seeing useful things, like Year to Date totals.  They don’t want to hear how the blender is constructed, they just want their daiquiri.

So, what’s a report designer to do?  Give up?  No way.  We make daiquiris anyway.

I’m going to use that second calendar above, the one that has no dates in it.  In fact this is the same data set I have been using for the Precedence Project.  (Really, this is Part Three in disguise).  So let’s return to a familiar pivot:

image

Familiar Starting Point

It has Year and MonthNum on rows, and the simple measure Total Sales.  I want to end up with this:

image

Desired Result

Getting to that desired result, in my experience, is something you either stumble upon quickly or flail around forever and never find.  In fact, finding myself in a situation like this is what triggered me to start the Precedence Project in the first place.

Cutting to the chase:  this formula works

To keep this post short and sweet, I’m just going to share a working formula.  I hesitate to call it “the” formula, because there are multiple variations that work, and some fit certain biz requirements better than others.

So here is one that works.  It has some quirks that I will iron out in the next post.

[YTD Sales]=
[Total Sales](
                FILTER(
                  ALL(Periods),
                  Periods[PeriodNum]<=MAX(Periods[PeriodNum])
                ),
                VALUES(Periods[Year])
             )

In short, this  measure clears all filters on the entire Periods table, then adds back two filters – restores the current context for Year (using VALUES), and limits the PeriodNum to be less than or equal to the current context (I will explain later why I used MAX and not just VALUES).   Note that I am using PeriodNum (which keeps increasing across years and is unique to a given Month/Year combo), not MonthNum (which is on the pivot and “resets” to 1 for January of each year).

Next Post:  Explaining the Formula, and Showing How NOT to do it

Explaining that relatively simple formula, if I do a thorough job, will consume its own post.  And I also want to cover some other approaches – ones that seem like they should work but do not.  Because the things that DON’T work are even more educational than the things that do.


Precedence Part 3: ALLEXCEPT vs. ALL w/ VALUES

June 22, 2011

 
Snatch - A Damn Fine Movie

 

“However…  you do have ALL() the characteristics of a dog, Gary.  ALLEXCEPT() loyalty…”

 

 

 

This will be a quick one.  I think.

In part two, we left off with the observation that VALUES() can “trump” ALL() even when VALUES() is applied to a column that is not on the pivot:

[ALL then VALUES of field not on pivot] =
     [Total Sales](ALL(Periods),VALUES(Periods[Period Num]))

image_thumb5

One thing I have often puzzled over is this:  how does a combination of ALL(Table), VALUES(Table1[ColumnA]) compare to using ALLEXCEPT(Table1, Table1[ColumnA])?

In essence, those should do the same things right?  In one case, you set the whole table to ALL(), but then “restore” ColumnA to its pivot context using VALUES().  In the other case, you set every field in the table except ColumnA to ALL(), leaving ColumnA in the context set by the pivot.

They should result in the same thing, right?  Let’s check.

ALLEXCEPT() using columns that are on the pivot

[ALLEXCEPT fields on Pivot]=
[Total Sales](ALLEXCEPT(Periods,Periods[Year],
              Periods[FiscalQuarter]))

There are two fields on the pivot from the Periods table – Year and FiscalQuarter, and both of those are excluded from being “all’d” in the measure.  Results:

image

OK, good news.  Those are indeed the same results as the unfiltered base measure, and that’s what we got from ALL() plus VALUES() as well.  So from this, we can say that ALL() plus VALUES() is the same as ALLEXCEPT().

Not so fast though…

ALLEXCEPT() using columns that are NOT on the pivot

Let’s try ALLEXCEPT() using a field that is NOT on the pivot.  Remember, VALUES() trumped ALL() in this case, too, and the results were the same as the unfiltered base measure.  Will ALLEXCEPT follow suit?

[ALLEXCEPT field NOT on Pivot]=
[Total Sales](ALLEXCEPT(Periods,Periods[Period Num]))

image

OK, that was a bit unexpected.  ALLEXCEPT() returned the same results as pure ALL().  So ALLEXCEPT() does *not* preserve the pivot context of columns NOT on the pivot, whereas VALUES() *does* pick those up.

I’m not sure if that’s intentional on the part of the PowerPivot team, or if it’s just a quirk.

Summary

1) ALLEXCEPT() does behave precisely like ALL(), VALUES() as long as the “exempted” columns are columns on the pivot.

2) ALLEXCEPT() does NOT preserve pivot context, however, on columns that are not on the pivot.

Oh, and one more…

No need to use ALLEXCEPT() instead of ALL() when using VALUES()

In the past, whenever I wanted to do an ALL() with a VALUES(), I typically used an ALLEXCEPT() instead of the ALL().

For instance, I would do something like this:

[Total Sales](ALLEXCEPT(Periods,Periods[ColumnA]),
              VALUES(Periods[ColumnA]))

That turns out to have been an overly-careful habit.  As we’ve seen here, and in part two, VALUES() trumps the ALL(), and does not need the “help” of ALLEXCEPT() instead.

Yes, I’m sure many of you will say that’s obvious from everything covered so far, but just in case you find yourself tempted to use ALLEXCEPT() to exempt the column you are preserving with VALUES(), there’s no need.


Precedence Project Part Two: VALUES()

June 16, 2011

 
OK, picking up from part one

Let’s start with a simplified version of last post’s pivot – remove one of the row fields, and all of the measures but the base Total Sales measure:

image

Now let’s add a measure that sets Year to ALL:

[ALL Year Sales] = [Total Sales](ALL(Periods[Year]))

PowerPivot ALL Year Measure

Note that all of the orange cells have the same value – they are all Q1 cells and differ only on Year, so the all Year measure means each orange cell will display the total sales for Q1 across all years.  So the green cells all add up to the value in each orange cell.

Rule #3:  VALUES() Selectively “Shuts Off” ALL()

OK, we’ve played with VALUES() a few times on the blog, most notably here and here.  But previously, we’ve never used it like we are going to use it this time.  We are going to use it to preserve the original filter context and “fight back” against ALL().  Exciting huh?

Let’s add another of our previous measures back to the pivot – the one that applied ALL() to the entire Periods table:

[ALL Period Sales] = [Total Sales](ALL(Periods))

image

OK, now let’s add VALUES() to the mix in a new measure that has ALL applied to the whole Periods table, and VALUES applied to just one column in Periods:

[ALL with VALUES] =
     [Total Sales](ALL(Periods), VALUES(Periods[FiscalQuarter]))

ALL Against Whole Table, VALUES Against One Field

This time, again the Q1 cells are all the sum of every Q1 in the base Sales measure.  So the VALUES function, applied to the FiscalQuarter column in this measure, overrode the ALL which was applied to every column in the Periods table (including FiscalQuarter!)

So this new measure differs from the pure [ALL Period Sales] measure at the Quarter level.  But note the blue cells.  It’s still identical to [ALL Period Sales] at the Year level.

So is ALL() still “winning” at the Year level?  No.  It’s just that, at the Year level, VALUES(Periods[FiscalQuarter]) returns  the list of all four quarter values, because all four Quarter values truly due correspond to that cell’s coordinates in the pivot (the subtotal for the Year does indeed imply all quarters).

Neat huh?

Clarification/Correction

Derek pointed out in a comment that the two values in blue actually are NOT equal.  They are off by $281.  This is because there are blank values for year – some noise in the data – and I have filtered those out using the little Row Labels dropdown:

image

If I clear that filter and allow blanks to show up in the pivot, I get:

image

Note that the two values in blue are still off by $281.  But the mysterious missing $281 is indeed accounted for by the blank year.

So…  the pure ALL(Periods) measure catches that $281.  That’s expected.  What’s unexpected, at least initially, is that the ALL(Periods), VALUES(Periods[Fiscal Quarter]) measure MISSES that $281. 

Here’s why:  note that there’s only a blank Quarter under the blank year.  There are no blank quarters associated with the blue cells.  So that $281 is omitted.

Great catch Derek, this was awesome.  I probably should spend a post just explaining where those blanks Years and Quarters came from.

Back to the original post…

We can even confirm that VALUES() beats ALL() in a heads-up fight.  Rather than use ALL() against a whole table and VALUES() against one column, let’s use them both against the same single column:

[All vs VALUES Heads Up] =
      [Total Sales](ALL(Periods[Year]),VALUES(Periods[Year]))

VALUES Beats ALL Heads Up

Hey look!  Our new measure returns the EXACT same results as the base sales measure.  VALUES completely shut off ALL() in a head to head fight.

Rule #4:  VALUES() Wins Even if the “Targeted” Field Is Not on the Pivot!

Maybe this one isn’t so much a precedence rule as it is surprisingly cool.  Let’s try using VALUES() against a column from the Periods table that we don’t have on the pivot…

Periods Table

Let’s try that [Period Num] column.  Notice how it uniquely identifies each period, regardless of Year?  It’s the unique key for this table, so if VALUES() works, that *should* beat ALL() even if ALL() is applied to the entire table.

Let’s try it:

[ALL then VALUES of field not on pivot] =
     [Total Sales](ALL(Periods),VALUES(Periods[Period Num]))

image

…and the result is IDENTICAL to the base sales measure.  The ALL() is completely overridden.  Cool huh?  VALUES operates against the underlying tables, and merely takes into account the filters from the current pivot cell.

Good thing to know.  I’ve been taking advantage of that for a long time, it is very helpful.

OK, this post has run long so I will save ALLEXCEPT for the next one.


The Great Precedence Project (Filter Rules for DAX)

June 14, 2011

 
Richard Pryor wrote most of Blazing Saddles, explaining why it's the best movie Mel Brroks ever made

“Wait a minute. There might be legal precedent. Of course. Land snatching. Let’s see, land, la-land…see snatch. snatch…snatch…ah! Haley vs. United states. Haley 7, United States nothing. You see, it can be done!”

 

 

Keeping filters straight as they pile up

Some example questions to ponder:

  1. Within a FILTER or a CALCULATE, does the order of the filter arguments ever make a difference?
  2. Does using an ALL inside of a FILTER yield different results than using an ALL inside of a CALCULATE?
  3. Measures that are used inside of a FILTER or DATESBETWEEN are never impacted by external filter-setting forces are they?  Or are they?

I went through several months without learning anything new about DAX, and that seemed to me like a Good Sign.  A sign that perhaps I now knew everything I needed to know – not everything I could know, but everything I would ever really need to know.

But then a series of new scenarios presented by clients opened up my eyes to some very “squishy” places in my understanding.  I got everything working for the clients, but I took a note to dig in later and see if I could capture everything into a series of new “rules.”

In the meantime I fired off a few emails – specifically to the product team and “The Italians” – and got some initial advice (the Italians were particularly helpful).  But both of those parties understand this stuff almost TOO well – and what Excel people like me need, I think, is to learn from someone who struggled initially.

So I propose to be that person – who struggles with questions like the above, then triumphs, then explains.  I am going to post my findings as I go, but rest assured that it’s not an entirely selfless act – forcing myself to explain things is the way I best learn things myself Smile

It’s gonna take a series of posts to cover this.  And frankly, I’m not done discovering all the rules yet.  Let’s dig in.

Setting the stage:  two simple measures

We need a base to start from.  So, here’s a simple pivot:

Starting Point for PowerPivot Filter Precedence

The measure there is nothing fancy – just a simple SUM.  All of the fields on rows are from the Periods table:

Periods Table

Now let’s add an ALL(Periods) version of the measure:

[ALL Period Sales] = [Total Sales](ALL(Periods))

Starting Point for PowerPivot Filter Precedence with ALL

And if the syntax I am using, [Measure](<filters>), makes you scratch your head, remember that it is just shorthand syntax for CALCULATE([Measure], <filters>) – they are 100% the same, I just prefer the shorthand syntax for readability.  (This is a controversial topic within the DAX community however, and others disagree with me, but I’m right, damnit! Smile)

Rule #1:  “ALL()” and “Table[Col]=Val” Work Together

Let’s add a third measure, one that has the ALL() from above, but also sets Year=2009:

[ALL Period Sales Set to Year 2009] =

     [Total Sales](ALL(Periods),Periods[Year]=2009)

And the results:

ALL and Simple Filters Work Together

Neither clause “overruled” the other.  They were both applied.  We can think of this as “all Periods filters were completely removed, but then Year was set to 2009.”

Of course, thinking of it that way makes you wonder…  what if I reverse the clauses and have the ALL second?  Will that then override the simple filter for 2009?  I mean, I hope that doesn’t matter.  Marco and Alberto say it doesn’t matter.  But I need to see it myself…

Rule #2:  Filter order does NOT make a difference within a CALCULATE

OK, let’s reverse the order:

[ALL Period Sales Set to Year 2009] =

     [Total Sales](Periods[Year]=2009, ALL(Periods))

And yes, same result:

Order does NOT matter in filter clauses

OK, that’s reassuring.  But it gets more interesting from here.  In the next post, I’m going to show you that when ALL() meets VALUES(), there arguably IS a winner, and that ALLEXCEPT() and VALUES() yield outcomes that I did not expect.


Portable Formulas: An Underestimated Benefit

May 5, 2011

 
image

Neo: Are you saying that I can copy/paste formulas?
Morpheus: No Neo, I’m saying that you won’t have to.

 

 

Hi folks.  Sorry about the lack of posts lately.  I’ve been on the road.  A lot.  This week I’m doing a doubleheader – consulting/training in New Mexico and then California, so I’m writing this on a plane between Denver and San Diego.  (I recently joked that I’m the PowerPivot version of Tyler Durden – never sleeping, crisscrossing the country setting up franchises of Pivot Club).  PowerPivot adoption is rapidly picking up and that has the ironic side effect of taking me away from the blog.

But the other reason I’ve been so quiet is that I submitted a two-part post to Microsoft’s official Excel blog rather than posting it here.  Some posts just “speak” to the Excel audience better than others, and this is one that I really want as many Excel folks to see as possible.  The majority of Excel pros have yet to discover PowerPivot, and I love being the bearer of good news.

The post is actually about an inherent weakness in all traditional spreadsheets.  One that I believe eats more than half of the average Excel pro’s time, if not more.  A weakness that PowerPivot obliterates… with a concept that I am calling Portable Formulas.

Part one of two is live today: Click here to view it on the Excel blog.


The Friendly Neighborhood Operators && and ||

April 22, 2011

David Hager’s guest post below reminded me of something I’ve been meaning to share for awhile.  Here’s a feature of PowerPivot so tiny, so sensible, that I’ve had a very hard time internalizing that Excel lacks it.  In fact, I think I’ve had to re-check Excel 3-4 times to confirm for myself that Excel doesn’t have it.

The feature is the pair of logical operators && and ||.  They are alternatives to the Excel functions AND() and OR(), respectively.  Check out these calculated column formulas:

IF(Product[Color]=”Blue” && Product[Weight]>6, “Heavy and Blue”, “Everything Else”)

IF(Product[Color]=”Blue” || Product[Color]=”Red” || Product[Color]=”Yellow”, “Primary Color”, “Blend”)

The first example is looking for Products that are both blue AND weigh over 6 pounds.  The second example will tag a product as “Primary Color” if it is Blue, Red, OR Yellow. 

Notice how you can also use more than two clauses – neat huh?

Note that using && and || is often a great alternative to the dreaded “nested IF” formula.

Also, I mentioned above that Excel DOES have the functions AND() and OR() that do the same things.  But these operators make for much easier formulas to read, and I think they are a lot easier to write as well.

Using || in a CALCULATE measure

I don’t think I’ve had occasion to use && in a CALCULATE measure, primarily because adding another filter clause to calculate achieves the same effect as an AND.  But || sure is helpful, here’s an example:

[Return Dollars] = CALCULATE([Sales],
     Sales[TransactionType]=”Return” ||
     Sales[TransactionType]=”Credit”)

That measure recalculates the [Sales] measure and only includes transactions that were either Returns OR Credits.  Neat huh?  It’s a great shortcut alternative to adding a calculated column with a nested IF, or even a calc column that uses || – just do it right there in the measure.

And now for another alternative to nested IF’s:  David’s post on using a lookup table.


Guest Post: Calculating a Sum Based on a List Criteria

April 22, 2011

Guest Post By David Hager

It is a fairly easy task in PowerPivot to calculate a sum based on N criteria (after a learning curve). This can be accomplished by filtering your data before importing it in the PowerPivot window, selections made in the Pivot Table, or through some (relatively) simple DAX calculations.

However, if N gets too big, the task get much bigger. Now, imagine a table with thousands of customers and the task is to calculate a sum based on only 1000 of those customers. Conventional filtering does not provide a way to do this. Then, say that this list of customers changes from day to day. It would be nice to have a method to perform this calculation. There is!

The first step is to create a linked table from your Excel workbook to its PowerPivot window. The table in PowerPivot is named LookupList. Then, a relationship is created between that table and an existing Table1 (as shown below).

image

The list shown above does not have 1000 entries, and is just for demonstration purposes. However, I have tested it with >1500 entries and it works :)

Solution One:  Calc Column as Basis for Measure

One solution is to create a calculated column named AggregateList that returns only the amounts that are associated with the customers in the lookup table. The correct formula to do this is:

=IF(ISBLANK(RELATED(LookupList[Customer])),BLANK(),Table1[AMOUNT])

Now, a pivot table can generate the desired result as shown below.

image

By changing the aggregation of the measure, an average or other value can also be obtained.

Solution Two:  Yes/No Column plus CALCULATE Measure

(From Rob):  Rather than produce the numerical column as David did above, my first instinct was to use the LookupTable to generate a simple yes/no column, and then use that as the filter in a CALCULATE measure.

Calc column:

=IF(RELATED(LookupList[Customer])<>"",1,0)

image

and then the measure:

=CALCULATE(SUM(Table1[AMOUNT]),Table1[IncludeCustomer]=1)

This is mostly a matter of style.  Both require a calc column – this option’s calc column is simpler but has a more complex measure.  (Neither approach is all that complicated however.)  This option might make for a smaller file size, but it’s hard to be certain of that, and even if true, it won’t be much.

Back to David.

Refreshing the customer list

I mentioned at the beginning of this article that the customer list provided by the linked table is dynamic by nature. If the list is changed manually, recalculation will occur if the PowerPivot window is opened. However, if the list is generated from Excel formulas some calculation errors can occur. Instead, the use of other methods to update this list is preferable. The details of these issues will be discussed in a future article.


A/B Campaign Analysis with Start & End Date Slicers

April 13, 2011

 
“Because every good story has a beginning, a middle, and an end.”

Awhile back I posted about promotional campaign analysis factoring in seasonal trends.  Now let’s look at another flavor of campaign analysis:  comparing results when a campaign is active versus when it was not.  Sometimes this is referred to as “A/B Testing.”

Let’s start by showing what the results can look like:  a report that has two date slicers – one where you select the Start Date of a campaign, and another where you select the End date:

image Pick Start and End Date, See How Sales Performed On vs. Off Program

The report then shows Sales results “On Program,” which are the sales that occurred between the start date and end date (inclusive) versus the Sales results “Off Program” (sales on all other dates).

Specifically, it shows Sales per Day for On vs. Off Program (because programs run for short periods of time, Sales per Day is a much more “apples to apples” comparison than Total Sales), and then the % change in Sales per Day when On Program versus Off (labeled “Program Delta” above).

(From a quick glance at Program Delta, it’s obvious that this was one awful program that ran from 5/20/03 to 7/8/03, but hey, that’s what we get when we use AdventureWorks as our data set).

Oh, and guess what?  Writing this blog post consumed a lot more time than building the report :)

How This Was Built, Step One:  Date Slicers and Date Measures

The first things you need are two single-column tables of dates – these are used to populate the Start and End date slicers:

 image   image

These tables are NOT related to ANY other tables in the model.  They stand alone, intentionally.  Also, MAKE SURE THEY ARE OF DATA TYPE DATE!  Otherwise the following steps will give you strange results.  Also, make sure none of the date columns in your model have time components lurking in them.

OK, now you need to define a measure on each of those two tables.  It’s the same formula, but I name one [Start Date] and the other [End Date], and assign each one to a different table:

image
image

By the way, the technique I’m showing here is a variation of a technique covered in prior blog posts:  one by Kasper and one by me.  Read Kasper’s in particular if I’m moving a little too fast for your taste in this post.

For grins (and to test this out), you can now slap both slicers and both measures on a pivot and inspect what each of those measures returns:

image

We’ll never place those measures on an actual report, but it’s good to see that they serve their purpose, which is to capture the dates that the user selects on the two slicers.  Note that I changed the captions on each slicer to reflect what their intended use is clear (originally they both just had “Date” as a caption since that was the column name in each table).

Step Two:  Sales Measures that are filtered by those Date Measures

Assuming we already have our base [Sales] measure defined, let’s start with [Sales on Program]:

[Sales on Program] = [Sales](DATESBETWEEN(Dates[FullDate],
      [Start Date],[End Date]))

We’re using the DATESBETWEEN function as a means of filtering our base [Sales] measure, and the [Start Date] and [End Date] measures that we defined previously are the end dates.

Note the use of the Dates table in the formula.  That is a third Date table, separate from the two slicer date tables, and that table IS related to the Sales table.

So…  we are picking up the user’s date selections from two tables that are NOT related to anything, and using the dates they selected to filter the “real” Dates table.  That filter then gets applied to the Sales table because it is related to the Sales table.

[Sales Off Program] is a bit trickier than “On” Program.  I suppose we COULD just subtract “On Program” sales from total [Sales], but just in case someday we need to build some date sensitivity into the base [Sales] measure, let’s defined [Sales Off Program] to be [Sales] filtered to dates OUTSIDE of the selected date range:

[Sales Off Program] = [Sales](FILTER(Dates,
   Dates[FullDate]<[Start Date] || 
   Dates[FullDate]>[End Date]))

First, notice that we are using the FILTER function this time rather than DATESBETWEEN.  There is no DATESNOTBETWEEN or DATESOUTSIDE function, so we have to express the logic more directly ourselves.  But that’s not a big deal – I’m pretty sure that DATESBETWEEN is really just a “skin” over the FILTER function anyway – I’ve used FILTER and DATESBETWEEN somewhat interchangeably over the past year and I always get the same results (and the same performance).

Also note the use of the “||” operator, also known as OR.  Rows from the Dates table are included if they are before the [Start Date] OR after the [End Date].

One more note:  you always want to use FILTER and DATESBETWEEN against the smallest tables you can.  Use them against your Dates table, for instance, rather than against the Date column in your Sales table, because they are MUCH slower against larger tables.

Here’s what we get with those new measures on our pivot, and with some Product hierarchy on rows:

image

Now you see why “per day” versions of these measures are required – the programs are so short that they are dwarfed by the “off” dates.

Note:  These two date slicers are great candidates for disabling cross-filtering and thereby improving the performance (update time) of this report.  See this blog post for an explanation.

Step Three:  Creating the Sales-Per-Day Measures

OK, first we need a [Day Count] measure that we can use as a denominator:

[Day Count] = COUNTROWS(DISTINCT(Sales[OrderDate]))

(Note that I am explicitly counting the Date column from the Sales table rather than the Dates table to account for product lines that did not exist for the duration of the entire Dates table, and therefore did not sell at all – there are tradeoffs here that I won’t go into, but I think this is the right thing to do in most cases when calculating a per-day measure).

[Sales per Day], then, is straightforward:

[Sales per Day] = [Sales] / [DayCount]

Now I need to create “On Program” and “Off Program” versions of that measure.  These formulas parallel the On/Off Sales measures from above:

[Sales per Day on Program] = [Sales per Day]
   (DATESBETWEEN(Dates[FullDate],
    [Start Date],[End Date]))

[Sales per Day Off Program] = [Sales per Day]
   (FILTER(Dates,
    Dates[FullDate]<[Start Date] || 
    Dates[FullDate]>[End Date]))

And lastly, let’s add the [Program Delta] measure, which is really Pct Change in Sales Per Day:

[Program Delta] = IF([Sales per Day on Program]=0,BLANK(),
   ([Sales per Day On Program] – [Sales per Day Off Program])/
   [Sales per Day off Program])

The IF in there is just to catch the case where a product did not sell during the selected dates.  If I leave that IF out, the measure will return –100% for those cases.  Sometimes that is what you want and sometimes it is not – it’s a case by case judgment.

Add some conditional formatting, and here’s the resulting report, repeated again from above:

image

Alternate Approach:  Using a Promotions Table

Rather than use two date slicers like we have here, you COULD have a single Promotions table that lists each promotion and its Start and End date.  Something like this:

image

Now, I just have to go back to my [Start Date] and [End Date] measures and change their definitions to reference the Start and End columns in this table, and “attach” them to this table instead of those slicer tables:

image

image

I don’t have to make ANY other changes.  Now I can remove the Start and End date slicers, and replace them with a single Promo Name slicer:

  image

Or, move Promo Name to Rows and some of the product stuff to slicers:

image

Other Fun Stuff

There are many, MANY other things you can do here too.  For instance, a measure that calculates sales per day for the month leading up to a promotion starting.  Or the month after it ends.  Or the exact same period 1 year ago.  PowerPivot truly does open doors that you’d never consider in traditional Excel (and probably would never get to in traditional BI).


Follow

Get every new post delivered to your Inbox.

Join 164 other followers