Just a reminder, if you’re a long-time RSS or email subscriber to this blog, you need to re-subscribe. You’ve missed a bunch of new posts already :)
Go to http://powerpivotpro.com and click one of the subscription options to re-subscribe.
Just a reminder, if you’re a long-time RSS or email subscriber to this blog, you need to re-subscribe. You’ve missed a bunch of new posts already :)
Go to http://powerpivotpro.com and click one of the subscription options to re-subscribe.
Hi folks. We’ve moved the blog to a new host. If you visit the blog via powerpivotpro.com in your browser, then you have no changes to make.
In fact, you should NOT be seeing this post unless you are subscribed via Email or RSS.
So, if you ARE receiving this post via Email or RSS, please visit http://powerpivotpro.com and re-subscribe. I promise I won’t do this to you again, and thanks for reading :)
A few people have asked me if the SharePoint Saturday presentation will be recorded this weekend. I am 99% sure the answer is no.
But if you’d like to see something similar, the webinar I did with Rackspace back in June covers many of the same points:
http://www.rackspace.com/knowledge_center/pivotstream_webinar
I’ll warn you though that we hit two distinct technical issues in the presentation – one small SharePoint glitch which was not a big deal, and a problem with the webinar software itself that basically wiped out 4-6 slides and frustrated me to no end.
Up until that point though, it was going great ![]()
Quick note everyone: I will be presenting at SharePoint Saturday in Columbus Ohio, THIS Saturday, August 20th.
Presentation titled: “PowerPivot: SharePoint 2010’s Secret Weapon” at 1:55 PM.
Event Directions Et Cetera: http://sharepointsaturday.org/columbus/Pages/about.aspx
And yes, I have some real posts coming. It’s been VERY busy lately, in a good way. One of my biggest problems is “paralysis” on what to write about first ![]()
A few days ago I mentioned I was writing a series of posts for CIMA Insight, which is the monthly web magazine of the Chartered Institute of Management Accountants – an audience that knows Excel quite well but probably has yet to discover PowerPivot.
Given that this series is starting from scratch, with an introductory post, it is mostly “old news” for readers here. But there are perhaps some high level explanatory points that are new. For instance, this diagram I sketched to explain the traditional tradeoffs between spreadsheets and formal BI: – a decision that results in impractically high costs sooner or later:
PowerPivot, of course, provides a curve that shares the low startup costs of spreadsheets AND the long-term maintainability and robustness of formal BI. Literally, PowerPivot is the end of that “damned if you do, damned if you don’t” situation.
To read the whole (brief) part one of the series, please click here.
As always, I’m interested in your comments.

“Stay close! Come together! Staggered columns! Staggered columns!
I mean… less columns! More rows!”
-General Maximus Speedicus Queryus
For a long time now I have been advocating that everyone minimize the number of columns in their large tables. For certain, you want to not import columns unless they are going to be used.
But I also recommend that if you have a lot of numerical columns in a table, it’s often better to replace them all with a SINGLE numerical column, and add a second column for “type.” Even though that results in many more rows, the theory is that PowerPivot’s VertiPaq engine would return quicker pivot results against the “tall, narrow” table than it would against the “short, wide” table.
I’ve simply “known” that to be true, in my bones, for a long time. I was so certain of it, in fact, that I have never bothered to test a “before and after” case until now.
We had an opportunity this weekend to quantify the “before” and “after” impact of making such a change, so I thought this time I’d record the results and share them.
With that data shape, the [Sales] measure needs to be the sum of a specific column:
And then, a very large pivot reporting against this data ran in 5.7 seconds.
OK, so there’s our baseline.
Whereas the original table used a single row to store all 9 numerical values, this new table stores each numerical value in its own row, and tags each row using the Value Type column to identify which numerical value it is.
(Note that in this new shape, zero values can be completely omitted if you’d like. And we did, which is why there are not 9 * 700k rows in the resulting narrow table – we just dropped rows that were going to be zeroes.)
I then added a very small table that “decodes” the ValueTypes into something I can understand, and related it to the tall table:
Lookup Table – Helps Write More Readable Measures
Each row in this lookup table corresponds to one of the original numerical columns in the Wide table, of course.
I then wrote a simple measure for [Amount]:
Base Measure – Just a Raw Sum of the Value Column,
Serves as a Basis for Other Measures
And then the Sales measure can simply be a filtered version of Amount:
[$ Sales] = The [Amount] Measure Filtered to “Ttl_Dollars” Type Rows
I have to admit, I was a bit nervous, because I had colleagues watching me, they knew I was making a precise comparison and was going to share the results. Would this turn out as expected? Would it be faster? I experienced a moment of doubt as I was about to click the mouse and time the query. I mean, we were looking at nearly 7x as many rows in the new set. I was starting to sandbag my own expectations, telling myself it would likely be the same. I should not have doubted.
It ran in 2.4 seconds, more than twice as fast as the original 5.7 seconds.
Boom. It was worth it, big time. Even more impressive since there is definitely some fixed-cost overhead involved in Excel sending the query, populating the grid with thousands of rows of results, etc. There may even be a full second of overhead involved, in which case raw query time went from 4.7 to 1.4 seconds.
Remember that VertiPaq is a COLUMN storage engine, and it compresses the data VERY efficiently. (I’ve covered this topic briefly here as well).
With that in mind, here are two quick answers. The long answers would be too long for this post ![]()
1) By removing a lot of the “noise” presented by the 9 numerical columns, this change enabled VertiPaq to do a much better job compressing the columns that remained. And that compression translates into better query speed as well.
2) No, I do NOT expect that this trick will always speed things up. It all comes down to “cardinality,” or how many unique values per column. And I’m pretty sure that the three columns dedicated to “Units” in the original data set had lower cardinality than the SKU column for instance. If we hadn’t had those Units columns, I suspect this change would NOT have been faster.
As a side note, the file size of tall and narrow was NOT smaller than short and wide. I’ve mentioned many times that file size is a ROUGH guide to performance but not 100%. Maybe we’ll discuss this in the comments section, because this post has run too long already.
No, I don’t think this will always work.
Yes, this is a real product you can buy.
And yes, it is VERY disturbing. It also dances. Yeah.
A bit off topic today. Fridays are like that sometimes. Let’s call this Friday Brain Food.
SQLRockstar, aka Tom LaRock, has written something awesome on his blog. It deals with interview questions and hiring and is one of the most insightful things I’ve read in a long time.
It’s a short post – go read it now and then come back for my commentary.
That post seems so out of place on a SQL blog. It needs to be a NYTimes editorial or something. Bravo Tom, but if you decide to make a habit of being this insightful all the time, I will frankly find it hard to be around you ![]()
The part that lit me up the most, by far, was that last part:
If a person says that they are typically thought to be red but they don’t really know what color they are because they have never met Santa Claus then you have someone that is not only open to new ideas but will likely want to test those ideas thoroughly. My guess is that maybe 1 in 20 people would respond in this manner. Most folks don’t like to be the pain in the ass that points out the obvious facts to others around them.
(And not to belabor the point, but here’s the link, you really should read his post. Oh, and Tom has grown sensitive to plagiarism over the years so I figure, why not over-link to him to be safe? He’s been plagiarized a million times this year alone. Go copy/paste some text off his blog into Word or anywhere else and see what happens. Seriously, something funny does happen.)
OK, why did that part crack me up? Because I have met these people. And they are much more rare than 1 in 20.
In my 13 years at Microsoft I really only met two people like that. And they were both amazing software testers.
Thanks to a college recruiter with a particular quota to fulfill, I was miscast as a software tester at Microsoft for my first year in Redmond. Actually, I realized in my very first WEEK that I had been routed into the wrong job, but it took a year (and some serious luck) to escape.
You see, to be a truly great test engineer, you have to come to work every single day with a belief that the product is broken. That your coworkers who built it failed to do a good job. You have to be contrarian to the core… but still be able to work with others.
If you know me personally, you might be thinking “well hey, Rob is QUITE contrarian,” and you have a point, but that’s a different flavor. I have no problem with ideas and thoughts that are unpopular, and I’m less afraid to share those thoughts at times than perhaps I should be (I sent some unsolicited investment advice yesterday that comes to mind), but fundamentally I am a VERY optimistic person. I believe things work. I believe in the people around me.
I found it VERY difficult to be a software tester. My biggest problem, every single day, was motivation. Telling myself that my job actually made a difference. Because hey, the product worked! And more importantly, there was serious CREATION going on! The thrill of creating things was everywhere, and I wanted to be on THAT side of the equation. I was constantly trying to create, by disguising new product ideas as “bugs” and entering them into the system.
I still did pretty well at it, but honestly, I never sunk more than a solid hour of real work into actual software testing each day, like the guy in Office Space. And I suspect that was true of many of the test engineers on my team.
But then there was Lawrence.
Lawrence Landauer was head and shoulders above everyone else on that team. When Office 97 was done, he had found four times as many bugs as the person in second place on the team. Lawrence worked LONG hours, but he never seemed to be straining himself at any particular point. The bug tracking system (called RAID at the time, before lifesucking corporatism took hold and gave us Product Studio, and then later Team Foundation Services) was like a word processor for him. And I am absolutely POSITIVE that if we asked Lawrence what color Santa’s pants are, he would laugh and say he’d never met Santa.
Honorable mention in this category goes to James Rivera. I give James a 50% chance of answering that way, depending on the day. Oddly, James came up this week when I was talking to Bill Jelen (Mr. Excel). And naturally, he came up because he had been the bearer of bad news. Still got it after all these years James! ![]()
Tom closes his article with the following thought:
Most folks don’t like to be the pain in the ass that points out the obvious facts to others around them.
Then again, most folks aren’t DBAs, either.
I watch DBA’s talk to each other in Twitter a lot. And mostly what they do, as their primary means of social interaction, is complain. Complain complain complain. Mostly about their coworkers, or clients. Things like “no you can’t have admin privs on the database mr. developer” but usually things that go over my head like “look at this query the previous schmuck wrote that I have to clean up.”
But in person, these are all VERY nice people. I meet them all the time at SQL Saturdays in particular.
The software test teams I worked on in my first year at Microsoft were just like that. So are DBA’s really just software test engineers who have moved a bit toward the “create” side of the spectrum? Are test engineers really just DBA’s who haven’t realized it yet?
Building, growing, and maintaining a good quality assurance organization is a huge ongoing challenge for most software teams, and I think that is mostly due to the psychological challenges involved. There are simply not enough Lawrence and James types running around. Understanding and embracing that may be critical to success.
A quick update, as tonight I managed to sit down and experiment a bit with the V2 Beta for the first time in many days.
(OK, actually, I am standing up… because I have a fantastic new treadmill desk and have walked nearly 8 miles today while working, but that’s a topic for another post).
I was working with one of our HostedPowerPivot clients today who was observing slower-than-expected slicer click performance in one of their reports. And after some sleuthing, I found that they had a “distinct count” measure in their model, something like this:
[Measure] = COUNTROWS(DISTINCT(Table[Column]))
Now, that’s a pretty useful measure in many situations, like “does product X sell every day?” Simply taking a count of the date column doesn’t cut it, because if it sells twice on one day and not at all the next day, the count is still 2, and you want it to be 1.
Trouble is, in PowerPivot v1, that operation can be slow. I won’t bore you with all the details, because frankly, I don’t know all of them, heh heh. But I do know that PowerPivot v1 does end up creating all kinds of temporary new tables in memory behind the scenes when it is evaluating the measure.
Anyway, I’d heard a long time ago that PowerPivot V2 was going to make distinct counts a lot faster, so I thought I’d try it out.
First I needed to create a test case that was challenging. I don’t want to compare “fast query vs. fast query” because that often gets skewed by fixed overhead. Much better to test a “worst case scenario.”
So, I took a 15 million row table, and added a calculated column to it. Calculated columns in PowerPivot are much less compressed than imported columns, and are therefore more challenging for the measure engine to scan through, looking for duplicates etc. which is required for a distinct count.
And then, to make matters worse, I just used the RAND() function so that there are many unique values:
A distinct count measure in PowerPivot v1 took 35 seconds to complete in my test pivot. This was a big part of why our client was seeing slow perf today.
That same measure, in that same pivot, took less than a second in the V2 beta.
Wow. This is gonna be nice.
PowerPivot v1 did not have a dedicated function for distinct count, you had to do the countrows of distinct thing like what I showed above.
But in PowerPivot v2, there IS a DISTINCTCOUNT() function. So you can write your distinct count measures in two ways:
[Measure] = COUNTROWS(DISTINCT(Table[Column]))
[Measure] = DISTINCTCOUNT(Table[Column])
That new function is nice, makes for a more readable formula. BUT… the performance is still the same. Either way I wrote it, the measure was sub-second fast. Very nice.
In the intro to David’s guest post below I mentioned some new applications we are building. Check out a PowerPivot report where, in the browser, you can type in a SKU number and get a full dashboard for just that product!
Click for Larger Version
Much respect to our resident Data Junkie Monkey, aka DJ Monkey, for pulling this together.
Well folks I haven’t had much time to play with the V2 Beta (aka CTP3) yet. A lot going on over the past week:
So in the meantime, David Hager has a short article to share on the new TOPN function. I haven’t had a chance to try it yet, and I’m a little surprised that it doesn’t have special handling for ties (as he’s discovered).
Using the New PowerPivot Version 2 TOPN Function: Up, Down and Between
One of the new DAX functions included in the Denali PowerPivot CTP3 release is the TOPN function. This function returns a table which can be aggregated for the top n values. Prior to the release of this function, workarounds were available to achieve the same results, mainly through DAX ranking formulas (and Denali PowerPivot now has native functions for ranking too!). So, a typical DAX formula which uses the TOPN function is:
=SUMX(TOPN(N,SalesTable,SalesTable[SalesTotal]),SalesTable[SalesTotal])
where SalesTable is the table and SalesTotal is the column in that table containing the top n results to be analyzed.
Well, that was easy enough. However, there is a systemic problem here and it’s mainly data-related. In order to understand why the TOPN function don’t work the way you might expect, read the following that comes from the Denali PowerPivot CTP3 help documentation on TOPN.
“If there is a tie, in order_by values, at the Nth row of the table, then all tied rows are returned. Then, when there are ties at the Nth row the function might return more than n rows.”
In English, that means if there are duplicate values in your data, the TOPN function will return the WRONG answer if the Nth value is adjacent to duplicates. In order to get a meaningful result then, the data must be converted so that each data point has a unique value. The question is, how do you do that and retain the integrity of the data? An Excel trick can be used here as a workaround. This method requires that each value be incremented by a very small but unique value. The best implementation of this comes in the form of a calculated column with the following formula
SalesTable[UniqueIncremValue] =
IF(RANDBETWEEN(-1,1)=-1,-RAND()*2,RAND())*0.000001
The goal of this formula is to create a column of unique numbers add up basically to zero. There is a balance that is required in this formula between affecting data and insuring that all values in the dataset are unique. The RAND function returns a 15-place decimal, which is the limit in Excel. However, if all of the decimal places were used, the value of the number could be as high as 1 or as low as -2. By multiplying the random numbers by a very small number, the numbers will not interfere with the individual values if they themselves are not small numbers. However, if the multiplier is made too small, not all of the random numbers may be unique depending on the size of the dataset. So, if you are going to use this workaround, you need to test [UniqueIncremValue] to see if all of the values are unique. The best way to do this is to compare its distinct count with the distinct count of a primary key column in the table. If they are equal, TOPN will work for any value of N.
Now, the original formula will work fine if the new calculated column is used in place of SalesTable[SalesTotal].
SalesTable[SalesTotalMod]=
SalesTable[SalesTotal]+ SalesTable[UniqueIncremValue]
It should also be noted that the TOPN function can used in a DAX like the one shown below to afford a BOTTONM result. Note that this formula takes advantage of the new DISTICTCOUNT function.
=SUMX(SalesTable,[SalesTotalMod])
-SUMX(
TOPN(
DISTINCTCOUNT(SalesTable[SalesTotalMod])- N,
SalesTable,
SalesTable[SalesTotalMod]
),
SalesTable[SalesTotalMod]
)
I think you can see that the TOPN function is a valuable new item for your DAX toolbox. Have fun!