Less Columns, More Rows = More Speed!

August 1, 2011

 

“Stay close!  Come together!  Staggered columns!  Staggered columns!

I mean…  less columns!  More rows!”

-General Maximus Speedicus Queryus

 
 

A Long-Held Belief, Quantified

For a long time now I have been advocating that everyone minimize the number of columns in their large tables.  For certain, you want to not import columns unless they are going to be used.

But I also recommend that if you have a lot of numerical columns in a table, it’s often better to replace them all with a SINGLE numerical column, and add a second column for “type.”  Even though that results in many more rows, the theory is that PowerPivot’s VertiPaq engine would  return quicker pivot results against the “tall, narrow” table than it would against the “short, wide” table.

I’ve simply “known” that to be true, in my bones, for a long time.  I was so certain of it, in fact, that I have never bothered to test a “before and after” case until now.

We had an opportunity this weekend to quantify the “before” and “after” impact of making such a change, so I thought this time I’d record the results and share them.

The Original Data

PowerPivot Wide Table Lots of Columns

With that data shape, the [Sales] measure needs to be the sum of a specific column:

PowerPivot Measure Wide Table

And then, a very large pivot reporting against this data ran in 5.7 seconds.

OK, so there’s our baseline.

The Taller, Narrower Table

PowerPivot Tall Table Less Columns

Whereas the original table used a single row to store all 9 numerical values, this new table stores each numerical value in its own row, and tags each row using the Value Type column to identify which numerical value it is.

(Note that in this new shape, zero values can be completely omitted if you’d like.  And we did, which is why there are not 9 * 700k rows in the resulting narrow table – we just dropped rows that were going to be zeroes.)

I then added a very small table that “decodes” the ValueTypes into something I can understand, and related it to the tall table:

PowerPivot Tall Table Lookup Table

Lookup Table – Helps Write More Readable Measures

Each row in this lookup table corresponds to one of the original numerical columns in the Wide table, of course.

I then wrote a simple measure for [Amount]:

PowerPivot Amount Measure Tall Table

Base Measure – Just a Raw Sum of the Value Column,
Serves as a Basis for Other Measures

And then the Sales measure can simply be a filtered version of Amount:

PowerPivot Measure Tall Table

[$ Sales] = The [Amount] Measure Filtered to “Ttl_Dollars” Type Rows

The Results

I have to admit, I was a bit nervous, because I had colleagues watching me, they knew I was making a precise comparison and was going to share the results.  Would this turn out as expected?  Would it be faster?  I experienced a moment of doubt as I was about to click the mouse and time the query.  I mean, we were looking at nearly 7x as many rows in the new set.  I was starting to sandbag my own expectations, telling myself it would likely be the same.  I should not have doubted.

It ran in 2.4 seconds, more than twice as fast as the original 5.7 seconds.

Boom.  It was worth it, big time.  Even more impressive since there is definitely some fixed-cost overhead involved in Excel sending the query, populating the grid with thousands of rows of results, etc.  There may even be a full second of overhead involved, in which case raw query time went from 4.7 to 1.4 seconds.

Why Faster?  Will This ALWAYS Work?

Remember that VertiPaq is a COLUMN storage engine, and it compresses the data VERY efficiently.  (I’ve covered this topic briefly here as well).

With that in mind, here are two quick answers.  The long answers would be too long for this post Smile

1) By removing a lot of the “noise” presented by the 9 numerical columns, this change enabled VertiPaq to do a much better job compressing the columns that remained.  And that compression translates into better query speed as well.

2) No, I do NOT expect that this trick will always speed things up.  It all comes down to “cardinality,” or how many unique values per column.  And I’m pretty sure that the three columns dedicated to “Units” in the original data set had lower cardinality than the SKU column for instance.  If we hadn’t had those Units columns, I suspect this change would NOT have been faster.

As a side note, the file size of tall and narrow was NOT smaller than short and wide.  I’ve mentioned many times that file size is a ROUGH guide to performance but not 100%.  Maybe we’ll discuss this in the comments section, because this post has run too long already.

No, I don’t think this will always work.


Two tips from “the Old Country”

June 10, 2011

   

“Don Corleone, we have some information for you.”

For awhile now I’ve been meaning to share a few tips from who I call “the Italians” – Alberto Ferrari and Marco Russo.  I reviewed their book awhile back, if you recall.  Over the past several months they have discovered a few things that I think you will appreciate.

 

 

Tip 1 – <CTRL>-Scroll Wheel to zoom the measure window

OK, we all know that the formula measure editor window uses a small font:

PowerPivot Measure Editing Font is Small

PowerPivot Measure Editing Font is Small

Alberto discovered that if you hold down the <CTRL> key on your keyboard, and scroll the mouse wheel, you can increase the font size!

Bigger Font in the PowerPivot Measure Editor

Bigger Font in the PowerPivot Measure Editor

Neat huh?  Funny thing is, this does not work on all of my computers.  I use different mice, different versions of Windows, etc. – I scramble all of the variables.  So it’s hard to know why it doesn’t work everywhere.  (In fact, the two screenshots above were taken on different machines).

While we’re on the topic, you may also want to consider writing your measures *outside* of this dialog, so take a look at this post on using Notepad++ written by Colin Banfield.

Tip 2– Sort data before it’s imported to improve perf

Another Alberto discovery:  Believe it or not, when PowerPivot imports data, the sort order of the incoming data actually has an impact.

The original post is here.  Bottom line:  you can shave 25% or more off your workbook size if the data coming in is sorted ahead of time, and Alberto’s results suggested that it doesn’t matter much which column you sorted by.

NOTE:  We are NOT talking about sorting data AFTER it’s been imported.  We mean sorting it during the import process.  Sorting after import has ZERO effect.

Tip 2a – Smaller workbooks are faster workbooks

He also noted that pivottables built on those smaller workbooks are faster than on the workbooks created from unsorted data.

That 100% matches my expectations, because PowerPivot’s compression is not just an on-disk compression – that compression is also maintained when the data is loaded into memory, AND is used to speed up queries!

So I have been telling people for a long time:  smaller workbooks are generally faster workbooks.  You can use workbook size as a rough indicator of how fast it will be.

Tip 2b – The column you choose DOES matter, if…

At Pivotstream we recently had a SQL table that just refused to import.  PowerPivot basically just locked up every time we tried to import it.

The table was sorted by a column with millions of distinct values (the dollars column).  When we sorted it by basically ANY other column, it imported just fine.

And this was only a 1.5M row table.  It seems that certain data sets can just chew up massive amounts of RAM during import if they are not sorted properly.

This discovery was what triggered me to go back and share all of the tips above, so it was a fortunate find.


Buying a PowerPivot Server? Talk to Me First!

March 2, 2011

PowerPivot Servers - A Joint Collaboration Between Pivotstream and um... a leading technology partner :) PowerPivot *IS* Fast… but takes some tuning

Most of us have seen the demos of 100 million rows in PowerPivot, and slicer clicks taking well under a second against those sources.  Super slick.

And whenever I first load a large data set like that and start working with it, I do indeed experience sub-second response times.  But once I go and build sophisticated measures, and put, say, 5-6 slicers on the report, things start slowing down, sometimes by a lot.

When you think about everything PowerPivot is DOING in those cases, it’s staggering.  But the report consumers don’t care – they want speed, not explanations.

So at Pivotstream, we spend as much effort tuning our models and reports for speed as we do creating them in the first place.  We’ve gotten much better at that over time, so it’s almost second nature to us now.

Multiple Times Faster.  AND More Cost Effective.

The quest for speed didn’t end with our modeling and reporting techniques though.  Remember, we’ve been at this for over a year now, and eventually we started to brainstorm and experiment on hardware platforms.  The results have been surprising – in a very good way.

Being a relatively small company, we needed help in this quest, so we approached a leading technology provider for assistance.  We weren’t sure what sort of response to expect – would they understand / take it seriously / appreciate the opportunity?  And even if so…  would our collaborative investigation yield results?

Yes, yes, yes, and BIG YES.

As a technology professional, this has been one of the most stimulating experiences of my entire career.  Real science fiction, cutting edge experimentation…  which is why for now, I’m going to refer to our partner as Cyberdyne.  (This is all so fresh that there are still some legal details to tidy up before I can disclose their true identity).

Anyway, bottom line:  what we’ve developed is both multiple times faster than off-the-shelf server hardware…  AND at the same time, almost certainly more cost effective.

Take the PowerPivot Pepsi ChallengeOur Version of the Pepsi Challenge :)

I’m quite confident that our Black Box (as we are calling it today) will blow the doors off of literally ANYTHING you can get your hands on through any other channel.

So, before you sink big money into PowerPivot server hardware, talk to us.  There’s a good chance that we can save you a lot of time and money.

If you are somewhere in that process today, from early evaluation to the final stages, drop us a note – info@pivotstream.com


PowerPivot Scheduled Refresh Pt 3: Thin Workbooks

February 1, 2011

 
PowerPivot Thin and Core Workbooks

What is a “Thin” Workbook?

At Pivotstream, we came up with the term “thin workbook” to describe an Excel workbook that does not contain a PowerPivot model, but that DOES connect to another workbook that does, which we call a “core” workbook.  In other words, thin workbooks use core workbooks as their data source.  Others have called this “hub and spoke,” and I may even like that phrasing better, but hey, we’ve been using “thin and core” at Pivotstream for nearly a year and it’s hard to change.

So, a thin workbook contains pivots, charts, cube formulas, etc. – it’s a report-only workbook.  And whereas our core workbooks may be 200 MB or more, our thins are often 5 MB or less.

Why create thin workbooks?

Here are a few reasons why you may want to use a thin workbook approach.

1) To avoid duplicating data.  If you want to create multiple workbooks that offer different views of the same data, which we do all the time, you probably don’t want to duplicate the PowerPivot data across all of those workbooks.  That duplication needlessly burns RAM on the server, takes extra time to run scheduled refreshes, and also increases your maintenance work (ex:  if you want to modify a measure, you have to do it multiple places instead of one).  I *highly* recommend you never duplicate PowerPivot models like that, and thin workbooks are a good alternative.

2) To lock down data.  In many ways this is another flavor of #1.  At Pivotstream we often create one workbook that contains a full set of measures, and then another that omits certain measures that are particularly sensitive.  Both can use the same core workbook as a data source, but if you properly lock down your permissions on SharePoint, you can give one set of users a different level of data than another set.

3) To avoid the refresh on open performance problem.  In the last post about refresh, I pointed out that the PowerPivot refresh service on SharePoint kinda “cheats” a little bit – it refreshes the data in the PowerPivot model, but the data in the pivots (the data in the Excel sheets themselves) is untouched.  To work around that, PowerPivot for SharePoint sets the “refresh on open” flag, forcing Excel to refresh all pivots whenever the workbook is opened.

That can be slow.  In many cases in our work at Pivotstream, it is VERY slow.  Since the PowerPivot refresh service does NOT touch the thin workbooks, “refresh on open” never gets set.

Of course, that creates another problem doesn’t it?  The data in the Excel sheets then remains stale, with no automatic way to refresh it.  That’s no good.

But we are busy little beavers at Pivotstream and we don’t give up easy :)  We started working on this problem in March 2010.  More on this later.

How do you create thin workbooks?

The first step is to publish your “Core” workbook to SharePoint.  You CANNOT connect a thin to a core sitting on your desktop – the desktop version of the PowerPivot engine does not support cross-workbook connections like that.

Once you have your Core published, there are two primary ways to create a thin workbook:  Start from Scratch, or Hack like Mad.

Boring:  Start from Scratch

In this method, all you do is go to the PowerPivot Gallery and click the highlighted button:

Creating a Thin PowerPivot Workbook from the Gallery 
…and that gets you started in Excel, in a new workbook, with a blank pivot connected to the Core model.

More Fun:  Hack Like Mad!

For those of you who prefer to color outside the lines, here’s another technique that I often find helpful.  You’ve created a single Core workbook, and it has lots of report sheets in it.  You don’t want to start from scratch and throw all of those report sheets away, right?

So here’s what you do:

  1. Create a copy of the workbook.  Just copy/paste the file in Windows Explorer.  Name one file something like Core.xlsx and the other Thin.xlsx
  2. Delete pivots from the Core workbook.  Two reasons for this.  One, it reduces confusion – when you open the Core, you know it’s the core because you don’t see any report sheets.  But deleting pivots also makes the scheduled refresh process on SharePoint run faster – after PowerPivot refreshes the model on the server, it then needs to refresh the thumbnails in the gallery.  To do that, it asks Excel Services to open the workbook, which triggers “refresh on open.”  Why pay extra time and CPU penalties on reports no one is going to see?
  3. Publish the Core workbook to SharePoint.
  4. Change the connection string in the Thin workbook.  Open it up in Excel, and in Connection Properties change the “$Embedded$” string to be the full URL to the core workbook on SharePoint.  Example:

Changing the Connection String to Point to a Published PowerPivot Workbook

When you click OK, Excel performs a full refresh against the new source.  Be patient.  Once that is done, your thin no longer connects to the PowerPivot data embedded in its own file, it now queries the server copy instead. 

Of course, now you probably want to make the file smaller and get rid of all that unused PowerPivot embedded data.  Remember where that lives?  This post explains in more detail, but here’s the image of the thin workbook renamed from XLSX to ZIP:

Embedded Data - One Reason Why PowerPivot Will Not Work With Excel 2007

Item1.data is the file you want to get rid of.  But if you delete it, you will corrupt the file.  So I just create a 0-byte Item1.data file on my desktop and copy it over the top of the file inside the ZIP.

Rename back to XLSX and you have a true “Thin” workbook with all of the original report sheets intact and functional!

Summary, and a hint at the next post

Thin workbooks are a VERY useful tool in a PowerPivot system.  For Pivotstream, they are absolutely essential.

The only trick, as mentioned above, has been getting those thin workbooks to always be up to date.  I’m happy to say we have that problem solved.

I’m pretty sure others are going to need this sort of thing, too, as well as some other internal tools and components we have built.  If you have interest in beta testing our PowerPivot Accelerators, as we are calling them, watch for the next post :)


Got a high end CPU? Please help w/ an experiment!

January 23, 2011

If anyone out there has a really high end CPU and wants to participate in a VERY quick experiment, please drop me a note – rob at pivotstream (dot) com.

Will take you five minutes.  And as a bonus, I will explain to you why it is relevant to PowerPivot – advanced glimpse behind the scenes, weeks in advance of the eventual blog post :)

May even have a gift or two to line up for you as well.

To be clear, a high end processor is one that ranks near the top of these rankings:

http://www.cpubenchmark.net/high_end_cpus.html

I want to conduct this experiment in the next 24-48 hours.

UPDATE:  results so far are a bit surprising.  One of the top-end procs is one of the worst performers so far.  The current champ is a mid-range, affordable proc that was just released in the past two weeks, based on the new “Sandy Bridge” architecture from Intel.

UPDATE #2:  More results trickling in, updated the table below (click for larger version).  I am going to leave this open for another day or two, so please get in touch if you would like to participate.  You will need 64-bit Office 2010 and 64-bit PowerPivot.

UPDATE #3:  New results and information as of Tuesday morning, Jan 25.

image

(Click for larger version)

If you have a CPU not on that list, or one on that list and in a different RAM config (or overclocked), please leave me a comment or drop me an email – rob at pivotstream (dot) com.

Thanks!


PowerPivot Scheduled Refresh – Twists & Turns, Pt 1

January 3, 2011

 
Skipper Plans A PowerPivot Refresh Strategy  
“Rico, we’ll need special tactical equipment. We’re going to face extreme peril. The private probably won’t survive.”

OK, with a new year it’s time to get serious on the blog.  (Yeah, nothing says “serious” like cartoon penguins, but really, I *do* mean it).  This is gonna kick off a series of inter-related posts that I’ve had in my head for a long time now.  So buckle up, Rico.

 

 

Let’s say you’ve built a bunch of nifty PowerPivot workbooks.  You’ve published them to SharePoint.  And naturally, now you want to set things up so those workbooks automatically refresh, with the latest data, every night (or perhaps once a week).

As a recap, the simplest way to get started is to go to the PowerPivot Gallery view and click the highlighted button:

PowerPivot Gallery Schedule Refresh Button

Which leads to this page:

PowerPivot Gallery Schedule Refresh Page

Refresh ASAP vs. Refresh Once

Most of the options in that scheduling page are quite straightforward.  The only one that’s ever tripped me up is “Once.”  “Once” does NOT mean “Now.”  Instead, “Once” typically means refresh will occur tonight – either after business hours or after your specified earliest time.

If you want a refresh to occur now, you either need to specify a time (from today) that has already passed, OR don’t use “Once” at all.

Instead, you can schedule a refresh to occur Daily/Monthly/Weekly, and then check the “Also refresh as soon as possible” checkbox.

Personally, I think the lack of a “One time, right now” option is an oversight.  As a workaround, I often found myself scheduling something to run once every 12 months just so I can check the “ASAP” checkbox.

Another Way to Turn “Once” into ASAP

At Pivotstream, we have found another way to schedule ASAP refreshes:  we’ve changed the centralized definition of “business hours” to be so narrow that there are essentially no business hours at all.  (Right now we have biz hours set to be 7-7:15 AM – I haven’t tried setting Start Time = End Time, but a 15 minute window is narrow enough for our purposes).

The downside of essentially disabling Business Hours, of course, is that it impacts everyone.  That may not be a good option for your organization, as a refresh scheduled for “after business hours” will end up running sooner than expected.  It works well for us at Pivotstream however.

For details on how to modify Business Hours, see Vidas’s item in the FAQ:  http://powerpivotfaq.com/Lists/TGPPF/DispForm.aspx?ID=51

Behind the Scenes – the Basics

When a scheduled refresh does occur, a lot more happens than meets the eye.  This rough diagram illustrates the steps behind the scenes:

What Happens During PowerPivot Refresh - The Basics

(Click for Larger Version)

Notes on the diagram

I won’t repeat the descriptions on the diagram above, so spend a couple of minutes going over it ok? :)

With that done, I do want to call out a few points:

  1. The timer service only checks about once per minute – so if you schedule an ASAP refresh but nothing happens immediately, just be patient, it will start shortly.
  2. You can set different schedules per connection – I just discovered this recently, thanks to the watchful eyes of a training/consulting client of mine.  George and company, you rock :)
  3. The whole process will fail if the original file is checked out, or otherwise open for editing, during Step 6.  So it’s extra important that you not leave your published PowerPivot workbooks checked out.  I recommend editing locally on your desktop for this reason (and also because SharePoint is not terribly speedy for open/save of large files…  and you should be saving often.)

Refresh is Actually TWO Refreshes

OK, one last piece of background in this post.  When PowerPivot runs a refresh, it is refreshing the PowerPivot model – which corresponds to the contents of the PowerPivot window in Excel client.

In order for report consumers to see the latest and greatest data when they first navigate to it, however, the pivots in the Excel sheets themselves ALSO need to be refreshed.  So there are TWO refreshes that need to happen – we’ll call the first one “PowerPivot Refresh” and the second one “Excel Refresh.”

PowerPivot Refresh vs Excel Refresh
PowerPivot Refresh and Excel Refresh

OK, that’s enough for part one.  Next time I’ll explain how this two-stage refresh can complicate things in some situations.

Click Here for Part Two >>

In the meantime, if you are interested in further reading on scheduled refresh, you can check out the following articles:

http://msdn.microsoft.com/en-us/library/ff976569.aspx (Mariano’s excellent whitepaper)

http://powerpivotgeek.com/2010/09/08/a-peek-inside-getting-the-most-from-data-refresh/

http://blogs.msdn.com/b/powerpivot/archive/2009/11/20/powerpivot-data-refresh.aspx


PowerPivot Refresh: CPU/RAM Spike at Completion

December 6, 2010

 
I mentioned this in a recent post, but I figure a picture illustrates it better.  This is a screenshot of task manager as a PowerPivot refresh completes.  This one ran for about 30 minutes, so this represents just the tail end of the process:

PowerPivot Data Refresh CPU and RAM Spike 

Note the two highlighted points – one of the 4 CPU’s pegged at 100% for awhile – and while the other 3 did go “quiet” at the very end, they WERE active for part of the time that the fourth CPU was at 100.  So – CPU gets precious near the end of refreshes (after hovering around 50% for much of the process).

Even more notably, RAM usage spiked by nearly 2 GB!  The PowerPivot file in question was 1.45 GB on disk when it was complete.  And that workbook was still in RAM after refresh completed, so the 2 GB spike was pure overhead during the final compression process.

(Side effect:  Even if you have enough RAM to load a workbook, that does not mean you have enough RAM to refresh it.)

Allocate your server RAM and CPU around the refresh process, folks.


Hardware/Capacity Planning: From the Trenches

November 30, 2010

So How Many Servers Do I Need in My PowerPivot SharePoint Farm?

As people increasingly move up from just dabbling with the addin, and decide to start leveraging the publish/schedule/share/secure benefits of the PowerPivot for SharePoint infrastructure, I am getting this question more frequently.

How many servers?  How much RAM?  And less often but just as important…  how many CPU’s?

Time for an old joke:  “Ever hear the one about the statistician who drowned in a river that was, on average, only 3 inches deep?”

That’s the whole joke.  And as time goes on, I only find it funnier.  And funnier.  But first, some basics.

The Three Primary Server Roles

For the vast majority of PowerPivot deployments, you will mostly need to concern yourself with three server types:  Excel Services, PowerPivot Engine, and Data Source DB Servers:

PowerPivot Server Roles Summary For RAM and CPU Planning 
(Click for Larger Version)

A few notes on the diagram up front:

  1. I recommend viewing the larger version – there is information on the diagram (in the notes) that I will not repeat below.
  2. Don’t read too much into the 3/4/2 ratio of Excel Services/PowerPivot/SQL – I included different numbers of servers in each tier intentionally, to illustrate that you can scale each tier out at independently.  But, for instance, I do NOT expect you will need 3 Excel Services boxes for every PowerPivot box (you will likely need less).
  3. Although separated above, server roles CAN be combined onto single servers – for instance, many folks can get by with an “all in one” server where everything above lives on a single box.  Also, even in a multi-server farm, one of the MS-recommended configs is to combine Excel Services and PowerPivot onto a single box, and then deploy as many of those combined boxes as needed.
  4. There are some elements missing from the diagram – for instance, there is a Web Front End (WFE) role, and a SharePoint Configuration Database role, but in my experience, PowerPivot does not put unique strain on those elements (except for config db disk space).  Your mileage may vary of course.

Early Planning Efforts

Back in the Spring, I set down to the very serious task of “how much hardware will we need on our SharePoint servers in order to handle user loads?”  And naturally, I made a very serious spreadsheet to model it out:

PowerPivot SharePoint Hardware Planning Spreadsheet v1

Wow, what a spreadsheet.  I mean, I even used the =POISSON.DIST() & =BINOM.DIST() functions – clearly, such spreadsheet horsepower indicates accurate results!

Turns out, that spreadsheet was nothing more than an amusing theoretical exercise with little bearing on the real world.

There were multiple problems with that spreadsheet:

  1. It relied on outright guesses as to how often/how intensely consumers would utilize the published reports.
  2. I was not yet aware of the CPU-gobbling power of slicers, a power that makes real-world queries many times more CPU-intensive.
  3. Most importantly, it assumed peak usage would occur in the morning, when consumers are most in need of fresh information.

Solution for 1) and 2)

These are in some sense the simplest to address.  Quite simply, put up a pilot solution and observe the usage characteristics.  You will learn a lot about usage patterns as well as what that does to CPU and RAM.

But even better:  you may choose to simply ignore these factors for now and focus on problem 3 instead.  Here is why:

Peak Load is Probably Experienced During Scheduled Refresh

If you’ve looked at the larger version of the diagram above, you have seen a hint as to what I’m about to say here: 

At Pivotstream, our nightly refresh process puts far more strain on our servers than our users do, and is what we now plan our hardware around.

I will explain further, so you can evaluate whether your situation will be similar.  We receive new data on a nightly basis (typically late at night).  That data gets ingested into SQL server, and then our PowerPivot refresh process begins.

Once PowerPivot refresh begins, we have about 6 hours, tops, to get all models and reports refreshed, so that when business opens in the morning, everyone has access to fresh insights.  6 hours sounds like a lot…  until you discover that it isn’t.

A lot happens during scheduled refresh!

Keep in mind that the PowerPivot model (the embedded database living inside your PowerPivot workbooks) must first be refreshed – let’s call this phase “Model Refresh.”  This can put a lot of strain on your SQL servers just in terms of supplying the data, since PowerPivot v1 pulls a fresh copy of all tables (no incremental refresh).

As the data flows into the PowerPivot servers, a LOT of CPU power goes into compressing that data into the storage format.  During the refresh process, RAM usage steadily climbs as well.

As model refresh nears completion, RAM usage spikes upward quite a bit, sometimes by as much as 50-100% the size of the resulting workbook.  CPU usage also spikes during this phase.

Once the model is done refreshing, PowerPivot for SharePoint then triggers a refresh (via Excel Services) of the Pivots and Cube Formulas in the workbook, so that when thumbnail screenshots are “snapshotted” for the report gallery, those reflect the latest data (let’s call this phase “Pivot Refresh”). 

Yes, that process initiates via Excel Services, but that refresh immediately results in a meaty set of queries sent back to the PowerPivot servers for processing.  So, CPU usage spikes again.  And the model in question is held in RAM while that happens, so that RAM can’t be recycled into the pool to be used for model refresh.

Remember, this all happens for a single workbook.  And if you hang a bunch of report-only workbooks off of a single “hub” model workbook (as we do all the time), the amount of time it takes to finish Pivot Refresh can actually exceed the time for Model Refresh.

If you have to cram all of this into a narrow nightly window, chances are that you will need more hardware to pull it off than you ever will need for normal daily usage!

Rough Guidelines for Your Own Situation

This post is running a bit long, so I will try to be succinct in closing:

  1. If a significant percentage of your PowerPivot models and reports will be refreshed nightly, refresh is likely going to be a larger peak strain than normal interactive usage
  2. Even if you only refresh weekly (or less often), if the execution window between “data is ready” and “reports must be ready” is basically still a single night, then refresh is still likely going to drive your peak hardware need.
  3. If you think “Peak Load = Refresh Process” is likely the case for your organization, I suggest ignoring interactive usage projections during your hardware planning process, and instead developing some prototype workbooks (with rich data sets and mutliple report sheets with 100% realistic slicer sets, measure complexity, etc.), and then putting those workbooks through the refresh process to get a baseline.

Contest: Many Charts, Shared Slicers, Fast Queries

September 20, 2010

 
Excel Pros Searching for a Worthy Challenge

“Shooting space garbage is no test of an Excel Warrior’s mettle!

I think it is appropriate to start this post with the following disclaimer: 

You do NOT have to be able to win this contest in order to be a monster at PowerPivot.

Think of this as brain candy.  Something that opens your eyes to some of the high-end power available when you merge PowerPivot calcs with native Excel calcs.  A 300-level technique.

***CONTEST UPDATE***  Check here for 3 small rules clarifications.

The Goal:  One Slicer Sheet Impacting Many Report Sheets

This is a pretty common desire.  If the report consumer wants to see ALL of the data broken out by the same set of filters, why make them repeat those filter selections on every single sheet of a workbook?

Isn’t it better, in those cases, to give them a single set of “master” slicers on a single sheet like this?

Slicer Home Page 
Master Slicers Sheet
(Sensitive Data Redacted as Always)

There are 14 slicers on that sheet.  The user can select the department, class, etc. that they care about up front.

Once they have made their selections, in this particular workbook they then have dashboard sheets that look like this:

Dashboard Sheet
Multi-Chart Dashboard Sheet Fed from PowerPivot 

We have several sheets like that in this one workbook, with 20-30 charts on each.  All charts react to the slicers on the Master Slicers sheet.

But you’ll also notice that 4 slicers appear on this sheet, too.  Those are duplicates of 4 of the slicers on the master sheet.  They are the same exact fields as the corresponding 4 on the master sheet, so they are always in sync – set them on one sheet and all other sheets reflect the same selections.

That is merely a convenience – we figured that most slicer selections would be made once, but that users would appreciate being able to bounce around the calendar a bit with the time slicers, without having to switch sheets.

There are also about 20 sheets in the workbook that look like this:

Full Chart Sheet
1 of 20 Single-Chart Sheets Fed from PowerPivot

Again, with the four time slicers repeated for convenience.

The Snag:  Slicer Cross-Filtering Performance

80 unique charts and 14 unique slicers.  What could go wrong?

Well, as I documented in the post on slicer cross-filtering, this is roughly the equivalent of having 80 * 14 = 1,120 pivots all updating at once in response to a single click.  (Maybe it’s not quite that bad, but it DOES get VERY complicated).

And that is very slow, even if the performance of any single pivot is super-fast.

But our report that we built at Pivotstream is fast.  It does NOT bog down on slicer cross filtering performance.  So, how did we do it?

The Contest:  Find Efficient Techniques!

After we built this workbook at Pivotstream I realized that our technique could be refined quite a bit, made simpler.  And while we are already doing that, I realized:

THIS IS AN EXCELLENT CONTEST FOR EXCEL PROS!

I still have two MSDN Subscriptions to hand out.  These basically are unlimited, free MS software for your own personal use, so they are a pretty damn good prize.

The top two submissions will win those.  Runners-up will win PowerPivot architecture posters from Denny Lee, assuming he has not run out :)

Rules Schmules!
 

  1. Use lots of charts, and have them all respond to a shared set of slicers
  2. Entries must include your PowerPivot workbook, plus an explanation of what you did
  3. Use any data source you want, but please use non-sensitive data and enough rows that we can spot performance differences
  4. Use at least 6 slicers, with cross-filtering enabled
  5. You do NOT have to use PivotCharts, but you can
  6. Macros can be used to design the report if you find that helpful, but should NOT be part of the update process when I click a slicer.  (Authoring time is ok.  Run time is not.)
  7. Entries will be judged on, in roughly descending order:
    1. Performance – I will compare query speed with slicer cross-filtering turned on vs. turned off.  Smaller differences are better for this contest.
    2. Polish – as I slice the report, do the charts all still look nice?  Ex:  Blank space in charts = not good.
    3. Ease – how hard is it to execute your technique?  We will turn winning entries into blog posts, and this will become a new reference technique here at PowerPivotPro.
    4. Originality – this never hurts but is not crucial.  Fun is good.  Unexpected benefits and features are even better.

Have fun :)


Slicers and pivot update performance

July 2, 2010

-Posted by Rob Collie

 
CHI0000402_P

OK, I was sitting at the 2nd-day keynote down in New Orleans last month, watching Amir Netz do a demo on some PowerPivot goodies we can expect in the next release.

He was showing off a 2 Billion row model, and its amazing performance, and while that was eye-opening, something he said in passing caught my attention in a big way.

 

In an effort to explain that what we were seeing (split-second pivot performance against said 2 Billion rows), he said the following:

“Now look at this report!  It has 4 pivotcharts and 6 slicers, each of which issues 2 queries whenever I click a slicer, meaning this report is actually querying the 2 Billion rows twenty times!”

Each slicer issues 2 queries against the data source????  It was revelation time.  I kinda missed the next couple minutes of what Amir was saying, as I digested the implications for our work at Pivotstream.

And, I was kicking myself for forgetting this, because I had once known most of this in Redmond: 

Slicers can, if used improperly in a report, end up slowing a report down by a factor of 5, 10, or more.

Tip #0:  Don’t overthink this!

Hey, if your report is fast, don’t obsess.  Move on to something else.  It is not worth monkeying around with your set of slicers to trim half a second.  Remember, slicers are the difference between report consumers loving your work and dismissing it as just more nerdy junk.

When a client recently told us “Reports are dead, now we have Pivotstream!” there was no way we would have received that reaction without slicers.

But every now and then you will find yourself with a report that doesn’t operate quickly enough for your purposes, and slow response times can drain the value and utilization out of a report quickly (well, slowly I guess).

In those cases, one of the first places you should look is your usage of slicers.

Why do slicers issues 2 queries each?

Well, I won’t go into great detail here, because  1) I don’t know every detail   and 2) Vidas covered the tech details quite well here.

What I will give you, though, is an intuitive sense of what a slicer has to do.

First of all, it has to populate itself.  I don’t know how often it checks to see if something has changed in the underlying model – I’ll have to run some tests.  But it’s something to think about:

Tip #1:  Are you using a field from your measure table as a slicer?

Say you have a 2 million row table, and one of the columns is Date.  When you park that Date column on a slicer, you are demanding that the slicer populate itself from all of the distinct values of the Date column.  That doesn’t happen for free – the PowerPivot engine is gonna have to work pretty hard just to generate that distinct list, and it might have to do that every time you interact with the report (click a slicer, change a page filter, etc.)

So, this is yet another reason to consider using a separate, smaller table that just contains all unique dates, and then relating that back to your large table.  Then you can use the field from the smaller table in your slicer, and PowerPivot won’t have to burn so much time doing something unnecessary.

Cross-filtering:  the incredibly useful but sometimes crippling feature

You know that cool feature of slicers where, as you make selections, other slicers update to show which tiles are valid (clickable) in those conditions, and which tiles are not?

For instance, consider this example:

PowerPivot Slicers and Cross Filtering

OK, in week 7 (of all NFL seasons combined in this data set), there were 8 receiving touchdowns scored by players weighing 179 pounds or less.

Now check out the FullName slicer – I have not clicked anything in it myself, but all player names have been disabled except four.  That’s because those four players are the only ones under 179 pounds who caught TD’s in week 7.  Cool!

Of course, the identification of those four players doesn’t come for free.  To drive that point home, I’m just gonna add a second measure to the pivot:

PowerPivot Slicers and Cross Filtering Depends on Measures

Hey look!  Now there are eight players enabled.  That’s because there were four players who weighed less than 179 pounds and recorded receiving yards in Week 7.

Stated more generally, any player who has data for ANY of the measures in the pivot, in the conditions dictated by the other slicers, will be enabled.

What does this mean to us?  It means quite a bit, actually, once you understand how it works.

Under the hood:  what Excel does w/ slicers during a pivot update

I’m sure I will miss a detail or two but that’s not the point.  The point is to give you an intuitive understanding.

  1. Excel takes all of your slicer selections, plus the layout of your pivot (rows, columns, measures), bundles all of that up into a query, and sends it off to the PowerPivot engine
  2. The data comes back and Excel can populate the pivot with numbers at this point
  3. But now Excel still needs to determine, for each slicer, which tiles to enable.
  4. So for every single slicer, Excel does the following:
    1. Takes all of the selections from every other slicer, bundles those up into a query with the measures from the pivot…
    2. …And then adds the field of the current slicer to the query, unfiltered, as if it were on the Row Labels axis of a pivot
    3. Sends that whole query off to PowerPivot, waits for a reply
    4. When the data comes back, only the values of that slicer field for which at least one measure has data, will be in the result
    5. Excel ignores the measure values returned and uses the list of returned slicer field values to enable/disable tiles
    6. Excel then moves onto the next slicer and repeats steps 1-5

That’s a lot huh?

Turning that knowledge into action

There are many tips we can derive from that understanding.

Tip #2:  Reduce the number of slicers you use on a pivot

From the above, you can tell that the query being sent for a given slicer is absolutely on par with the query that is used to populate the body of the pivot with data.  In fact in some cases, that slicer query can be MORE time-consuming that the pivot itself!

So, if you remove a slicer from a report that your consumers rarely use, you can take a HUGE chunk off of the time it takes for the report to respond to user interaction.

Furthermore, every time you add a slicer to a report, you also make the queries issued for the other slicers more complex, so removing a slicer might make the other slicers’ queries speed up as well.

Tip #3:  Watch out for slicers with lots of tiles

I have not confirmed this but feel confident enough to share it anyway:  I’m pretty sure that slicers with long lists of tiles are more expensive to update than those with a small number of tiles.  Customer Email Address, for instance, is probably much more impactful on performance than Gender.

Tip #4:  Reduce number and complexity of measures

Just something to think about.  A measure that uses the FILTER function, for instance, is going to be a lot more time-consuming to crunch than a straight SUM.  And you might already be 100% cognizant of that.

The point here is that when you add one such complex measure to a pivot, you are not running that measure once per report interaction.  You are running it once for the pivot AND once for each slicer, so it adds up faster than you think.

Tip #5:  Use a report filter instead of a slicer

Traditional report filters don’t have this cross-filtering behavior, so they don’t impact performance in the ways outlined above.  You can use report filters without fear of slowing down your report.

But report filters are ugly, clumsy, and they scream Windows 3.1 – if report filters were so great, we never would have built slicers in the first place.

So I have a better suggestion…

Tip #6:  Disable cross-filtering for slicers that don’t need it

Here ya go.  Right click a slicer and click Slicer Settings.

In the dialog, you can uncheck the highlighted checkbox:

Slicer Settings Disable Cross Filtering Items with no Data

…and now you won’t get cross-filtering anymore.  All players will be enabled for example:

PowerPivot Slicers and Cross Filtering Disabled

Well, for a list like Players, that may not be the best idea.  There are thousands of them, and cross-filtering is REALLY useful for helping me narrow down the list and find the players I want.

Then again, having a field with thousands of values in it as a slicer runs contrary to Tip #3, so maybe it’s not such a good idea to have such a slicer in the first place.  Can you live without it?  It’s worth asking yourself.

Tip #7 Consider using other, shorter fields for slicers

Sometimes I like to use the same field in a slicer and in the pivot itself:

Same Field on Slicer and on Row Labels PowerPivot

This is just a lot more convenient for limiting the set of players (sometimes) than using the Row Labels filters.

But again, this can incur performance cost, especially for long lists, which is precisely where you may be tempted to “double park” a field like above.

The alternative that’s worth thinking about:  maybe a field that’s just the first letter of the player’s last name would be sufficient.  Or the team they play for.  Or whatever.  The idea is to reduce the number of tiles, per tip #3, and sometimes you can get what you want by using a different field.


Follow

Get every new post delivered to your Inbox.

Join 164 other followers