|
In the surgical intensive care unit (SICU), one issue is nutrition, in particular, the amount of nutrition each patient there receives relative to the amount that the caregivers desire for the patient. This "amount of nutrition" is typically expressed as a "percent of goal". Ideally, the physicians would like all the patients at some minimal amount of nutrition --- I guess one might want all of them at 100% of goal --- but other factors, such as swelling or infection or whatnot, can hamper this effort. I was presented with a collection of data that gave, for each of the first 14 days since start of stay in the SICU, the "percent of goal" for each patient. Several dozen patients have data in this collection. Some have data on just one day; some have data on almost all the days; there is not a consistent number of patients on any given day. So be it. I'm not going to enter into issues of whether this is a good or bad measure, or what to do about the fact that some patients come and go, or how to deal with the fact that most of these readings are serial measures within a patient; the goal here is to talk about the presentation of data and the impact of summary measures. This first plot shows the median percent of goal for each of the first 14 days. One can see rather variable median percents of goal over the first 4 or 5 days, followed by a general, but not monotonic, rise in the percent of goal over the last week. On day 14, the median percent of goal is 90%, perhaps an indication that all is well. After all, the median percent of goal is nearly 100%, so we must be doing well, right?
Click for full-sized version: png or pdf The second plot adds to the medians the 10th and 90th percentile points. Now we see evidence of the distribution of the percents of goal. None of us would admit now that we didn't realize that there just had to be spread in the data; we probably just didn't realize how big it could be. But recall that the median is the point where 50% of the data lie on each side; as such, having, even at day 14, a median of 90% implies that 50% of the patients who have data that day have a percent of goal that is less than 90%. How much less than 90%? Well, the 10th percentile is about 20% of goal. Yikes. Eventhough the median is 90%, 1 in 10 patients is getting less than 20% of what he or she should receive on day 14. And look at days 1 and 2: real high then real low. That seem perhaps like odd behavior. Perhaps more digging is needed.
Click for full-sized version: png or pdf The third plot adds all the actual data points to the summaries that have just been discussed. The points have been 'jittered' slightly so that identical values don't obscure each other. Look! Day 1 has only 2 data. Perhaps the median and the 10th and 90th percentiles (however they may be estimated from a sample of size 2) aren't very good summary measures. In fact, on day 1 we have 3 summary measures to show only 2 data. Goodness. That's just silly. Look! Day 2 has a whole bundle of data from 10% to 40% of goal. Look! The minimum on day 14 is a lowly 10% while more than 50% of the points are above 90% of goal! Look! On day 10, two patients actually have values above 100%! Look! There is a growing proportion of patients who have 100% of goal achieved as the study goes on. Look! Look! Look at the data atoms!
Click for full-sized version: png or pdf Analysis, from the Greek, implies breaking things down into component parts so as to understand the whole. Its opposite is synthesis, bringing together the parts to construct the whole. If we are going to do data analysis, then we must make attempts to break the data down to their component parts, their atoms. Computing summary measures like means and medians and percentiles and standard deviations and even F and chi-square and t statistics and P values, is not analysis; it is synthesis! And, worse than playing games with word meaning, data synthesis often obscures understanding the data. Why, then, do we ever compute summary measures? Well, at the heart of exploring data is the concept of a distribution, a collection of things that are somehow the same, but yet are different. In our current example, we have, on each day, a distribution of percents of goals. The theory of statistics has a concept call sufficient statistics. The general idea is that if you know that the data come from a distribution with a known form, then there are certain summaries of the data that tell you all you can possibly know about the distribution. In many of the nice theoretical distributions, the sum (and thus, by way of simply scaling by the sample size, the mean) of the data values is a sufficient statistic. And medians are close to means when you have well-behaved data, so people use medians too. But, often the data are not well-behaved or the they don't come from pretty distribtuions. So, then what? Can we find sufficient statistics? The answer is yes; there is always a set of sufficient statistics. And that set of sufficient statistics (formally called the order statistics) is essentially the data themselves! So, when doing data analysis, plot the raw data. Show the atoms. Search for the fun stuff, like outliers. The excitement is always in the tails, or outliers, of the data. Seek to understand their source. Remember that the goal is understanding of the distribution of the data; therefore, if some simple summary measures tell you all you need to know about the distribution, fine. But if not, try to show all the data! |
|
|
|
Last update: 31 January 2007 |