Requirements for Interactive Graphics Software for Exploratory Data Analysis

Antony Unwin

Dept of Computer-Oriented Statistics and Data Analysis, Mathematics Institute, University of Augsburg, 86135 Augsburg, Germany

Summary

Interactive graphics tools are an essential component of EDA. This paper proposes a basic set of requirements for interactive graphics software. To be fully effective it needs fast, flexible and usable implementations of the following interactive tools: direct querying, zooming, rescaling, selection with linking, and the use of multiple views. In addition, it needs to support all these operations for selected subgroups. The interface must be consistent throughout and each tool or method must be fully integrated into the system. These points are discussed and illustrated with examples.

 

 

Introduction

Quite a number of data displays have familiar, easily recognisable patterns. The graphs in Figure 1, for instance, show a skew distribution of car prices, a possibly multimodal distribution of horsepower and a scatterplot of the two variables showing a roughly linear relationship, but with a couple of possible outliers (the two points marked x).

Figure 1 Histograms of Price and Horsepower and a scatterplot of Price against Horsepower

For the variable, price, in the first plot, we might consider another display to put more emphasis on the extreme values, either a boxplot or a dotplot or we might zoom in on that part of the display. We could also consider a transformation. For the variable, horsepower, in the second plot, we would need to check whether the multimodality is an artefact of the particular histogram parameters (anchor-point and bin-width) chosen. We might suspect that there are different groups of cars and look at other variables (such as the number of cylinders or the country of origin) for explanation using linking. In the third plot we would probably query the two outliers to see which cars they were, if there was anything special about them and, possibly, treat them separately in later analyses.

Displays of other data may not show such potentially interpretable patterns at first glance, but the same kinds of basic tools are necessary to explore the inherent information: querying, zooming, variation of displays, multiple views, grouping and, last but by no means least, linking. This list is easy enough to write and many packages claim to offer most, if not all, of the tools. However, exploratory analyses have to be fast and flexible. A first pattern may prove to be superficial or only very weakly supported and then it should be discarded to investigate other possibilities. It may be necessary to check a pattern in many different ways to become confident that it is revealing structure in the data set. Exploratory analysis is about generating many ideas, but also about discarding most (and maybe all) of them.

 

Software for EDA

Statisticians have always analysed data and so has almost everyone else at some time or another, but the term data analysis was not common and what data analysis might be in contrast to statistics was not clear. Tukey’s 1962 article in the Annals of Statistics on the Future of Data Analysis was an important watershed and drew attention to the differences between the two. Since then there has been considerably more attention paid to data analysis, but the books and articles on it are few and far between, compared to the large numbers on various areas of statistical theory. Tukey’s own book, Exploratory Data Analysis (Tukey 1977), and the books with Mosteller and others (Mosteller et al 1977, Hoaglin et al 1983) were important but are very much framed in the pre-PC period. The main reason for the lack of publications is that data analysis is about methods more than mathematics and methodology is hard to write about. A paper on the asymptotic properties of a new variant of an old estimator may well produce a new mathematical result that is of no immediate consequence, but the paper is worth considering publishing. A paper discussing how statistical tools might be applied in an informative fashion is not so clear-cut, and however much editors claim to want such papers, they do not get enough of sufficient quality to enable them to publish many.

Developments in computer hardware and software over the past fifteen years have led to much more attention being paid to graphics and much wider use of statistical displays. But the concentration has been on presentation graphics not on graphics for exploratory work. There is much good advice in the books of Cleveland (1993 and 1994) and Tufte (1983, 1990, 1997) that is relevant for exploratory analyses but the emphasis is on presentation and there is little on the powerful exploratory tools of interactive graphics. Tufte’s dictum that "graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space" (Tufte 1983 p51) presupposes a fixed set of results and a fixed display. Cleveland’s criticism that "It is easy to be dazzled by a display of data, especially if it is rendered with color or depth" (Cleveland 1993 p1) is clearly directed at elaborate static images.

Exploratory graphics have quite different requirements. Buja et al (1996) helpfully distinguish between rendering and manipulation. Both types of graphic need rendering, but only exploratory graphics can be manipulated and manipulation can often solve rendering difficulties. The problems for presentation graphics arise because results must be displayed in a limited space, usually using only one graphic, for viewing by a potentially wide range of other people. Exploratory graphics have in effect unlimited space, unlimited numbers of graphics and are primarily only for the person who draws them. Presentation graphics are for conveying information, while exploratory graphics are for discovering information. Presentation graphics should concentrate on the essentials and exclude less important features. A set of exploratory graphics should highlight all features that might be important. Presentation graphics are closed, while exploratory graphics are open-ended. In particular, an optimal choice of display for a presentation graphic may not be included in an optimal range of displays for exploratory graphics.

There have been articles on software for dynamic graphics, starting with the first example of rotating plots in PRIM-9 (Fisherkeller et al 1974) and going on to much more modern applications, particularly XGobi (Swayne et al 1991), though these emphasise dynamic rather than interactive graphics. Eick and Wills (1995) outline four principles of interactive graphics: simple, easily interpretable views, information hiding, direct manipulation and linked views. Wilhelm et al (1995) discuss requirements for software. The thesis of Theus (1996), unfortunately only available in German, emphasises the four concepts of selection, linking, cues and warnings. Buja et al (1996) refer to focusing, linking and arranging views. All of these are valuable contributions, but as yet no agreed standards have emerged, as the variation in key concept lists shows. In many ways the best sources for information on interactive graphics are the software packages themselves: Data Desk (Velleman 1997), Visual Insights being developed at Lucent technologies (Wills 1995) and MANET (Unwin et al 1996). (Web page addresses for all of these are given at the end of the paper).

The following sections outline important basic tools for interactive graphics for data analysis. These are fundamental requirements for any exploratory software, but it must be born in mind that they can only achieve their full value if they are integrated in a consistent and flexible system.

 

Minimum Requirements – Querying

Even the most carefully prepared data displays can show unexpected features which have to be checked and that goes for routine displays even more. Querying outliers in a boxplot can identify errors, querying points in a scatterplot can suggest why clustering might be visible, querying a histogram bar can highlight threshold problems and querying rectangles in a mosaic plot is essential for getting any information from the display. Querying refers here to the obtaining of specific information (data or labels) and can be regarded as a special version of linking. It is useful to distinguish between querying and linking between graphics which is discussed later on.

Many years ago it was quite impressive to be able to find out the ID number of a point in a scatterplot. You then had to look up the point in the data set. It was not easy to look at several points as a group. This made querying the data slow and uninformative. Nowadays we should expect to be able to identify points by name and by value, to be able to get other variable values for the point, and to do all this interactively and instantaneously with the information appearing where we make the query and not on some different and distant part of the screen. This is still surprisingly rare. JMP can identify points by a previously defined label variable, Data Desk can report case values for a number of selected variables (but unfortunately you have to know the order of selection of the variables to know what the values mean), while MANET offers the fullest interactive querying facility. MANET can not only query points as outlined above. The same querying command will provide information in all graphic displays according to where in the display the request is made (see, for instance, Figure 2, for the querying of a histogram bar). However, MANET can only query one screen location object at a time (point, group of overlapping points, histogram bar). It cannot query, for instance, several histogram bars, and it does not provide any summaries from its queries, just direct values. The next stage in the development of querying tools must be to provide more sophisticated levels of querying. We may regard point identification and values for the display as a sensible default, providing values for other variables as a deeper level of information and providing summary information from querying a selected group as the next stage.

Figure 2 Querying a histogram in MANET. This plot shows the Maths marks of 124 Irish schoolchildren. There are 6 pupils with marks in the bin 40 ² x < 42 and none are currently selected.

Zooming

Zooming is useful for checking on details, on possible clusterings or on the granularity of data. If it is intuitive, fast and reversible, then it will be used more often. Operations which are difficult to carry out or slow to display can hinder rather than help if they distract us from our train of thought.

It has always been possible to redraw graphs with a change of scale by re-entering the necessary commands with the required changes (or by working through the appropriate sequence of menu options). If you guessed wrongly about the new scale the new display would not show the area you wanted and if the original plot was no longer visible you might lose the context of the details being inspected. You could also not easily zoom in on several different sections simultaneously. This feature is still not really available in any statistical software though directly zooming in on individual sections is (for instance, in Data Desk and in the Visual Insights software written by Graham Wills and others at Lucent Technologies). Ideally, the software should retain an overview of the original display in miniature, to show where the zoom is located. It should be possible to change the level of the zoom, gradually moving in and out, like using a telescopic lens. It should be possible to move the magnified view around in the original display without requiring a redraw. Aspects of these features are available in some computer games, but not generally in professional computer software.

Zooming is becoming ever more interesting as data sets become larger. Boxplots, which are very effective for identifying a few outliers in several hundred points, would identify thousands in a million for the same distribution. Even for small, but skew distributions, zooming can be very useful as can be seen in Figure 3.

If we assume that the screen is 600x400 pixels and that we use about a quarter of the available real estate for a window, then assuming further that only 80% of the window is actually for data (the rest being taken up by legends and scales), we have 48000 pixels for the data. In most plots the data are far more dense in one part than elsewhere (often scatterplots are mainly empty). Assuming that the bulk of the data is in 20% of the available area and that each point needs at least 5 pixels to be clearly visible, we can show about 2000 points. This is far too small to be accurate for today’s data sets.

The above description of zooming implies a direct visual magnification. In practice it is useful to do more than that, to amend the display according to the level of magnification – logical zooming. This is well-known from maps. Augsburg is a dot on the map of the world, a bigger circle on a map of Germany and is displayed in great detail on its street map. Anyone would be disappointed if they zoomed in on Augsburg on a map and just saw the black dot get bigger and bigger. Logical zooming for statistical displays would be valuable for histograms, it would be good to be able to zoom in on one or more adjacent bars and see them in more detail (Unwin 1995). Logical zooming would be interesting for boxplots, displaying the dotplot for the selected section. And logical zooming would be useful for mosaic plots in investigating how a large cell could be split up by introducing other variables.

Figure 3 (a) Boxplot of 79 companies’ sales figures (b) Zoom of 10000 to 20000

 

Variation of displays

Different scales, different aspect ratios, different window sizes can all give different impressions of the same data. Cleveland and McGill (1987) have given several interesting examples for time series and also suggested some theory to determine optimal default displays. As with zooming, displays will only be varied if it is likely we get some return for our efforts. The more effort is required for a possible unknown gain, the less likely it is to be considered. In the second plot of Figure 1 it was suggested that alternative versions of the histogram should be checked to see if the multimodality was an artefact. This is hard work if each new histogram has to be laboriously constructed (and many may have to be checked to be sure). Some packages now offer ways round this. Data Desk allows an interactive changing of the number of bins, but only with a change in window size, which makes comparisons difficult. JMP (dragging a hand vertically for binwidth and horizontally for anchor point changes) and MANET (adjusting binwidth and anchor point directly on screen) both allow instantaneous generation of new histograms. Static histograms have a justifiably poor theoretical reputation as univariate displays, particularly for density estimation (Simonoff 1996), but interactive histograms are very effective for data analysis. The three histograms in the following example have been drawn with specified bin-widths for display purposes. In practice they would be a subset of many which would be explored very quickly looking for possible features of interest. In this case there is a suspicious gap just below the exam pass mark of 40.

Figure 4 Three histograms of the exam marks in Maths for 124 Irish pupils. In each case the anchor point is 10 but the bin widths are 10, 5 and 2 respectively. Only the last histogram shows the typical gap pattern under the pass mark of 40.

Every display has different variations. Scatterplots may be drawn with axes switched (plotting horsepower against price in Figure 1, what power do you get for your money rather than what does power cost) and with a different aspect ratio. In MANET axis switching is merely a mouse-click. The aspect ratio can be swiftly varied in many packages by resizing the window with the mouse. Full interactive rescaling would require a way of stretching and shrinking the axes and no package offers that yet. As always with the development of tools for interactive graphics, we have to acknowledge that at the moment few people would see the need for interactive rescaling of this kind, because it is awkward to rescale and so they don’t do it. Until you have the tools available to you in a usable form you cannot be sure how effective they will be in practice. One area in which progress has been made is in the drawing of scatterplots for large data sets. Statisticians have always been aware of the problems of displaying overlapping points in scatterplots and there have been some novel, if impractical suggestions for dealing with it. In the revised edition of Cleveland’s generally excellent book "The Elements of Graphing Data" (Cleveland 1994) his many suggestions include transforming the data, moving some data, jittering the data, displaying multiple points with sunflower symbols and using open circles. None of these are appropriate for large data sets and several are not even good for small ones. Carr et al (1987) have proposed a scheme involving hexagonal binning and density estimation. Wills implemented a simplified density estimation scheme in REGARD which proved very effective and has been implemented by Hofmann in MANET. Although the kernel used is uniform, the ability to vary the kernel range interactively allows great flexibility in exploring the two dimensional distribution. This is the classical trade-off in interactive graphics: the flexibility of the tools in practice makes up for possible theoretical deficiencies in the displays due to the need for fast redrawing. In fact, it would be feasible to implement more sophisticated kernels now given the rapidly increasing computer power at our disposal and it will be interesting to discover whether they contribute substantially more to the information conveyed by the displays.

 

Multiple views

Varying displays is one way to get another view of data but a further one is to consider plots of different kinds. The initial example of the skew distribution for price in Figure 1 led to suggesting a boxplot or dotplot. Figure 5 displays boxplots and dotplots of price by country of origin. The boxplots show that German cars tend to be more expensive. The dotplots reveal surprising gaps between the most expensive cars in the price ranges both for the Japanese and the American cars (there is a similar gap for the German cars, but little data to make anything of it).

Of course, the new information has come through including another variable in the analysis, the categorical variable country of origin. We would not have thought of drawing histograms of price by country because there would not be much data per group and because multiple histograms make inefficient use of screen real estate. Having drawn a single boxplot or dotplot it is natural to think of looking at the distribution by subgroups in parallel. (In this case, not surprisingly, the boxplots do not show the gaps in the data. On the other hand they do show the median and the shape of the distribution better in general than the dotplots.)

 

 

Figure 5 Parallel boxplots and dotplots of price by country of origin.

 

Data in groups

In Figure 5 more information was obtained by splitting up the data by groups. German cars might be combined with "Others" as they are both much smaller than the Japanese and American groups. It would be valuable for comparison purposes to be able to reorder the groups easily, for instance to compare the Japanese and Americans side by side. Neither of these obviously interactive requirements is generally available, though MANET does offer interactive reordering of bar charts (Figure 6). The column for Japan, which is second to the left of the first display, has been grabbed by the mouse and moved to the right of the display with the result shown in the third chart. The cars with 6 and 8 cylinders have been highlighted.

 

Figure 6 Numbers of car models by country of origin (a) alphabetic order of countries (b) moving the Japanese column to the far right and (c) the final chart.

Prespecified groups can be dealt with more or less easily in most systems and it is clear what kind of comparative analyses are needed. Working with groups which are identified during analysis is more difficult. Both Data Desk and MANET permit the definition of selections as new groups, but keeping track of multiple, possibly overlapping and not exhaustive groups can be hard. In the following display of the cars data set from Figures 1, 5 and 6, Japanese cars with high reliability have been selected and marked with a circle while cars with low horsepower (<100) have been marked blue (the grey-shaded points in Figure 7). These two groups have 18 and 20 members respectively with 4 being in both. 57 of the 91 cars are in neither group. The performance spread in terms of MPG against weight is large and it is interesting that two of the reliable Japanese cars of low horsepower are poor in this respect compared to other low horsepower cars.

Defining new groupings (perhaps high reliability would be better defined as a reliability score of 4 or 5 and not just 5), keeping track of group definitions (there are many ways low horsepower could be defined, <100 was an arbitrary rounded choice) and making appropriate comparisons (it might be relevant to compare the Japanese high reliability cars either with the high reliability cars of other countries or with the other Japanese cars) are all essential data analysis tasks which could be made more intuitive and easier with good software. There has to be much more work done in this area to enable subgroup analyses to play their full role in exploratory studies.

 

Figure 7 MPG against Weight for the cars data set with two overlapping subgroups selected.

 

Linking

The example in Figure 6 illustrates the use of linking: selecting a point or groups of points in one plot and having them highlighted in all other displays. In fact, linking could have been used to good effect in all the other examples. Linking between the histograms in Figure 1 would have shown the general structure that was then shown in the scatterplot and, more importantly, linking the outliers in the scatterplot to displays of further variables could have explained their occurrence. Linking to displays of profit, assets, cash flow and business would enable an understanding of the group of outliers in the zoomed section of the boxplot in Figure 3. It would be interesting to link the Maths exam results in Figure 4 to results of the same children in other subjects, in particular to see if those who failed could maybe compensate elsewhere, but also to see if there were common patterns of performance across the subjects. Figure 8 shows histograms of the marks for Maths, English, Irish, Geography and History with those who failed in Maths selected. Although a scatterplot is excellent for examining the relationship between two variables as in Figure 1, linking multiple one dimensional displays is often better for a group of variables. This is a matter of taste and goals. A scatterplot matrix for the variables or a set of parallel boxplot or dotplots could be just as suitable, they each emphasise different aspects of the data. For 5 or more variables only boxplots and dotplots can really be used.

Figure 8 Histograms of the exam marks of 126 Irish pupils. Missing values (that is students who were absent for the corresponding exam) are displayed in the white column to the left. All scales are from 10 to 100 but the bin-widths are 2 for Maths and 5 for the others for reasons of space. The vertical scales in the four others are also equal.

The linking shows that, in general, students who were weak in Maths were also weak in other subjects, but that there were some exceptionally high marks (for instance in history and geography, further use of linking shows these marks to have been obtained by the same pupil). It is noticeable that some of the worst marks in the other subjects must have been obtained by pupils who did not fail Maths. One explanation might be that they had missed the Maths exam. Interactive exploration shows that this is not the case but reveals something more interesting: two pupils missed all 5 exams and four additional ones were missing only for Irish. However, these four pupils did well in all the other four subjects. It could be that these were foreign pupils who did not have to study Irish (which is compulsory for Irish schoolchildren) and therefore had less of a study-load.

This is the essence of exploratory analysis, observing potentially interesting features and checking them via other variables and other displays quickly and efficiently. We started out from the failing Maths students, noticed the possibility that some who were very bad in other subjects might have passed Maths, checked the missings and discovered a different subgroup which could be interesting. In parallel we might have noted the differing distributional patterns of the marks (e.g. the bunching of English in the middle range) and the common distributional features (they all have a local mode at 40 and above as we found with Maths). These features should be checked in more detail as well.

A more traditional approach to a multivariate data set like this would be to calculate correlation coefficients, but they only cover linear relationships and may be discredited by problems with missing values (in the full exams data set casewise deletion leads to an almost empty set, because one pair of subjects is supposed to be mutually exclusive).

 

Integrating tools in software

Working with interactive graphics is quite different from working with static, presentation graphics. The aim is not to improve and polish a particular display till it conveys its message in an effective manner, but to use sets of displays to explore data sets and discover the information in them. This demands that both the displays and the tools used to manipulate them are tightly interlinked. It must be possible to use the tools in any particular order on any graphic displayed. It will not be clear in advance whether it is better to link from one display or from another, whether you should query and then zoom or first link and then query, whether a scatterplot or two linked histograms is more informative initially.

If software is slow and inflexible, then clear, default, recommended procedures will be necessary, but these will miss many of the features in the data. A good example is given by histograms, as in Figure 4. An automatic, "optimal" binwidth (see, e.g. Scott (1992)) might not pick out the gap and will almost certainly choose a non-integer binwidth which does not reflect the granularity of the marks. It will also be based only on the data from the variable itself and will not use any multivariate information which might be relevant (in this case, as in Figure 8, that it can be sensible for the marks from different exams to have related scales).

Fast and flexible software gives users freedom to explore many different lines of thought, enables them to review and check insights, permits them to recover swiftly from unproductive or misleading paths, and encourages them to incorporate their domain knowledge of data sets’ subject matter to gain further information. In the case of the Irish exam marks, you needed to know that the pass mark was 40 and that it is common, though not obligatory, to avoid borderline results below the pass mark.

 

Conclusions

This paper has been about the tools needed for exploring data. Direct querying, zooming, rescaling, selection with linking, and the use of multiple views do not involve complex concepts, it is their efficient implementation and their effective use which are complex. All of them are valuable in discerning the information in data sets. They enable exploratory data analysis to be carried out in a flexible and insightful manner. Sophisticated tools such as animation, rotation, functional linking, spatial linking, relational linking would be welcome, but should ideally be consistently integrated with a fully equipped basic system.

At the moment what we have are a number of packages with a variety of more or less integrated tools. What we do not have and what we very much need are integrated packages with all these tools and more. When we reach that stage we can finally begin to realise the promise of that picture of data analysis which Tukey first put firmly in the public arena over 35 years ago.

Software

Data Desk and JMP are both commercial software packages which run on PCs and Macs. Data Desk is a trade mark of Data Description. JMP is a trade mark of SAS. MANET is in development at the department of Computer-Oriented Statistics and Data Analysis at the University of Augsburg. It is based on ideas from several members of the group but the implementation is primarily due to Heike Hofmann. MANET currently only runs on Macintosh computers. Contact the author for details of availability.

Web addresses:

Data Desk http://www.datadesk.com

JMP http://www.sas.com/otherprods/jmp/

MANET http://www1.math.uni-augsburg.de/Manet/

Visual Insights http://www.lucent.com/visualinsights

Acknowledgements

The department of Computer-Oriented Statistics and Data Analysis is supported by the Volkswagen Foundation. Thanks are due to the members of the Augsburg group for many helpful discussions and advice and to a particularly helpful referee.

 

References

Buja, A., Cook, D., Swayne, D.F. (1996). Interactive High-Dimensional Data Visualisation. JCGS, 5(1), 78-99.

Carr, D. B., Littlefield, R.J., Nicholson, W.L., Littlefield, J.S. (1987) Scatterplot Matrix Techniques for large N. JASA, 82(398), 424-436

Cleveland, W. S., McGill, R. (1987). Graphical Perception: The Visual Decoding of Quantitative Information on Graphical Displays of Data. JRSS A, 150, 192-229.

Cleveland, W. S. (1993). Visualizing Data. Summit, NJ, USA: Hobart Press

Cleveland, W. S. (1994). The Elements of Graphing Data (Revised ed.). Summit, New Jersey, USA: Hobart Press

Eick, S. G., Wills, G. J. (1995). High Interaction Graphics. European Journal of Operational Research, 445-459

Fisherkeller, M. A., Friedman, J.H., and Tukey, J.W. (1974). PRIM-9: An Interactive Multidimensional Data Display and Analysis System. SLAC-PUB-1408, SLAC Publications Office, Stanford, California.

Hoaglin, D., Mosteller, F. and Tukey, J. (Ed.). (1983). Understanding Robust and Exploratory Data Analysis. New York: Wiley.

Mosteller, F., Tukey, J. (1977). Data Analysis and Regression. Reading, MA: Addison-Wesley

SAS (1995). JMP 3.1. SAS Institute

Scott, D. W. (1992). Multivariate Density Estimation. New York: Wiley

Simonoff, J. (1996). Smoothing Methods in Statistics. New York: Springer.

Swayne, D. F., Cook, D., and Buja, A. (1991). XGobi: Interactive Dynamic Graphics in the X Window System with a link to S. In Proceedings of the 1991 ASA Statistical Graphics Section (pp. 1-8).

Theus, M. (1996) Theorie und Anwendung Interaktiver Statistischer Graphik. Augsburg.

Tufte, E. R. (1983). The Visual Display of Quantitative Information. Cheshire, Connecticut: Graphic Press

Tufte, E. R. (1990). Envisioning Information. Cheshire, Connecticut: Graphic Press

Tufte, E. R. (1997). Visual Explanations. Cheshire, Connecticut: Graphic Press

Tukey, J. (1962). The future of data analysis. Ann Math Stat, 33, 1-67.

Tukey, J. (1977). Exploratory Data Analysis. London: Addison-Wesley.

Unwin, A. R. (1995). Interaktive Statistische Grafik – eine Übersicht? In J. Frohn et al. (Eds.), Applied Statistics – Recent Developments (pp. 177-183). Dortmund: Vandenhoeck & Ruprecht: Göttingen.

Unwin, A. R., Hawkins, G., Hofmann, H., and Siegl, B. (1996). Interactive Graphics for Data Sets with Missing Values - MANET. Journal of Computational and Graphical Statistics, 5(2), 113-122.

Velleman, P. F. (1997). Data Desk. Ithaca New York: Data Description

Wilhelm, A., Unwin, A. R., Theus, M. (1995). Software for Interactive Statistical Graphics - a Review. In F. Faulbaum (Ed.), SoftStat, (pp. 3-12). Heidelberg: Lucius & Lucius