|What is Gauguin?|
GAUGUIN ( Grouping And Using Glyphs Uncovering Individual Nuances ) is a project for the interactive visual exploration of multivariate data sets, developed for use on all major platforms (Windows, Linux, Mac). It supports a variety of methods for displaying flat-form data and hierarchically clustered data.
Glyphs are geometric shapes scaled by the values of dataset variables. They may be drawn for individual cases or for averages of groups or clusters of cases. GAUGUIN offers four different glyph shapes (but more could be added). The number of data elements which can be displayed simultaneously is limited, because each glyph requires a minimum amount of screen space to be viewed, but hierarchical glyphs can be drawn for groups of cases. Hierarchical glyphs are composed of a highlighted case representing the group and a band around it showing the variability of all the members of the cluster.
GAUGUIN also provides scatterplots and tableplots, and via Rserve is able to use R to calculate MDS views and clusters for the data. All GAUGUIN displays are linked interactively and can be directly queried.
The variable window is the most important control tool where the user can enable and disable variables, completely delete them or select and replace them. The user can decide which variables to be included in (or excluded from) multivariate distance calculations, which are used for highlighting neighbours in the grid plot.
Gauguin implements following interactive graphics for categorical and continuous data:
Grid: The grid plot gives an overview of the entire data set. The glyphs are arranged on a grid and can be sorted by any variable or displayed in their original order in the data set. Each glyph represents one case (row) of the data set. The grid plot allows for
logical zooming; pressing the key “z” creates a window of the selected glyphs (at least two glyphs need to be selected) locally scaled. Groups or clusters obtained from grouping or clustering plots (see below) are accordingly highlighted in the grid plot. Distance calculations for highlighting neighboring (similar) glyphs to an individual selected glyph or to the prototype (average) of a collection of selected glyphs can be performed.
The scatterplot matrix of all pairwise scatterplots of the variables selected from the list in the main variable window is graphed. The scatterplots display allows for logical zooming; clicking one of the pairwise scatterplots (only one individual plot selectable at a time) zooms in, giving a detailed view with points represented by their glyphs. Groups or clusters obtained from grouping or clustering plots are accordingly highlighted in the glyph pairwise scatterplots. Distance calculations can be performed.
MDS and CENTER
Multidimensional scaling (MDS) using the R functions cmdscale (metric MDS; in stats package) and isoMDS and sammon (non-metric MDS; in MASS package) on the scaled data of variables selected in the main window is performed in 2-dimensional space and the obtained configuration is displayed. The configuration points are represented by the corresponding glyphs of the original data. The MDS plot is generated based on the variables selected from the list of variables. A variation of the MDS plot is the center plot (Center). The latter is a special MDS plot, in which selected glyphs are displayed in the center of the plot (central view). Groups or clusters obtained from grouping or clustering plots are accordingly highlighted in these MDS plots. The plots allow for distance calculations.
The data can be grouped by a given variable. There are two options, grouping by radius and grouping by count; in case of a textual variable, there is also grouping by category.
Grouping by radius means that the cases are ordered with increasing value of the selected variable and then are assigned to the groups successively such that the range (max−min) of the selected variable in each group does not exceed the prespecified radius. The number of resulting groups depends on the chosen radius.
Grouping by count means that the cases are ordered with increasing value of the selected variable and then are assigned to the ‘count’ many groups of, as far as possible, equal size, filling groups successively starting with the smallest value of the selected variable, in ascending order.
In case of grouping by category the categories of the textual variable represent the groups and
the case are put into groups depending on their values of the selected variable. The grouping plot consists of hierarchical glyphs. They are composed of a highlighted prototype representing the group and a band around it showing the candidate glyphs of the group, as a visual measure of the variability of the members of that group.
The grouping plot allows for logical zooming; pressing the key “z” creates a window of the selected glyphs (at least two
glyphs need to be selected) locally scaled, and a combination of the key “ctrl” and a mouse click creates a window approximately reproducing all the glyphs of one group. The groups are colored and distance calculations can be performed in the grouping plot.
Clustering: Clustering using the R functions hcluster and Kmeans (hierarchical clustering; in amap package) and kmeans (k-means clustering; in stats package) on the data of variables selected in the main window is performed and the calculated clusters are displayed as hierarchical glyphs. The user can specify the number of clusters (groups) to be computed.
The agglomeration method used for hierarchical clustering can be one of ward, single, complete, average, mcquitty, median, or centroid. Possible distance functions that can be specified are euclidean, maximum, manhattan, canberra, binary, or minkowski. The clustering plot is generated based on the variables selected from the list of variables. The clustering plot allows for logical zooming; pressing the key “z” creates a window of the selected glyphs (at least two glyphs need to be selected) locally scaled, and a combination of the key “ctrl” and a mouse click creates a window approximately reproducing all the glyphs of one cluster. The clusters are colored and distance calculations can be performed in the clustering plot.
In a table plot each cell of the data set is mapped to a bar in the plot with the length and color of the bar encoding the value of the cell. For continuous variables the length of the bar is proportional to the value of the variable. Cells containing categorical information are mapped to fixed-length rows whose color encodes the attribute value. Each column and row of the table plot represents a variable and one or several adjacent cases of the data set, respectively. When the data set is large and it is not
possible to draw all cases on screen, adjacent cases are aggregated into one bar of the plot. The aggregation count or group size can be set by the user. For an aggregated continuous variable, a horizontal bar of corresponding size shows the mean value. For an aggregated categorical variable, a horizontal bar is shown, which is subdivided according to the proportion of cases within the particular aggregation group and colored respectively. By clicking a column header, the data are sorted by the variable in that column; more generally, nested sorting using more than one variable is possible. The table plot also supports such interactivity features as selection, drag and drop for column reordering, randomization of data, and zooming.
In the groups plot there is one row for each group or cluster as determined by a grouping plot or clustering plot, respectively, and one column for every variable selected from the list of variables. Continuous variables are represented by histograms and categorical variables by barcharts, conditional on the groups or clusters. All plots in the same column are common scaled. The anchorpoint and bin width for each histogram column can be set individually. This is sort of a trellis like display in which the column variables are conditioned on a row ‘variable,’ which has the groups or clusters as its categories.
Barcharts and Histograms
Gauguin also implements interactive histograms and barcharts. In histograms the anchorpoint and bin width can be set by the user, in barcharts the numerical coding of the categories can be changed. Histograms and barcharts can also be produced by double
clicking the variables in the main variable window.
All graphic displays benefit from being made interactive. Gauguin replicates glyphs and such common plot types as barcharts, histograms, scatterplots and scatterplot matrices readily available in most statistical software packages. However, Gauguin is highly interactive and offers a wide range of query and data exploration options. All Gauguin displays are fully linked and can be queried directly. The prime aim is to add interactive capabilities to glyph representations. One can control flexibly and smoothly important
factors influencing the interpretation of glyph visualizations, such as the form of glyph chosen, the variables included, the axis ordering within each glyph, glyph size, and the ordering of glyphs in the display.
Special features include shading of glyphs falling within a specified radius or distance from an individual selected glyph or from the prototype of a collection of selected glyphs. Distances are discriminated through intensity of color. These features enable to find glyphs of similar shape, which could be difficult to identify otherwise. The user can specify the variables and the distance function (Euclidean, Maximum, Manhattan) to be used for such distance calculations. Moreover, highlighting of groups or clusters obtained from grouping or clustering plots in other plots, or highlighting of selections including selections obtained from distance calculations, can be switched on or off. The color shadow option can be activated or deactivated in grouping and clustering plots, for a crude, shadowy-like visual representation of hierarchical unfilled star glyphs, primarily giving information about the
density of candidate glyphs of a particular group or cluster. In grouping and clustering plots the user can also specify the transparency of the group and cluster colors, respectively, through the “cursor-left” and “cursor-right” keys. This is called α-channel transparency or α-blending.
Linking of plots is easily accomplished by selection and highlighting. For example, a single data point or case selected in one plot is highlighted in all other plots. However, Gauguin is lacking from a useful interactivity feature, that of selection sequences, which
may be incorporated in future versions of the software. Selection sequences allow users to combine current selections with new selections via simple Boolean operators. By storing the sequence, it is possible to modify any individual selection in the sequence at will during data analysis.
Creating multiple (simultaneous) views in single or different plot windows, manually and automatically sorting and reordering categories and variables, and varying the size of points in a scatterplot are further interactive options available in Gauguin. Gauguin also allows for standard and logical zooming. The current version, however, does not support censored zooming and color brushing (persistent assignment of colors; for instance, to mark outliers more permanently). Future versions of Gauguin may provide for these additional interactivity features.
Some general remarks:
Download and installation
GAUGUIN (application for Mac)
GAUGUIN.jar (jar file for Windows and Unix)
GAUGUIN uses R to carry out its MDS and clustering calculations. To access R you must have R loaded on your computer and the package Rserve. (You also need the package amap for clustering.)
Currently GAUGUIN will start Rserve automatically only under MacOSX. For the time being, Windows and Linux users need to start Rserve manually. Just type
Detailled instructions on Rserve can be found here.
Gauguin supports the standard ASCII data format, which consists of a header of variable names, and tab-delimited columns.
Grouping and Clustering Plots:
Gauguin developed by Alexander Gribov and Antony Unwin is a project of the Department of Computer Oriented Statistics and Data Analysis (COSADA) at the University of Augsburg, Germany