Ggplot2: elegant graphics for data analysis pdf download free
Table 2. The variables depth, table, x, y and z refer to the dimensions of the diamond as shown in Figure 2. The dataset has not been well cleaned, so as well as demonstrating inter- esting relationships about diamonds, it also demonstrates some data quality problems. There is also an optional data argument. Here is a simple example of the use of qplot. It produces a scatterplot showing the relationship between the price and carats weight of a diamond.
The plot shows a strong correlation with notable outliers and some interest- ing vertical striation. Because qplot accepts functions of variables as arguments, we plot log price vs. The relationship now looks linear. The majority of diamonds do seem to fall along a line, but there are some large outliers. This makes it easy to include additional data on the plot. In the next example, we augment the plot of carat and price with informa- tion about diamond colour and cut.
The results are shown in Figure 2. It is this scale that controls the appearance of the points and associated legend.
For example, in the above plots, the colour scale maps J to purple and F to green. Note that while I use British spelling throughout this book, the software also accepts American spellings. You can also manually set the aesthetics using I , e.
This is not the same as mapping and is explained in more detail in Section 4. For large datasets, like the diamonds data, semi- transparent points are often useful to alleviate some of the overplotting. To make a semi-transparent colour you can use the alpha aesthetic, which takes a value between 0 completely transparent and 1 complete opaque.
For example, colour and shape work well with categorical variables, while size works better with continuous variables. An alternative solution is to use faceting, which will be introduced in Section 2. Some geoms have an associated statistical transformation, for example, a histogram is a binning statistic plus a bar geom. The following geoms enable you to investigate two-dimensional relationships:. This is the default when you supply both x and y arguments to qplot. Traditionally these are used to explore relationships between time and another variable, but lines may be used to join observations connected in some other way.
The histogram geom is the default when you only supply an x value to qplot. If you have a scatterplot with many data points, it can be hard to see exactly what trend is shown by the data. In this case you may want to add a smoothed line to the plot.
This is easily done using the smooth geom as shown in Figure 2. Notice that we have combined multiple geoms by supplying a vector of geom names created with c. The geoms will be overlaid in the order in which they appear. The dsmall dataset left and the full dataset right. More details about the algorithm used can be found in? The wiggliness of the line is controlled by the span parameter, which ranges from 0 exceedingly wiggly to 1 not so wiggly , as shown in Figure 2.
This is similar to using a spline with lm, but the degree of smoothness is estimated from the data. This is used by default when there are more than 1, points. The second parameter is the degrees of freedom: a higher number will create a wigglier curve.
You are free to specify any formula involving x and y. Figure 2. When a set of data includes a categorical variable and one or more continuous variables, you will probably be interested to know how the values of the continuous variables vary with the levels of the categorical variable.
As the colour improves from left to right the spread of values decreases, but there is little change in the centre of the distribution. Each method has its strengths and weaknesses. In the example here, both plots show the dependency of the spread of price per carat on diamond colour, but the boxplots are more informative, indicating that there is very little change in the median and adjacent quartiles. The overplotting seen in the plot of jittered values can be alleviated some- what by using semi-transparent points using the alpha argument.
The plots are produced with the following code. As the opacity decreases we begin to see where the bulk of the data lies. However, the boxplot still does much better. For boxplots you can control the outline colour, the internal fill colour and the size of the lines. Another way to look at conditional distributions is to use faceting to plot a separate histogram or density plot for each value of the categorical variable. This is demonstrated in Section 2. They provide more information about the distribution of a single group than boxplots do, but it is harder to compare many groups although we will look at one way to do so.
For the histogram, the binwidth argument controls the amount of smoothing by setting the bin size. It is very important to experiment with the level of smoothing. In Figure 2. Binwidths from left to right: 1, 0. Only diamonds between 0 and 3 carats shown. Mapping a categorical variable to an aesthetic will automatically split up the geom by that variable, so these commands instruct qplot to draw a density plot and histogram for each level of diamond colour.
Left Density plots are overlaid and right histograms are stacked. In addition, the density plot makes some assumptions that may not be true for our data; i.
This is illustrated in Figure 2. Line and path plots are typically used for time series data. Line plots join the points from left to right, while path plots join them in the order that they appear in the dataset a line plot is just a path plot of the data sorted by x value.
Line plots usually have time on the x-axis, showing how a single variable has changed over time. Because there is no time variable in the diamonds data, we use the economics dataset, which contains economic data on the US measured over the last 40 years. Left Percent of population that is unemployed and right median number of weeks unemployed. To examine this relationship in greater detail, we would like to draw both time series on the same plot. We could draw a scatterplot of unemployment rate vs.
The solution is to join points adjacent in time with line segments, forming a path plot. Below we plot unemployment rate vs. In the second plot, we apply the colour aesthetic to the line to make it easier to see the direction of time.
Left Scatterplot with overlaid path. Right Pure path plot coloured by year. We can see that percent unemployed and length of unemployment are highly correlated, although in recent years the length of unemployment has been increasing relative to the unemployment rate.
With longitudinal data, you often want to display multiple time series on each plot, each series representing one individual. To do this with qplot , you need to map the group aesthetic to a variable encoding the group membership of each observation. This is explained in more depth in Section 4. Faceting takes an alternative approach: It creates tables of graphics by splitting the data into subsets and displaying the same graph for each subset in an arrangement that facilitates comparison.
Section 7. To facet on only one of columns or rows, use. The second set of his- tograms shows proportions, making it easier to compare distributions regardless of the relative abundance of diamonds of each colour. The y-axis of the histogram does not come from the original data, but from the statistical transformation that counts the number of observations in each bin. This can be a string e. As with the plot title, these can be character strings or mathematical expressions. D 1. E 1. F 1. G 1.
H 1. I 1. J 1. Left Bars show counts and right bars show densities proportions of the whole. The density plot makes it easier to compare distributions ignoring the relative abundance of diamonds within each colour. Small diamonds. Note, however, that ggplot is generic, and may provide a starting point for producing visualisations of arbitrary R objects. See Chapter 9 for more details. This is then scaled and displayed with a legend.
If you want to set the value, e. This is explained in more detail in Section 4. With ggplot2, you need to add additional layers to the existing plot, described in the next chapter. Chapter 3. This chapter describes the theoretical basis of ggplot2: the layered gram- mar of graphics. The next chapters discuss the components in more detail, and provide more examples of how you can use them in practice.
The grammar is useful for you both as a user and as a potential developer of statistical graphics. As a user, it makes it easier for you to iteratively update a plot, changing a single feature at a time. The grammar is also useful because it suggests the high-level aspects of a plot that can be changed, giving you a framework to think about graphics, and hopefully shortening the distance from mind to paper.
It also encourages the use of graphics customised to a particular problem, rather than relying on generic named graphics. As a developer, the grammar makes it much easier to add new capabilities to ggplot2. You only need to add the one component that you need, and you can continue to use all of the other existing components. For example, you can add a new statistical transformation, and continue to use the existing scales and geoms.
This chapter begins by describing in detail the process of drawing a simple plot. Section 3. The chapter concludes with Section 3. Consider the fuel economy dataset, mpg, a sample of which is illustrated in Table 3. It records make, model, class, engine size, transmission and fuel economy for a selection of US cars in and It contains the 38 models that were updated every year, an indicator that the car was a popular model.
Table 3. This dataset suggests many interesting questions. How are engine size and fuel economy related? Do certain manufacturers care more about economy than others? Has fuel economy improved in the last ten years? It is a scatterplot of two continuous variables engine displacement and highway mpg , with points coloured by a third variable number of cylinders.
From your experience in the previous chapter, you should have a pretty good feel for how to create this plot with qplot. But what is going on underneath the surface? How does ggplot2 draw this plot? Points are coloured according to number of cylinders.
This plot summarises the most important factor governing fuel economy: engine size. What precisely is a scatterplot? You have seen many before and have probably even drawn some by hand. As well as a horizontal and vertical position, each point also has a size, a colour and a shape.
These attributes are called aesthetics, and are the properties that can be perceived on the graphic. Each aesthetic can be mapped to a variable, or set to a constant value. In Figure 3. Size and shape are not mapped to variables, but remain at their constant default values. Once we have these mappings we can create a new dataset that records this information. This new dataset is a result of applying the aesthetic mappings to the original data.
The scatter- plot uses points, but were we instead to draw lines we would get a line plot. Neither of those examples makes sense for this data, but we could still draw them, as in Figure 3. This data frame contains all the data to be displayed on the plot. Neither of these geoms makes sense for this data, but they are still grammatically valid.
Points, lines and bars are all examples of geometric objects, or geoms. Plots that use a single geom are often given a special name, a few of which are listed in Table 3. For example, Figure 3. What would you call this plot? Named plot Geom Other features scatterplot point bubblechart point size mapped to a variable barchart bar box-and-whisker plot boxplot line chart line.
This plot takes Figure 3. The values in Table 3. We need to convert them from data units e. This conversion process is called scaling and performed by scales. Now that these values are meaningful to the computer, they may not be meaningful to us: colours are represented by a six-letter hexadecimal string, sizes by a number and shapes by an integer.
In this example, we have three aesthetics that need to be scaled: horizontal position x , vertical position y and colour. Scaling position is easy in this example because we are using the default linear scales. We need only a linear mapping from the range of the data to [0, 1]. This is done by the coordinate system, or coord. In most cases this will be Cartesian coordinates, but it might be polar coordinates, or a spherical projection used for a map.
The process for mapping the colour is a little more complicated, as we have a non-numeric result: colours. However, colours can be thought of as having three components, corresponding to the three types of colour-detecting cells in the human eye.
These three cell types give rise to a three-dimensional colour space. Scaling then involves mapping the data values to points in this space. There are many ways to do this, but here since cyl is a categorical variable we map values to evenly spaced hues on the colour wheel, as shown in Figure 3.
The result of these conversions is Table 3. As well as aesthetics that have been mapped to variable, we also include aesthetics that are constant. The description of colours is intimidating, but this is the form that R uses internally. This is the default scale for discrete variables. Finally, we need to render this data to create the graphical objects that are displayed on the screen.
To create a complete plot we need to combine graphical objects from three sources: the data, represented by the point geom; the scales and coordinate system, which generate axes and legends so that we can read values from the graph; and plot annotations, such as the background and plot title.
Figure 3. Contributions from the data, the point geom, have been removed. This plot adds three new components to the mix: facets, multiple layers and statistics. The facets and layers expand the data structure described above: each facet panel in each layer has its own dataset.
You can think of this as a 3d array: the panels of the facets form a 2d grid, and the layers extend upwards in the 3rd dimension. This requires an additional step in the process described above: after mapping the data to aesthetics, the data is passed to a statistical transformation, or stat, which manipulates the data in some useful way.
Other useful stats include 1 and 2d binning, group means, quantile regression and contouring. As well as adding an additional step to summarise the data, we also need some extra steps when we get to the scales. Scaling actually occurs in three parts: transforming, training and mapping. In general, this structure also has a third dimension for layers, but in this example the data for each layer is the same. This ensures that a plot of log x vs.
See Section 6. The training operation combines the ranges of the individual datasets to get the range of the complete data. Sometimes we do want to vary position scales across facets but never across layers , and this is described more fully in Section 7.
This is a local operation: the variables in each dataset are mapped to their aesthetic values producing a new dataset that can then be rendered by the geoms. We have also touched on the coordinate system. Each square represents a layer, and this schematic represents a plot with three layers and three panels. Together, the data, mappings, stat, geom and position adjustment form a layer. A plot may have multiple layers, as in the example where we overlaid a smoothed line on a scatterplot.
The following sections describe each of the higher level components more precisely, and point you to the parts of the book where they are documented.
Layers are responsible for creating the objects that we perceive on the plot. The properties of a layer are described in Chapter 4 and how they can be used to visualise data in Chapter 5. A scale controls the mapping from data to aesthetic attributes, and we need a scale for every aesthetic used on a plot. Each scale operates across all the data in the plot, ensuring a consistent mapping from data to aesthetics.
Some scales are illustrated in Figure 3. A scale is a function, and its inverse, along with a set of parameters. For example, the colour gradient scale maps a segment of the real line to a path through a colour space.
The inverse function is used to draw a guide so that you can read values from the graph. Guides are either axes for position scales or legends for everything else. Most mappings have a unique inverse i. A unique inverse makes it possible to recover the original data, but this is not always desirable if we want to focus attention on a single aspect. Chapter 6 describes scales in detail. From left to right: continuous variable mapped to size, and to colour, discrete variable mapped to shape, and to colour.
The ordering of scales seems upside-down, but this matches the labelling of the y-axis: small values occur at the bottom. A coordinate system, or coord for short, maps the position of objects onto the plane of the plot.
The Cartesian coordinate system is the most common coordinate system for two dimensions, while polar coordinates and various map projections are used less frequently. For example, in polar coordinates, bar geoms look like segments of a circle.
Additionally, scaling is performed before statistical transformation, while coordinate transformations occur afterward. The consequences of this are shown in Section 7. Coordinate systems control how the axes and grid lines are drawn.
Very little advice is available for drawing these for non-Cartesian coordinate systems, so a lot of work needs to be done to produce polished output. Coordinate systems are described in Section 7. This is a powerful tool when investigating whether patterns hold across all conditions.
Faceting is described in Chapter 7. This grammar is encoded into R data structures in a fairly straightforward way.
A plot object is a list with components data, mapping the default aesthetic mappings , layers, scales, coordinates and facet. Plots can be created in two ways: all at once with qplot , as shown in the previous chapter, or piece-by-piece with ggplot and layer functions, as described in the next chapter. This saves a complete copy of the plot object, so you can easily re-create that exact plot with load.
Note that data is stored inside the plot, so that if you change the data outside of the plot, and then redraw a saved plot, it will not be updated. The following code illustrates some of these tools. Layering is the mechanism by which additional data elements are added to a plot.
This chapter is mainly a technical description of how layers, geoms, statistics and position adjustments work: how you call and customise them. These two chapters are companions, with this chapter explaining the theory and the next chapter explaining the practical aspects of using layers to achieve your graphical goals. Section 4. The plot is not ready to be displayed until at least one layer is added, as described in Section 4.
The stat returns a data frame with new variables that can also be mapped to aesthetics with a special syntax. To conclude, Section 4. When we used qplot , it did a lot of things for us: it created a plot object, added layers, and displayed the result, using many default values along the way.
To create the plot object ourselves, we use ggplot. This has two arguments: data and aesthetic mapping. These arguments set up defaults for the plot and can be omitted if you specify data and aesthetics when adding each layer.
You are already familiar with aesthetic mappings from qplot , and the syntax here is quite similar, although you need to wrap the pairs of aesthetic attribute and variable name in the aes function. A minimal layer may do nothing more than specify a geom, a way of visually representing the data.
If we add a point geom to the plot we just created, we create a scatterplot, which can then be rendered. This layer uses the plot defaults for data and aesthetic mapping and it uses default values for two optional arguments: the statistical transformation the stat and the position adjustment.
We can simplify it by using shortcuts that rely on the fact that every geom is associated with a default statistic and position, and every statistic with a default geom. It is most commonly omitted, in which case the layer will use the default plot data. See Section 4. You can also use aesthetic properties as parameters.
This is a text string containing the name of the geom to use. Using the default will give you a standard plot; overriding the default allows you to achieve something more exotic, as shown in Section 4.
Note that the order of data and mapping arguments is switched between ggplot and the layer functions. We suggest explicitly naming all other arguments rather than relying on positional matching. This makes the code more readable and is the style followed in this book. Layers can be added to plots created with ggplot or qplot. Remember, behind the scenes, qplot is doing exactly the same thing: it creates a plot object and then adds layers.
The following example shows the equivalence between these two ways of making plots. The summary function can be helpful for inspecting the structure of a plot without plotting it, as seen in the following example. The summary shows information about the plot defaults, and then each layer. You will learn about scales and faceting in Chapters 6 and 7. Layers are regular R objects and so can be stored as variables, making it easy to write clean code that reduces duplication.
If you later decide to change that layer, you only need to do so in one place. This is restrictive, and unlike other graphics packages in R. Lattice functions can take an optional data frame or use vectors directly from the global environment.
Base methods often work with vectors, data frames or other R objects. However, there are good reasons for this restriction. However, if a variable changes from discrete to continuous or vice versa , you will need to change the default scales, as described in Section 6. It is not necessary to specify a default dataset except when using faceting; faceting is a global operation i. See Section 7. If the default dataset is omitted, every layer must supply its own data.
The data is stored in the plot object as a copy, not a reference. This has two important consequences: if your data changes, the plot will not; and ggplot2 objects are entirely self-contained so that they can be save d to disk and later load ed and plotted without needing anything else from that session. This matches the way that qplot is normally used. You should never refer to variables outside of the dataset e.
This is one of the ways in which ggplot2 objects are guaranteed to be entirely self-contained, so that they can be stored and re-used. The default mappings in the plot p can be extended or overridden in the layers, as with the following code. The results are shown in Figure 4. For that reason, unless you modify the default scales, axis labels and legend titles will be based on the plot defaults.
The way to change these is described in Section 6. Instead of mapping an aesthetic property to a variable, you can set it to a single value by specifying it in the layer parameters. Left Overriding colour with factor cyl and right overriding y-position with disp.
Table 4. Layer aesthetics can add to, override, and remove the default map- pings. We map an aesthetic to a variable e. This sets the point colour to be dark blue instead of black. Because this value is discrete, the default colour scale uses evenly spaced colours on the colour wheel, and since there is only one value this colour is pinkish. With qplot , you can do the same thing by putting the value inside of I , e.
When "darkblue" is mapped to colour, it is treated as a regular value and scaled with the default colour scale. This results in pinkish points and a legend. In ggplot2, geoms can be roughly divided into individual and collective geoms. An individual geom has a distinctive graphical object for each row in the data frame. For example, the point geom has a single point for each observation.
On the other hand, collective geoms represent multiple observations. This may be a result of a statistical summary, or may be fundamental to the display of the geom, as with polygons. Lines and paths fall somewhere in between: each overall line is composed of a set of straight segments, but each segment represents two points. How do we control which observations go in which individual graphical element? This is the job of the group aesthetic.
By default, the group is set to the interaction of all discrete variables in the plot. There are three common cases where the default is not enough, and we will consider each one below. It records the heights height and centered ages age of 26 boys Subject , measured on nine occasions Occasion. In many situations, you want to separate your data into groups, but render them in the same way. When looking at the data in aggregate you want to be able to distinguish individual subjects, but not identify them.
This is common in longitudinal studies with many subjects, where the plots are often descriptively called spaghetti plots. You can see the separate growth trajectories for each boy, but there is no way to see which boy belongs to which trajectory. This is not very useful! Right A single line connects all observations. Building on the previous example, suppose we want to add a single smooth line to the plot just created, based on the ages and heights of all the boys.
This is a useful time-saving technique, and is expanded upon in Chapter The plot has a discrete scale but you want to draw lines that connect across groups.
The colour is a rendering attribute, which has no corresponding variable in the data. Another important issue with collective geom is how the aesthetics of the individual observations are mapped to the aesthetics of the complete entity.
This means that the aesthetic for the last observation is not used, as shown in Figure 4. An additional limitation for paths and lines is that that line type must be constant over each individual line, in R there is no way to draw a joined up line which has varying line type. If colour is categorical left there is no meaningful way to interpolate between adjacent colours. If colour is continuous right , there is, but this is not done by default.
You could imagine a more complicated system where segments smoothly blend from one aesthetic to another. This would work for continuous variables like size or colour, but not for line type, and is not used in ggplot2. If this is the behaviour you want, you can perform the linear interpolation yourself, as shown below. For all other collective geoms, like polygons, the aesthetics from the indi- vidual components are only used if they are all the same, otherwise the default value is used.
These issues are most relevant when mapping aesthetics to continuous variable, because, as described above, when you introduce a mapping to a discrete variable, it will by default split apart collective geoms into smaller pieces. This works particularly well for bar and area plots, because stacking the individual pieces produces the same shape as the original ungrouped data.
This is illustrated in Figure 4. Very Good Premium Ideal. For example, using a point geom will create a scatterplot, while using a line geom will create a line plot. Each geom has a set of aesthetics that it understands, and a set that are required for drawing. For example, a point requires x and y position, and understands colour, size and shape aesthetics. These are listed for all geoms in Table 4. Internally, the rect geom is described as a polygon, and it is parameters are the locations of the four corners.
This is useful for non-Cartesian coordinate systems, as you will learn in Chapter 7. Every geom has a default statistic, and every statistic a default geom. For example, the bin statistic defaults to using the bar geom to produce a histogram.
These defaults are listed in Table 4. Overriding these defaults will still produce valid plots, but they may violate graphical conventions. See examples in Section 4. For example, a useful stat is the smoother, which calculates the mean of y, conditional on x, subject to some restriction that ensures smoothness. All currently available stats are listed in Table 4. This ensures that the transformation stays the same when you change the scales of the plot.
A stat takes a dataset as input and returns a dataset as output, and so a stat can add new variables to the original dataset. It is possible to map aesthetics to these new variables. Emboldened aesthetics are required. Useful for overplotting on scatter- plots summary Summarise y values at every unique x unique Remove duplicates.
The following example shows a density histogram of carat from the diamonds dataset. The names of generated variables must be surrounded with..
Each statistic lists the variables that it creates in its documentation. The syntax to produce this plot with qplot is very similar: qplot carat,..
Position adjustments are normally used with discrete data. Figure 4. Dodging is rather similar to faceting, and the advantages and disadvantages of each method are described in Section 7. For these operations to work, each bar must have the same width and not overlap with any others. The identity adjustment i.
The following examples demonstrate some of the ways to use the capabilities of layers that have been introduced in this chapter. These are just to get you started. You are limited only by your imagination! Right It is useful for lines, however, because lines do not have the same problem.
Left A frequency polygon; middle a scatterplot with both size and height mapped to frequency; right a heatmap representing frequency with colour. A number of the geoms available in ggplot2 were derived from other geoms in a process like the one just described, starting with an existing geom and making a few changes in the default aesthetics or stat. For example, the jitter geom is simply the point geom with the default position adjustment set to jitter.
In practice, you often have related datasets that should be shown together. A very common example is supplementing the data with predictions from a model. In Figure 4. In practice we might use a mixed model to do better. This section explores how we can combine the output from this more sophisticated model with the original data to gain more insight into both the data and the model. We do this by building up a grid that contains all combinations of ages and subjects.
This is overkill for this simple linear case, where we only need two values of age to draw the predicted straight line, but we show it here because it is necessary when the model is more complex. Next we add the predictions from the model back into this dataset, as a variable called height. Once we have the predictions we can display them along with the original data. We also set two aesthetic parameters to make it a bit easier to compare the predictions to the actual values.
There is now less evidence of model inadequacy. Notice how easily we were able to modify the plot object. We updated the data and replotted twice without needing to reinitialise oplot. Chapter 5. The layered structure of ggplot2 encourages you to design and construct graphics in a structured manner. You have learned what a layer is and how to add one to your graphic, but not what geoms and statistics are available to help you build revealing plots.
This chapter lists some of the many geoms and stats included in ggplot2, broken down by their purpose. This chapter will provide a good overview of the available options, but it does not describe each geom and stat in detail.
For more information about individual geoms, along with many more examples illustrating their use, see the online and electronic documentation. You may also want to consult the documentation to learn more about the datasets used in this chapter.
This chapter is broken up into the following sections, each of which deals with a particular graphical challenge. However, this breakdown should cover many common tasks and help you learn about some of the possibilities. If you need a reminder on how to translate between the two, see Appendix A. It is useful to think about the purpose of each layer before it is added. We plot the raw data for many reasons, relying on our skills at pattern detection to spot gross structure, local structure, and outliers.
This layer appears on virtually every graphic. In the earliest stages of data exploration, it is often the only layer. As we develop and explore models of the data, it is useful to display model predictions in the context of the data. We learn from the data summaries and we evaluate the model. Showing the data helps us improve the model, and showing the model helps reveal subtleties of the data that we might otherwise miss.
Summaries are usually drawn on top of the data. A metadata layer displays background context or annotations that help to give meaning to the raw data.
Metadata can be useful in the background and foreground. A map is often used as a background layer with spatial data. Other metadata is used to highlight important features of the data.
In that case, you want this to be the very last layer drawn. These geoms are the fundamental building blocks of ggplot2. They are useful in their own right, but also to construct more complex geoms. Each of these geoms is two dimensional and requires both x and y aesthetics.
The point geom uses shape and line and path geoms understand linetype. The geoms are used for displaying data, summaries computed elsewhere, and metadata. Multiple groups will be stacked on top of each other. The identity stat leaves the data unchanged. By default, multiple bars in the same location will be stacked on top of one another. The group aesthetic determines which observations are connected; see Section 4.
Each vertex of the polygon requires a separate row in the data. It is often useful to merge a data frame of polygon coordinates with the data just prior to plotting.
Section 5. This is the only geom in this group that requires another aesthetic: label. It also has optional aesthetics hjust and vjust that control the horizontal and vertical posi- tion of the text; and angle which controls the rotation of the text. See Appendex B for more details. The tiles form a regular tessellation of the plane and typically have the fill aesthetic mapped to another variable. Each of these geoms is illustrated in Figure 5. There are a number of geoms that can be used to display distributions, depending on the dimensionality of the distribution, whether it is continuous or discrete, and whether you are interested in conditional or joint distribution.
For 1d continuous distributions the most important geom is the histogram. Figure 5. You can change the binwidth, or specify the exact location of the breaks. We can see that the distribution is slightly skew-right. These options are illustrated in Figure 5. This statistic produces two output variables count and density. The count is the default as it is most interpretable.
The density is basically the count divided by the total count, and is useful when you want to compare the shape of the distributions, not the overall size. Most of these geoms are aliases: a basic geom is combined with a stat to produce the desired plot. Fair 0. Good 0. Very Good Premium 0. Ideal 0. Very Good 0.
Fair density. From top to bottom: faceted histogram, a conditional density plot, and frequency polygons. All show an interesting pattern: as quality increases, the distribution shifts to the left and becomes more symmetric. This is a useful display when the categorical variable has many distinct values.
When there are few values, the techniques described above give a better view of the shape of the distribution. For continuous variables, the group aesthetic must be set to get multiple boxplots. An example is shown in Figure 5. Also described in Sec- tion 2. Use a density plot when you know that the underlying density is smooth, continuous and unbounded.
Generally this works better for smaller datasets. Car class vs. Visualising a joint 2d continuous distribution is described in the next section. However, when the data is large, often points will be plotted on top of each other, obscuring the true relationship. A density plot of depth left , coloured by cut right. The data is points sampled from two independent normal distributions, and the code to produce the graphic is shown below.
If you specify alpha as a ratio, the denominator gives the number of points that must be overplotted to give a solid colour. In Figure 5. The complete code is shown below. From left to right: geom point, geom jitter with default jitter, geom jitter with horizontal jitter of 0. Breaking the plot into many small squares can produce distracting visual artefacts.
Carr et al. Legends have been omitted to save space. Another approach to dealing with overplotting is to add data summaries to help guide the eye to the true shape of the pattern within the data.
Top Image displays of the density; bottom point and contour based displays. However, it does support the common tools for representing 3d surfaces in 2d: contours, coloured tiles and bubble plots. These were used to illustrated the 2d density surfaces in the previous section. Table 5. Adding map border is performed by the borders function.
The following code uses borders to display the spatial data shown in Figure 5. The results are shown in Figure 5. Left All cities with population as of January of greater than half a million, right cities in Texas. In the following example we compute the approximate centre of each county in Iowa and then use those centres to label the map.
If you have information about the uncertainty present in your data, whether it be from a model or from distributional assumptions, it is often important to display it. There are four basic families of geoms that can be used for this job, depending on whether the x values are discrete or continuous, and whether or not you want to display the middle of the interval, or just the extent. These geoms are listed in Table 5. These geoms assume that you are interested in the distribution of y conditional on x and use the aesthetics ymin and ymax to determine the range of the y values.
For very simple cases, ggplot2 provides some tools in the form of summary functions described in Section 5. The effects package Fox, is particularly useful for extracting these values from linear models. The packages multcomp and multcompView are useful calculating and displaying these errors while correctly adjusting for multiple comparisons.
Left Both x and y axes are log10 transformed to remove non-linearity. Right The major linear trend is removed. These alternatives are described below. Note that ggplot2 displays the full range of the data, not just the range of the summary statistics. The arguments fun. You can use any summary function that takes a vector of numbers and returns a single numeric value: mean , median , min , max.
You can also write your own summary function. This summary function should return a named vector as output, as shown in the following example. When annotating your plot with additional labels, the important thing to remember is that these annotations are just extra data.
There are two basic ways to add annotations: one at a time, or many at once. Adding one at a time works best for small numbers of annotations with varying aesthetics.
You just set all the values to give the desired properties. If you have multiple annotations with similar properties, it may make sense to put them all in a data frame and add them at once. The example below demonstrates both approaches by adding information about presidents to economic data. However, pulling out just a few observations using subset can be very useful. Typically you will want to label outliers or other important points.
All these geoms have an arrow parameter, which allows you to place an arrowhead on the line. You create arrowheads with the arrow function, which has arguments angle, length, ends and type. When you have aggregated data where each row in the dataset represents multiple observations, you need some way to take into account the weighting variable. We will use some data collected on Midwest states in the US census.
The data consists mainly of percentages e. There are two aesthetic attributes that can be used to adjust for weights. Firstly, for simple geoms like lines and points, you can make the size of the grob proportional to the number of points, using the size aesthetic, as with the following code, whose results are shown in Figure 5. These weights will be passed on to the statistical summary function.
Weights are supported for every case where it makes sense: smoothers, quantile regressions, boxplots, histograms, and density plots. No weighting left , weighting by population centre and by area right. When we weight a histogram or density plot by total population, we change from looking at the distribution of the number of counties, to the distribution of the number of people. The unweighted histogram shows number of counties, while the weighted histogram shows population.
The weighting considerably changes the interpretation! Chapter 6. Scales control the mapping from data to aesthetics. They take your data and turn it into something that you can perceive visually: e. Scales also provide the tools you use to read the plot: the axes and legends collectively known as guides. Formally, each scale is a function from a region in data space the domain of the scale to a region in aesthetic space the range of the range.
The domain of each scale corresponds to the range of the variable supplied to the scale, and can be continuous or discrete, ordered or unordered. The range consists of the concrete aesthetics that you can perceive and that R can understand: position, colour, shape, size and line type. If you blinked when you read that scales map data both to position and colour, you are not alone.
The notion that the same kind of object is used to map data to positions and symbols strikes some people as unintuitive. However, you will see the logic and power of this notion as you read further in the chapter. The process of scaling takes place in three steps, transformation, training and mapping, and is described in Section 6. Without a scale, there is no way to go from the data to aesthetics, so a scale is required for every aesthetic used on the plot. It would be tedious to manually add a scale every time you used a new aesthetic, so whenever a scale is needed ggplot2 will add a default.
You can generate many plots without knowing how scales work, but understanding scales and learning how to manipulate them will give you much more control. Default scales and how to override them are described in Section 6. Scales can be roughly divided into four categories: position scales, colour scales, the manual discrete scale and the identity scale. The common options and most important uses are described in Section 6.
The section focusses on giving you a high-level overview of the options available, rather than expanding on every detail in depth. Details about individual parameters are included in the online documentation. The other important role of each scale is to produce a guide that allows the viewer to perform the inverse mapping, from aesthetic space to data space,.
For position aesthetics, the axes are the guides; for all other aesthetics, legends do the job. We will just focus on ggplot2 here , as it is vastly superior to base graphics. Let's talk about some basic nuts and bolts that are useful to know. The ggplot2 package [6] stands out because of its conceptually different approach of dealing with visualizations. The ggplot2 visualization package is based on the theory of grammar of graphics [7] and derives its name from the same It provides a powerful model of graphics that makes it easy to produce complex multilayered graphics.
The underlying plan of the creator of ggplot2 , Hadley Wickham, was based on the When data visualization requires of more advanced or elaborated graphics, other packages can be used. Sometimes the use of more proficient images comes from the need of sharing a beautiful display in order to This book contains 6 parts providing step-by-step guides to create easily beautiful graphics using the R package ggplot2.
This book presents the essentials of ggplot2 to easily create beautiful graphics in R. Key features: - Covers the most important graphic functions- Short, self-contained chapters with practical examples. When data is presented to you in a graphical or pictorial format, you can analyze it more effectively.
This book begins by introducing you to basic concepts, such as grammar of graphics and geometric objects. We were confident in doing so, Skip to content.
0コメント