-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathggplot2.qmd
More file actions
974 lines (724 loc) · 57 KB
/
ggplot2.qmd
File metadata and controls
974 lines (724 loc) · 57 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
# Visualizing Data {#sec-ggplot2}
```{r}
#| echo: false
#| output: false
library(tidyverse)
library(patchwork)
library(kableExtra)
```
We continue the development of your data analysis toolbox with data visualization. By visualizing data, we gain valuable insights we couldn't easily obtain from just looking at the raw data values or even the summaries we generated in @sec-dplyr. To visualizer our data, we'll be using the `ggplot2` package, as it provides an easy way to customize your plots. `ggplot2` is rooted in the data visualization theory known as _the grammar of graphics_ [@wilkinson2005], developed by Leland Wilkinson.
At their most basic, graphics/plots/charts (we use these terms interchangeably in this book) provide a nice way to explore the patterns in data, such as the *distribution* of individual variables and *relationships* between groups of variables. Graphics are designed to emphasize the findings and insights you want your audience to understand. This does, however, require a balancing act. On the one hand, you want to highlight as many interesting findings as possible. On the other hand, you don't want to include so much information that it overwhelms your audience.
As we will see, plots also help us to identify patterns in our data. We'll see that a common extension of these ideas is to compare the *distribution* of one numerical variable, such as what are the center and spread of the values, as we go across the levels of a different categorical variable.
## The grammar of graphics {#grammarofgraphics}
We start with a discussion of a theoretical framework for data visualization known as "the grammar of graphics." This framework serves as the foundation for the `ggplot2` package which we'll use extensively in this chapter. In @sec-dplyr, we saw how dplyr provides a "grammar" of data manipulation, a grammar which is made up of several "verbs" (functions like `filter` and `mutate`). Similar to dplyr's grammar of data manipulation, ggplto2 provides a a grammar of graphics that defines a set of rules for constructing *statistical graphics* by combining different types of *layers*. This grammar was created by Leland Wilkinson [@wilkinson2005] and has been implemented in a variety of data visualization software platforms like R, but also [Plotly](https://plot.ly/) and [Tableau](https://www.tableau.com/).
### Components of the grammar
In short, the grammar tells us that:
> **A statistical graphic is a `mapping` of `data` variables to `aes`thetic attributes of `geom`etric objects.**
Specifically, we can break a graphic into the following three essential components:
1. `data`: the dataset containing the variables of interest.
1. `geom`: the geometric object in question. This refers to the type of object we can observe in a plot. For example: points, lines, and bars.
1. `aes`: aesthetic attributes of the geometric object. For example, x/y position, color, shape, and size. Aesthetic attributes are *mapped* to variables in the data set.
You might be wondering why we wrote the terms `data`, `geom`, and `aes` in a computer code type font. We'll see very shortly that we'll specify the elements of the grammar in R using these terms. However, let's first break down the grammar with an example.
### An initial example
Let's take another look at our `nba` data set, this time via the grammar of graphics. Let's specifically take a look at how things have changed over the years. As always, we need to load the data we downloaded back in @sec-dataactivities. Run the following:
```{r}
#| output: false
nba <- read_csv("./data/nba_all_seasons.csv", na = c("Undrafted"))
```
We need to do a bit of work before we can use `season` as a measure of time. This is because the `season` column is currently stored as a character vector, with values such as "2000-01" and "2011-12". So we need two things: these character vectors need to be trimmed so that we retain only the first 4 characters in each vector and we then need to convert these character vectors to their corresponding numeric values. We'll do that using the `stringr` package, yet another package that is part of the tidyverse.
```{r}
nba <- nba %>%
# select first 4 characters of `season`
mutate(season_int = substr(nba$season, start=1, stop=4)) %>%
# convert to integer
mutate(season_int = as.integer(season_int))
```
Now that we have time represented as a numeric column in our data set, we can use it to plot some data to visualize.
```{r}
#| echo: false
#| warning: false
ggplot(
data =
nba %>%
group_by(team_abbreviation, season_int) %>%
summarize(
m_pts = mean(pts),
m_ast = mean(ast),
team = first(team_abbreviation),
season_int = first(season_int)
) %>%
filter(team %in% c("LAL", "NYK", "PHX")) %>%
arrange(season_int)
,
aes(x = season_int, y = m_pts, color = team)
) +
geom_line() +
geom_point(aes(size = m_ast))
```
Let's view this plot through the grammar of graphics. First, we have actually used two type of `geom`etric object here: a line object and a point object. The point object provides the small circular data points. The line object provides the line segments connecting the points.
1. The `data` variable **season_int** gets mapped to the `x`-position `aes`thetic of the lines and the points.
1. The `data` variable **pts** gets mapped to the `y`-position `aes`thetic of the lines and the points.
1. The `data` variable **team** gets mapped to the `color` `aes`thetic of the lines and the points.
1. The `data` variable **ast** gets mapped to the `size` `aes`thetic of the points.
That being said, this is just an example. Plots can specify points, lines, bars, and a variety of other geometric objects.
Let's summarize the three essential components of the grammar.
| geom | aes | data variable |
|-------|-------|---------------|
| line | x | season_int |
| line | y | pts |
| line | color | team |
| point | x | season_int |
| point | y | pts |
| point | color | team |
| point | size | ast |
### Other components
There are other components of the grammar of graphics we can control as well. As you start to delve deeper into the grammar of graphics, you'll start to encounter these topics more frequently. In this book, we'll keep things simple and only work with these two additional components:
- `facet`ing breaks up a plot into several plots split by the values of another variable
- `position` adjustments for barplots
Other more complex components like `scales` and `coord`inate systems are left for a more advanced text such as [*R for Data Science*](https://r4ds.hadley.nz/layers.html#aesthetic-mappings). Generally speaking, the grammar of graphics allows for a high degree of customization of plots and also a consistent framework for easily updating and modifying them.
### ggplot2 package
In this book, we will use the `ggplot2` package for data visualization, which is an implementation of the `g`rammar of `g`raphics for R. As we noted earlier, a lot of the previous section was written in a computer code type font. This is because the various components of the grammar of graphics are specified in the `ggplot()` function included in the `ggplot2` package. For the purposes of this book, we'll always provide the `ggplot()` function with the following arguments (i.e., inputs) at a minimum:
* The data frame where the variables exist: the `data` argument.
* The mapping of the variables to aesthetic attributes: the `mapping` argument which specifies the `aes`thetic attributes involved.
After we've specified these components, we then add *layers* to the plot using the `+` sign. The most essential layer to add to a plot is the layer that specifies which type of `geom`etric object we want the plot to involve: points, lines, bars, and others. Other layers we can add to a plot include the plot title, axes labels, visual themes for the plots, and facets (which we'll see in @sec-facets).
To stress the importance of adding the layer specifying the `geom`etric object, consider @fig-nolayers where no layers are added. Because the `geom`etric object was not specified, we have a blank plot which is not very useful!
```{r}
#| label: fig-nolayers
#| fig-cap: Empty plot of assists versus points
ggplot(data = nba, mapping = aes(x = pts, y = ast))
```
Let's next look at the three common ways of calling `ggplot()`:
1. `ggplot(data = df, mapping = aes(...))`
1. `ggplot(data = df)`
1. `ggplot()`
The first of these is likely to be the most common usage pattern. We pass in a `data`frame and `aes`thetics. The data and the aesthetics will both be used for all layers that you add to the plot. In the second case, we pass in a dataframe, but omit the aesthetics. This pattern can be useful if each of your layers relise on the same data, but the aesthetics vary from one layer to another. Finally, we have a relatively less common method of calling `ggplot()` in which we provide neither data nor aesthetics. This can be useful if you wish each of your layers to both a) refer to different data and b) use different aesthetics. As we will see, passing aesthetics to `ggplot()` is not an all-or-nothing choice. You can pass some aesthetics to `ggplot()` and they will be applied to all layers. You can then pass additional aesthetics to be applied to individual layers.
Let's put the theory of the grammar of graphics into practice.
## Five named graphs - the 5NG {#sec-FiveNG}
In order to keep things simple in this book, we will only focus on five different types of graphics, each with a commonly given name. We term these "five named graphs" or in abbreviated form, the **5NG**:
1. scatterplots
1. linegraphs
1. histograms
1. boxplots
1. barplots
We'll also present some variations of these plots, but with this basic repertoire of five graphics in your toolbox, you can visualize a wide array of different variable types. Note that certain plots are only appropriate for categorical variables, while others are only appropriate for numerical variables.
## 5NG#1: Scatterplots {#sec-scatterplots}
The simplest of the 5NG are *scatterplots*. They allow you to visualize the *relationship* between two numerical variables. While you may already be familiar with scatterplots, let's view them through the lens of the grammar of graphics. Specifically, we will visualize the relationship between the following two numerical variables in the `nba` data frame:
1. `pts`: average points per game each player scored
1. `ast`: average number of assists per game each player made
### Scatterplots via `geom_point` {#sec-geompoint}
Let's now go over the code that will create the desired scatterplot, while keeping in mind the grammar of graphics framework we introduced above. Let's take a look at the code and break it down piece-by-piece.
```{r}
#| eval: false
ggplot(data = nba, mapping = aes(x = pts, y = ast)) +
geom_point()
```
Within the `ggplot()` function, we specify two of the components of the grammar of graphics as arguments (i.e., inputs):
1. The `data` as the `nba` data frame via `data = nba`.
1. The `aes`thetic `mapping` by setting `mapping = aes(x = pts, y = ast)`. Specifically, the variable `pts` maps to the `x` position aesthetic, whereas the variable `ast` maps to the `y` position.
We then add a layer to the `ggplot()` function call using the `+` sign. The added layer in question specifies the third component of the grammar: the `geom`etric object. In this case, the geometric object is set to be points by specifying `geom_point()`. After running these two lines of code in your console, you'll notice two outputs: a warning message and this graphic.
```{r}
#| label: fig-scatter
#| echo: false
#| fig-cap: Assists versus points
ggplot(data = nba, mapping = aes(x = pts, y = ast)) +
geom_point()
```
Let's first unpack the graphic in @fig-scatter. Observe that a *positive relationship* exists between `pts` and `ast`: as the number of points increases, the number of assists also increases. Observe also the large mass of points clustered near (0, 0), the point indicating players have no points and no assists (e.g., what would be expected from a player that doesn't play very much).
Before we continue, let's make a few more observations about this code that created the scatterplot. Note that the `+` sign comes at the end of lines, and not at the beginning. You'll get an error in R if you put it at the beginning of a line. When adding layers to a plot, you are encouraged to start a new line after the `+` (by pressing the Return/Enter button on your keyboard) so that the code for each layer is on a new line. As we add more and more layers to plots, you'll see this will greatly improve the legibility of your code.
### Overplotting {#overplotting}
The large mass of points near (0, 0) in Figure can cause some confusion since it is hard to tell the true number of points that are actually in this lower corner. This is the result of a phenomenon called *overplotting*. As one may guess, this corresponds to points being plotted on top of each other over and over again. When overplotting occurs, it is difficult to know the number of points being plotted. There are two methods to address the issue of overplotting. Either by
1. Adjusting the transparency of the points or
1. Adding a little random "jitter", or random "nudges", to each of the points.
**Method 1: Changing the transparency**
The first way of addressing overplotting is to change the transparency/opacity of the points by setting the `alpha` argument in `geom_point()`. We can change the `alpha` argument to be any value between `0` and `1`, where `0` sets the points to be 100% transparent and `1` sets the points to be 100% opaque. By default, `alpha` is set to `1`. In other words, if we don't explicitly set an `alpha` value, R will use `alpha = 1`.
Note how the following code is identical to the code in @sec-scatterplots that created the scatterplot with overplotting, but with `alpha = 0.05` added to the `geom_point()` function:
```{r}
#| label: fig-scatter-alpha
#| fig-cap: Plot of assists versus points with transparency
ggplot(data = nba, mapping = aes(x = pts, y = ast)) +
geom_point(alpha = 0.05)
```
The key feature to note in @fig-scatter-alpha is that the transparency of the points is cumulative: areas with a high-degree of overplotting are darker, whereas areas with a lower degree are less dark. Note furthermore that there is no `aes()` surrounding `alpha = 0.05`. This is because we are not mapping a variable to an aesthetic attribute, but rather merely changing the default setting of `alpha`. In fact, you'll receive an error if you try to change the second line to read `geom_point(aes(alpha = 0.05))`.
**Method 2: Jittering the points**
The second way of addressing overplotting is by *jittering* all the points. This means giving each point a small "nudge" in a random direction. You can think of "jittering" as shaking the points around a bit on the plot. Let's illustrate using a simple example first. Say we have a data frame with 4 identical rows of x and y values: (0,0), (0,0), (0,0), and (0,0). In @fig-jitter-example-plot-1, we present both the regular scatterplot of these 4 points (on the left) and its jittered counterpart (on the right).
```{r}
#| label: fig-jitter-example-plot-1
#| echo: false
#| fig-cap: Regular and jittered scatterplots
jitter_example <- tibble(x = rep(0, 4),
y = rep(0, 4))
jittered_plot_1 <-
ggplot(data = jitter_example, mapping = aes(x = x, y = y)) +
geom_point() +
coord_cartesian(xlim = c(-0.025, 0.025),
ylim = c(-0.025, 0.025)) +
labs(title = "Regular scatterplot")
jittered_plot_2 <-
ggplot(data = jitter_example, mapping = aes(x = x, y = y)) +
geom_jitter(width = 0.01, height = 0.01) +
coord_cartesian(xlim = c(-0.025, 0.025),
ylim = c(-0.025, 0.025)) +
labs(title = "Jittered scatterplot")
jittered_plot_1 + jittered_plot_2
```
In the left scatterplot, observe that the 4 points are superimposed on top of each other. While we know there are 4 values being plotted, this fact might not be apparent to others. In the right scatterplot, the points are jittered and it is now plainly evident that this plot involves four points since each point is given a random "nudge."
Keep in mind, however, that jittering is strictly a visualization tool; even after creating a jittered scatterplot, the original values saved in the data frame remain unchanged.
To create a jittered scatterplot, instead of using `geom_point()`, we use `geom_jitter()`. Observe how the following code is very similar to the code that created @fig-scatter, but with `geom_point()` replaced with `geom_jitter()`.
```{r}
#| label: fig-jitter-example-plot-2
#| fig-cap: Assists versus points jittered scatterplot
ggplot(data = nba, mapping = aes(x = pts, y = ast)) +
geom_jitter(width = 5, height = 5)
```
In order to specify how much jitter to add, we adjusted the `width` and `height` arguments to `geom_jitter()`. This corresponds to how hard you'd like to shake the plot in horizontal x-axis units and vertical y-axis units, respectively. In this case, both axes are in counts (number of points, number of assists). How much jitter should we add using the `width` and `height` arguments? On the one hand, it is important to add just enough jitter to break any overlap in points, but on the other hand, not so much that we completely alter the original pattern in points.
As can be seen in the resulting @fig-jitter-example-plot-2, in this case jittering doesn't really provide much new insight. In this particular case, it can be argued that changing the transparency of the points by setting `alpha` proved more effective. When would it be better to use a jittered scatterplot? When would it be better to alter the points' transparency? There is no single right answer that applies to all situations. You need to make a subjective choice and own that choice. At the very least when confronted with overplotting, however, we suggest you make both types of plots and see which one better emphasizes the point you are trying to make.
### Summary
Scatterplots display the relationship between two numerical variables. They are among the most commonly used plots because they can provide an immediate way to see the trend in one numerical variable versus another. However, if you try to create a scatterplot where either one of the two variables is not numerical, you might get strange results. Be careful!
With medium to large datasets, you may need to play around with the different modifications to scatterplots we saw such as changing the transparency/opacity of the points or by jittering the points. This tweaking is often a fun part of data visualization, since you'll have the chance to see different relationships emerge as you tinker with your plots.
## 5NG#2: Linegraphs {#sec-linegraphs}
The next of the five named graphs are linegraphs. Linegraphs show the relationship between two numerical variables when the variable on the x-axis is ordinal; there is an inherent ordering to the variable.
The most common examples of linegraphs have some notion of time on the x-axis: hours, days, weeks, years, etc. Since time is naturally ordinal, we connect consecutive observations of the variable on the y-axis with a line. Linegraphs that have some notion of time on the x-axis are also called *time series* plots.
### Linegraphs via `geom_line` {#sec-geomline}
Let's a linegraph to visualize a single NBA player's number of points scored across seasons. To do so, we'll use `geom_line()`, instead of using `geom_point()` as we did for the scatterplots above:
```{r}
#| fig-cap: Stephen Curry points over time
ggplot(
data = nba %>%
filter(player_name == "Stephen Curry"),
mapping = aes(x = season_int, y = pts)
) +
geom_line()
```
Let's break down this code piece-by-piece in terms of the grammar of graphics. Within the `ggplot()` function call, we specify two of the components of the grammar of graphics as arguments:
1. The `data`. Here we have provided a filtered version of our `nba` data set, selecting only those row where `player_name=="Stephen Curry"`.
1. The `aes`thetic `mapping` by setting `mapping = aes(x = season_int, y = pts)`. Specifically, the variable `season_int` maps to the `x` position aesthetic, whereas the variable `pts` maps to the `y` position aesthetic.
We add a layer to the `ggplot()` function call using the `+` sign. The layer in question specifies the third component of the grammar: the `geom`etric object in question. In this case, the geometric object is a `line` set by specifying `geom_line()`.
### Summary
Linegraphs, just like scatterplots, display the relationship between two numerical variables. However, it is preferred to use linegraphs over scatterplots when the variable on the x-axis (i.e., the explanatory variable) has an inherent ordering, such as some notion of time.
## 5NG#3: Histograms {#sec-histograms}
Let's consider the `pts` variable in the `nba` data frame once again, but unlike with the linegraphs in @sec-linegraphs, let's say we don't care about its relationship with time, but rather we only care about how the values of `pts` *distribute*. In other words:
1. What are the smallest and largest values?
1. What is the "center" or "most typical" value?
1. How do the values spread out?
1. What are frequent and infrequent values?
One way to visualize this *distribution* of this single variable `pts` is to plot them on a horizontal line as we do in @fig-pts-on-line:
```{r}
#| label: fig-pts-on-line
#| echo: false
#| fig-cap: Plot of players' points per-game point averages.
ggplot(
data = nba %>%
group_by(player_name) %>%
summarize(pts = mean(pts)),
mapping = aes(x = pts, y = factor("A"))
) +
geom_point(alpha = .01) +
theme(
axis.ticks.y = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank()
)
hist_title <- "Plot of players' points per-game point averages."
```
This gives us a bit of an idea of how the values of `pts` are distributed: note that values range from zero to approximately 20. In addition, there appear to be more values falling between approximately 3 and 10 than there are values falling above this range. However, because of the high degree of overplotting in the points, it's hard to get a sense of exactly how many values are between 5 and 10.
What is commonly produced instead of @fig-pts-on-line is known as a *histogram*. A histogram is a plot that visualizes the *distribution* of a numerical value as follows:
1. We first cut up the x-axis into a series of *bins*, where each bin represents a range of values.
1. For each bin, we count the number of observations that fall in the range corresponding to that bin.
1. Then for each bin, we draw a bar whose height marks the corresponding count.
Let's drill-down on an example of a histogram, shown in @fig-histogramexample.
```{r}
#| label: fig-histogramexample
#| echo: false
#| fig-cap: Example histogram.
ggplot(
data = nba %>%
group_by(player_name) %>%
summarize(pts = mean(pts)),
mapping = aes(x = pts)
) +
geom_histogram(binwidth = 2,
boundary = 70,
color = "white")
```
Let's focus only on values between 10 points and 20 points for now. Observe that there are five bins of equal width between 10 points and 20 points. Thus we have five bins of width 2 points each: one bin for the 10-12 range, another bin for the 13-14 range, etc.
1. The bin for the 10-12 range has a height of around 150. In other words, around 150 players scored a season average of between 10 and 20 points.
1. The bin for the 13-14 range has a height of around 100. In other words, around 100. players scored a season average of between 13 and 14 points.
1. The bin for the 15-16 range has a height of around 50. In other words, around 50 players scored a season average of between 15 and 16 points.
1. And so on...
### Histograms via `geom_histogram` {#sec-geomhistogram}
Let's now present the `ggplot()` code to plot your first histogram! Unlike with scatterplots and linegraphs, there is now only one variable being mapped in `aes()`: the single numerical variable `pts`. The y-aesthetic of a histogram, the count of the observations in each bin, gets computed for you automatically. Furthermore, the geometric object layer is now a `geom_histogram()`. After running the following code, you'll see the histogram in @fig-pts-histogram as well as a warning message. We'll discuss the warning message first.
```{r}
#| label: fig-pts-histogram
#| warning: true
#| fig-cap: "Histogram of average pts per game."
ggplot(data = nba %>%
group_by(player_name) %>%
summarize(pts = mean(pts)),
mapping = aes(x = pts)) +
geom_histogram()
```
The warning is telling us that the histogram was constructed using `bins = 30` for 30 equally spaced bins. This is the default value for this argument (see @sec-functionarguments). We'll see in the next section how to change the number of bins to another value than the default.
Now let's unpack the resulting histogram in @fig-pts-histogram2. Observe that values greater than 20 are rather rare. However, because of the large number of bins, it's hard to get a sense for which range of temperatures is spanned by each bin; everything is one giant amorphous blob. So let's add white vertical borders demarcating the bins by adding a `color = "white"` argument to `geom_histogram()` and ignore the warning about setting the number of bins to a better value:
```{r}
#| echo: false
#| label: fig-pts-histogram2
#| fig-cap: "Histogram of average pts per game."
ggplot(data = nba %>%
group_by(player_name) %>%
summarize(pts = mean(pts)),
mapping = aes(x = pts)) +
geom_histogram(color = "white")
```
We it's slightly easier to associate each of the bins in @fig-pts-histogram-color with a specific range of temperatures. We can also control the color of the bars by setting the `fill` argument. For example, you can set the bin colors to be "blue steel" by setting `fill = "steelblue"`:
```{r}
#| label: fig-pts-histogram-color
#| fig-cap: "Histogram of average pts per game."
ggplot(data = nba %>%
group_by(player_name) %>%
summarize(pts = mean(pts)),
mapping = aes(x = pts)) +
geom_histogram(color = "white", fill = "steelblue")
```
If you're curious, run `colors()` to see all `r colors() %>% length()` possible choice of colors in R!
### Adjusting the bins {#adjustbins}
Observe in @fig-pts-histogram-color that in the 10-20 range there appear to be roughly 11 bins. Thus, each bin has width 20-10 divided by 11, or 0.91 points, which is not a very easily interpretable range to work with. Let's improve this by adjusting the number of bins in our histogram in one of two ways:
1. By adjusting the number of bins via the `bins` argument to `geom_histogram()`.
1. By adjusting the width of the bins via the `binwidth` argument to `geom_histogram()`.
Using the first method, we control how many bins we would like to chop the x-axis up into and let ggplot figure out what the bin widths should be. As mentioned in the previous section, the default number of bins is 30. We can override this default. Let's set it to 25 bins instead:
```{r}
#| eval: false
ggplot(
data = nba %>%
group_by(player_name) %>%
summarize(pts = mean(pts)),
mapping = aes(x = pts)
) +
geom_histogram(bins = 25, color = "white")
```
Using the second method, we control the width of the bins and let ggplto figure out how many bins should be used. For example, let's set the width of each bin to be 5 points.
```{r}
#| eval: false
ggplot(
data = nba %>%
group_by(player_name) %>%
summarize(pts = mean(pts)),
mapping = aes(x = pts)
) +
geom_histogram(binwidth = 5, color = "white")
```
We compare both resulting histograms side-by-side in @fig-hist-bins.
```{r}
#| label: fig-hist-bins
#| fig-cap: "Setting histogram bins in two ways."
#| echo: false
hist_1 <- ggplot(
data = nba %>%
group_by(player_name) %>%
summarize(pts = mean(pts)),
mapping = aes(x = pts)
) +
geom_histogram(bins = 25, color = "white")
hist_2 <- ggplot(
data = nba %>%
group_by(player_name) %>%
summarize(pts = mean(pts)),
mapping = aes(x = pts)
) +
geom_histogram(binwidth = 5, color = "white")
hist_1 + hist_2
```
### Summary
Histograms, unlike scatterplots and linegraphs, present information on only a single numerical variable. Specifically, they are visualizations of the distribution of the numerical variable in question.
## Facets {#sec-facets}
Before continuing with the next of the 5NG, let's briefly introduce a new concept called *faceting*. Faceting is used when we'd like to split a particular visualization by the values of another variable. This will create multiple copies of the same type of plot with matching x and y axes, but whose content will differ.
For example, suppose we were interested in looking at how the histogram of players' points per game averages changed across seasons. We could "split" this histogram so that we had a separate histogram of `pts` for each of several values of `season_int`. We do this by adding `facet_wrap(vars(season_int))` layer.
```{r}
#| warning: false
#| label: fig-facethistogram
#| fig-cap: "Faceted histogram of points per game."
ggplot(
data = nba %>%
group_by(player_name, season_int) %>%
summarize(pts = mean(pts), season_int = first(season_int)),
mapping = aes(x = pts)
) +
geom_histogram(binwidth = 5, color = "white") +
facet_wrap(vars(season_int))
```
We can also specify the number of rows and columns in the grid by using the `nrow` and `ncol` arguments inside of \index{ggplot2!facet\_wrap()} `facet_wrap()`. For example, say we would like our faceted histogram to have 4 rows instead of 3. We simply add an `nrow = 4` argument to `facet_wrap(vars(season_int))`.
```{r}
#| warning: false
#| label: fig-facethistogram2
#| fig-cap: "Faceted histogram of points per game."
ggplot(
data = nba %>%
group_by(player_name, season_int) %>%
summarize(pts = mean(pts), season_int = first(season_int)),
mapping = aes(x = pts)
) +
geom_histogram(binwidth = 5, color = "white") +
facet_wrap(vars(season_int), nrow = 4)
```
## 5NG#4: Boxplots {#sec-boxplots}
Though faceted histograms are one type of visualization used to compare the distribution of a numerical variable split by the values of another variable, another type of visualization that achieves a similar goal is a *side-by-side boxplot*.
Let's again consider the distribution of points. For now, let's confine ourselves to the 1996 season to keep things simple.
```{r}
#| label: fig-1996
#| fig-cap: Points from 1996 represented as jittered points.
base_plot <- nba %>%
filter(season_int %in% c(1996)) %>%
ggplot(mapping = aes(x = factor(season_int), y = pts))
base_plot + geom_jitter(width = 0.1,
height = 0,
alpha = 0.3)
```
A boxplot is constructed from the information provided in the *five-number summary*:
```{r}
#| echo: false
min_pts <- min(filter(nba, season_int %in% c(1996))$pts)
max_pts <- max(filter(nba, season_int %in% c(1996))$pts)
quartiles <- filter(nba, season_int %in% c(1996)) %>%
pull(pts) %>%
quantile(prob = c(0.25, 0.5, 0.75)) %>%
round(0)
five_number_summary <- tibble(values = c(min_pts, quartiles, max_pts))
```
1. Minimum: `r five_number_summary$values[1]`
1. First quartile (25th percentile): `r five_number_summary$values[2]` points
1. Median (second quartile, 50th percentile): `r five_number_summary$values[3]` points
1. Third quartile (75th percentile): `r five_number_summary$values[4]` points
1. Maximum: `r five_number_summary$values[5]`
In the leftmost plot of @fig-1996-2, let's mark these 5 values with dashed horizontal lines on top of the actual data points. In the middle plot of @fig-1996-2 let's add the *boxplot*. In the rightmost plot of @fig-1996-2, let's remove the points and the dashed horizontal lines for clarity's sake.
```{r}
#| label: fig-1996-2
#| echo: false
#| fig-cap: "Building up a boxplot from individual data points"
boxplot_1 <- base_plot +
geom_hline(data = five_number_summary, aes(yintercept = values), linetype = "dashed") +
geom_jitter(width = 0.075,
height = 0.5,
alpha = 0.1)
boxplot_2 <- base_plot +
geom_boxplot() +
geom_hline(data = five_number_summary, aes(yintercept = values), linetype = "dashed") +
geom_jitter(width = 0.075,
height = 0.5,
alpha = 0.1)
boxplot_3 <- base_plot +
geom_boxplot()
boxplot_1 + boxplot_2 + boxplot_3
```
What the boxplot does is visually summarize the points by cutting them into *quartiles* at the dashed lines, where each quartile contains four equally-size groups of observations. Thus
1. 25% of points fall below the bottom edge of the box, which is the first quartile of `r five_number_summary$values[2]` points. In other words, 25% of observations were below `r five_number_summary$values[2]` points.
1. 25% of points fall between the bottom edge of the box and the solid middle line, which is the median of `r five_number_summary$values[3]` points. Thus, 25% of observations were between `r five_number_summary$values[2]` points and `r five_number_summary$values[3]` points and 50% of observations were below `r five_number_summary$values[3]` points.
1. 25% of points fall between the solid middle line and the top edge of the box, which is the third quartile of `r five_number_summary$values[4]` points. It follows that 25% of observations were between `r five_number_summary$values[3]` points and `r five_number_summary$values[4]` points and 75% of observations were below `r five_number_summary$values[4]` points.
1. 25% of points fall above the top edge of the box. In other words, 25% of observations were above `r five_number_summary$values[4]` points.
1. The middle 50% of points lie within the *interquartile range (IQR)* between the first and third quartile. Thus, the IQR for this example is `r five_number_summary$values[2]` - `r five_number_summary$values[4]` = `r (five_number_summary$values[2] - five_number_summary$values[4]) %>% round(3)` point. The interquartile range is one measure of a numerical variable's *spread*.
Furthermore, in the rightmost plot of @fig-1996-2, we see the *whiskers* of the boxplot. The whiskers stick out from either end of the box all the way to the minimum and maximum observations of `r five_number_summary$values[1]` and `r five_number_summary$values[5]` points, respectively. However, the whiskers don't always extend to the smallest and largest observed values as they do here. Instead, they extend no more than 1.5 $\times$ the interquartile range from either end of the box. In this case, we see a small number of observations that lie more than 1.5 $\times$ `r (five_number_summary$values[2] - five_number_summary$values[4]) %>% round(3)` points = `r (1.5*(five_number_summary$values[2] - five_number_summary$values[4])) %>% round(3)` points the top of the box. These observations are are called *outliers*.
### Boxplots via `geom_boxplot` {#sec-geomboxplot}
Let's now create a side-by-side boxplot of players' average points per game split by the different seasons as we did previously with the faceted histograms. We do this by mapping the `season_int` variable to the x-position aesthetic, the `pts` variable to the y-position aesthetic, and by adding a `geom_boxplot()` layer:
```{r}
#| warning: true
#| label: fig-badbox
#| fig-cap: "Invalid boxplot specification."
ggplot(data = nba, mapping = aes(x = season_int, y = pts)) +
geom_boxplot()
```
Observe in @fig-badbox that this plot does not provide information about points separated by season. The warning message clues us in as to why. It is telling us that we asked for the the x-position aesthetic to be mapped to a "continuous", or numerical variable (i.e., `season_int`). Boxplots, however, require the x-position aesthetic to be mapped to a _categorical_ variable.
We can convert the numerical variable `season_int` into a `factor` categorical variable by using the `factor()` function. So after applying `factor(season_int)`, `season_int` is converted from numerical values (e.g., 1995, 1996, etc.) to categories. With these categories, `ggplot()` now knows how to work with this variable to produce the needed plot.
```{r}
#| warning: true
#| label: fig-seasonptsbox
#| fig-cap: "Side-by-side boxplot of average points per game split by season."
ggplot(data = nba, mapping = aes(x = factor(season_int), y = pts)) +
geom_boxplot()
```
The resulting @fig-seasonptsbox shows 26 separate "box and whiskers" plots similar to the rightmost plot of @fig-1996-2 of only data from 1996. Thus the different boxplots are shown "side-by-side". To reiterate:
* The "box" portions of the visualization represent the 1st quartile, the median (the 2nd quartile), and the 3rd quartile.
* The height of each box (the value of the 3rd quartile minus the value of the 1st quartile) is the interquartile range (IQR). It is a measure of the spread of the middle 50% of values, with longer boxes indicating more variability.
* The "whisker" portions of these plots extend out from the bottoms and tops of the boxes and represent points less than the 25th percentile and greater than the 75th percentiles, respectively. They're set to extend out no more than $1.5 \times IQR$ units away from either end of the boxes. We say "no more than" because the ends of the whiskers have to correspond to observed points per game averages The length of these whiskers show how the data outside the middle 50% of values vary, with longer whiskers indicating more variability.
* The dots representing values falling outside the whiskers are called *outliers*. These can be thought of as potentially anomalous ("out-of-the-ordinary") values.
It is important to keep in mind that the definition of an outlier is somewhat arbitrary and not absolute. In this case, they are defined by the length of the whiskers, which are no more than $1.5 \times IQR$ units long for each boxplot. Looking at this side-by-side plot we can easily compare scorring distributions across seasons by drawing imaginary horizontal lines across the plot. Furthermore, the heights of the boxes as quantified by the interquartile ranges are informative too; they tell us about variability, or spread, of the different players' averages recorded in a given season.
### Summary
Side-by-side boxplots provide us with a way to compare the distribution of a numerical variable across multiple values of another variable. One can see where the median falls across the different groups by comparing the solid lines in the center of the boxes.
To study the spread of a numerical variable within one of the boxes, look at both the length of the box and also how far the whiskers extend from either end of the box. Outliers are even more easily identified when looking at a boxplot than when looking at a histogram as they are marked with distinct points.
## 5NG#5: Barplots {#sec-geombar}
Both histograms and boxplots are tools to visualize the distribution of numerical variables. Another commonly desired task is to visualize the distribution of a categorical variable. This is a simpler task, as we are simply counting different categories within a categorical variable, also known as the *levels* of the categorical variable. Often the best way to visualize these different counts, also known as *frequencies*, is with barplots (also called barcharts).
One complication, however, is how your data is represented. Is the categorical variable of interest "pre-counted" or not? For example, the following code that manually creates a data frames representing a collection of fruit: 3 apples and 2 oranges. Here, we have this data is a relatively "raw" form, with the "identity" (i.e., organge or apple) of each individual observation occupying a position (row) in the data frame.
```{r}
fruits <- tibble(fruit = c("apple", "apple", "orange", "apple", "orange"))
fruits
```
Here is a similar set of data, but we can see that this is a bit more "processed"; we have the frequencies reflecting how many times each type of fruit was observed:
```{r}
fruits_counted <- tibble(fruit = c("apple", "orange"),
number = c(3, 2))
fruits_counted
```
Both `fruits` and `fruits_counted` represent the same collection of fruit and both types of representation are common in all types of data analyses. Depending on how your categorical data is represented, you'll need to add a different type of `geom`etric layer to your `ggplot()` to create a barplot. Let's see how.
### Barplots via `geom_bar` or `geom_col`
Let's generate barplots using these two different representations of our fruit basket: 3 apples and 2 oranges. Using the `fruits` data frame, where the type of each observation is listed individually, we map the `fruit` variable to the x-position aesthetic and add a `geom_bar()` layer:
```{r}
#| label: fig-geombar
#| fig-cap: "Barplot when counts are not pre-counted."
ggplot(data = fruits, mapping = aes(x = fruit)) +
geom_bar()
```
When using the `fruits_counted` data frame, in which the fruit has been "pre-counted", we map the `fruit` variable to the x-position aesthetic (as above), but here we also map the `count` variable to the y-position aesthetic, and add a `geom_col()` layer instead.
```{r}
#| label: fig-geomcol
#| fig-cap: "Barplot when counts are pre-counted."
ggplot(data = fruits_counted, mapping = aes(x = fruit, y = number)) +
geom_col()
```
Compare the barplots in Figures @fig-geombar and @fig-geomcol. They are identical because they reflect counts of the same five fruits. However, depending on how our categorical data is represented, either "pre-counted" or not, we must add a different `geom` layer. When the categorical variable whose distribution you want to visualize
* Is *not* pre-counted in your data frame, we use `geom_bar()`.
* Is pre-counted in your data frame, we use `geom_col()` with the y-position aesthetic mapped to the variable that has the counts.
Let's now go back to the `nba` data frame and visualize the distribution of the categorical variable `college`. Specifically, we'll visualize the number of players who graduated from different colleges. We'll focus on the New York Knicks (`team_abbreviation==NYK`) and data from the 2006-2009 seasons.
Recall from @sec-dplyr that each row in the `nba` data set corresponds to a player in a given season. In other words, the `college` column is more like the `fruits` data frame than the `fruits_counted` data frame because `college` contains the "identity" of the college, not a "pre-counted" frequency. Thus, we should use `geom_bar()` to create a barplot. Much like a `geom_histogram()`, there is only one variable in the `aes()` aesthetic mapping: the variable `college` gets mapped to the `x`-position. Unlike histograms, where the adjacent bars are often plotted so that they touch, bar graphs typically leave a bit of space between the bars (to help readers ).
```{r}
#| label: fig-collegesbar
#| fig-cap: Number of players by colege using geom_bar().
nba %>%
group_by(player_name, college) %>%
filter(team_abbreviation %in% c("NYK"),
season_int < 2010,
season_int > 2005) %>%
ggplot(mapping = aes(x = college)) +
geom_bar() +
guides(x = guide_axis(angle = 90))
```
Observe in @fig-collegesbar that there are many Knicks players who either did not attend any college ("None") and many who attended Arizona State or Florida. Alternatively, say you had a data frame where the number of players attending each `college` was pre-counted as in @fig-colleges-counted.
```{r}
#| echo: false
#| label: fig-colleges-counted
#| fig-cap: Number of players, pre-counted for each college
colleges_counted <- nba %>%
group_by(player_name, college) %>%
filter(team_abbreviation %in% c("NYK"),
season_int < 2010,
season_int > 2005) %>%
group_by(college) %>%
summarize(number = n())
kable(colleges_counted,
digits = 3,
booktabs = TRUE,
longtable = TRUE,
linesep = ""
)
```
In order to create a barplot visualizing the distribution of the categorical variable `college` in this case, we would now use `geom_col()`, mapping the `y` aesthetic to `number` in addition to the the `x = colelge` we used previously. The resulting barplot would be identical to @fig-collegesbar (much as we saw from our two identical fruit plots earlier).
### Must avoid pie charts!
One of the most common plots used to visualize the distribution of categorical data is the pie chart. Though they may seem harmless, pie charts actually present a problem in that humans are unable to judge angles well. As Naomi Robbins describes in her book, *Creating More Effective Graphs* [@robbins2013creating], people tend to overestimate angles greater than 90 degrees and underestimate angles less than 90 degrees. In other words, it is difficult to determine the relative size of one piece of the pie compared to another. So stay away!
### Two categorical variables {#sec-two-categ-barplot}
Barplots are a very common way to visualize the frequency of different categories, or levels, of a single categorical variable. Another use of barplots is to visualize the *joint* distribution of two categorical variables at the same time. Let's examine the *joint* distribution of players by `college` as well as `season`. In other words, the number of players for each combination of `college` and `season`.
For example, the number of players in the league in 2005 who attended `Arizona`, the number players in the league in 2006 who attended `Arizona`, the number players in the league in 2005 who attended `Florida`, the number players in the league in 2006 who attended `Florida`, and so on. Recall the `ggplot()` code that created the barplot of `college` frequency in @fig-collegesbar:
```{r}
#| eval: false
nba %>%
group_by(player_name, college) %>%
filter(team_abbreviation %in% c("NYK"),
season_int < 2010,
season_int > 2005) %>%
ggplot(mapping = aes(x = college)) +
geom_bar() +
guides(x = guide_axis(angle = 90))
```
We can now map the additional variable `season_int` by adding a `fill = season_int` inside the `aes()` aesthetic mapping.
```{r}
#| label: fig-colleges-stacked-bar
#| fig-cap: Frequencies of players playing in each season having attended each college
nba %>%
group_by(player_name, college) %>%
filter(team_abbreviation %in% c("NYK"),
season_int < 2010,
season_int > 2005) %>%
ggplot(mapping = aes(x = college, fill = factor(season_int))) +
geom_bar() +
guides(x = guide_axis(angle = 90))
```
@fig-colleges-stacked-bar is an example of a *stacked barplot*. Though simple to make, in certain aspects it is not ideal. For example, it is not particularly easy to compare the heights of the different colors between the bars, corresponding to comparing the number of players from each `season_int` between the different teams.
Before we continue, let's address some common points of confusion among new R users. First, the `fill` aesthetic corresponds to the color used to _fill_ the bars, whereas the `color` aesthetic corresponds to the color of the _outline_ of the bars. This is identical to how we added color to our histogram in @sec-geomhistogram: we set the outline of the bars to white by setting `color = "white"` and the colors of the bars to blue steel by setting `fill = "steelblue"`. Observe in @fig-colleges-stacked-bar-color that mapping `season_int` to `color` and not `fill` yields grey bars with different colored outlines.
```{r}
#| label: fig-colleges-stacked-bar-color
#| fig-cap: Stacked barplot with color aesthetic used instead of fill.
nba %>%
group_by(player_name, college) %>%
filter(team_abbreviation %in% c("NYK"),
season_int < 2010,
season_int > 2005) %>%
ggplot(mapping = aes(x = college, color = factor(season_int))) +
geom_bar() +
guides(x = guide_axis(angle = 90))
```
Second, note that `fill` is another aesthetic mapping much like `x`-position; thus we were careful to include it within the parentheses of the `aes()` mapping. The following code, where the `fill` aesthetic is specified outside the `aes()` mapping (but inside the call to `ggplot()`) will yield an error. This is a fairly common error that new `ggplot` users make:
```{r}
#| eval: false
ggplot(mapping = aes(x = college), color = factor(season_int)) +
geom_bar()
```
An alternative to stacked barplots are *side-by-side barplots*, also known as *dodged barplots*, as seen in @fig-colleges-dodged-bar-color. The code to create a side-by-side barplot is identical to the code to create a stacked barplot, but with a `position = "dodge"` argument added to `geom_bar()`. In other words, we are overriding the default barplot type, which is a *stacked* barplot, and requesting a side-by-side barplot instead.
```{r}
#| label: fig-colleges-dodged-bar-color
#| fig-cap: Dodged barplot.
nba %>%
group_by(player_name, college) %>%
filter(team_abbreviation %in% c("NYK"),
season_int < 2010,
season_int > 2005) %>%
ggplot(mapping = aes(x = college, fill = factor(season_int))) +
geom_bar(position = "dodge") +
guides(x = guide_axis(angle = 90))
```
Here, the **width** of the bars for DuPaul and "None" is different than the width of the bars for Arizona and Iowa State. We can make one tweak to the `position` argument to get them to be the same size in terms of width as the other bars by using the more robust `position_dodge()` function.
```{r}
nba %>%
group_by(player_name, college) %>%
filter(team_abbreviation %in% c("NYK"),
season_int < 2010,
season_int > 2005) %>%
ggplot(mapping = aes(x = college, fill = factor(season_int))) +
geom_bar(position = position_dodge(preserve = "single")) +
guides(x = guide_axis(angle = 90))
```
Lastly, another type of barplot is a *faceted barplot*. Recall in @sec-facets we visualized the distribution of players' points *split* by season using facets. We can apply the same principle to our barplot visualizing the frequency of `college` split by `season_int`: instead of mapping `college` to `fill` we include it as the variable to create small multiples of the plot across the levels of `college`.
```{r}
#| label: fig-facet-bar-vert
#| fig-cap: "Faceted barplot comparing the number of players by college and season."
nba %>%
group_by(player_name, college) %>%
filter(team_abbreviation %in% c("NYK"),
season_int < 2010,
season_int > 2005) %>%
ggplot(mapping = aes(x = college)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90)) +
facet_wrap(vars(season_int), ncol = 1)
```
### Summary
Barplots are a common way of displaying the distribution of a categorical variable, or in other words the frequency with which the different categories (or *levels*) occur. They are easy to understand and make it easy to make comparisons across levels. Furthermore, when trying to visualize the relationship of two categorical variables, you have many options: stacked barplots, side-by-side barplots, and faceted barplots (among others). Depending on what aspect of the relationship you are trying to emphasize, you will need to make a choice between these three types of barplots and own that choice.
## Conclusion {#sec-data-vis-conclusion}
### Summary table
Let's recap all five of the five named graphs (5NG) in @tbl-viz-summary-table summarizing their differences. Using these 5NG, you'll be able to visualize the distributions and relationships of variables contained in a wide array of datasets. This will be even more the case as we start to map more variables to more of each `geom`etric object's `aes`thetic attribute options, further unlocking the awesome power of the `ggplot2` package.
| Named graph | Shows | Geometric object | Notes |
|------|------|------|------|
| Scatterplot | Relationship between 2 numerical variables | `geom_point()` | |
| Linegraph | Relationship between 2 numerical variables | `geom_line()` | Used when there is a sequential order to x-variable, e.g., time |
| Histogram | Distribution of 1 numerical variable | `geom_histogram()` | Facetted histograms show the distribution of 1 numerical variable split by the values of another variable |
| Boxplot | Distribution of 1 numerical variable split by the values of another variable | `geom_boxplot()` | C |
| Barplot | Distribution of 1 categorical variable | `geom_bar()` when counts are not pre-counted, `geom_col()` when counts are pre-counted ` | Stacked, side-by-side, and faceted barplots show the joint distribution of 2 categorical variables |
: Summary of the 5 Name Graphs {#tbl-viz-summary-table}
## A Few Last Points
### Themes
As mentioned earlier, the mapping provided by `aes()` can be applied to the entire plot (by passing it as the `mapping` argument to `ggplot()`) or you can pass different mappings to each of the layers you add to your plot. There is also the possibility to tweaking an entire set of aesthetic features (fonts, ticks, panel strips, and backgrounds) all at once and for the entire plot by using ggplot *themes*. Themes can be added using the `+` just like we added `geom` layers. Here is a sketch to give you the basic idea:
```{r}
#| eval: false
ggplot(data = df,
mapping = aes(x = x, y = y, color = z)) +
geom_line() +
theme_minimal()
```
The topic of themes is a deep one, but to briefly illustrate the effect of themes, let's take a look at the plot we opened the chapter with. The theme that ggplot uses by default is `theme_grey()`:
```{r}
#| echo: false
#| warning: false
ggplot(
data =
nba %>%
group_by(team_abbreviation, season_int) %>%
summarize(
m_pts = mean(pts),
m_ast = mean(ast),
team = first(team_abbreviation),
season_int = first(season_int)
) %>%
filter(team %in% c("LAL", "NYK", "PHX")) %>%
arrange(season_int)
,
aes(x = season_int, y = m_pts, color = team)
) +
geom_line() +
geom_point(aes(size = m_ast))
```
Here we use `theme_minimal()`:
```{r}
#| echo: false
#| warning: false
ggplot(
data =
nba %>%
group_by(team_abbreviation, season_int) %>%
summarize(
m_pts = mean(pts),
m_ast = mean(ast),
team = first(team_abbreviation),
season_int = first(season_int)
) %>%
filter(team %in% c("LAL", "NYK", "PHX")) %>%
arrange(season_int)
,
aes(x = season_int, y = m_pts, color = team)
) +
geom_line() +
geom_point(aes(size = m_ast)) +
theme_minimal()
```
Here we use `theme_classic()`:
```{r}
#| echo: false
#| warning: false
ggplot(
data =
nba %>%
group_by(team_abbreviation, season_int) %>%
summarize(
m_pts = mean(pts),
m_ast = mean(ast),
team = first(team_abbreviation),
season_int = first(season_int)
) %>%
filter(team %in% c("LAL", "NYK", "PHX")) %>%
arrange(season_int)
,
aes(x = season_int, y = m_pts, color = team)
) +
geom_line() +
geom_point(aes(size = m_ast)) +
theme_classic()
```
Here we use `theme_dark()`:
```{r}
#| echo: false
#| warning: false
ggplot(
data =
nba %>%
group_by(team_abbreviation, season_int) %>%
summarize(
m_pts = mean(pts),
m_ast = mean(ast),
team = first(team_abbreviation),
season_int = first(season_int)
) %>%
filter(team %in% c("LAL", "NYK", "PHX")) %>%
arrange(season_int)
,
aes(x = season_int, y = m_pts, color = team)
) +
geom_line() +
geom_point(aes(size = m_ast))+
theme_dark()
```
Using one of these "complete" themes is a great way to get the overall look of your plot to be close to what you want. This can be helpful when incorporating plots into slides or a poster. It is also true that these themes don't give you fine-grained control over individual aspects of your plots. To modify individual elements, you need to use `theme()` to override the default setting for individual elements. For more information, I strongly encourage readers to check out the book [ggplot2: Elegant Graphics for Data Analysis](https://ggplot2-book.org/).
### ggplot Argument Names
Let's go over some important points about specifying the arguments (i.e., inputs) to functions. Run the following two segments of code:
```{r}
#| eval: false
# Option 1:
ggplot(data = nba, mapping = aes(x = team_abbreviation)) +
geom_bar()
# Option 2:
ggplot(nba, aes(x = team_abbreviation)) +
geom_bar()
```
You'll notice that both code segments create the same barplot, even though in the second segment we omitted the `data = ` and `mapping = ` code argument names. This is because the `ggplot()` function by default assumes that the `data` argument comes first and the `mapping` argument comes second. As long as you specify the data frame in question first and the `aes()` mapping second, you can omit the explicit statement of the argument names `data = ` and `mapping = `. That being said, explicit is better than implicit. Given the uniformity of the tidyverse packages, you will often see the `data=` argument name omitted (because the first argument of all tidyverse function is a tibble), but it is good practice to include the names of other arguments for readability and clarity purposes.