+ - 0:00:00
Notes for current slide
Notes for next slide

Data Visualization Workshop

DataFEWSion Graduate Traineeship

Anabelle Laurent

February 20, 2023

1 / 85

My background

  • 2012: Received my Master's degree in Agronomy/ Agroecology (France)

  • 2013-2016: Conducted research projects about energy crops at INRAE (France)

  • 2017-2020: PhD at ISU in Crop Production & Physiology - Deparment of Agronomy

  • 2021-2022: Postdoc at ISU - Department of Agronomy

  • 2023 to current: Research Scientist at Corteva (Johnston, IA) in the Biostatistics Team

2 / 85

Let's talk: why is Data Visualization important? 🤔

3 / 85

Why is Data Visualization important?

  • Universal way to communicate information

  • Provides clear and effective message

  • Find patterns, trends, spot extreme values

  • Make data memorable

  • Maintain the audience's interest

4 / 85

Let's talk: Who is your audience? Which support are you using? 🤔

5 / 85
  • Who is your audience?

    • scientist 🥼
    • students 👨‍🎓
    • industry
    • general audience
  • Which support?

    • peer-reviewed paper 🗞
    • oral presentations 💬
    • website, blog, etc.
6 / 85

What make a good visualization? 😄

7 / 85

What make a good visualization?

  • Reveals a trend or relationship between variables

  • Always have at minimum a caption, axis, scales and symbols

  • Distinct and legible symbols (i.e., use contrast)

  • Caption should convey as much information as possible

  • No noise: keep information at minimum

  • the correct graph type based on the kind of data to be presented

8 / 85

Disclaimer

This workshop does not provide code but all the plots were made using R Studio (see last slides for more details)

Artwork by @allison_horst

9 / 85

Visualizing quantity

10 / 85

Visualizing quantity : bar plot

11 / 85

Visualizing quantity : bar plot

What's wrong with this plot?

11 / 85

Visualizing quantity : bar plot

  • Avoid abbreviations
  • Precise axis title + unit
  • Make it more attractive
12 / 85

Visualizing quantity : bar plot

13 / 85

Visualizing quantity : bar plot

For long x-axis labels, flip the the axis

13 / 85

Visualizing quantity: bar plot

  • Order the categories by ascending or descending values

  • Keep categories naturally ordered like age group

  • For long labels: flip the axis

14 / 85

Visualizing quantity : grouped bar plot

Useful to draw bars within each group according to another other categorical variable

15 / 85

What's wrong with this plot?

16 / 85

What's wrong with this plot?

  • bars are too long
  • Can be impractical sometimes
16 / 85

Don't do that! 🙅

17 / 85

Don't do that! 🙅

  • Bars charts start at zero. Indeed, the bar length is proportional to the amount displayed.
  • dot plot is a better option
17 / 85

Visualizing quantity: dot plot

18 / 85

Visualizing quantity: dot plot

19 / 85

Visualizing quantity: dot plot

  • Bars charts or dot plot: the order matters

  • Here, you don't deliver a clear message

19 / 85

Visualizing quantity : lollipop plot

  • Database: On-time data for all flights that departed NYC

  • Lollipop plots are an alternative for simple barchart

20 / 85

Visualizing distribution

Artwork by @allison_horst

21 / 85

Visualizing distribution : histograms

Histogram are useful for plotting the distribution of a single quantitative variable

22 / 85

Visualizing distribution : histograms

Try different bin widths for best visual appearance.

  • Small bin width -> peaky and busy histogram

  • Large bin width -> features might disappear

23 / 85

Visualizing distribution : density plot

24 / 85

Visualizing distribution : density plot

Try different bandwidths for best visual appearance

  • Small bandwidth -> peaky and busy density

  • Large bandwidth -> smooth feature and might look like a gaussian

25 / 85

Visualizing multiple distributions

26 / 85

Visualizing multiple distributions

  • The peaks of the density plot are where there is the highest concentration of points

  • For several distributions, density plots work better than histograms.

27 / 85

Visualizing multiple distributions

28 / 85

Visualizing multiple distributions: ridgeline plot

29 / 85

Visualizing multiple distributions: ridgeline plot

Ridgeline plot shows the distribution of a numeric value for several groups (at least 5-6 groups) or when they overlap each other.

30 / 85

Visualizing distributions: boxplot

A boxplot can summarize the distribution of a numeric variable for several groups

31 / 85

Visualizing distributions: boxplot

Boxplot does not tell about the number of observations.

32 / 85

Visualizing distributions: boxplot with jitter

Boxplots with jitter tell about:

  • the distribution of the data

  • if the groups are balanced or unbalanced in terms of observations.

33 / 85

Visualizing distributions: boxplot with jitter

No overlapping facilitates the visual appearence of the plot

34 / 85

Visualizing distributions: violin plot

  • Violins are equivalent to density estimate

  • They are useful to represent bimodal data.

35 / 85

Your turn 👨‍💻

Create one visual using one of these types of graphics:

  • bar chart

  • histograms

  • density plot

  • boxplot

  • violin plot

36 / 85

Your turn 👩‍💻

Choose the dataset of your choices:

  • Nutritional and marketing information on US Cereals
  • Diamonds
37 / 85

Your turn 👩‍💻

Choose the dataset of your choices:

  • Nutritional and marketing information on US Cereals
  • Diamonds Choose the dataset of your choices
mfr calories protein fat sugars shelf
100% Bran N 212.1 12.1 3 18.2 3
All-Bran K 212.1 12.1 3 15.2 3
All-Bran with Extra Fiber K 100.0 8.0 0 0.0 3
## 'data.frame': 65 obs. of 11 variables:
## $ mfr : Factor w/ 6 levels "G","K","N","P",..: 3 2 2 1 2 1 6 4 5 1 ...
## $ calories : num 212 212 100 147 110 ...
## $ protein : num 12.12 12.12 8 2.67 2 ...
## $ fat : num 3.03 3.03 0 2.67 0 ...
## $ sodium : num 394 788 280 240 125 ...
## $ fibre : num 30.3 27.3 28 2 1 ...
## $ carbo : num 15.2 21.2 16 14 11 ...
## $ sugars : num 18.2 15.2 0 13.3 14 ...
## $ shelf : int 3 3 3 1 2 3 1 3 2 1 ...
## $ potassium: num 848.5 969.7 660 93.3 30 ...
## $ vitamins : Factor w/ 3 levels "100%","enriched",..: 2 2 2 2 2 2 2 2 2 2 ...
37 / 85

Your turn 👨‍💻

Choose the dataset of your choices

carat cut color clarity depth table price x y z
0.2 Ideal E SI2 61.5 55 326 4.0 4.0 2.4
0.2 Premium E SI1 59.8 61 326 3.9 3.8 2.3
0.2 Good E VS1 56.9 65 327 4.0 4.1 2.3
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
38 / 85

Your turn 👩‍💻

install.packages("esquisse")
library(esquisse)
library(MASS)
library(ggplot2)
?UScereal # more details about the dataset
esquisse::esquisser(UScereal,viewer="browser")
?diamonds # more details about the dataset
esquisse::esquisser(diamonds,viewer="browser")
39 / 85

Visualizing associations among quantitative variables

40 / 85

Relationship between 2 numeric variables: scatterplot

41 / 85

Relationship between 2 numeric variables: scatterplot + linear fit

42 / 85
Relationship between 2 numeric variables: scatterplot + quadratic fit

⚠️ Linear fit is widely used but it is not always the best fit, try quadratic fit too.

43 / 85

Relationship between 2 numeric variables: scatterplot

44 / 85

Multi-panel plots

Split a single plot using one variable with many levels

45 / 85

Multi-panel plots

Split a single plot using the combinations of two discrete variables.

46 / 85

Multi-panel plots

⚠️ different scales can lead to misinterpretation

47 / 85

Bubble plot

A bublle plot is a scatterplot with 3 numerical variables

48 / 85

Hexagonal heatmap

It counts the number of cases in each hexagon. Useful for large dataset or avoid overplotting.

49 / 85

Computationally more efficient than plotting individual data points for very large dataset.

50 / 85

Your turn 👨‍💻

  • Create one visual using scatter plot or bubble plot

  • Use a data set from TidyTuesday

51 / 85

Your turn 👨‍💻

species island bill_length_mm bill_depth_mm flipper_length_mm
Adelie Torgersen 39.1 18.7 181
Adelie Torgersen 39.5 17.4 186
Adelie Torgersen 40.3 18.0 195
Adelie Torgersen NA NA NA
Adelie Torgersen 36.7 19.3 193
52 / 85

Your turn 👩‍💻

species island body_mass_g sex year
Adelie Torgersen 3750 male 2007
Adelie Torgersen 3800 female 2007
Adelie Torgersen 3250 female 2007
Adelie Torgersen NA NA 2007
Adelie Torgersen 3450 female 2007
53 / 85

Your turn 👩‍💻

penguins <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-28/penguins.csv')
esquisse::esquisser(penguins,viewer="browser")

🐧See this link for more details about the penguins dataset

54 / 85

Visualizing time series

55 / 85

Visualizing time series

Example with a NASA dataset: atmospheric measurements across a grid of locations in Central America (Murrell, 2010)

56 / 85

Visualizing time series

Overview of the data

## time y x lat long date cloudhigh cloudlow cloudmid ozone
## 1 1 1 1 -21.2 -113.8000 1995-01-01 0.5 31.0 2.0 260
## 2 1 1 2 -21.2 -111.2957 1995-01-01 1.5 31.5 2.5 260
## 3 1 1 3 -21.2 -108.7913 1995-01-01 1.5 32.5 3.5 260
## 4 1 1 4 -21.2 -106.2870 1995-01-01 1.0 39.0 4.0 258
## 5 1 1 5 -21.2 -103.7826 1995-01-01 0.5 48.0 4.5 258
## 6 1 1 6 -21.2 -101.2783 1995-01-01 0.0 50.0 2.5 258
## pressure surftemp temperature id day month year
## 1 1000 297.4 296.9 1-1 0 1 1995
## 2 1000 297.4 296.5 2-1 0 1 1995
## 3 1000 297.4 296.0 3-1 0 1 1995
## 4 1000 296.9 296.5 4-1 0 1 1995
## 5 1000 296.5 295.5 5-1 0 1 1995
## 6 1000 296.5 295.0 6-1 0 1 1995
57 / 85

Visualizing time series

Let's pick one location (x=1 & y=1) and focus on surface temperature (Kelvin)

## time x y surftemp day month year
## 1 1 1 1 297.4 0 1 1995
## 2 2 1 1 298.7 31 2 1995
## 3 3 1 1 298.3 59 3 1995
## 4 4 1 1 298.7 90 4 1995
## 5 5 1 1 298.3 120 5 1995
## 6 6 1 1 295.0 151 6 1995
58 / 85

Visualizing time series

  • Without the dots you emphasize on the general trend and not on the individual observation
  • A plot with line + dots is called a line graph
59 / 85

Visualizing time series

60 / 85

Display change between two time periods: dummbbell chart

Source: Rob Kabacoff (2020)

61 / 85

Aesthetics: Color, Shape, Opcacity

62 / 85

Color to distinguish

🐧

63 / 85

Color to highlight

Something wrong?🤷

64 / 85

Color & shape to highlight

Alternative 1

65 / 85

Color & shape to highlight

Alternative 2

66 / 85

Opacity

Spatial distribution of drug related crimes in Chalottesville

You can miss the interpretation of your graphic if the opacity is not set correctly

67 / 85

Opacity

Spatial distribution of drug related crimes in Chalottesville

Control the opacity to avoid overlapping and provide shading

68 / 85

Tell a story with your data 📖

69 / 85

Tell a story with your data

Before data visualizatio, you must:

  • Know your audience

  • Know the level of data detail expected

  • Give enough context

  • Ask yourself: What do I want my audience know/remember with the data I am presenting?

70 / 85

Tell a story with your data

Don't be repetitive but be consistent (theme, color scheme, font size etc.)

71 / 85

Tell a story with your data

Guide your audience by point out specific values

72 / 85

Tell a story with your data

Guide your audience by pointing out specific values

73 / 85

Tell a story with your data

Customize your plot using highlighting

74 / 85

Tell a story with your data

Customize your plot using highlighting + text

75 / 85

Interactive graphics with ggplotly

010000200003000040000500004050607080
PopulationcontinentAfricaAmericasAsiaEuropeOceaniaLife expectancy vs GDP per capita in 2007 GDP per capita (US$) Life expectancy (years)
76 / 85

Interactive time-series with dygraphs

Lung deaths in UK

mdeaths
fdeaths
500
1000
1500
2000
2500
Jan 1974
Jan 1975
Jan 1976
Jan 1977
Jan 1978
Jan 1979
77 / 85

Your turn 👩‍💻

library(dygraphs)
lungDeaths <- cbind(ldeaths, mdeaths, fdeaths)
# Ex 1: highlighting persist even after the mouse leaves the graph area.
dygraph(lungDeaths, main = "Deaths from Lung Disease (UK)") %>%
dyHighlight(highlightCircleSize = 5,
highlightSeriesBackgroundAlpha = 0.2,
hideOnMouseOut = FALSE)
# Ex 2: stroke width of highlighted series
dygraph(lungDeaths, main = "Deaths from Lung Disease (UK)") %>%
dyHighlight(highlightSeriesOpts = list(strokeWidth = 3))
78 / 85

Your turn 👨‍💻

library(dygraphs)
lungDeaths <- cbind(ldeaths, mdeaths, fdeaths)
# Ex 3: fill in the area underneath the series
dygraph(ldeaths, main = "Deaths from Lung Disease (UK)") %>%
dyOptions(fillGraph = TRUE, fillAlpha = 0.4)
# Ex 4: display of the individual points in a series
dygraph(ldeaths, main = "Deaths from Lung Disease (UK)") %>%
dyOptions(drawPoints = TRUE, pointSize = 2)
79 / 85

Data visulization using interactive web-app

First example: ISOFAST web-app

Second example: ONFANT web-app

80 / 85

Your turn! 👩‍💻

Create a prototype user interface for a Shiny app Goal: get familiar with R Shiny BUT then level up with R Studio tutorials

install.packages("designer")
library(designer)
designer::designApp()

Check this website for more information.

81 / 85

R library used for this presentation

library(dygraphs)
library(gapminder)
library(gghighlight)
library(ggplot2)
library(ggrepel)
library(dplyr)
library(plotly)
library(tidyr)
82 / 85

Ressources to go deeper into Data Viz

83 / 85

Thank you for your attention

✉️ my email: alaurent@iastate.edu

Slides created via the R package xaringan.

85 / 85

My background

  • 2012: Received my Master's degree in Agronomy/ Agroecology (France)

  • 2013-2016: Conducted research projects about energy crops at INRAE (France)

  • 2017-2020: PhD at ISU in Crop Production & Physiology - Deparment of Agronomy

  • 2021-2022: Postdoc at ISU - Department of Agronomy

  • 2023 to current: Research Scientist at Corteva (Johnston, IA) in the Biostatistics Team

2 / 85
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow