2012: Received my Master's degree in Agronomy/ Agroecology (France)
2013-2016: Conducted research projects about energy crops at INRAE (France)
2017-2020: PhD at ISU in Crop Production & Physiology - Deparment of Agronomy
2021-2022: Postdoc at ISU - Department of Agronomy
2023 to current: Research Scientist at Corteva (Johnston, IA) in the Biostatistics Team
Universal way to communicate information
Provides clear and effective message
Find patterns, trends, spot extreme values
Make data memorable
Maintain the audience's interest
Who is your audience?
Which support?
Reveals a trend or relationship between variables
Always have at minimum a caption, axis, scales and symbols
Distinct and legible symbols (i.e., use contrast)
Caption should convey as much information as possible
No noise: keep information at minimum
the correct graph type based on the kind of data to be presented
This workshop does not provide code but all the plots were made using R Studio (see last slides for more details)
What's wrong with this plot?
For long x-axis labels, flip the the axis
Order the categories by ascending or descending values
Keep categories naturally ordered like age group
For long labels: flip the axis
Useful to draw bars within each group according to another other categorical variable
Bars charts or dot plot: the order matters
Here, you don't deliver a clear message
Database: On-time data for all flights that departed NYC
Lollipop plots are an alternative for simple barchart
Histogram are useful for plotting the distribution of a single quantitative variable
Try different bin widths for best visual appearance.
Small bin width -> peaky and busy histogram
Large bin width -> features might disappear
Try different bandwidths for best visual appearance
Small bandwidth -> peaky and busy density
Large bandwidth -> smooth feature and might look like a gaussian
The peaks of the density plot are where there is the highest concentration of points
For several distributions, density plots work better than histograms.
Ridgeline plot shows the distribution of a numeric value for several groups (at least 5-6 groups) or when they overlap each other.
A boxplot can summarize the distribution of a numeric variable for several groups
Boxplot does not tell about the number of observations.
Boxplots with jitter tell about:
the distribution of the data
if the groups are balanced or unbalanced in terms of observations.
No overlapping facilitates the visual appearence of the plot
Violins are equivalent to density estimate
They are useful to represent bimodal data.
Create one visual using one of these types of graphics:
bar chart
density plot
violin plot
Choose the dataset of your choices:
Choose the dataset of your choices:
mfr | calories | protein | fat | sugars | shelf | |
100% Bran | N | 212.1 | 12.1 | 3 | 18.2 | 3 |
All-Bran | K | 212.1 | 12.1 | 3 | 15.2 | 3 |
All-Bran with Extra Fiber | K | 100.0 | 8.0 | 0 | 0.0 | 3 |
## 'data.frame': 65 obs. of 11 variables:## $ mfr : Factor w/ 6 levels "G","K","N","P",..: 3 2 2 1 2 1 6 4 5 1 ...## $ calories : num 212 212 100 147 110 ...## $ protein : num 12.12 12.12 8 2.67 2 ...## $ fat : num 3.03 3.03 0 2.67 0 ...## $ sodium : num 394 788 280 240 125 ...## $ fibre : num 30.3 27.3 28 2 1 ...## $ carbo : num 15.2 21.2 16 14 11 ...## $ sugars : num 18.2 15.2 0 13.3 14 ...## $ shelf : int 3 3 3 1 2 3 1 3 2 1 ...## $ potassium: num 848.5 969.7 660 93.3 30 ...## $ vitamins : Factor w/ 3 levels "100%","enriched",..: 2 2 2 2 2 2 2 2 2 2 ...
Choose the dataset of your choices
carat | cut | color | clarity | depth | table | price | x | y | z |
0.2 | Ideal | E | SI2 | 61.5 | 55 | 326 | 4.0 | 4.0 | 2.4 |
0.2 | Premium | E | SI1 | 59.8 | 61 | 326 | 3.9 | 3.8 | 2.3 |
0.2 | Good | E | VS1 | 56.9 | 65 | 327 | 4.0 | 4.1 | 2.3 |
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
install.packages("esquisse")library(esquisse)library(MASS)library(ggplot2)?UScereal # more details about the datasetesquisse::esquisser(UScereal,viewer="browser")?diamonds # more details about the datasetesquisse::esquisser(diamonds,viewer="browser")
⚠️ Linear fit is widely used but it is not always the best fit, try quadratic fit too.
Split a single plot using one variable with many levels
Split a single plot using the combinations of two discrete variables.
⚠️ different scales can lead to misinterpretation
A bublle plot is a scatterplot with 3 numerical variables
It counts the number of cases in each hexagon. Useful for large dataset or avoid overplotting.
Computationally more efficient than plotting individual data points for very large dataset.
Create one visual using scatter plot or bubble plot
Use a data set from TidyTuesday
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm |
Adelie | Torgersen | 39.1 | 18.7 | 181 |
Adelie | Torgersen | 39.5 | 17.4 | 186 |
Adelie | Torgersen | 40.3 | 18.0 | 195 |
Adelie | Torgersen | NA | NA | NA |
Adelie | Torgersen | 36.7 | 19.3 | 193 |
species | island | body_mass_g | sex | year |
Adelie | Torgersen | 3750 | male | 2007 |
Adelie | Torgersen | 3800 | female | 2007 |
Adelie | Torgersen | 3250 | female | 2007 |
Adelie | Torgersen | NA | NA | 2007 |
Adelie | Torgersen | 3450 | female | 2007 |
penguins <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-28/penguins.csv')esquisse::esquisser(penguins,viewer="browser")
🐧See this link for more details about the penguins dataset
Example with a NASA dataset: atmospheric measurements across a grid of locations in Central America (Murrell, 2010)
Overview of the data
## time y x lat long date cloudhigh cloudlow cloudmid ozone## 1 1 1 1 -21.2 -113.8000 1995-01-01 0.5 31.0 2.0 260## 2 1 1 2 -21.2 -111.2957 1995-01-01 1.5 31.5 2.5 260## 3 1 1 3 -21.2 -108.7913 1995-01-01 1.5 32.5 3.5 260## 4 1 1 4 -21.2 -106.2870 1995-01-01 1.0 39.0 4.0 258## 5 1 1 5 -21.2 -103.7826 1995-01-01 0.5 48.0 4.5 258## 6 1 1 6 -21.2 -101.2783 1995-01-01 0.0 50.0 2.5 258## pressure surftemp temperature id day month year## 1 1000 297.4 296.9 1-1 0 1 1995## 2 1000 297.4 296.5 2-1 0 1 1995## 3 1000 297.4 296.0 3-1 0 1 1995## 4 1000 296.9 296.5 4-1 0 1 1995## 5 1000 296.5 295.5 5-1 0 1 1995## 6 1000 296.5 295.0 6-1 0 1 1995
Let's pick one location (x=1 & y=1) and focus on surface temperature (Kelvin)
## time x y surftemp day month year## 1 1 1 1 297.4 0 1 1995## 2 2 1 1 298.7 31 2 1995## 3 3 1 1 298.3 59 3 1995## 4 4 1 1 298.7 90 4 1995## 5 5 1 1 298.3 120 5 1995## 6 6 1 1 295.0 151 6 1995
Source: Rob Kabacoff (2020)
Something wrong?🤷
Alternative 1
Alternative 2
Spatial distribution of drug related crimes in Chalottesville
You can miss the interpretation of your graphic if the opacity is not set correctly
Spatial distribution of drug related crimes in Chalottesville
Control the opacity to avoid overlapping and provide shading
Before data visualizatio, you must:
Know your audience
Know the level of data detail expected
Give enough context
Ask yourself: What do I want my audience know/remember with the data I am presenting?
Don't be repetitive but be consistent (theme, color scheme, font size etc.)
Guide your audience by point out specific values
Guide your audience by pointing out specific values
Customize your plot using highlighting
Customize your plot using highlighting + text
Lung deaths in UK
library(dygraphs)lungDeaths <- cbind(ldeaths, mdeaths, fdeaths)# Ex 1: highlighting persist even after the mouse leaves the graph area.dygraph(lungDeaths, main = "Deaths from Lung Disease (UK)") %>% dyHighlight(highlightCircleSize = 5, highlightSeriesBackgroundAlpha = 0.2, hideOnMouseOut = FALSE) # Ex 2: stroke width of highlighted seriesdygraph(lungDeaths, main = "Deaths from Lung Disease (UK)") %>% dyHighlight(highlightSeriesOpts = list(strokeWidth = 3))
library(dygraphs)lungDeaths <- cbind(ldeaths, mdeaths, fdeaths)# Ex 3: fill in the area underneath the seriesdygraph(ldeaths, main = "Deaths from Lung Disease (UK)") %>% dyOptions(fillGraph = TRUE, fillAlpha = 0.4) # Ex 4: display of the individual points in a seriesdygraph(ldeaths, main = "Deaths from Lung Disease (UK)") %>% dyOptions(drawPoints = TRUE, pointSize = 2)
First example: ISOFAST web-app
Second example: ONFANT web-app
Create a prototype user interface for a Shiny app Goal: get familiar with R Shiny BUT then level up with R Studio tutorials
Check this website for more information.
Website for R colors and palettes
Marie Döbler & Tim Großmann's book Available online with ISU Library
Check the hashtag #tidytuesday on twitter if you are looking for inspiration & R code.
✉️ my email: alaurent@iastate.edu
