class: center, middle, inverse, title-slide .title[ # Data Visualization Workshop ] .subtitle[ ## DataFEWSion Graduate Traineeship ] .author[ ###
Anabelle Laurent ] .date[ ### February 20, 2023 ] --- ### My background - 2012: Received my Master's degree in Agronomy/ Agroecology (France) - 2013-2016: Conducted research projects about energy crops at INRAE (France) - 2017-2020: PhD at ISU in Crop Production & Physiology - Deparment of Agronomy - 2021-2022: Postdoc at ISU - Department of Agronomy - 2023 to current: Research Scientist at Corteva (Johnston, IA) in the Biostatistics Team --- ### Let's talk: why is Data Visualization important? π€ --- ### Why is Data Visualization important? - Universal way to communicate information - Provides clear and effective message - Find patterns, trends, spot extreme values - Make data memorable - Maintain the audience's interest --- ### Let's talk: Who is your audience? Which support are you using? π€ --- - Who is your audience? + scientist π₯Ό + students π¨βπ + industry + general audience - Which support? + peer-reviewed paper π + oral presentations π¬ + website, blog, etc. --- ### What make a good visualization? π --- ### What make a good visualization? - Reveals a **trend** or **relationship** between variables - Always have at minimum a **caption**, **axis**, **scales** and **symbols** - Distinct and legible symbols (i.e., use contrast) - Caption should convey as much information as possible - No noise: keep information at minimum - the **correct graph type** based on the kind of data to be presented --- ### Disclaimer This workshop does not provide code but all the plots were made using R Studio (see last slides for more details) <center><img src="images/ggplot2_masterpiece.png" style="width: 70%" /> </center> [Artwork by @allison_horst](https://github.com/allisonhorst/stats-illustrations) --- # Visualizing quantity --- ### Visualizing quantity : bar plot .right-column[ ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-1-1.png)<!-- --> ] -- .left-column[ What's wrong with this plot? ] --- ### Visualizing quantity : bar plot .right-column[ ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-2-1.png)<!-- --> ] .left-column[ - Avoid abbreviations - Precise axis title + unit - Make it more attractive ] --- ### Visualizing quantity : bar plot .right-column[ ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-3-1.png)<!-- --> ] -- .left-column[ For long x-axis labels, flip the the axis ] --- ### Visualizing quantity: bar plot .right-column[ ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-4-1.png)<!-- --> ] .left-column[ - Order the categories by ascending or descending values - Keep categories naturally ordered like age group - For long labels: flip the axis ] --- ### Visualizing quantity : grouped bar plot .right-column[ ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-5-1.png)<!-- --> ] .left-column[ Useful to draw bars within each group according to another other categorical variable ] --- ### What's wrong with this plot? .right-column[ ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-6-1.png)<!-- --> ] -- .left-column[ - bars are too long - Can be impractical sometimes ] --- ### Don't do that! π .right-column[ ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-7-1.png)<!-- --> ] -- .left-column[ - Bars charts start at zero. Indeed, the bar length is proportional to the amount displayed. - **dot plot** is a better option ] --- ### Visualizing quantity: dot plot ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-8-1.png)<!-- --> --- ### Visualizing quantity: dot plot .right-column[ ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-9-1.png)<!-- --> ] -- .left-column[ - Bars charts or dot plot: **the order matters** - Here, you don't deliver a clear message ] --- ### Visualizing quantity : lollipop plot .right-column[ ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-10-1.png)<!-- --> ] .left-column[ - Database: On-time data for all flights that departed NYC - Lollipop plots are an alternative for simple barchart ] --- # Visualizing distribution <center><img src="images/histogram.png" style="width: 70%" /> </center> [Artwork by @allison_horst](https://github.com/allisonhorst/stats-illustrations) --- ### Visualizing distribution : histograms .right-column[ ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-11-1.png)<!-- --> ] .left-column[ Histogram are useful for plotting the distribution of a single quantitative variable ] --- ### Visualizing distribution : histograms .right-column[ ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-12-1.png)<!-- --> ] .left-column[ Try different bin widths for best visual appearance. - Small bin width -> peaky and busy histogram - Large bin width -> features might disappear ] --- ### Visualizing distribution : density plot ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-13-1.png)<!-- --> --- ### Visualizing distribution : density plot .right-column[ ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-14-1.png)<!-- --> ] .left-column[ Try different bandwidths for best visual appearance - Small bandwidth -> peaky and busy density - Large bandwidth -> smooth feature and might look like a gaussian ] --- ### Visualizing multiple distributions ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-15-1.png)<!-- --> --- ### Visualizing multiple distributions .right-column[ ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-16-1.png)<!-- --> ] .left-column[ - The peaks of the density plot are where there is the highest concentration of points - For several distributions, density plots work better than histograms. ] --- ### Visualizing multiple distributions ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-17-1.png)<!-- --> --- ### Visualizing multiple distributions: ridgeline plot ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-18-1.png)<!-- --> --- ### Visualizing multiple distributions: ridgeline plot .right-column[ ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-19-1.png)<!-- --> ] .left-column[ Ridgeline plot shows the distribution of a numeric value for several groups (at least 5-6 groups) or when they overlap each other. ] --- ### Visualizing distributions: boxplot <center><img src="images/read_boxplot.jpeg" style="width: 100%" /> </center> A boxplot can summarize the distribution of a numeric variable for several groups --- ### Visualizing distributions: boxplot .right-column[ ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-20-1.png)<!-- --> ] .left-column[ Boxplot does not tell about the number of observations. ] --- ### Visualizing distributions: boxplot with jitter .right-column[ ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-21-1.png)<!-- --> ] .left-column[ Boxplots with jitter tell about: - the distribution of the data - if the groups are balanced or unbalanced in terms of observations. ] --- ### Visualizing distributions: boxplot with jitter .right-column[ ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-22-1.png)<!-- --> ] .left-column[ No overlapping facilitates the visual appearence of the plot ] --- ### Visualizing distributions: violin plot .right-column[ ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-23-1.png)<!-- --> ] .left-column[ - Violins are equivalent to density estimate - They are useful to represent bimodal data. ] --- ### Your turn π¨βπ» Create one visual using one of these types of graphics: - bar chart - histograms - density plot - boxplot - violin plot --- ### Your turn π©βπ» Choose the dataset of your choices: - Nutritional and marketing information on US Cereals - Diamonds -- Choose the dataset of your choices | |mfr | calories| protein| fat| sugars| shelf| |:-------------------------|:---|--------:|-------:|---:|------:|-----:| |100% Bran |N | 212.1| 12.1| 3| 18.2| 3| |All-Bran |K | 212.1| 12.1| 3| 15.2| 3| |All-Bran with Extra Fiber |K | 100.0| 8.0| 0| 0.0| 3| ``` ## 'data.frame': 65 obs. of 11 variables: ## $ mfr : Factor w/ 6 levels "G","K","N","P",..: 3 2 2 1 2 1 6 4 5 1 ... ## $ calories : num 212 212 100 147 110 ... ## $ protein : num 12.12 12.12 8 2.67 2 ... ## $ fat : num 3.03 3.03 0 2.67 0 ... ## $ sodium : num 394 788 280 240 125 ... ## $ fibre : num 30.3 27.3 28 2 1 ... ## $ carbo : num 15.2 21.2 16 14 11 ... ## $ sugars : num 18.2 15.2 0 13.3 14 ... ## $ shelf : int 3 3 3 1 2 3 1 3 2 1 ... ## $ potassium: num 848.5 969.7 660 93.3 30 ... ## $ vitamins : Factor w/ 3 levels "100%","enriched",..: 2 2 2 2 2 2 2 2 2 2 ... ``` --- ### Your turn π¨βπ» Choose the dataset of your choices | carat|cut |color |clarity | depth| table| price| x| y| z| |-----:|:-------|:-----|:-------|-----:|-----:|-----:|---:|---:|---:| | 0.2|Ideal |E |SI2 | 61.5| 55| 326| 4.0| 4.0| 2.4| | 0.2|Premium |E |SI1 | 59.8| 61| 326| 3.9| 3.8| 2.3| | 0.2|Good |E |VS1 | 56.9| 65| 327| 4.0| 4.1| 2.3| ``` ## tibble [53,940 Γ 10] (S3: tbl_df/tbl/data.frame) ## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ... ## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ... ## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ... ## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ... ## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ... ## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ... ## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ... ## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ... ## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ... ## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ... ``` --- ### Your turn π©βπ» ```r install.packages("esquisse") library(esquisse) library(MASS) library(ggplot2) ?UScereal # more details about the dataset esquisse::esquisser(UScereal,viewer="browser") ?diamonds # more details about the dataset esquisse::esquisser(diamonds,viewer="browser") ``` --- # Visualizing associations among quantitative variables --- ### Relationship between 2 numeric variables: scatterplot ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-27-1.png)<!-- --> --- ### Relationship between 2 numeric variables: scatterplot + linear fit ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-28-1.png)<!-- --> --- ##### Relationship between 2 numeric variables: scatterplot + quadratic fit ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-29-1.png)<!-- --> β οΈ Linear fit is widely used but it is not always the best fit, try quadratic fit too. --- ### Relationship between 2 numeric variables: scatterplot ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-30-1.png)<!-- --> --- ### Multi-panel plots ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-31-1.png)<!-- --> Split a single plot using one variable with many levels --- ### Multi-panel plots ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-32-1.png)<!-- --> Split a single plot using the combinations of two discrete variables. --- ### Multi-panel plots ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-33-1.png)<!-- --> β οΈ different scales can lead to misinterpretation --- ### Bubble plot A bublle plot is a scatterplot with 3 numerical variables ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-34-1.png)<!-- --> --- ### Hexagonal heatmap It counts the number of cases in each hexagon. Useful for large dataset or avoid overplotting. ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-35-1.png)<!-- --> --- Computationally more efficient than plotting individual data points for very large dataset. <center><img src="images/nitrate_geomhex.png" style="width: 60%" /> </center> --- ### Your turn π¨βπ» - Create one visual using scatter plot or bubble plot - Use a data set from TidyTuesday --- ### Your turn π¨βπ» |species |island | bill_length_mm| bill_depth_mm| flipper_length_mm| |:-------|:---------|--------------:|-------------:|-----------------:| |Adelie |Torgersen | 39.1| 18.7| 181| |Adelie |Torgersen | 39.5| 17.4| 186| |Adelie |Torgersen | 40.3| 18.0| 195| |Adelie |Torgersen | NA| NA| NA| |Adelie |Torgersen | 36.7| 19.3| 193| <center><img src="images/bill_penguin.png" style="width: 50%" /> </center> --- ### Your turn π©βπ» |species |island | body_mass_g|sex | year| |:-------|:---------|-----------:|:------|----:| |Adelie |Torgersen | 3750|male | 2007| |Adelie |Torgersen | 3800|female | 2007| |Adelie |Torgersen | 3250|female | 2007| |Adelie |Torgersen | NA|NA | 2007| |Adelie |Torgersen | 3450|female | 2007| <center><img src="images/penguins_drawing.png" style="width: 50%" /> </center> --- ### Your turn π©βπ» ```r penguins <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-28/penguins.csv') esquisse::esquisser(penguins,viewer="browser") ``` π§See this [link for more details about the penguins dataset](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-07-28/readme.md) --- # Visualizing time series --- ### Visualizing time series Example with a NASA dataset: atmospheric measurements across a grid of locations in Central America (Murrell, 2010) <center><img src="images/grid_nasa.png" style="width: 50%" /> </center> --- ### Visualizing time series Overview of the data ``` ## time y x lat long date cloudhigh cloudlow cloudmid ozone ## 1 1 1 1 -21.2 -113.8000 1995-01-01 0.5 31.0 2.0 260 ## 2 1 1 2 -21.2 -111.2957 1995-01-01 1.5 31.5 2.5 260 ## 3 1 1 3 -21.2 -108.7913 1995-01-01 1.5 32.5 3.5 260 ## 4 1 1 4 -21.2 -106.2870 1995-01-01 1.0 39.0 4.0 258 ## 5 1 1 5 -21.2 -103.7826 1995-01-01 0.5 48.0 4.5 258 ## 6 1 1 6 -21.2 -101.2783 1995-01-01 0.0 50.0 2.5 258 ## pressure surftemp temperature id day month year ## 1 1000 297.4 296.9 1-1 0 1 1995 ## 2 1000 297.4 296.5 2-1 0 1 1995 ## 3 1000 297.4 296.0 3-1 0 1 1995 ## 4 1000 296.9 296.5 4-1 0 1 1995 ## 5 1000 296.5 295.5 5-1 0 1 1995 ## 6 1000 296.5 295.0 6-1 0 1 1995 ``` --- ### Visualizing time series Let's pick one location (x=1 & y=1) and focus on surface temperature (Kelvin) ``` ## time x y surftemp day month year ## 1 1 1 1 297.4 0 1 1995 ## 2 2 1 1 298.7 31 2 1995 ## 3 3 1 1 298.3 59 3 1995 ## 4 4 1 1 298.7 90 4 1995 ## 5 5 1 1 298.3 120 5 1995 ## 6 6 1 1 295.0 151 6 1995 ``` --- ### Visualizing time series ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-41-1.png)<!-- --> - Without the dots you emphasize on the general trend and not on the individual observation - A plot with line + dots is called a line graph --- ### Visualizing time series ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-42-1.png)<!-- --> --- ### Display change between two time periods: dummbbell chart <center><img src="images/dummbbell_chart.png" style="width: 80%" /> </center> Source: Rob Kabacoff (2020) --- # Aesthetics: Color, Shape, Opcacity --- ### Color to distinguish ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-43-1.png)<!-- --> π§ --- ### Color to highlight ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-44-1.png)<!-- --> Something wrong?π€· --- ### Color & shape to highlight ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-45-1.png)<!-- --> Alternative 1 --- ### Color & shape to highlight ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-46-1.png)<!-- --> Alternative 2 --- ### Opacity Spatial distribution of drug related crimes in Chalottesville .right-column[ <center><img src="images/crimedata_noalpha.png", style="width: 60%" /> </center> ] .left-column[ You can miss the interpretation of your graphic if the opacity is not set correctly ] --- ### Opacity Spatial distribution of drug related crimes in Chalottesville .right-column[ <center><img src="images/crimedata_alpha.png", style="width: 80%" /> </center> ] .left-column[ Control the opacity to avoid overlapping and provide shading ] --- # Tell a story with your data π --- ### Tell a story with your data Before data visualizatio, you must: - Know your audience - Know the level of data detail expected - Give enough context - Ask yourself: What do I want my audience know/remember with the data I am presenting? --- ### Tell a story with your data .right-column[ <center><img src="images/2020_08_14_penguins.png" style="width: 80%" /> </center> ] .left-column[ Don't be repetitive but be consistent (theme, color scheme, font size etc.) ] --- ### Tell a story with your data Guide your audience by point out specific values <center><img src="images/2019_10_08_powerlifting.png" style="width: 80%" /> </center> --- ### Tell a story with your data Guide your audience by pointing out specific values <center><img src="images/2020_foodconsumption.png" style="width: 70%" /> </center> --- ### Tell a story with your data Customize your plot using highlighting ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-47-1.png)<!-- --> --- ### Tell a story with your data Customize your plot using highlighting + text ![](DataFEWSion_DataViz_files/figure-html/unnamed-chunk-48-1.png)<!-- --> --- ### Interactive graphics with ggplotly
--- ### Interactive time-series with dygraphs Lung deaths in UK
--- ### Your turn π©βπ» ```r library(dygraphs) lungDeaths <- cbind(ldeaths, mdeaths, fdeaths) # Ex 1: highlighting persist even after the mouse leaves the graph area. dygraph(lungDeaths, main = "Deaths from Lung Disease (UK)") %>% dyHighlight(highlightCircleSize = 5, highlightSeriesBackgroundAlpha = 0.2, hideOnMouseOut = FALSE) # Ex 2: stroke width of highlighted series dygraph(lungDeaths, main = "Deaths from Lung Disease (UK)") %>% dyHighlight(highlightSeriesOpts = list(strokeWidth = 3)) ``` --- ### Your turn π¨βπ» ```r library(dygraphs) lungDeaths <- cbind(ldeaths, mdeaths, fdeaths) # Ex 3: fill in the area underneath the series dygraph(ldeaths, main = "Deaths from Lung Disease (UK)") %>% dyOptions(fillGraph = TRUE, fillAlpha = 0.4) # Ex 4: display of the individual points in a series dygraph(ldeaths, main = "Deaths from Lung Disease (UK)") %>% dyOptions(drawPoints = TRUE, pointSize = 2) ``` --- ### Data visulization using interactive web-app First example: [ISOFAST web-app](https://analytics.iasoybeans.com/cool-apps/ISOFAST/) Second example: [ONFANT web-app](https://onfant.agron.iastate.edu/) --- ### Your turn! π©βπ» Create a prototype user interface for a Shiny app Goal: get familiar with R Shiny BUT then level up with [R Studio tutorials](https://shiny.rstudio.com/tutorial/) ```r install.packages("designer") library(designer) designer::designApp() ``` Check this [website](https://github.com/ashbaldry/designer) for more information. --- ### R library used for this presentation ```r library(dygraphs) library(gapminder) library(gghighlight) library(ggplot2) library(ggrepel) library(dplyr) library(plotly) library(tidyr) ``` --- ### Ressources to go deeper into Data Viz - Website for [R colors and palettes](https://www.color-hex.com/) - [Claus Wilke's book](https://clauswilke.com/dataviz/index.html) - [Rob Kabacoff's book](https://rkabacoff.github.io/datavis/) - [Marie DΓΆbler & Tim GroΓmann's book](https://www.barnesandnoble.com/w/the-data-visualization-workshop-second-edition-mario-d-bler/1136609407) Available online with ISU Library - [CΓ©dric Scherer's blog](https://www.cedricscherer.com/top/dataviz/) - [From Data to Viz's website](https://www.data-to-viz.com/) - [dygraphs R package](https://rstudio.github.io/dygraphs/) - [Plotly R package](https://plotly.com/r/) - [Shiny tutorial](https://shiny.rstudio.com/tutorial/) - Check the hashtag **#tidytuesday** on twitter if you are looking for inspiration & R code. - [Shiny app about Tidy Tuesday tweets](https://nsgrantham.shinyapps.io/tidytuesdayrocks/) --- ### Accurate <center><img src="images/r_rollercoaster.png" style="width: 85%" /> </center> [Artwork by @allison_horst](https://github.com/allisonhorst/stats-illustrations) --- ### Thank you for your attention <center><img src="images/lastslide.jpg" style="width: 60%" /> </center> βοΈ my email: **alaurent@iastate.edu** Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan).