OxShef: Charts

Charts and Plots

Charts are the most common type of data visualisation, and certainly the most varied. However, it’s important to note that many readers will be confused by chart types that they have not seen before. Charts should be as easy to consume as possible. In general, a chart is used to present/compare data and a plot is used in the analysis of variables.

These are the most widely used easy to create and read charts; column charts, barcharts, piecharts, scatter plots and histograms.

Chart Purpose
Types of Data

When designing a data visualisation you must think carefully about the story the dataviz tells your reader, or what questions does this dataviz answer? For simplicity, let’s use the excellent Financial Time’s Visual Vocabulary to create four types of story we can tell:

Comparison/Ranking/Magnitude Stories:
- How much data is there?
- How many cateogires are there in a dataset?
- What percentage of the data belongs to each category?
- Which category has the most items in it?

Deviation Stories:
- What lead did candidate X have in the election?
- In which regions did candidate X lead over candidate Y?

Correlation Stories:
- Are variables X and Y correlated?
- How well does your model fit your data?

Distribution Stories:
- What’s the age distribution in country X?
- How does the age distribution vary between country X and Y?
- What are the min and max values for each category in your data?

Example Stories: Brexit Referendum 2016

To demonstrate the different types of stories we can tell with charts, let’s consider a public dataset; the results of the United Kingdom’s Brexit Referendum. In the referendum, voters were asked to select between one of two options; Remain a member of the European Union or Leave the European Union.

Who won? This is a magnitude story, we care about whether Remain or Leave got more of the vote. A horizontal barchart is an excellent choice for this story, note it’s important to order the bars from largest to smallest. See the Financial Time’s Visual Vocabulary for more example dataviz.

How did voter turnout compare between the regions of the UK? This is a comparison story, note that if there’s a sensible way to rank values in your data it will make the dataviz easier to read. While a horizontal barchart could work for this data, we’ve chosen to use a horizontal lollipop chart as we care about the voter turn out values and not ‘which region had the biggest impact on the referendum’. See the Financial Time’s Visual Vocabulary for more example dataviz.

How did the results vary between regions? This is a deviation question, as we’re asking what is the variance from 50:50 for each region. Stacked barcharts are very good options for this purpose, as always with barcharts the order of the categories matters. The first chart is a very good chart if you have the specific question, “How did Leaves margin vary between regions”. See the Financial Time’s Visual Vocabulary for more example dataviz.

Does the vote margin depend on region populaton? This is a (fairly silly) correlation question, as we’re asking do larger regions tend to have a smaller/larger margins. Scatter plots are the most useful type of chart for these questions. In general, a chart is used to present/compare data and a plot is used in the analysis of variables. See the Financial Time’s Visual Vocabulary for more example dataviz.

How do the populations of the different constituencis vary? This is a distribution story, as we want readers to understand the minimum, maximum and “average” constituency population sizes. See the Financial Time’s Visual Vocabulary for more example dataviz.

Chart selection is heavily dependent on what types of data you have. Many dataviz tools automatically recommend charts to you based on the data type definitions below. Let’s consider an example dataset:

You’ve collected exam results from an exam with 100 questions, taken by 200 students. In the data you have three columns; grade, number of correct responses, and percentage of correct responses.

Grades: This is a categorical and discrete variable as there are a limited set of available values. Because the order of these values is important (i.e. “Fail” should always be displayed before “Pass”) this is also an ordinal variable.
Number of correct responses: This is a discrete variable as students can only answer an integer number of questions correctly.
Percentage of correct responses: This is a continuous variable as results can vary between 0 - 100%. However, practically this is a discrete variable .

A common issue with discrete variables masquerading as continuous variables is weird-looking histograms that are often not fit for purpose; the charts below (histogram, violin chart, column chart) all display the distribution of exam result percentages differently. The “best” chart of the three is dependent on the story the chart is telling.

In general, this is sufficient knowledge about variables to get the most from other more detailed resources. It’s important to be careful with ordered ordinal variables as their intrinsic order must be well presented in the dataviz.

Charts

Charts and Plots

Example Stories: Brexit Referendum 2016

Recommended Reading

OxShef