1.5. Harvest a dataset

Duration: 15 min

Overview tuto 1.5

Goals

  • Harvest a dataset with a tool from an online source
  • Know that the tool calls an API to get data

Harvest a dataset

We will use Toolforge PageViews to harvest a the same dataset we have used to build the timeline of page views in Tutorial 1.1.

  • Go to PageViews
  • Visualize the data for the two following Wikipedia articles:
    • Space-based solar power
    • Thorium-based nuclear power
  • Use the right settings:
    • The Dates should be from 01/07/2015 to today (we used 06/02/2022)
    • The Date type should be Daily. Indeed, Tableau can aggregate into months or years easily, so the more precise data is just better.
  • Download the dataset by clicking on the Download drop-down menu and selecting CSV.

The tool works by calling the Application Programming Interface (API) of Wikipedia. The endpoint for page views gives you more options than the tool is able to offer, but the tool makes it easy to interact with API. We will see in the next tutorial how to use script to call the API directly.

Visualize

Open the dataset into Tableau and check that it works. Can you visualize it at the day level using bars? It may look like this:

Timeline

TIP: if the bars look grey, that is because they are so thin that we only see their grey border. If you want to remove that border, click on the Color button in the Marks panel and set Borders to None.

No need to annotate this time (you’ve done it already in the first tutorial).

Documents produced

This time, none!

Next tutorial

 1.6. Harvest data with a notebook (30 min)


Tools for getting similar data (CSV format with timestamps) from other sources:

Relation to the course readings

  • The process of getting data through scraping, crawling and calling APIs is covered in Chapter 6: Collecting and curating digital records of Venturini, T. & Munk, A.K. (2021). Controversy Mapping: A Field Guide.