Data preprocessing

Data preprocessing #

So, you have your Newslogging data set and you’re rushing to visualize it? Not so fast. Before data scientists conduct any study, they usually spend up to 80% of their time cleaning and transforming data – otherwise, they might be useless, resulting in misleading visualisations.

◕ Make common data cleaning tasks #

A common job of a data analyst is to remove trailing spaces or split values from one to multiple columns. Most of these predefined actions can be performed using Google Sheets, Excel or even a text editor.

  • In Google Sheets, go to [ ▾ Data ] » Cleanup suggestions – this will help you fix the basic issues like trailing spaces
  • Alternatively, you can handle them manually with =TRIM()
  • Check that the column order is logical, e.g. time should be next to date
  • Make sure the column names are consistent
  • The same goes for the cell values. If necessary, find and replace/remove multiple phrases at once using Ctrl+h shortcut, e.g.: yes/no to 1/0

PRO TIP: In complex cases you may need to use more sophisticated software like OpenRefine which uses natural language algorithms to help you tackle this otherwise tedious task.

◕ Split strings to multiple columns #

Very often at this stage we also need to adjust our data for analysis by transforming individual values or adding new variables based on other columns.

Let’s assume that you want to extract the names of the news sources from urls. We could simply split the url into several columns with the . separator. However, some of the urls are not preceded by the www prefix.

https://www.theguardian.com/uk-news/2021/jan/09/the-queen-and-prince-philip-receive-first-dose-of-covid-vaccine
https://twitter.com/teslaownersSV/status/1347201514930991105

Fortunately, we can automatically recognise more than one phrase and remove all of them from the cells before performing the column split.

  • Use Ctrl+h and tick Search using regular expressions
find replace
https://www.|https:// blank field
  • Go to [ ▾ Data ] » Split text to columns and remember to select . as a separator.

Yes, this is something you’re not yet familiar with – we have just used a regular expression. Don’t worry, we’ll get into details later on throughout the course!

◕ Are we done yet? #

You might think that you are done with data cleaning after this stage – not at all.

Data workflow is based on continuous iteration between various stages of data processing. Even during the final visualisation, it may turn out that our data needs to be cleaned again, reformatted or even supplemented with new sources.