So, you have your data set and you’re rushing to analyse it? Not so fast. Before data scientists conduct any study, they usually spend 80% of their time cleaning and transforming data – otherwise, they may be useless, resulting in misleading visualisations.
Let’s first look at the way our data is stored. This is a good time to get to know the CSV file, which is one of the basic data formats.
You may wonder why to keep data in a plain text file when we can use more sophisticated options such as spreadsheets or databases? The problem is that they store data in a way that is not always understood by other programs. For this reason, small data sets are often kept in ordinary text files, such as CSV, so they can be read quickly by any computer without installing additional applications.
In a nutshell, a CSV is no different than a typical data table inside a text file with values separated (most often) by commas – as opposed to the column boundaries in the excel table.
col_1,col_2,col_3
val,val,val
val,val,val
col_1 | col_2 | col_3 |
---|---|---|
val | val | val |
val | val | val |
A quick file inspection before you start working with the data is a good habit for any analyst. It will help you get a sense of what’s inside the data, discover missing records or spot other errors that may have occurred while writing data to the file. If you don’t have more advanced software at hand, you can quickly preview the CSV file content by using online tools.
In this exercise, you will explore Open Refine, a tool used by researchers and data journalists, first created by Google, currently developed as an open-source project. It will help you automatically clean your data, correct language inaccuracies, and perform table transformation from wide to long format.
You can import data to OpenRefine directly from any online source (url) like Google Sheets or upload files in different formats (e.g. csv, JSON, xlsx).
true
valuesA common job of a data analyst is to remove trailing spaces or split values from one to multiple columns. Many of those predefined actions can be performed in OpenRefine much faster than by typing formulas into a spreadsheet.
value.replace("UŚ", "Uniwersytet Śląski").trim()
asp, Politechnika
Explore some other options as well.
If you work a lot with survey data, you may have a hard time cleaning up the records – e.g. street names or education background – entered manually by survey participants. Losing hours to make the names written in dozens of ways consistent? (e.g. Środmieście, Środmieście, Śródmeiście, Srodmiescie, środmieście
). This is where OpenRefine shines – using algorithms that analyse natural language to help you tackle this otherwise tedious task.
address_district
columnIf you’ve ever dealt with a spreadsheet, you’re probably used to storing data in a wide table with multiple columns. However, many other tools, e.g. for data visualisation, require the data in a long form (also called tidy or stacked) with only one value in a single row (where each variable is a column). While completing this transformation using spreadsheets is not a trivial task, with OpenRefine it will only take a couple of minutes.
PRO TIP: Read more about tidy data or see this video.
UNTIDY DATA
student | maths | physics | pe |
---|---|---|---|
Anna | 4 | 5 | 5 |
Robert | 2 | 4 | 3 |
Stefan | 5 | 5 | 3 |
TIDY DATA
student | subject | mark |
---|---|---|
Anna | maths | 4 |
Anna | physics | 5 |
Anna | pe | 5 |
Robert | maths | 2 |
Robert | physics | 4 |
Robert | pe | 3 |
Stefan | maths | 5 |
Stefan | physics | 5 |
Stefan | pe | 3 |
We have just learned how to tidy up (stack) our data, and what if we would like to make an opposite transposition – from long to wide table? Very often this will be necessary when counting values in the consecutive rows. While this can also be done in OpenRefine, let’s move for a moment to Google Sheets to discover a very popular and extremely useful feature of creating pivot tables.
Let’s assume that we are interested in the number of participants from different cities and then districts of Katowice.
address_city
& address_district
id
and COUNTA
Suppose you performed several operations on data, then your data set has been updated with new values. Fortunately, you don’t have to do all the work from scratch. You can save a list of executed steps and apply them to another similar file.
Ctrl + a
)If you work with larger files comprising hundreds of thousands lines, your spreadsheet may not be ready for such challenges. However, OpenRefine should be able to handle them, and unlike a spreadsheet, they can come from XML or JSON as well (please note that you may need to allocate more RAM memory to OpenRefine). This makes OpenRefine a powerful companion to popular commercial data processing applications.
Load at most 1200000 rows of data
)See OpenRefine documentation.