Text processing

Even if you are not a data researcher, but you deal a lot with text data, a simple text editor combined with the find and replace command can make your daily work much more efficient. In this exercise, you will learn several useful tricks to help you clean up your text data. We’ll start with Ctrl + h and simple commands to end up with regular expressions. They will come in handy especially to those of you who make a living from text editing.

1. Routine cleaning tasks

Let’s start with a few simple patterns that will prove useful for your routine text editing work.

◕ Remove white spaces
Modern  writers,  painters,  photographers,  filmmakers  and  digital  
artists  have created many fascinating representations of city life. 
Paintings of Parisian   boulevards and cafés by Pissarro and Renoir, 
photomontages by Berlin Dada  artists,      Spider-Man comics  by  Stan  
Lee  and  Steve  Ditko,  Broadway Boogie-Woogie by Piet Mondrian and 
Playtime by Jacques Tati are some of the classic examples of artists 
encountering the city. 

expression
\s\s double spaces
\s{2} double spaces
\s{2,} multiple spaces
◕ Replace words to improve consistency
Mercer's 2012 Quality of Living survey ranked Dusseldorf the sixth 
most livable city in the world. DUS Airport is Germany's third-busiest 
airport after those of Frankfurt and Munich, serving as the most 
important international airport for the inhabitants of the densely 
populated Ruhr, Germany's largest urban area. Duesseldorf is an inter-
national business and financial centre, renowned for its fashion 
and trade fairs, and is headquarters to one Fortune Global 500 
and two DAX companies. (Wikipedia)

find replace
(Duesseldorf|Dusseldorf|DUS) Düsseldorf
◕ Deal with line breaks
The result of their explorations is On Broadway: a visually-rich, image-
centric interface without maps and where numbers play only a secondary role.
Like a spine in a human body, Broadway runs through the middle of
Manhattan Island curving along its way. In order to capture the activi-
ties nearby, a slightly wider area than the street itself was included. To define
this area, points were selected at 30-metre intervals going through the
centre of Broadway, and 100-metre-wide rectangles centred on every point
were defined (see Figure 4.2). The result is a spin-like shape that is 21,390
metres (13.5 miles) long and 100 metres wide.

expression
\n line breaks
-\n end line word split
(?<!-)\n omit end line split
(?<![\?\.\!])\n omit end of paragraph (.?!)
◕ Handle digits and line beginning
11:01
John: You may want to copy a conversation from Slack
11:01
John: or another tool used to communicate with your colleagues.
11:01
Emilly: that's interesting indeed.
11:02
John: Think about how you'll get rid of unnecessary information – the names of your partners and the time of the conversation.
11:03
John: Can you formulate your expression so that you only catch the numbers
11:03
John: and names at the beginning of the line? Could you get it done before 20:00?

You may want to copy a conversation from Slack or another tool used to communicate with your colleagues. Think about how you'll get rid of unnecessary information – the names of your partners and the time of the conversation. Can you formulate your expression so that you only catch the numbers and names at the beginning of the line? Could you get it done before 20:00?

expression
^\d\d:\d\d hh:mm at line start
^\n blank line
^Emilly.*$ line beggining with ‘Emilly’
◕ Insert characters at the beginning and/or end of line
one
two
three

{  one  }
{  two  }
{  three  }

find replace
(^.+$) { $1 }
◕ Solve common text editors’ problems
The presidency of Donald Trump began at noon EST (17:00 UTC) 
on 20-01-2017 ,when Donald Trump was inaugurated 
as the 45th president of the United States , succeeding 
Barack Obama. (...) The nonpartisan Tax Policy Center estimated 
that the richest 0,1% and 1% would benefit the most in raw 
dollar amounts and percentage terms from the [Trump's] tax plan , 
earning 10.2% and 8,5% more income after taxes respectively.

The presidency of Donald Trump began at noon EST (17:00 UTC) 
on 2017.01.20, when Donald Trump was inaugurated 
as the 45th president of the United States, succeeding 
Barack Obama. (...) The nonpartisan Tax Policy Center estimated 
that the richest 0.1% and 1% would benefit the most in raw 
dollar amounts and percentage terms from the [Trump's] tax plan, 
earning 10.2% and 8.5% more income after taxes respectively. 

(Wikipedia)

find replace
\s,\s ,
(\d+),(\d%) $1.$2
([^\s])\s,([^\s]) $1, $2
(\d\d).(\d\d).(\d{4}) $3.$1.$2

2. PDF parsing

Suppose you want to quote extensive excerpts from an online article or republish it in a non-commercial magazine of which you are an editor (you can do so without asking for permission, as the book is published under Creative Commons CC BY NC). You don’t have time to email the publisher, so you decide to copy the article manually from the pdf file. Of course, the result is not satisfactory. How to get rid of text formatting without losing half a day of manual efforts?

  • Open your favourite text editor which can handle regular expressions
  • Copy-paste by hand the content of the following academic article: Case Study: On Broadway by Daniel Goddemeyer, Moritz Stefaner, Dominikus Baur & Lev Manovich
  • To perform the task, use the patterns from previous exercises
Here are the expressions that we’ve already used
expression meaning
\n newline
\s whitespace
. any single character
\. (literal) dot
a* zero or more of a
a+ one or more of a
\d any digit
^ start of string/line
$ end of string/line
[abc] either a, b or c char.
[^abc] any char. except for a, b or c
a{3,} at least 3 consecutive a chars
(a|b) either a or b
Resources

Fallen in love with regular expressions already and want to try more? There are plenty of great tools and tutorials on the web, below I refer to my two favourites.