Text processing

Text processing #

Being a designer, you have certainly discovered GREP’s superpowers. It’s nothing more than regular expressions that computer scientists have known for a long time. Did you know that you can use them on in google docs or your text editor?

Just the find and replace command can make your daily work much more efficient. In this exercise, you will learn several useful tricks to help you clean up your text data. We’ll start with Ctrl + h and simple commands to end up with regular expressions.

◕ Learn the basics of regular expressions #

This short online tutorial will introduce you efficiently to the fascinating (or scary – as some people believe) world of regular expressions.

You can do further exercises using Regex 101, an online tool that makes it easy to create and test regular expressions.

◕ Remove white spaces #

Now let’s test in practice a few simple patterns that will prove useful for your routine text editing work.

Modern  writers,  painters,  photographers,  filmmakers  and  digital  
artists  have created many fascinating representations of city life.
Paintings of Parisian   boulevards and cafés by Pissarro and Renoir,
photomontages by Berlin Dada  artists,      Spider-Man comics  by  Stan  
Lee  and  Steve  Ditko,  Broadway Boogie-Woogie by Piet Mondrian and
Playtime by Jacques Tati are some of the classic examples of artists
encountering the city.

expression
\s\s double spaces
\s{2} double spaces
\s{2,} multiple spaces

◕ Replace words to improve consistency #

Mercer's 2012 Quality of Living survey ranked Dusseldorf the sixth
most livable city in the world. DUS Airport is Germany's third-busiest
airport after those of Frankfurt and Munich, serving as the most
important international airport for the inhabitants of the densely
populated Ruhr, Germany's largest urban area. Duesseldorf is an inter-
national business and financial centre, renowned for its fashion
and trade fairs, and is headquarters to one Fortune Global 500
and two DAX companies. (Wikipedia)

find replace
(Duesseldorf|Dusseldorf|DUS) Düsseldorf

◕ Deal with line breaks #

The result of their explorations is On Broadway: a visually-rich, image-
centric interface without maps and where numbers play only a secondary role.
Like a spine in a human body, Broadway runs through the middle of
Manhattan Island curving along its way. In order to capture the activi-
ties nearby, a slightly wider area than the street itself was included. To define
this area, points were selected at 30-metre intervals going through the
centre of Broadway, and 100-metre-wide rectangles centred on every point
were defined (see Figure 4.2). The result is a spin-like shape that is 21,390
metres (13.5 miles) long and 100 metres wide.

expression
\n line breaks
-\n end line word split
(?<!-)\n omit end line split
(?<![\?\.\!])\n omit end of paragraph (.?!)

The (?<!...) pattern you see above is called negative lookbehind. In this case it ensures that the newline character won’t be matched at the line split or at the end of paragraph. Go to the REFERENCE SECTION at Regex 101 to learn more about it.

◕ Handle digits and line beginning #

You may want to copy a conversation from Slack or another tool used to communicate with your colleagues. Think about how you’ll get rid of unnecessary information – the names of your partners and the time of the conversation. Can you formulate your expression so that you only catch the numbers and names at the beginning of the line? Could you get it done before 20:00?

  • Remove numbers (e.g. 11:01) from the line beginning.
  • Drop the line beginning with Emilly.
  • Leave out John: .
  • Remove empty lines and newline characters (\n).
11:01
John: You may want to copy a conversation from Slack
11:01
John: or another tool used to communicate with your colleagues.
11:01
Emilly: that's interesting indeed.
11:02
John: Think about how you'll get rid of unnecessary information – the names of your partners and the time of the conversation.
11:03
John: Can you formulate your expression so that you only catch the numbers
11:03
John: and names at the beginning of the line? Could you get it done before 20:00?
expression
^\d\d:\d\d hh:mm at line start
^\n blank line
^Emilly.*$ line beggining with ‘Emilly’

◕ Insert characters at the beginning and/or end of line #

one
two
three

{  one  }
{  two  }
{  three  }

find replace
(^.+$) { $1 }

◕ Solve common text editors' problems #

The presidency of Donald Trump began at noon EST (17:00 UTC)
on 20-01-2017 ,when Donald Trump was inaugurated
as the 45th president of the United States , succeeding
Barack Obama. (...) The nonpartisan Tax Policy Center estimated
that the richest 0,1% and 1% would benefit the most in raw
dollar amounts and percentage terms from the [Trump's] tax plan ,
earning 10.2% and 8,5% more income after taxes respectively.

The presidency of Donald Trump began at noon EST (17:00 UTC)
on 2017.01.20, when Donald Trump was inaugurated
as the 45th president of the United States, succeeding
Barack Obama. (...) The nonpartisan Tax Policy Center estimated
that the richest 0.1% and 1% would benefit the most in raw
dollar amounts and percentage terms from the [Trump's] tax plan,
earning 10.2% and 8.5% more income after taxes respectively.

(Wikipedia)

find replace
\s,\s ,
(\d+),(\d%) $1.$2
([^\s])\s,([^\s]) $1, $2
(\d\d).(\d\d).(\d{4}) $3.$1.$2

Resources #

Fallen in love with regular expressions already and want to try more? There are plenty of great tools and tutorials on the web, below I refer to my two favourites.