Tuesday, June 29, 2010

PERL: Removing Text Within Parentheses

I recently wanted to get a list of major landmarks, but the text list had the name of the landmark, followed by the location of the landmark in parentheses: ex. Eiffel Tower (Paris, France). I just wanted the names of the landmarks without the text in the parentheses, so had to figure out the command to remove all the parenthesized text.

Perl was the winner as tool of choice. There was a small trick to doing this seemingly trivial task, so document it we shall.

Original: Eiffel Tower (Paris, France)
Desired: Eiffel Tower

There are two ways to do it based on what you want:
perl -p -e 's#\(.*\)##g' textfile
You may have seen 's/oldtext/newtext/g' as the syntax before and are wondering why I am using hash marks (or pound signs) instead. You don't have to use the forward slash, it is just the common way, but if you want to use the forward slash in the search text without having to escape, using hash marks are the way. It can also be used to make it easier to read. Now, onto the command--the \( obviously says look for a left parenthesis, then there is the critical .* which says find any number of any characters. Finally, we close it off with a right parenthesis. This will find anything encapsulated by two parentheses.
perl -p -e 's#\([^)]*\)##g' textfile
This solution will also do the same thing based on our Original text string, but it is slightly different. The [^)] is telling "any character that is not a right parenthesis." The carat (^) is negating everything in the brackets. This is useful if you are making an exclusion set. You can place several characters, [^$)?], and it will look for any character except a $, ), or ?.

Since the two commands work the same for the given example, let's show how the commands will vary in different situations:
If textfile contains
1. Paris (France,) Hilton (Hotel)
2. Paris (France (Hilton) Hotel)
perl -p -e 's#\(.*\)##g' textfile
The results would be:
1. Paris
2. Paris
Note the danger here is that, even though in line 1 Hilton is not in parentheses, it gets removed because there is an ending right parenthesis at the end of the line. This may not be the expected/intended operation.

Next, using
perl -p -e 's#\([^)]*\)##g' textfile
The results would be:
1. Paris Hilton
2. Paris Hotel)
The operation for line 1 may have been what we were expecting, but line 2 doesn't look good. The moral here is to understand what you are trying to do and choose the correct command to do the appropriate operation.

1 comment:

  1. The ultimate reason that the second version leaves the final parenthesis is that regular languages cannot match parenthesis. You'd need a context-free language/grammar to do that. As a workaround, though, you could ensure that the parentheses match and then keep applying until the text keeps changing. Specifically:

    perl -p -e 's#\([^()]*\)##g'

    Of course, this assumes that the parentheses are all on the same line. And that you don't intermix parentheses with square brackets, etc.