The Pumping Lemma (for Regular Languages)

So apparently, this is the bit of formal logic at which all computer science undergrads quake in their boots. Guess I've got that to look forward to at uni. In the meantime, let's have a crack at understanding this:

For every regular language L there exists a positive integer p, such that every string W in the language L may be broken down into three substrings X, Y, and Z, where Y is not the empty string, the length of XY does not exceed p, and for each n in the set of natural numbers, X(Y^n)Z is also in the language L.

Christ. That'll probably take some explaining.

First of all, let's consider this fact: any regular language can be expressed as an equivalent deterministic finite state machine. So any word in the language is just a valid path through the finite state machine.

Really, what this all hinges on is the idea of the Pigeonhole Principle. If you're not familiar with it, here's what it comes down to:

If there's h pigeon-holes and p pigeons, and p > h, then there's bound to be a pigeon-hole with more than one pigeon hiding inside.

In the same way, let's imagine our regular language as the finite state machine. If we count up all the states, we get a positive integer. And if we remember that finite state machines can only output one character per state, it seems like we reach a bit of a problem. What if there's a regular language containing strings with more letters than there are states in the machine?

Well, now it gets interesting. Regular languages, just like their equivalent finite state machines, can repeat bits of themselves to infinity. Here's a regular expression to match any string that starts with an 'A', then has some positive number of 'B's, then an 'A' again:

A(B)+A

(If you're not familiar with regular expressions, the brackets are just used for grouping - just like orders of precedence in maths, the brackets make sure that only the 'B' has the '+' applied to it. Although in this case, it strictly isn't needed - I've just put it in there to be extra-clear.)

If we consider this as a finite state machine, we'll probably have 3 states, and then a final "accepting" state. In the first state the A is printed, and then we have no choice but to move to the second state. The B is printed. Now, we have the choice either to print B again (and stay in the second state), or move to the third state and print A. Lastly, we have no choice but to move on into the accepting state.

The second state here is really the important one, and it demonstrates what we'd call "pumping": repeating a state (or group of states) in order to get their output more than once.

ABBBBBBBBBBBBA would be an accepted word in this regular language: even though it's got more characters than there are states in the machine, it starts and ends with 'A's and has some positive integer number of 'B's sandwiched in the middle.

So, what's the upshot of all this? Well, let's tie the ideas of "pumping" and the Pigeonhole Principle together. For any regular language, there's a finite number of states in the state machine. This means that for any regular language, there's some word-length that can only be achieved if you 'pump' the finite state machine: if you repeat one or more states.

But how is this useful? There's some regular languages which can't be pumped at all - just think of the language {"abc"} as an example; it only accepts the string "abc" and no other strings.

Well, the Pumping Lemma is absolutely useless for proving a language is regular. But where it comes in handy is proving that language are not regular.

Here's an example: let's prove that the set of all palindromes is NOT a regular language.

Palindromes [PDF]

And if you want a challenge, I've gone through the process of showing that the set of all matched brackets - e.g. ((()))()(()) (LISP, anyone?) is NOT a regular language:

Brackets [PDF]