Friday, December 5, 2008

Polya-style write-up: representing a language with binary strings with even 0s and odd 1s

This question came up in class during the lecture where we were introduced to DFSA: Find a regular expression that denotes the language of binary strings with an even number of zeros and an odd number of ones, and prove that it denotes that language. My Polya-style write-up will attempt the latter part, using the regex:

R = (00+11+(01+10)(00+11)*(10+01))*(1+(01+10)(11+00)*0)

This seemed like a good problem to apply the Polya process to, since at the time I was working on it, it was the first such proof I had done. Hence, I was working a priori in a way, and it wasn't just a case of following a familiar, mechanical process, and so I made more interesting mistakes.

Understanding the Problem:


My first insight into this type of proof was that I would have to prove two things to show that R denotes the language in question (I'll call it L' from now on, for convenience). I would have to show that all strings generated by R are in L', and that all strings in L' can be generated by R. The way I like to think of this now, that makes it even more clear, is that to prove L(R) = L', I can show that L(R) is both a subset and a superset of L'.

An easy way to convince myself that neither criterion alone is enough would be to consider regexes like 001 which satisfies the former, or (0+1)* which satisfies the latter. It's abundantly clear that neither regex denotes the language L' though.

Making a Plan 1.0:


My first attempt at showing that L' is a subset of L(R), involved taking an arbitrary string from L' and then breaking it up into two smaller strings based on finding the rightmost 1. I think my thinking at the time was that those two substrins would be easier to deal with, since I would be left with a substring containing only 0's, and a substring where the final symbol is known.

Carrying out the Plan 1.0:

Recorded below is my initial attempt at solving the problem, warts and all. I've put it in small font so as not to clutter things up too much, so feel free to zoom in if reading my mistakes in detail interests you):

To prove that all strings, s, in L' can be generated by R. For an arbitrary s:

Find the rightmost 1. We know at least one must exist, since 1 is the smallest odd number.

Case 1: the rightmost one is followed by an odd number of zeros.
Then s = a0(00)*, where a is a string with odd 1s and odd 0s.
a is a member of L((00+11+(01+10)(00+11)*(10+01))*)
a is an arbitrary string with even zeros and ones.
let a_0 be the first pair of letters in a (if a is empty, then by the kleene star, the regex can form it, otherwise it must have even length)
Subcase 1: a_0 is 00 or 11. (00+11)* is a subset of the regex, so a_0 can be formed.
Subcase 1: a_0 is 01 or 10.
Then subsequent pairs of letters are either in (00+11) or (01+10). If we eventually reach one in the latter set, then it is formed by (01+10)(00+11)*(10+01) which is a subset of the regex.
We know that eventually it does reach one in that set though, since by our assumption a has an even number of zeros and ones. We started with an odd number of both, and every time a pair in (00+11) is encountered, the parity does not change.

So if a is not of length zero, it can either be expressed as (00+11)b or (01+10)(00+11)*(01+10)b where b is some string. Since b is preceded by a string with even zeros and ones, and therefore must have the same property (if j is even, and j + k is even, then k must be even, for natural j and k).

We can apply the same process to substring b, and continue the process until we get the empty string (which we must since the lengths of the suffix strings are a set of decreasing natural numbers).

Case 2: The rightmost 1 is followed by an even number of zeros.

Then s can be expressed as a1(00)*, where a is some string with an even number of zeros and ones.
Since L(1(00)*) is a subset of L(1+(01+10)(11+00)*0)
Subcase 1: 1 is followed by zero zeros. L(1) is a subset of L(1+(01+10)(11+00)*0) so it can be formed.
Subcase2: 1 is followed by a non-zero, even number of zeros. Then it is in L(10(00*)0), which is a subset of L(1+(01+10)(11+00)*0)

By the proof above, a can be expressed by the regex (00+11+(01+10)(00+11)*(10+01))
*.

You'll notice that the proof is certainly incomplete. After some time I gave up, realizing that I wasn't getting anywhere with my approach. Specifically, I didn't really have enough information about the prefix I had formed, so I realized that the place I had chosen to divide the string probably wasn't appropriate.

After this, I went back to look at the regular expression in question, and tried to better understand it, and perhaps understand the author's motivation in writing the regex that way (since I state without proof that there are many other equivalent REs).

Forming a Plan 2.0:

Reading the RE again, I thought about the division of the string into a substring with parity of 0s and 1s, and a substring with parity of only 0s. That division really is the essence of the RE, and in retrospect, it was probably foolish of me to ignore that, and think that my own division was more clever.

My second attempt at proving L' to be a subset of L(R) was much more in accord with the spirit of the regular expression.

Carrying out the Plan 2.0:


Preamble:
Find the smallest substring that is a suffix to x and has an odd number of ones and an even number of zeros. We know that one such substring exists, since s itself has this property by assumption. s is finite and so has a finite number of substrings - a subset of s's substrings is therefore also finite and has a largest member (and a unique one, since only one substring is a suffix to x and has length n for natural n).

call this substring b, and let a be the substring that when concatenated with b forms s.

Left side
Then a has an even number of zeros and ones. (Since s must have an odd number of ones and an even number of zeros. b posses this property, and we will maintain it, since adding an even number preserves parity). a must also have an even length, since its length is the sum of its zeros (even) and the sum of its 1s (even).

If a is not the empty string, let a_0, a_1, a_2... be the smallest non-empty substrings of a, each having an even number of 1s and 0s, such that:
a_0a_1a_2.... = a

then for each a_n, either:

a_n is 00 or 11
or a_n begins with one of 01 or 10. (If it began with 11 or 00 and had length greater than 2, it could be "split" into 11 or 00 followed by a shorter string)
then a_n ends with either 01 or 10 (by the same logic as before - if it ended in one of 11 or 00, the end could be split off)
and between the opening and closing (01+10) there may be some number of 00 or 11 pairs, since these preserve a_n's indivisibility.

the first case corresponds to the regular expression (00+11) and the second to (01+10)(00+11)*(10+01).
So a is in their union, or L((00+11+(01+10)(00+11)*(10+01))*

Right side:
Turning our attention to b:
Case 1: b is 1
Case 2: b is not 1.
then b begins with either 01 or 10. This is true, since if it began with 00 or 11, then those two characters could be "split" from b and we would be left with a smaller suffix with even 0s and odd 1s. Also, we know that b is of length at least 3, since its length is larger than 1 and odd.
then b ends with a zero. If b ended with a 1, then a smaller suffix of s with odd 1s and even 0s would exist (that being 1).
so b is equal to one of 01 or 10, followed by an even number of characters which we divide into pairs (only those pairs whose first member has an even index - that is to say, each character appears in only one pair), followed by a zero.
assume at least one of the above mentioned pairs is 01 or 10.
then if we locate the rightmost one of those pairs, we see that there exists some substring 01(00+11)*0, with even zeros and odd 1s, which is smaller than b
this contradicts our assumption, so the pairs must consist only of 00 and 11.

So we may describe b as either 1, or one of 01 or 10, followed by some (possibly zero) number of pairs of zeros or ones, followed by a zero
The above description is described by the regular expression (1+(01+10)(11+00)*0)

Dénouement
Since s is the concatentation of a and b, it can be described by the concatenation of regular expressions for a and b. So

(00+11+(01+10)(00+11)*(10+01))*(1+(01+10)(11+00)*0)


Proving L' to be a subset of L(R) was the main hurdle I faced in this proof, and showing L' to be a superset I found to be fairly simple, and so I won't devote much time, or even a very formal proof, to it.

My plan in showing that R generates good strings was to divide it into subsets (which I think is a natural thing to do, given all the "+" operators that are used. I state that if L(a) L(b) and L(c) all have some common property, then L(a+b+c) has that property (call this Lemma A). I also state that if the number of some symbols in L(a) has parity, then the same is true for L(a*) (Lemma B). I state these without proof (mostly because I don't consider a proof to be necessary here, and for both statements to be self-evident, not because writing a proof would be difficult). These will be the driving force behind my proof.

First I wish to show that L(00+11+(01+10)(00+11)*(10+01)) always generates strings with even 1s and 0s. This is true of L(00) and L(11) by observation, so I only need to prove it for L((01+10)(00+11)*(01+10)). (00+11)* generates strings with parity of 0s and 1s by Lemma B, and adding 2n 0s and 2k 1s to a string does not change the parity, so consider the language L((01+10)(01+10)). Each of the two strings in the concatenation has exactly 1 one and 1 zero, so their concetantion has exactly 2 ones and 2 zeros.

So, by Lemmas A and B, (00+11+(10+01)(00+11)*(10+01))* generates strings with even 1s and 0s.

Next I wish to show that L(1+(01+10)(11+00)*0) contains strings with odd 1s and even 0s.

This is true of L(1) by observation, so I only need to show it for L((01+10)(11+00)*0) by Lemma A. By the reasoning earlier, (11+00)* can be ignored since it won't change the parity of the string, so we consider L((01+10)0). In call cases, this generates a string with 1 one, 2 zeros. So, the property holds.

So an arbitrary string generated by the regular expression contains a prefix (possibly empty) with even 0s and even 1s, and the remaining string has odd ones and even zeros. So, the total number of ones in any string is odd, and the total zeros is even.

Looking Back:

My biggest mistake in my first approach was not paying close enough attention to (or rather, not attributing enough importance to) the regular expression itself (rather than just the language), and perhaps the authors intentions in writing it the way they did.

Another stumbling block I encountered while writing this proof was formalizing some of my observations. At many turns I found myself having to state things informally. I was able to improve the expressiveness of my proof somewhat by speaking more of the language denoted by a regular expression, rather than the regular expression itself. For example, I think it's more convincing to say something like L(01+10)1 = {011, 101} which contains only elements with an even number of ones and an odd number of zeros, rather than saying simply that (01+10)1 always generates strings with that property because it "chooses between" 01 and 10 which both have odd zeros and odd ones, and then appends a 1 (maybe I was committing the pathetic fallacy with my regular expressions).

Ahem. Sorry, that ended up being an ungodly length. I don't know if Blogger has functionality for any of that "after the jump" trickiness, but if it does I can't find it. I apologize to the people who read this blog (the empty set?). I think "The Empty Set" would be a great name for a band. Just saying.

No comments: