ABC

One for the discrete mathematics wonks:

Consider strings of symbols taken from an alphabet (a set of symbols). A string contains an echo if it contains a substring of the form SS, where S stands for any substring. So the string hello contains an echo (S = l), as does banana (S = an, or S = na). But bandana is contains no echo: although the substring an occurs twice, these occurrences are not contiguous. A string without an echo is echo-free.

So here's the question: Given an alphabet, what is the longest echo-free string of symbols from that alphabet? Is there a echo-free string (or strings) of length n such that all strings with length greater than n do contain an echo?

With an alphabet of only one symbol, a, the longest echo-free string is obviously just a, since aa contains an echo.

For an alphabet of two symbols, a and b, the longest echo-free strings are aba and bab. Extending these strings always introduces an echo (e.g. abaa, abab).

What about an alphabet of three symbols? It turns out that with three symbols, there are echo-free strings which continue without end, and so there is no longest echo-free string. And since such strings exist for alphabets of three symbols, they exist for all larger alphabets.

Here is a definition in Haskell of an echo-free infinite string using three symbols (Haskell being a wonderful language for describing infinite sequences):

-- three magical strings
expand 'a' = "abcbabcacbc"
expand 'b' = "abcbacabacbc"
expand 'c' = "abcbacbcacbabcbacbc"

expands = concatMap expand

-- an infine echo-free string
noecho = (expand 'a') ++ (expands $ drop 1 noecho)

Using ghc to run it:

*Main> take 1000 noecho
"abcbabcacbcabcbacabacbcabcbacbcacbabcbacbcabcbacabacbcabcbabcacbcabcbacabacbcab
cbacbcacbabcbacbcabcbabcacbcabcbacbcacbabcbacbcabcbacabacbcabcbacbcacbabcbacbcab
cbabcacbcabcbacabacbcabcbacbcacbabcbacbcabcbacabacbcabcbabcacbcabcbacbcacbabcbac
bcabcbabcacbcabcbacabacbcabcbabcacbcabcbacbcacbabcbacbcabcbacabacbcabcbacbcacbab
cbacbcabcbabcacbcabcbacabacbcabcbacbcacbabcbacbcabcbacabacbcabcbabcacbcabcbacbca
cbabcbacbcabcbacabacbcabcbacbcacbabcbacbcabcbabcacbcabcbacbcacbabcbacbcabcbacaba
cbcabcbabcacbcabcbacabacbcabcbacbcacbabcbacbcabcbacabacbcabcbabcacbcabcbacbcacba
bcbacbcabcbacabacbcabcbacbcacbabcbacbcabcbabcacbcabcbacabacbcabcbacbcacbabcbacbc
abcbacabacbcabcbabcacbcabcbacbcacbabcbacbcabcbabcacbcabcbacabacbcabcbabcacbcabcb
acbcacbabcbacbcabcbacabacbcabcbacbcacbabcbacbcabcbabcacbcabcbacabacbcabcbacbcaca
bcbacbcabcbacabacbcabcbabcacbcabcbacabacbcabcbacbcacbabcbacbcabcbabcacbcabcbacbc
acbabcbacbcabcbacabacbcabcbacbcacbabcbacbcabcbabcacbcabcbacabacbcabcbacbcacbabcb
acbcabcbacabacbcabcbabcacbcabcbacbcacbab"

It's straightforward to write a program to check that some prefix of the string generated by noecho is in fact echo-free. Although, I have not found an algorithm for this with time complexity better than O(n2), making it time-consuming to check more than a few thousand symbols.

Of course, empirically verifying that some finite substring of noecho is echo-free is not the same as proving that noecho in its entirety is echo-free. The three magical strings from the program are carefully chosen to allow such a proof, which I will outline in a future post.

I expect that this result was described long ago in some corner of the mathematics literature. If anyone reading this knows where, please get in touch.

Update: thanks to a commenter, I know now that the established term for what I dubbed echo-free sequences is squarefree words.

Comment from Anonymous

Hi,

What you are describing is commonly known as "square-free words". Axel Thue studied those in the beginning of the XX-th century, so an infinite square-free word is sometimes called "a Thue word".

A quick search reveals shorter substitutions and a better algorithm to check the property.

Eugene.

Comment from David

Excellent!

I think I can forgive myself guessing that the official term is squarefree.