# Does a deep orthography decrease the entropy of a written language?

#### FlatAssembler

Registered Member
In a paper I published in Valpovački godišnjak and Regionalne studije, I measured the collision entropy of 5 languages: English, German, French, Italian and Croatian. I measured the collision entropy in both a long text and in the Aspell spell-checker word-list for that language. You can see the results in the table:

Of those five languages, English and French have far deeper orthography than others. And they also have the lowest collision entropy in both a long text and Aspell word list. If we assume the depth of the orthography has no effect on the collision entropy, the probability of that happening by chance (the p-value of my observation) is 1/((5!/(2!*(5-2)!))^2)*2=1/50.

Now, obviously, suggesting that the depth of the orthography decreases the collision entropy (or, for that sake, any entropy, including the Shannon entropy) seems absurd in the light of historical linguistics. Historical linguistics teaches us that the way a word is spelt in a language with a deep orthography corresponds to how it was pronounced at some point in history of the language. English spelling represents how English was pronounced at the time of the invention of the printing press. One of the basic principles of historical linguistics is the assumption that languages that were spoken in the past had, on average, the same statistical properties as languages spoken today. Saying that languages spoken in the past had a lower collision entropy obviously contradicts that principle.

So, what do you think?

Can you please explain what "collision entropy" means?

Can you please explain what "collision entropy" means?
Sure. There are two equivalent definitions of collision entropy:
1. Collision entropy is the negative logarithm of the probability that, if you randomly choose two characters from a string, you have chosen equal ones.
2. Collision entropy is the negative logarithm of the sum of the squares of the relative frequencies of symbols in a string.
In the program I used to measure the collision entropy, I used the first definition.

1. Collision entropy is the negative logarithm of the probability that, if you randomly choose two characters from a string, you have chosen equal ones.
Equal ones what?

You mean "...you have chosen two letters with the same probability", I think. OK, that threw me for a sec.

Equal ones what?

You mean "...you have chosen two letters with the same probability", I think. OK, that threw me for a sec.
Let's say we have a string "abb". What is the probability that, if you randomly choose two characters from it, you have chosen equal characters? Well, if the first character you've chosen is 'a', the probability that the second character you will choose is also 'a' is 1/3. If the first character you've chosen is 'b', the probability that the second character you've chosen is also 'b' is 2/3. So, the probability that, if you randomly choose two characters from that string, you've chosen equal ones, is 1/3*1/3+2/3*2/3=0.556. And therefore, its collision entropy is -log2(0.556)=0.847 bits/symbol.