Does a deep orthography decrease the entropy of a written language?

FlatAssembler · May 5, 2023

In a paper I published in Valpovački godišnjak and Regionalne studije, I measured the collision entropy of 5 languages: English, German, French, Italian and Croatian. I measured the collision entropy in both a long text and in the Aspell spell-checker word-list for that language. You can see the results in the table:

Please Register or Log in to view the hidden image!

Of those five languages, English and French have far deeper orthography than others. And they also have the lowest collision entropy in both a long text and Aspell word list. If we assume the depth of the orthography has no effect on the collision entropy, the probability of that happening by chance (the p-value of my observation) is 1/((5!/(2!*(5-2)!))^2)*2=1/50.

Now, obviously, suggesting that the depth of the orthography decreases the collision entropy (or, for that sake, any entropy, including the Shannon entropy) seems absurd in the light of historical linguistics. Historical linguistics teaches us that the way a word is spelt in a language with a deep orthography corresponds to how it was pronounced at some point in history of the language. English spelling represents how English was pronounced at the time of the invention of the printing press. One of the basic principles of historical linguistics is the assumption that languages that were spoken in the past had, on average, the same statistical properties as languages spoken today. Saying that languages spoken in the past had a lower collision entropy obviously contradicts that principle.

So, what do you think?

Log in or Sign up to hide all adverts.

James R · May 11, 2023

Can you please explain what "collision entropy" means?

Log in or Sign up to hide all adverts.

FlatAssembler · May 11, 2023

James R said: ↑

Can you please explain what "collision entropy" means?
Click to expand...

Sure. There are two equivalent definitions of collision entropy:
1. Collision entropy is the negative logarithm of the probability that, if you randomly choose two characters from a string, you have chosen equal ones.
2. Collision entropy is the negative logarithm of the sum of the squares of the relative frequencies of symbols in a string.
In the program I used to measure the collision entropy, I used the first definition.

Log in or Sign up to hide all adverts.

DaveC426913 · May 11, 2023

FlatAssembler said: ↑

1. Collision entropy is the negative logarithm of the probability that, if you randomly choose two characters from a string, you have chosen equal ones.
Click to expand...

Equal ones what?

You mean "...you have chosen two letters with the same probability", I think. OK, that threw me for a sec.

FlatAssembler · May 12, 2023

DaveC426913 said: ↑

Equal ones what?

You mean "...you have chosen two letters with the same probability", I think. OK, that threw me for a sec.
Click to expand...

Let's say we have a string "abb". What is the probability that, if you randomly choose two characters from it, you have chosen equal characters? Well, if the first character you've chosen is 'a', the probability that the second character you will choose is also 'a' is 1/3. If the first character you've chosen is 'b', the probability that the second character you've chosen is also 'b' is 2/3. So, the probability that, if you randomly choose two characters from that string, you've chosen equal ones, is 1/3*1/3+2/3*2/3=0.556. And therefore, its collision entropy is -log2(0.556)=0.847 bits/symbol.

Log in or Sign up

Does a deep orthography decrease the entropy of a written language?

FlatAssembler Registered Member

Google AdSense Guest Advertisement

James R Just this guy, you know? Staff Member

Google AdSense Guest Advertisement

FlatAssembler Registered Member

Google AdSense Guest Advertisement

DaveC426913 Valued Senior Member

FlatAssembler Registered Member

Share This Page

Log in or Sign up

Does a deep orthography decrease the entropy of a written language?

FlatAssembler Registered Member

Google AdSense Guest Advertisement

James R Just this guy, you know? Staff Member

Google AdSense Guest Advertisement

FlatAssembler Registered Member

Google AdSense Guest Advertisement

DaveC426913 Valued Senior Member

FlatAssembler Registered Member

Share This Page

Useful Searches