# In Croatia, the first two consonants in river names are often 'k' and 'r'. What's the p-value?

Discussion in 'Linguistics' started by FlatAssembler, May 5, 2023.

1. ### FlatAssemblerRegistered Member

Messages:
33
In Croatia, the first two consonants in many river names are 'k' and 'r', respectively: Karašica (two rivers with the same name), Krka, Korana, Krbavica, Krapina and Kravarščica. Mainstream linguistics considers that to be a coincidence, that those river names are unrelated. But what is the probability of something like that happening by chance? Does anybody know how to calculate that?

I have published a paper in both Valpovački Godišnjak and Regionalne Studije that tries to do just that. It is basically this text, just edited differently.

To summarize, I think that I have thought of a way to measure the collision entropy of the different parts of the grammar. The entropy of the syntax can obviously be measured by measuring the entropy of spell-checker word list such as that of Aspell and subtracting from that an entropy of a long text in the same language (I was measuring only for the consonants, I was ignoring the vowels, because vowels were not important for what I was trying to calculate). I got that, for example, the entropy of the syntax of the Croatian language is log2(14)-log2(13)=0.107 bits per symbol, that the entropy of the syntax of the English language is log2(13)-log2(11)=0.241 bits peer symbol, and that the entropy of the syntax of the German language is log2(15)-log2(12)=0.3219 bits per symbol. It was rather surprising to me that the entropy of the syntax of the German language is larger than the entropy of the syntax of the English language, given that German syntax seems simpler (it uses morphology more than the English language does, somewhat simplifying the syntax), but you cannot argue with the hard data. The entropy of the phonotactics of a language can, I guess, be measured by measuring the entropy of consonant pairs (with or without a vowel inside them) in a spell-checker wordlist, then measuring the entropy of single consonants in that same wordlist, and then subtracting the former from the latter multiplied by two. I measured that the entropy of phonotactics of the Croatian language is 2*log2(14)-5.992=1.623 bits per consonant pair. Now, I have taken the entropy of the phonotactics to be the lower bound of the entropy of the phonology, that is the only entropy that matters in ancient toponyms (entropy of the syntax and morphology do not matter then, because the toponym is created in a foreign language). Given that the Croatian language has 26 consonants, the upper bound of the entropy of morphology, which does not matter when dealing with ancient toponyms, can be estimated as log2(26*26)-1.623-2*0.107-5.992=1.572 bits per pair of consonants. So, to estimate the p-value of the pattern that many names of rivers in Croatia begin with the consonants 'k' and 'r' (Karašica, Krka, Korana, Krbavica, Krapina and Kravarščica), I have done some birthday calculations, first setting the simulated entropy of phonology to be 1.623 bits per consonant pair, and the second by setting the simulated entropy of phonology to be 1.623+1.572=3.195 bits per consonant pair. In both of those birthday calculations, I assumed that there are 100 different river names in Croatia. The former birthday calculation gave me the probability of that k-r-pattern occuring by chance to be 1/300 and the latter gave me the probability 1/17. So the p-value of that k-r-pattern is somewhere between 1/300 and 1/17. So I concluded that the simplest explanation is that the river names Karašica, Krka, Korana, Krbavica, Krapina and Kravarščica are related and all come from the Indo-European root *kjers meaning horse (in Germanic languages) or to run (in Celtic and Italic languages). I think the Illyrian word for "flow" came from that root, and that the Illyrian word for "flow" was *karr or *kurr, the vowel difference 'a' to 'u' perhaps being dialectical variation (compare the attested Illyrian toponyms Mursa and Marsonia, almost certainly from the same root). Furthermore, based on the historical phonology of the Croatian language, I reconstructed the Illyrian name for Karašica as either *Kurrurrissia or *Kurrirrissia, and the Illyrian name for Krapina as either *Karpona or *Kurrippuppona, with preference to *Karpona. Do those arguments sound compelling to you?

I understand that I probably should have asked this question before publishing that paper in two journals, but I guess that now is better than never.

Messages:
18,959

5. ### FlatAssemblerRegistered Member

Messages:
33
I am glad this forum has not died in the meantime, but I was hoping for a more meaningful response.

7. ### DaveC426913Valued Senior Member

Messages:
18,959
Hey, at least I tried to read it...

8. ### TiassaLet us not launch the boat ...Valued Senior Member

Messages:
37,893
Mod Hat — Closure

This thread will probably find sufficient discussion at Stack Exchange, Github, Atheist Forums, and anywhere else you might have posted it.

9. ### TiassaLet us not launch the boat ...Valued Senior Member

Messages:
37,893
Mod Hat — Reopening

Per explicit request, this thread is re-opened for the merit of being scienc-y.

10. ### FlatAssemblerRegistered Member

Messages:
33
Well, yes, it definitely looks sciency. Most of the people on forums about linguistics consider it pseudoscientific, and I am not sure why. It seems as if, to people in soft sciences, the terms like "collision entropy" or even "p-value" sound like pseudoscientific buzzwords, which is very unfortunate.

11. ### DaveC426913Valued Senior Member

Messages:
18,959
Ah. You are at the crossroads of linguistics and information theory, and no bus stop in sight.

FlatAssembler likes this.
12. ### FlatAssemblerRegistered Member

Messages:
33
What does "no bus stop in sight" mean?

13. ### DaveC426913Valued Senior Member

Messages:
18,959
I mean I think you're alone here.

Not a lot of sci-fi peeps into info theory.

14. ### FlatAssemblerRegistered Member

Messages:
33
Why do you think my work is "science fiction"?

15. ### DaveC426913Valued Senior Member

Messages:
18,959
Oops. Sorry. That was a typo.* It was supposed to be "Sci-Fo" - as in: members on this site.
I simply meant I doubt there's anyone here who can provide competent feedback.

*(Man, I'm battin' a thousand for faux pas...

)

16. ### FlatAssemblerRegistered Member

Messages:
33
Well, I don't know much about information theory. I got a C in it at the university.

17. ### DaveC426913Valued Senior Member

Messages:
18,959
I ... that ...

This is all about information theory. Shannon entropy, collision entropy, information bits per character, etc. are all core to the field.

18. ### FlatAssemblerRegistered Member

Messages:
33
I am quite sure we didn't mention collision entropy in our information theory classes.

19. ### DaveC426913Valued Senior Member

Messages:
18,959
Well you did say you only got a C, so...