Junk phonemes in Proto-South-Cushitic, and some possible fixes

Posted on Mon 2023-10-30 by sansdomino — 6 Comments

Followup on my previous overview of comparative Cushitic: a slightly more involved look at Ehret’s Proto-South-Cushitic from 1980, and some readily observable issues in it. To reiterate slightly, his view of South Cushitic includes four basic units:

West Rift, a generally-accepted cluster comprising three or four languages (Iraqw–Gorowa, Alagwa, Burunge);
East Rift = two recently extinct-or-moribund languages (Kwʼadza and Aasax / Asa);
Ma’a, still treated by Ehret as a Bantuized Cushitic language;
Dahalo.

There’s a couple easily observable typological features that all four share, maybe most prominently

the presence of labialized consonants; in most just dorsal consonants like /kʷ gʷ qʷ ŋʷ/, in Dahalo also a couple labialized coronals like /dʷ ɬʷ/;
the presence of lateral obstruents; fricative /ɬ/ in all, also an ejective affricate /tɬʼ/ in most with the exception of Aasax and Ma’a, in some descriptions of Dahalo further also voiced /dɮ/ and/or palatals /cʎ̥ʼ ʎ̥/.

These are not a priori guaranteed to be innovative though — those reconstructed by Ehret for PSC he indeed goes on to later treat as Proto-Cushitic (and even Proto-Afrasian) archaisms, and his argument for the genealogical unity of South Cushitic is more involved, for one part of which see later. It might also seem premature for him to have focused on the “full” South Cushitic instead of just the clear Rift subgroup, but that’s what we do have and can therefore review.

Newer research has already put quite a bit more work into the comparison of the still more or less thriving West Rift languages (Kießling & Mous 2003, The Lexical Reconstruction of Proto-West Rift), and it seems like combining this with Ehret’s work could make a productive project. As noted in my previous post, there’s been also a fair bit of debate in the literature on what to do with Dahalo. Most of the discussion that I’ve seen, though, is unfortunately sort of typology-oriented and relies on methods like looking where cognates might be found and which of them look the most surface-similar to Dahalo. But even if we suppose the language is e.g. ultimately instead East Cushitic, there is no rule saying that sometimes e.g. a proper Proto-Cushitic etymon could not have survived just in, or mainly in, Dahalo + Rift, instead of being something like a Rift loanword in Dahalo. More detailed work on its historical phonology might be able to sometimes make this distinction, and Ehret’s proposals on this should not be wholly ignored. That Dahalo’s consonant system is one of the most “kitchensinky” on the planet (it has a little bit of almost anything you could ask for: clicks, ejectives, implosives, prenasalized consonants, a dental/alveolar distinction…) should surely also help: clearly it has been absorbing loanword phonemes for a while now, and then perhaps not only from unrelated Bantu or extinct Paleoafrican languages, but also from other branches of Cushitic? Ehret’s work already leaves a few clear openings for this kind of a hypothesis, I think. More on this further along.

The situation of research on Kwʼadza, Aasax and Ma’a looks much less satisfying. As for East Rift, R. Kießling, one of the more active West Rift researchers, seems to have deemed their closest eastern relatives not worth looking into, with claims appearing in overview works such as “The position of Qwadza and Asax is dubious, since there is not enough data (and probably never will be) to prove that they belong to a different subbranch within Southern Cushitic“. ^[1] I have not found substantiation for this claim of “not enough data” (how would one prove such a negative anyway?), and that Ehret already has given in his work reasons to think that they form a distinct East Rift branch looks to be simply swept under the carpet. Any idea that a language could not be even classified if it is extinct and its available documention is short of state-of-the-art modern linguistic methodology is, of course, absurd. Historical linguistics can demonstrate just fine that languages such as Oscan or Umbrian are not merely Indo-European, they’re indeed Italic and further form a separate subgroup of it in contrast to the well-attested Latin. This is no folly or privilege or just Indo-Europeanists either, the same has been done again just fine e.g. with plenty of extinct poorly attested Semitic languages (Ammonite, Edomite, Samalic, Ugaritic, all of Ṣayhadic / Old South Arabian…), further all sorts of extinct poorly attested Algonquian or Tupian or Uto-Aztecan languages, etc etc.

Even single wordlists can yield fair evidence for a detailed classification, if a language is not too far removed from well-attested relatives, and can be thereby linked with the framework of historical phonology, lexicon and, perhaps, morphology that they allow setting up. For a Uralic example, consider Yurats (and I could cite also plenty of cases discussing the detailed dialectological positions of Finnic, Samic, Mordvinic, Mari, Mansi etc. varieties known again only from single wordlists). This is more or less what Ehret does too, even if the work is kind of buried within his corpus of South Cushitic etymologies and his overarching South Cushitic phonological reconstruction. Clearly missing though is any synthesis of the different sources of Kwʼadza and Aasax; the former known from five or six primary collections, the latter from three, many of both unpublished. Work on this might be desirable also for helping with steering people like Kießling away from outright denying the studyability of East Rift. ^[2] And really not even any more humble general description of either variety seems to have been published at all! Ehret, too, is mostly content to simply assert overall phonological inventories, without commenting much on the primary sources. This might be fair in the case of Kwʼadza, since he has himself conducted fieldwork with its last speakers (and there seems to be an unpublished manuscript from this that’s cited in various later work; even on Wikipedia amusingly enough), but less so with Aasax. Ehret passingly notes e.g. having converted the last field records from 1974 from phonetic into phonological transcription, but one would like to know some details on this too.

It might be less clear if East Rift is really entirely a sister group of West Rift, or just a divergent member, or even, some kind of an areal within it. At least one of the more distinctive common West Rift innovations that Ehret proposes, the shift of the ejectives *kʼ, *kʼʷ to uvulars *q⁽ʼ⁾, *q⁽ʼ⁾ʷ, is alas trivial within Cushitic, appearing in several other groups or languages including Agaw, Somaloid, Konsoid (here as an implosive [ʛ]!); and maybe indicating that some stage of Common Cushitic did not quite have *kʼ, but rather something slightly different such as an also-pharyngealized *kˤʼ. ^[3] The same holds also for one of the more immediately obvious common East Rift innovations, the merger of pharyngeals *ħ *ʕ with glottals *h *ʔ respectively, which is again attested in a large number of other basic groups, e.g. Agaw, Oromoid, Highland East, and per Ehret indeed also in Ma’a. But also many minor conditional developments are posited for both West Rift and East Rift. These would require actual review rather than blanket dismissal.

Several yet further differences between WR and ER involve weakly attested sound correspondences that Ehret simply reconstructs as additional Proto-Rift segments. Some do involve specific attestable segments in ER (e.g. WR *d ~ ER prenasalized *nd < Ehret’s Proto-Rift dental *d̪, oddly enough); some others, just “crossed” correspondences between more basic segments. This is where we start getting into clearly dubious territory. These segments do not really flesh out any highly sensible phonological subsystem. Some of them could be fitted into “empty slots” seen in the West Rift and East Rift consonant systems, but only with further assumptions about their development: e.g. a correspondence WR *b ~ ER *p is reconstructed by Ehret as Proto-Rift *pʼ, ^[4] even though there are no other examples of either ejective voicing in WR or ejective devoicing in ER.

Still, before continuing this thread, a few words also on Ma’a. For this variety we already have clear criticism of Ehret’s position of treating it as a third basic branch of South Cushitic: Mous 1996, “Was there ever a Southern Cushitic Language (Pre-)Ma’a?” The most basic argument, which I take no issue with, is that attested Ma’a should not be itself even treated as a Cushitic language, but merely a largely-Cushitic lexical register of a Bantu language otherwise known as Mbugu. This then opens the option that Ma’a might not originate by language shift from some highly distinct Cushitic variety, but rather, from adoption of Cushitic vocabulary from at least two sources, in his view one of them probably a member of West Rift, the other closer resembling Oromo. So far, so good; Mous even admits that language shift from Cushitic still remains on the table (though he thinks this pre-Ma’a to have been more probably East Cushitic), which to me at least would still sound like a compelling reason for why modern Ma’a-as-a-register arose at all. However, other problems remain. As in research on Dahalo, Mous too seems to deem various lexemes to be either “West Rift” or “East Cushitic” mainly by their etymological distribution, without detailed attention to comparative phonology. For a simple example, West Rift /ħ ʕ/ correspond with Ma’a /h ʔ/. While this could arise as a sound substitution by Bantu speakers unfamiliar with pharyngeals (but would they not have been also unfamiliar with the glottal stop at least?), it could also indicate borrowing, instead, from East Rift, where as mentioned, loss of pharyngeals appears natively; at least if we admit that things about the East Rift languages are knowable. Even the geographically closest certainly-Cushitic language to Ma’a is indeed Aasax! (None are in direct contact with it.) For that matter, at least a few broad but specifically Ma’a–Aasax isoglosses seem to be proposed in Ehret’s work too: *x > h and *tsʼ > s in at least some cases, again simple enough to be plausibly just sound substitutions by foreign speakers, but plausibly also real common innovations, especially if we no longer require that these should be reflected everywhere in the Cushitic component of Ma’a.

Furthermore, all this is complicated also by the proposals in literature that Dahalo, too, has ended up with lexicon of both South (≈ Rift) and East Cushitic origin. Cushitic vocabulary in Ma’a may not force the existence of a single substratal pre-Ma’a as an independent South Cushitic branch, but it definitely forces the existence of at least a Cushitic language in contact with older Mbugu; if two different Cushitic lexical strata are accepted, then at least two such contact languages. We must then still ask: what was the internal history of this / these Cushitic varieties? Already the geographic separation of Ma’a and Rift has implications for this. Even if we supposed there was simply an originally Rift variety that wandered further towards the coast along the Pangani river, this does not rule out a possibility that the “East Cushitic” component surfacing today was not borrowed independently into Bantu Ma’a, but rather, already into this Rift variety — just as the proposed two-stratum theory of Dahalo would also require the existence of South / East Cushitic language contacts. (Trivial if Dahalo is really South Cushitic, it’s already right next to southern Somali and Oromo(id) varieties, but less so if it’s supposed to be “just another” East Cushitic branch.) The opposite scenario can be considered as well: a lost East Cushitic language in the area, which had at some earlier point in history absorbed also some Rift influence, before itself contributing a chunk of vocabulary into Ma’a.

Even moreover, the general problem that “East Cushitic” remains without a good definition by shared innovations (and might even contain all of the putatively South Cushitic languages in it) keeps on the table also the option that some of the Ma’a vocabulary is not narrower East Cushitic-isms, as much as archaisms, lost from the attested Rift languages. This holds even if Ma’a is indeed analyzed to contain vocabulary from two different Cushitic sources! After all, it is not very economical to assume two completely separate Cushitic spreads south into Tanzania, only for one of them to then disappear completely except for a few loanwords into Ma’a. Instead, it would make geographical sense for the “East Cushitic” component to be still really para-Rift, as per the family tree followed by Ehret. Something of this sort is also readily suggested by theories of East African prehistory which posit the Rift languages and maybe Dahalo (if it’s been in Kenya longer than its immediate “East Cushitic” neighbors) to not represent any kind of an outpost of Cushitic, as much as a remnant of a continuous Cushitic belt that would’ve once, before the newer expansions of South and East Nilotic and Northeast Bantu, stretched all across the areas of modern Kenya and northern Tanzania.

All this leaves a large number of “moving parts” available for any reanalysis of the historical phonology of Ma’a. As we will see, various reanalyses are probably required regardless; but it also seems to me Mous’ idea of mixture of basically modern West Rift and modern East Cushitic is too simplistic and, above all, geographically implausible in an environment where nothing West Rift nor classically East Cushitic has been attested. Prehistory hides many lost languages, and there is nothing a priori implausible in proposing one where evidence so suggests.

As I’ve mentioned recently on Twxttxr, any review of Ehret’s phonological scheme of South Cushitic should probably begin from the reconstruction of Proto-West Rift. There is fairly good overall agreement between Ehret’s reconstruction and the later work of Kießling & Mous (henceforth K&M), to be expected since also the modern languages retain the makeup of the system almost intact. PWR comes out with a reasonably distinctive system containing at least:

all six basic stops *p *b *t *d *k *g, plus labiovelars *kʷ *gʷ and uvulars *q *qʷ;
two ejective affricates *tsʼ *tɬʼ, interestingly without voiceless or voiced equivalents;
an almost full system of voiceless fricatives, *f *s *ɬ *x *xʷ;
the full original Cushitic (perhaps already original AA?) laryngeal system, *ħ *ʕ *h *ʔ;
all six basic sonorants *m *n *r *l *w *y, plus palatal and velar nasals, *nʲ *ŋ;
a bog standard Cushitic vowel system *a *e *i *o *u.

K&M add to this the labiovelar nasal *ŋʷ (actually mostly corresponding to Ehret’s *ŋ), vowel length, and, appearing mainly in clear loanwords, postalveolar affricates *č *ǰ. Ehret adds ejective *čʼ, supposedly distinguished from *tsʼ only in older Alagwa records (but see below). As it turns out, comparison already with Ehret’s own Proto-Cushitic reconstruction shows that most of these segments can be easily equated with identical precedessors also there — only *nʲ and K&M’s *č *ǰ seem to be entirely novel. A few call for other notes, but give no reason to doubt their PWR or Proto-Rift existence:

PWR *q, *qʷ: as noted above, clearly from older velar ejectives *kʼ, *kʼʷ.
PWR *tsʼ corresponds most prominently with Ehret’s PC *tʼ, suggesting spontaneous affrication as the original explanation of the phonological asymmetry of /t tsʼ/ without **tʼ **ts in West Rift (or East Rift). The same occurs in the neighboring Sandawe, and a partly similar /t tsʼ ts/ without **tʼ in the neighboring Hadza (both of them “Khoisan” candidate isolates, presumably ancient in the region), suggesting that this has been an areal innovation, arising on-site in the Rift Valley, i.e. at least not at any extremely early time during the Cushitic expansion southwards.
PWR *tɬʼ: various distinctive correspondences identified by Ehret, enough of them that there was probably a distinct PC precedent, even if not necessarily a lateral obstruent (I’ve heard of some recent work suggesting secondary lateralization of earlier palatals).
PWR *x, *xʷ supposedly correspond with both *x, *xʷ and *ɣ, *ɣʷ in Agaw, but also stops in various other parts of Cushitic (most consistently in Beja). Probably these are real inherited segments too, but I would wonder about options like reconstructing “old” PC uvulars instead, later then either fricativized or merged with velars.
The labialization contrast usually goes with /o/, /u/ vocalism elsewhere in Cushitic and might be secondary, especially if East Cushitic does not hold up as a subgroup; Ehret posits its most prominent innovation to be *Cʷa > *Co, *Cʷ > *C elsewhere, but perhaps this is archaic rather than innovative. Labiovelars occur also in Beja and Agaw, but they might have been independently innovated, especially since neither is an especially old group by itself. Same might go for labiovelars in Dahalo and Ma’a (if we don’t think they simply get all their “South Cushitisms” thru loanwords from Rift proper). Clearly needs further research though.
PWR *p: compared with /p/ also in East Rift, Ma’a and Dahalo, probably correctly. From the rest of Cushitic, Ehret however mostly finds comparanda with /b/. Already on general typological grounds I suspect these are mainly dubious, and that original Proto-Cushitic *p was instead shifted to *f early and just about everywhere, leaving a **p gap in most of the Cushitic languages. A new *p then would have arisen at some point in the development towards Proto-Rift. If this is per hypothesis mostly newer areal vocabulary, appearence of /p/ also in Ma’a and Dahalo probably won’t suffice as a defining PSC feature though.

After this things start getting worse. Already the bare numbers suggest bloat in Ehret’s deeper phonological reconstructions: his system of 29 consonants in Proto-West Rift expands to 33 in Proto-East Rift, 36 in Proto-Rift, and finally balloons to 49 in Proto-South Cushitic. A priori this might not be a completely terrible amount, when compared with an impressive 60+ in Dahalo, where many of these find unique reflexes… though then just 30-ish in any other South Cushitic variety. More alarming is that even Ehret himself finds no Proto-Cushitic source for most of these additional segments. This could be all still OK, maybe open to various kind of reanalyses, if these reconstructions were based on good robust data. Alas, they are not. Most are based on few etymologies, often with semantic stretches or other irregularities. Often also with major distributional gaps, such that an asserted overall correspondence pattern really comprises e.g. individual Rift ~ Ma’a and Rift ~ Dahalo correspondences lumped together (or perhaps even weaker correspondences like Kwʼadza ~ Dahalo, West Rift ~ Ma’a, etc.), with no or almost no evidence of the implied Ma’a ~ Dahalo correspondences even existing. — This strategy of “farming” or “lumping” rarer proto-segments from disjoint correspondences probably needs a name for it, I keep seeing it in many long-range or otherwise dubious reconstruction proposals; e.g. it’s all over the place in versions of Nostratic. (Cf. also footnote 4.)

The most distinctively poor set of Ehret’s extra segments are prenasalized stops and affricates in word-initial position. (Word-medial cases do not look distinguishable from plain old nasal + stop clusters, well-attested thruout Cushitic. I’ll also skip over *nɬ, which does not yield anything prenasalized and which Ehret in his later Proto-Cushitic work readjusts to *dɮ.) These are a regular but small part of the phonology of Dahalo, which should be probably assumed to mainly originate as intrusive vocabulary, maybe some also from irregular nasalization of former plain stops. And almost all of the South Cushitic etymologies Ehret finds for them are weaksauce. To roll out the data as he cites it (abbreviations: I = Iraqw, B = Burunge, A = Alagwa, Q = Kwʼadza, S = Aasax, M = Ma’a, D = Dahalo; transcription should be mostly obvious but I maintain Ehret’s ṯ ḏ for dental stops in D):

*mpats- ‘to be strewn’: A pasit- ‘to scatter’, pisari ‘seed’, B pisagariya ‘seed’ ~ D mbàttsì ‘potsherd’. Poor semantics, and also *ts is a suspicious PSC segment.
*mparoxʷ- ‘egret’: Q palaʔeto ‘crested crane’ (-l- < *-r- is regular) ~ D mbórogo ‘young egret’. Poor distribution, uncompelling semantics (mixing different species’ names is always an easy way of farming junk etymologies), apparently irregular *xʷ > ʔ in Q, and even supposedly regular *xʷ > D g sounds phonetically suspicious.
*mpee- ‘little, mean, scanty, slight’: Q paʔali- ‘narrow’ ~ M -bí ‘to shorten’ ~ D mbííṯ- ‘to scorn’ (→ mbííṯe ‘bad’). Very short CV comparison, semantics at least in D too divergent to put any trust on.
*mpuux- ‘sprout, shoot’: M -buká, -buxá ‘greens’ ~ D mbùùku ‘vine, tendril, creeper’. Poor distribution and semantics.
*mpɨnde-: M -púnde ‘penis’ ~ D mbéne ‘vagina’. Poor distribution, uncompelling semantics (possible, but not strong evidence by itself to believe in the comparison).
*ntaakʷ- ‘small carnivore’: I taweramo, B takoraymo, A tokoraymo (K&M: PWR *takʷerimo) ‘wild dog’ ~ D nḏááge ‘aardvark’. Poor semantics, irregular voicing and delabialization of *kʷ in D.
*nteekʼʷ- ‘incisor’: I taqesamo ‘jaw’ ~ D nḏéégi ‘canine tooth’. Poor distribution, uncompelling semantics, irregular voicing of *kʼ⁽ʷ⁾ in D (and why is the labialization reconstructed at all?)
*nʈarag- ‘Orthoptera species’: Q tsʼelemayo ‘cricket’ ~ D nḏàràgì ‘mantis’. Poor distribution and semantics, ad hoc metathesis of a presumed suffixal *-m- in Q.
*nʈaŋa ‘beestings’: Q tsʼangayiko ‘fresh milk’ ~ M dáŋá ‘beestings’. Limited distribution; semantics fine; no direct evidence of prenasalization though.
*nʈif- ‘food stirring stick’: I tsʼifraŋ, B čʼufara, A tsʼufara, S šeferank ‘tongue’ (K&M: PWR *tsʼufiraaŋʷ; *tsʼ- > š- in Aasax is regular) ~ D nḏufuro [‘food stirring stick’?]. Uncompelling semantics, even if this is likely an innovative lexeme in Rift compared to the rest of Cushitic.
*nʈoh- ‘to clear the throat’, *nʈoh-aala ‘phlegm’: B čʼohod- ‘to cough’ ~ Q tsʼalahet- ‘to curse’ ~ D nḏwààlà ‘mucus’. Uncompelling semantics in Q (requires also metathesis); irregular contraction *-ohaa- > -waa- in D; plausibly simply a recent onomatopoetic verb in B.
*nʈuu- ‘hawk’: S šuʔununu ~ D nḏúúma. Poor distribution; very short comparison, requires ad hoc morphology. Semantics apparently fine.
*ntsaaw- ‘reeds’, *ntsoomari ‘straw’: I tsʼawo ‘reeds’ (K&S: also B A, PWR *tsʼaaboo ‘sisal, bushy end’) ~ Q tsʼemaliko ‘straw’ ~ M izumari ‘flute’. Vocalism problems in Q, uncompelling semantics in M, no direct evidence of prenasalization.
*ntsew- ‘small bird sp.’: M -zewe ‘carmine bee-eater’ ~ D ndzòmò ‘barbet sp.’ Poor distribution, uncompelling semantics.
*ntsi- ‘spleen’: I tsʼi-daʕa ‘heartburn’ ~ Q tsʼiyale ‘spleen’ ~ D ndzóne ‘spleen’. Very short comparison with ad hoc morphology, irregular vocalism in D, poor semantics in I.
*ntsom- ‘to shout’: Q tsʼamaʔato ‘happiness, joy’ ~ M -zo ‘to cry’. Poor distribution and semantics.
*ntsoom- ‘kind of bee’: Q tsʼamayituko ‘bee’ ~ D ndzóóme ‘honey of ḿpeele bee’. Poor distribution, Q morphology unexplained.
*ntʲaduʕ- ‘bog’: M -darú ‘swamp’ ~ D ndodóʕo ‘mud’. Poor distribution, uncompelling semantics.
*ntʲodi- ‘grasp, grip’: M -dóri ‘to take; marry (a wife)’, -doríwe ‘to be married’ ~ D ndódi ‘thumb’. Poor distribution and semantics.
*ntʲooʕ- ‘gravelly soil’: B čʼiʕaramo ‘pebble’ ~ Q čʼaʔamuko ‘small streambed’ ~ D ndóóʕo ‘sand’. No immediate major flaws, still not an obvious etymology either though.
*ŋkara-: Q kalaʔeto ‘stork’ ~ D ŋgára ‘crested crane’. Poor distribution, uncompelling semantics.
*ŋkexine- ‘eyebrow’: I gine ~ D ŋgikine. Looks decent except for loss of *-x- (or indeed maybe of *-k-) in I. K&M have instead two different PWR etyma for ‘eyebrow, eyelid’, neither certain to be old inheritance though.
*ŋko- ‘flea’: Q koyimaye ~ D ŋgúnewe ‘spirillum tick’. Poor distribution, uncompelling semantics, short comparison, ad hoc morphology. Maybe the worst etymology here, despite stiff competition!
*ŋkol- ‘steer’: I B A karama ‘bull, steer’, Q kolawatu ‘bull’ (K&M: PWR *karaama) ~ D ŋgólome ‘bull buffalo’. Would look decent except for a l/r mismatch; also with parallels in East Cushitic that again have just plain *k- and also *-r- rather than *-l-, e.g. Borana (Oromoid) korma ‘bull’.
*ŋkum- ‘fog’: M -gónónó ~ D ŋgúmine ‘raincloud’. Poor distribution, ad hoc *u > o and *m > n required in M; semantics fine.
*ŋkʷaa- ‘rainbow’: B ilakʷekʷiya ~ D ŋgòòwi. Extensive ad hoc morphology (including reduplication) required in B; semantics fine.
*ŋkʷaal- ‘to impoverish, leave poor’: I kʷalaʔo, B A kʷaʔalitoʔo, Q kalaʔay ‘widow’ (K&M: PWR *kʷaʔalaʔoo) ~ M -gwa ‘to steal’, -gwaló ‘thief’. Poor semantics, and Ehret’s assumption of original *l plus metathesis in B A might not hold. No direct evidence for prenasalization.

A few of these maybe might still be at least related areal vocabulary, but it should be clear that the low number of comparanda, both overall and especially in the otherwise well-documented West Rift, reliance on semantically off-field comparisons, and a common need for additional morphological or phonological assumptions, does not add up to a corpus supporting the existence of an already Proto-South Cushitic consonant series. E.g. the example of ‘bull’ could instead suggest a borrowing that is ultimately of Cushitic origin but was passed through some other intermediates before getting to Dahalo. Ehret also has one other similar case just in Dahalo and so not formally reconstructible for PSC: D ŋgaasið- ~ Somali kas- ‘to explain’ (besides prenasalization, vowel length also not matching; also s ~ s cannot be native).

There should be also a general suspicion of anything with prenasalized stops more likely coming from Bantu. Looking over Mauro Tosco’s 1991 A Grammatical Sketch of Dahalo, however (which includes a glossary with some loanword notes), fairly few cases of that have been identified: from Swahili there’s mbona ‘why’, nḏuugo ‘kinsman’, ŋgúúfu ‘strong’; from Northern Swahili nḏani ‘inside’, nḏigad- ‘to bury’, nḏoo ‘come!’ and tentatively nḏupa ‘bottle’, ŋgúúko ‘cock’ (~ NSw thupa, khuku). ^[5] Nothing from other nearby Bantu languages like Pokomo, but maybe that simply has not been (was not?) studied yet. But note also Dahalo’s nasalized dental click nǀ, which might also count as “prenasalized” phonologically, but almost surely can’t originate as is from anything Bantu.

It would be possible to go over similar problems in base etymological data also for some of Ehret’s other more poorly attested segments. As my tally of prenasalized consonants already shows, he in particular adds a few additional place of articulation series for PSC, which aren’t really reflected as such anywhere: retroflexes and palatalized dentals, most of them scantily attested and probably rejectable entirely. He proposes PC origin for two of them though: the voiced retroflex *ɖ and the palatalized ejective *tʲʼ. These show different issues, maybe worth discussing in more detail.

If taken at face value, Ehret’s PSC *tʲʼ probably should be first of all reconstructed instead as a postalveolar affricate *čʼ; since that is both his alleged PC source and its Proto-Rift reflex. Also the asserted Ma’a reflex is č, but most data for that are again poor etymologies, e.g. M -čá ‘to be crafty’ is compared with Q salimuko ‘coward’, D tʼar- ‘to practice witchcraft’; M hečéri ‘yet, not yet’ is first analyzed to have a prefix he- continuing a fossilized demonstrative, then compared with Q sel- ‘to straighten’, to gain a PSC root supposedly having meant ‘to make ready, prepare, put in order’. A few comparisons between Rift and Dahalo look better, e.g. *tʲʼatʼ- ‘soil, earth’ > B čʼečʼeʔiya, A tsʼatsʼaʔi ‘dust’ (K&M: PWR *tsʼatsʼaiʔya) ~ Q saʔamuko ‘earth’ ~ D tʼattʼe ‘mud’. But as the Dahalo reflex is just the alveolar ejective tʼ, it also turns out that most of the data would fit simply as cases of Ehret’s plain *tʼ! He relies most often on Kwʼadza data on making the distinction, where supposedly *tʲʼ > s (as in all three examples above) — but these cases generally leave again room for doubts about their validity, e.g. irregular *tʼ > ʔ in ‘earth’ above. A few also have Ehret’s PWR *čʼ on the grounds of čʼ in older Alagwa data, but these are firstly few, and secondly, they mostly occur in a palatal environment (e.g. čʼiraʔa ‘bird’; K&S’s PWR *tsʼiraʔa) where we might suspect this was actually just a dialectalism or a lost allophonic feature. Hard to tell though from the current presentation. Ehret still gives also words like PSC *tʼah- ‘to be pregnant’ >> A tsʼihay ‘pregnancy’, but does not state if this comes from older or newer Alagwa data. Either way, Ehret’s supposed distinction *tʼ | *tʲʼ looks like it has been really “farmed” together from disparate sources, most prominently

older Alagwa tsʼ | čʼ;
Kwʼadza tsʼ | s;
Ma’a s | č;

that actually do not correlate well with each other. I’ve already tentatively suggested that the supposed Ma’a cognates with č are just wrong, and that older Alagwa čʼ might be secondary. For Kwʼadza I’m not sure if either of these approaches works entirely. Not all data with s looks easily dismissable, e.g. PWR (K&M) *tsʼaaʔas- ‘to shine, shed light on’ (probably with the common Cushitic causative *-as- suffix) ~ Q saʔ- ‘to burn’; PWR *tsʼitsʼaʕiya ~ Q sasaʔamo ‘star’. ^[6] A third option though might be internal loaning: in Aasax, the regular word-initial reflex of Proto-Rift *tsʼ is a sibilant š, and this would probably be reflected as s (Q has itself no š) if some words had been borrowed into Kwʼadza from an Aasax-like variety. — Ehret’s later proposed distinct Proto-Cushitic sources also do not look very strongly established, but that would be more of a tangent that I want to get into in detail; though again, the same general types of problems recur as in Ehret’s PSC reconstruction.

As the last stretch for this blog post, Ehret’s PSC *ɖ proves to be relevant in several ways. Word-initially, this is proposed to be distinct from plain *d in that Dahalo would have an implosive ɗ for the former, a dental plosive ḏ for the latter; Ma’a would have mostly ɗ- from both, but sometimes also some *ɖ- > z-. A few do look plausible (e.g. I A deʔem-, B Q deʔ-, M -zéʔu ‘to herd’). All of Rift, however, shows just *d- (retained in I B A Q, > ɗ in S). Furthermore, comparison with Proto-Cushitic supposedly shows *ɖ- < *d- versus *d- < *z-. Both correspondences indeed have some decent etymologies for them, e.g. PC *dar- ‘to increase, add to’ > D ɗar- (Ehret only lists other cognates from Agaw and Somali; I’d consider adding also Highland East *darš- ‘to swell’); PC *zab- ‘to grasp’ (in Beja, Somaloid) > D ḏáβa ‘hand’. (Both of these also well represented in Rift languages.)

In his Proto-Cushitic book, Ehret claims that this chain shift of *d and *z would be evidence for the unity of South Cushitic. The fortition *z > *d is distinctive at first sight, but this sound change is again widespread in Cushitic, e.g. Beja, Saho–Afar, Oromoid, Konsoid; it may have started already at an early date, perhaps diffusing to Pre-Proto-Rift already from some neighboring East Cushitic dialect area. Thus, if Rift does not even show the distinction of Ehret’s *ɖ- and *d-, to me this supposed isogloss seems worthless. A real chainshift can be only really set up for Dahalo, and also without any shunts in place of articulation. The language probably has *d- >> ɗ- simply as a part of an areal innovation of implosion of voiced stops, appearing already also in Aasax (only ɓ- ɗ-) and Ma’a (all of ɓ- ɗ- ɠ- ɠʷ-), as well as in the local Bantu languages, including Swahili. Initial P(S)C *b-, too, gives Dahalo ɓ-. ^[7] The most likely reason for why Proto-Cushitic *z was not affected would be that it still remained a fricative at this point, if maybe already [ð] — which still exists in Dahalo as the intervocalic allophone of /d̪/ — and only fortiting to a stop ḏ- later: thus, in particular, independently of the similar change in Rift.

Besides implications on classification, another corollary is that if the distinction between Ehret’s *ɖ and *d was in the last common ancestor of Rift and Dahalo rather in manner and not place of articulation, there is little reason to expect the existence of Ehret’s other, more poorly evidenced retroflexes *ʈ, *ʈʼ (or *nʈ, which I believe I’ve already demonstrated above to be spurious).

The Ma’a initial correspondences, then, may have been assigned the wrong way around. It seems to me that we should suspect z- reflexes to be archaisms continuing PC *z- and not *d-. Most data could be in fact swapped around with ease, since Ehret has very little evidence for the correspondence M z– ~ D ɗ-; the only decent-looking case is I daʔ– ‘to penetrate’, A daʕ- ‘to thrust into’, D ɗaʕ- ‘to insert’ ~ M zaʔá ‘inside’ (apparently a native Cushitic root further cognate with Beja da- ‘to enter’). Another example, I daqaw- ‘to go’, D ɗakʷ- ‘to be going’ ~ M -zuxu ‘sandal’, is semantically divergent enough to not be immediately reliable. And if we allow for the existence of a few Rift or para-Rift loanwords in Dahalo, these could be chalked as examples of that. I can even note from Dahalo the preposition ḏa ‘in’, which could be taken as the real native reflex of PC *za(ʕ)- ‘inside’! There also might be other explicit evidence that Ma’a z- originates from PC *z-: the above-mentioned -zéʔu ‘to herd’ looks comparable with Highland East Cushitic *zoh- or *zoʔ- ‘to roam, wander’ (whence Hadiyya doʔ-, Kambaata zoh-, Sidaamo do-). Ehret’s proposed correspondence *d- > M ɗ- ~ D ḏ- can be also dealt with similarly. Most data is weak, but at least Q daʔas- ‘to scoop into fingers (e.g. porridge)’ ~ M -ɗaʔá ‘to pick, pluck’ ~ D ḏaʕaað- ‘to catch hold of’ looks like a decent comparison, but we could however suggest that this is simply a Rift-type word in Ma’a. The absense of data showing the opposite correspondences, M z- ~ D ḏ- and M D ɗ-, will have to remain a weakness; but not a major one, if we will not claim the two to be relatively close relatives within a South Cushitic group.

Word-medially, Ehret’s *-ɖ- behaves differently. It is now supposed to reflect both of PC *-d- and *-z-; and to yield in Ma’a -ɗ- or -r-, in Dahalo -ɗ- or -ṯṯ- (yes, dental!). In just one enviroment, at the end of noun stems, *-z- and now also *-s- are supposed to instead give *-d-, yielding Ma’a -r-, Dahalo -ḏ- or -r-. Most of Rift has *-r- for both, with the exception of Burunge, where *-ɖ- > -r-, but *-d- > -d-. Looking up the full data behind any of this would be more work than any survey of word-initial correspondences (though would not need to be done from scratch: Ehret’s PSC lexicon helpfully lists also non-initial occurrences of each consonant). Given from earlier examples that Ehret is maybe particularly prone to bad etymologizing with Ma’a, and that variable reflexes there could have a complex background involving Cushitic-internal loaning, I will simply ignore it for now to save my efforts. Exclusive Rift ~ Dahalo vocabulary will also not be the most interesting here. However, if Ehret’s argument for the unity of South Cushitic from alleged common development of *d- and *z- does not hold, will word-medial evidence work any better? Again there is no evidence for a common shift in place of articulation at least. Cases of Dahalo -ḏ- [-ð-] from earlier *-z- would not need to have ever gone thru a stop at all; cases of -ɗ- might be again independent development from plain *-d-. The supposed development of *-s- to *-z- is a bit less trivial — maybe not in a general typological light, but any medial voicing of fricatives seems rare across Cushitic. And there’s again indeed evidence decent enough to think that this is a real correspondence, e.g. Somali gus ‘penis’ ~ PWR (K&M) *gu(d)doo ‘testicles’ ~ D giḏḏa ‘semen’; Somali ħaas ‘wife; family’ ~ PWR (K&M) *hadee ‘wife’ (for which Ehret reports instead reflexes with ħ-). This is complicated, however, by Ehret finding that in Dahalo also verb-stem-final *-s- is voiced to [-ð-] (and *-f- to [-β-]), which changes do not appear in Rift (cf. e.g. the example of Q daʔas- ~ D ḏaʕaað- above). If the conditions of fricative voicing in fact differ, this innovation thus seems to be at most areal in nature, not inherited from common PSC. Also, the cases in noun stems could be even interpreted differently: as really devoicing of PC *z in original word-final position in a few languages like Somali — i.e. not even an innovation at all! Thus, still no particularly clear evidence here to set up a South Cushitic (Rift–Dahalo) group independent of wider East Cushitic or general Cushitic.

Many other issues remain that I’ve not even touched so far (e.g. Ehret’s PSC central vowels *ɨ, *ə which have no individually distinct reflexes in any of the languages). But I hope to have demonstrated that a better understanding of the South Cushitic hypothesis, and the history of its constituent languages, requires 1. actually engaging with Ehret’s PSC and PC reconstructions, preferrably to some extent also with the spottily documented Kwʼadza, Aasax and Ma’a; 2. being regardless prepared to throw out plenty of weak etymologies in the process; 3. not taking older literature’s assumptions about the history, reconstruction or existence of “East Cushitic” for granted either. A big wide playing field… but probably not insurmountable.

Postscript. I’ve recalled I have around also one further work from Ehret on overall Cushitic reconstruction: a 2008 article “The primary branches of Cushitic: Seriating the diagnostic sound change rules”. ^[8] He has in this come to accept a few similar conclusions on SC reconstruction as I do here — he retracts the reconstruction of the distinct prenasalized stops (though mostly hangs on to the etymologies, claiming that they instead arise from the reduction of some semantically unspecified prefix *(h)in-), as well as the weaker retroflex and palatalized stops (*ʈ, *ʈʼ, *tʲ, *dʲ); former *ɖ adjusted now to *ɗ and *tʲʼ indeed to *čʼ. He still insists on what I see here as a just-Dahalo chainshift of *d *z to constitute a defining feature of South Cushitic though, with no word on the Rift merger. Also, if the most obvious junk phoneme issues have already been admitted, it might have been a good idea to move to further issues, such as Ehret’s non-ejective affricates *ts *dz *dɮ. All three attested phonemes in Dahalo, but Rift and Ma’a correspondences look more problematic.

His delineation there of East Cushitic looks dubious as well. The proposed defining features include e.g. the rise of a substantial implosive series: *pʼ > *ɓ, *dɮ > *ɗ, *tɬʼ > *ʄ, *ɣ⁽ʷ⁾ > *ɠ⁽ʷ⁾… clearly nonsense, when several putative EC languages / basic subgroups actually have implosive reflexes for at most one or two of these. But this was not a post for reviewing EC reconstruction; that would be a different topic entirely, a bigger one that has had several more people working on it too.

[1] “Some salient features of Southern Cushitic (Common West Rift)“. Unclear to me from the academia.edu page where, or even if, this is published.
[2] Ultimately I suspect Kießling’s position to have been influenced by the occasionally seen overblown claim that “morphology is the only way to classify languages”, some kind of a broken-telephone exaggerration of the fact that morphology can often provide very strong evidence for classification (but is in no way the sole possible type of evidence).
[3] Or even already a uvular *qʼ, which could have later on reverted to a more neutral / less marked *kʼ in languages in closer contact with Omotic or Ethio-Semitic. But this all surely still requires a good areal survey of the whole of Cushitic and environs. I do suspect this is not entirely Proto-Cushitic anyway, but has ultimately spread from Semitic somehow, though this is complicated by the *ḳ > q shift being in modern Semitic limited to Central Semitic (Arabic etc.), absent from both Ethio-Semitic & Modern South Arabian as far as I know. Would it be a plausible or in any way investigable hypothesis that *kʼ ~ *q variation existed earlier in Ethio-Semitic too, and was just later levelled out in favor of the ejective? If contact with EthSem. is hypothesized to have brought about *q > kʼ in modern Bilin (in Agaw), then it is at least conceivable that the same could have happened also in some of the currently–smaller Ethio-Semitic languages; and perhaps not before them having passed uvularization “on to” various more southern Cushitic languages.
— A second, still more speculative idea that I could consider is that perhaps spontaneous kʼ > qʼ is in fact a natural sound change, and is merely blocked in most of the world’s language stocks with ejectives due to the fact that they happen to already possess also distinctive uvulars? This is after all the case almost universally in the Caucasus, quite widely also in western North America, and common even for several more isolated lineages with ejectives, e.g. Aymara, Itelmen, Mayan, Tuu.
[4] Projected also further to PSC and PC, and finally, in his PAA reconstruction, proposed to correspond with Proto-Omotic *pʼ, itself projected from the North Omotic branch. Actually, no actual cases of a correspondence of North Omotic *pʼ ~ West Rift *b ~ East Rift *p seem to exist: Ehret’s book on PAA lists 18 cognate sets with initial *pʼ-, but only five are attested in both Omotic and Cushitic, of these none in South Cushitic. His *pʼ thus really breaks down into disjoint sets of cases rooted in a South Cushitic etymon vs. cases rooted in a North Omotic etymon (and several reconstructed from still more indirect considerations like an Egyptian p ~ Semitic *b correspondence — six cases, two of them with a South Cushitic *pʼ cognate and one with a North Omotic *pʼ cognate).
[5] I don’t know OTTOMH if these comparisons are supposed to suggest earlier *nt, *ŋk in Bantu or the development of prenasalization within Dahalo itself.
[6] Per Ehret a derivative from the same root as the previous; per K&M instead derived from *tsʼaʕ- ‘to appear’, looking more likely since they document -ʕ- and not -ʔ-. Both seem to agree on reduplication.
[7] Ehret fails to recognize even this change as areal, and instead operates with allophonic implosion of word-initial *b- *d- *ɖ- already in PSC + its later reversion in most Rift languages.
[8] From the collection In Hot Pursuit of Language in Prehistory: Essays in the four fields of anthropology, ed. John D. Bengtson.

Tagged with: areal linguistics, cushitic, historical linguistics, language classification, linguistics, macro-comparison
Posted in Commentary, Methodology

Prospects in comparative Cushitic

Posted on Mon 2023-10-23 by sansdomino — 3 Comments

Long time no post! Those who have been following me on Tumblr or Twxttxr will know I’ve been recently digging into the history of the Cushitic languages — actually something I’ve wanted to get into for a while now. Here is a small outline of findings so far for the main blog too.

Cushitic is in some ways very similar territory to what I’m used to in Uralic studies: a family of 30–50 languages depending on how closely you’re counting; split into numerous small subgroups; recorded only recently, without long written traditions; thought to be a fairly old stock regardless; with at least one long-standing theory about its deeper relationships, but a loose enough one to be not much help for its own study. Some further similarities continue also in the research history of Cushitic, perhaps above all the fact that reconstruction has mostly not involved full-family comparison, but rather, comparison within major intermediate areals, some of them themselves identified by relatively weak signals. This probably should be expected to give fairly similar problems to what went down earlier on in Uralic studies due to too much reconstruction focus on assumed units like “Finno-Volgaic” that might not even exist after all.

Traditionally, Cushitic has been split in four main chunks: Northern, Central, Eastern and Southern (former “Western Cushitic” by now split off as Omotic, possibly unrelated entirely, as well as possibly itself multiple families). Northern = Beja and Central = Agaw are clean units, the former a single language, the latter a small distinct family — though there might be room to suspect some similarities within it arising not from common descent but from strong common Ethio-Semitic influence. ^[1] Eastern and Southern, however, have no consensus in literature for their extent! The original division was simply by geography: Eastern existing as a contiguous zone in Ethiopia, Eritrea, Djibouti, Somalia and northern Kenya, vs. Southern more scattered across Tanzania and partly Kenya. This is of course a priori about as reliable as claiming that Armenian must belong in Iranic because it’s an immediate neighbor of Persia, while Ossetic must not because it’s all alone in the Caucasus; and no wonder that later on other, quite different hypotheses have been appearing too. There are at least the following:

a South / East divide exists, but Dahalo, the one Kenyan language classically counted in South Cushitic, is actually Eastern (thus M. Tosco; followed by Glottolog)
Southern is altogether a subgroup of Eastern (thus R. Hetzron);
neither full Southern nor full Eastern are valid (roughly thus L. Bender, also Wikipedia in their primary navigation).

An additional problematic concept is “Lowland East Cushitic”, in all definitions including at least four clear small units (Oromoid, Konsoid, Arboroid, Somaloid, maybe further subgroupable e.g. as O+K and A+S) and variously proposed to include up to four more (Saho-Afar, Dullay, Yaaku, Dahalo). Its original definition seems to have been as a rump group in contrast to the clearly distinct Highland East Cushitic subfamily, and thus inheriting almost all doubts we might have about the validity of East Cushitic. Moreover, LEC notably includes Oromo, the largest and most geographically central Cushitic language, already known to have left loanword strata in many of its smaller neighbors. Any surface-similarity or lexical-similarity definitions of LEC and EC should be probably considered dubious until it can be ruled out that this cannot be just due to contact influence from Oromo (in a few cases maybe also: from the almost as large Somali?), or just due to already Proto-Cushitic archaisms, perhaps lost in smaller groups further out. This, again, has a clear parallel in Uralic studies too — the already 19th-century discovery that, once a large stratum of loanwords from Finnish is excluded, the Samic languages are actually not that closely related to Finnic and certainly not regular members of the group, ^[2] followed soon by the observation that ruling out F → S loanwords is necessary also for achieving reliable results in reconstruction.

Paths forward in Cushitic reconstruction would then surely have to either focus on whatever clearly constitutes valid units (which may help with identifying what in them is actually inherited Cushitic and what might be later arealisms); or on working on Cushitic as a whole. The latter would not be an easy task, since much of Cushitic remains at a middling state of documentation ^[3] and also much of the literature is poorly available to me. The former has however some easier openings. At least three monograph-size smaller-branch reconstructions exist already as entry points: Agaw, Highland East, and Southern. (Sizable descriptive works cover a few other branches too, e.g. Dullay; I do not know how much comparison or reconstruction they involve.)

The Agaw data (Appleyard 2012, A Comparative Dictionary of the Agaw Languages) was my first toe-dip into comparative Cushitic, already since five years or so ago. It’s however quite thorough work given the available data, and also such a distinct branch, that doing anything more with it would require either more basic documentation, or more general experience in Cushitic (and probably also: more familiarity with Ethiopian Semitic). A few months ago though I started delving into Highland East (Hudson 1989, Highland East Cushitic Dictionary), and here openings for improvement exist left and right — I’ve already started working on a paper outlining the properties and impact of the obvious Oromo loanword stratum in most of the languages, and I’m also tempted to compile some phonological, morphophonological and lexicological observations eventually. See e.g. my Tumblr sideblog link above for several additional details.

Most recently I’ve also taken a first look at the Southern data (Ehret 1980, The Historical Reconstruction of Southern Cushitic Phonology and Vocabulary). This raises fairly different questions. Christopher Ehret has acquired some notoriety later on for proposing extremely messy reconstructions of Proto-Afrasian and Proto-Nilo-Saharan. It turns out that similar issues come up already here: etymologies with poor semantics, ad hoc morphology, overimaginative phonology (for examples see my social-media-formerly-known-as-Twitter link above). Still, it is clear that the Southern Cushitic languages in fact are all related, be it just by themselves or within Cushitic as a whole, and a reliable reconstruction outline of the clearly valid Rift subgroup could be probably extracted from Ehret’s work without too much extra effort (a partial version I believe has already been done for West Rift). Only two languages stand outside of it — Dahalo, as already noted above, and also Ma’a, a curious Bantu–Cushitic “mixed language” (probably better: a large Cushitic lexical stratum maintained through recent language shift to Bantu). In principle, there is no reason that their comparison with Rift should not give at least an approximately valid reconstruction, even if it might end up really being for some larger unit.

It’s also again Ehret who has done the most work on overall Cushitic comparison! (1987, “Proto-Cushitic Reconstruction”, actually not a monograph though but one of those ~200-page-long megapapers.) Weak etymologies are now more numerous still, but his Proto-Cushitic takes into account also earlier work and does not err too far off into fantasy, I think. His own Proto-South Cushitic also remains a major phonological and etymological input for the reconstruction, probably leaving issues to be fixed in the reconstruction of several per se valid comparisons. A good illustration of how scholars left to work on major topics all by themselves, unchecked by colleagues, risk drifting into speculation. This work demonstrates well also an issue about reliance on intermediate reconstructions… As soon as a lower-level reconstruction has been set up and is thought to be mostly reliable, this should also trigger checking some “reconstruction upwards” on which details of any intermediate reconstructions might call for adjustment. Ehret identifies himself a few of these already; but it seems not easy for anyone else to continue the task just off the cuff, when he is otherwise content to present for his Proto-Cushitic etyma normally just their Beja, Proto-Agaw, Proto-Eastern and Proto-Southern reflexes, i.e. not their lower reflexes, in e.g. Dahalo or even just in something like Proto-Rift. Still, this could be surely done with some editing effort and enough other sources on hand, and maybe leading to reasonable grounds for working out issues such as where does Dahalo group in Cushitic after all. (A few papers even have a curious suggestion that Dahalo contains both an “Eastern” and a “Southern” lexical stratum. Unclear to me though if this is based on any kind of clear phonological evidence, or just on lexical distribution, and then also, how would we decide which one of these is inherited and which one due to contact? Remains to be seen.)

Initial reading of comparative Cushitic is also helping me to put more trust on its own validity within Afrasian. The family is diffuse and challenging to reconstruct, yes, but nowhere near as diffuse and challenging as e.g. the Egyptian–Semitic relationship (per how I’ve seen it presented anyway, I am not an expert of either), and at least some 200-ish clear cognate sets exist — Ehret has up to 650, but less than half of these look trustworthy to me just off the cuff. Moreover, one earlier reason I’ve had for suspicion is that Ehret’s later reconstruction of Proto-Afrasian comes out almost identical to his Proto-Cushitic. But likely this is not actually a strong sign of Cushitic as a maybe-paraphyletic junk zone — instead, it seems likely to be an artifact of Ehret’s own methodology: starting from his own already overengineered Proto-Cushitic and then finding that some materials elsewhere in the vast Afrasian family can be easily matched to it. It would surely be interesting to see what becomes of that picture, too, if pared back to just the cleanest Cushitic and Afrasian etyma… In the meanwhile though, it might be more profitable to focus on improving the inner-Cushitic reconstruction than to jump straight to its more distant relatives.

[1] So far I’m particularly suspicious of the phonological restructuring of the vowel system, very similar in both Agaw and Ethio-Semitic.
[2] In later times now refined to include also e.g. Karelian loanwords in Eastern Sami, early Samic loanwords in common Finnish–Karelian, old Germanic loanwords into both, some false cognates entirely, etc., which has further allowed mostly abandoning even the idea of a Finnic and Samic as two sister branches in a common Finno-Samic group. Very little unique similarities remain that could not be identified as either shared archaisms from Proto-Uralic (or some slightly narrower intermediate grouping), as shared arealisms, or as trivial independent innovations.
[3] But not poor, I would think. There by now does not seem to be much of Cushitic going undocumented entirely, and many modern grammars and dictionaries have appeared over the last 30, 40 years, many even by native speakers. Older Italian-colonial-era records also seem to have been relatively thorough already, with better coverage than in many other parts of Africa.

Tagged with: cushitic, historical linguistics, language classification, linguistics
Posted in Commentary, Methodology

Notes on Janhunen’s Law

Posted on Wed 2022-11-30 by sansdomino — 14 Comments

(Part ca. 3 of n in my irregularly scheduled series of Introducing Named Soundlaws in Uralic Studies. ^[0])

The issue, as I see it

Most of the vowel correspondences we now think to be regular between Samoyedic and the rest of Uralic are those that were outlined by Janhunen in 1981. The actual sound laws behind them have regardless often gotten re-tooled or re-dated by now, much in the same way how many of them already had earlier precedents in some form (primarily from Lehtisalo or Steinitz). E.g. the chainshift *e > *i, *ä > e has been by now shown by Helimski to be post-Proto-Samoyedic, given Nganasan evidence for *e > †e > i̮. On follow-up, also the reflexes of *ä > “*e” can be relatively open in some languages: Salminen (2012) has pointed this out about modern Forest Enets (e.g. *tät³tə > tät ‘4’), and to me it seems e.g. that the conditional developments *ä-a, *ä-å > *a in pre-Selkup also seem to presume an open value for *ä. Cf. *ān-uj ‘true’ < PS *änå, or *kuəsə ‘iron’ < *wåsV < *wasV < PS *wäsa.

What I call “Janhunen’s Law” is, though, not any sound change in Samoyedic, but a proposal that he had in the same paper for an innovation in some uncertain amount of western branches: PU *oCə > *uCə. Sammallahti (1988) indeed adopted it as an already Proto-Finno-Ugric innovation. Since then though there does not seem to have been too much support for it — but then neither critique or any other analysis either.

On any kind of closer look, it does seem clear this cannot be quite as simple as Janhunen suggests. First of all, also a correspondence western *o ~ PS *o exists. Janhunen identifies two examples: *koj(-wV) ~ *koəj ‘birch’, *kopa ~ *kopå ‘bark’. This number can be increased: clear examples also include *koj(ə)ra ~ *korå ‘male animal’; *kokə- ~ *ko- ‘to check, see’ (all of these with *ko-, but this looks simply accidental; *ko- > *kå- can be also attested in e.g. *kåmpå ‘wave’, *kåsə- ‘to dry’, *kåət ‘spruce’). Possibly also *ńoxə- ~ *ńo- ‘to pursue, hunt’, though Janhunen assumes that Finnic *nouta- continues earlier *ńux-ta-, thru a similar lowering as in *sou-ta ‘to row’ ~ PS *tu- < PU *suxə-, and this does not look entirely impossible.

I’ve observed already long ago (first presented at the 2nd International Winter School of FU Studies in Szeged in 2014) that there seems to be evidence for further conditioning. First, all of Janhunen’s positive examples involve front consonants in the medial consonantism: alveolars and labials. Four cases are immediately unambiguous:

*lumə ~ *jom ‘snow’;
*kusə- ~ *kot- ‘to cough’;
*purə- ~ *por- ‘to bite’;
*tulə- ~ *toj- ‘to come’.

I would add first of all two cases that should be reconstructed with *-w- and not, as proposed by Janhunen, *-x-:

*śuwə ~ *śo(-j) ‘mouth, throat’; *-w- is clearly indicated by Southern Sami tjovve.
*tuwə ~ *to ‘lake’; *u reflected at least in Permic *ti̮. Original *-w- seems to be indicated by Northern Khanty *tŭw, Konda tŏw, and maybe the oddly front-vocalic təw in rest of Southern Khanty. ^[1]

Probably even a third is *luwə ~ *lë ‘bone’. *-w- is again indicated by Western Khanty forms — mostly rhyming with ‘lake’, e.g. Konda tŏw, other Southern təw, Nizyam tŭw, Kazym ɬŭw (but in Obdorsk lăw, versus tuw ‘lake’). Samoyedic *ë could indicate a shift *ëw > *ow in other languages already before *o-ə > *u-ə (a tentative Proto-Finno-Ugric innovation — though this seems a bit too trivial and devoid of parallels to be relied on for that).

One additional example that was not known to Janhunen shows a palatalized alveolar medial: *wuďə ‘new’ ~ *oj- > North Selkup oć-əŋ ‘again’, a neglected etymology from Helimski (1976). ^[2] Note further that positing *o > *u here explains the rare initial combination *wu-, not reconstructed anywhere else in Uralic vocabulary and probably phonotactically impossible in Proto-Uralic proper.

Looking beyond Samoyedic, it also seems to be the case that from the evidence of other languages, we cannot really reconstruct word roots of shapes like *CoPə, *CoTə, *CoRə. The best two contenders are *monə ‘many’, *wolə- ‘to be’, but the first is readibly under doubt as being a loan from Indo-European (also Permic *-mi̮n, Mansi *-mān, Hungarian -vAn in names of decads does not particularly have to be related to ‘many’ in Finnic and Samic), and the latter looks more likely to have been *walə-. On the contrary, many reconstructions of the shape *CoKə have been already presented: at least *jokə ‘river’, *rokə- ‘to hack, cut’, *soŋə- ‘to enter’, *šokə- ‘to say’, *toxə- ‘to bring’; maybe also e.g. *poŋə ‘bosom’, *oŋə ‘hole’ (if not rather *poŋŋə, *aŋə). I take this also as grounds to suppose that there has indeed been a sound change *-oCə > *-uCə, for C ≠ velar.

I suspect also palatal *-j- might have blocked raising: cf. *kojə ‘male’ (though this is mostly continued in derivatives like *koj-ma, *koj-ra). An interesting case on this front is ‘to swim’, usually reconstructed as *ujə- per Finnic (Finnish uida, Estonian ujuma etc.), but most cognates (clearly at least Samic *vōjë-, Mordvinic *uj-, Permic *uji̮-, SKhanty üj-) better point to *ojə-. As I’ve noted by now in a talk from 2018, even within Finnic, Livonian vȯigõ (? < *oi-kV-) seems to still retain *o. The reflex in Samoyedic, on the other hand, mysteriously enough, is still indeed *u- or *uj-.

An alternative view?

The only counterproposal in any clear detail that I’ve seen comes from Jaakko Häkkinen, first in his Master’s thesis and later, much more briefly, on his 2009 paper on locating Proto-Uralic. He suggests inverting Janhunen’s Law, to apply in Samoyedic and not outside of it: *CuCə > *Co(C). I have seen / heard something similar by other colleagues in a variety of discussions, but I do not recall any defense of this being published. At most, see some discussion in this blog’s comments starting here, with Ante Aikio listing some notes about *o ~ *u variation within Samoyedic and additional irregular-looking examples of *o. Among these I would doubt at least the reconstruction PS *počå- ‘soak, ooze’, though. This probably refers to the words appearing in UEW under *poča- ‘become wet’; but Nganasan and (with irregular b-) Kamassian seem to point rather to *påTå-, with evidence for *o limited to Nenets–Enets. Or, since (old) Nganasan fo- can continue not just *på- but also *pə-, and Enets has o < *ə regularly, another option, maybe better still, would be that this was *pəčå- in PS after all, as would be expected per the Udmurt, Khanty and Mansi cognates; and that the Nenets word is a loan from Enets, while the Kamassian word doesn’t belong here at all. (Donner’s original data actually has not just a voiced b but palatalized bʲ, which is also difficult to explain.) In some other examples I don’t see any particular reason to think that they point to secondary *u > *o rather than secondary *o > *u (thus so maybe in “*num” ‘heaven’) or to *o at all (thus so in Nganasan tui ‘fire’ for expected ˣtüi: this looks like unclear retention of *u, which has other parallels).

Anyway, the major problem that I see in the inverted approach is explaining where Proto-Samoyedic *Cu(C) then comes from. There is solid evidence at least for a rime *-uj:

*tuj ‘fire’ < PU *tulə (a minimal pair with *toj- ‘to come’!);
*uj ‘pole’ < PU *ul(k)ə;
*kuj ‘spoon’ < PU ? *kujə (cf. Finnish kuiri ~ kuiru ‘id.’; I am not committed either way on if proposed Komi and Ob-Ugric cognates meaning ‘trough ~ mortar’ belong);
*puj ‘eye of a needle, etc.’ < *pujə.

The last two probably show PU *-jə > ∅ and PS *j as some derivative suffix, ^[3] but this alone cannot explain *u rather than *o, since also the latter readily occurs in CV stems: *ko-, *ńo-, *to, *śo-j. A few PS roots also show *u: natively at least *tu- ‘to row’ < PU *suxə; of unknown origin, *ku- ‘cord’, *ju ‘warm’ ^[4]. Some other CVC examples can be found too, including *pur ‘smoke’ < PU *purkə; *ut ‘road’ < PU ? *uktə. But at least these two examples we might argue to be irrelevant due to continuing PU *u in an original closed syllable, just with exceptional loss of *-ə after some probably very early cluster simplifications.

As comes to the lack of PS roots of shapes such as **Cup, **Cun, **Cuŋ, this could indicate that something happened to such cases, but it doesn’t follow that the result must have been *o. Other options would readily include reduction to *ə, already suggested by Janhunen in e.g. *təŋ ‘summer’ < PU *suŋə.

Future hypotheses

So far I do side with the hypothesis that Janhunen’s Law is a real phenomenon. Its exact extent and conditions seem to require review, however. I have some reasons to suspect that PU *o was in *CoCə stems retained not just in Samoyedic, but partly also elsewhere. E.g. *purə- / *porə- ‘to bite’ yields in Permic *puri̮-; *tulə- / *tolə- ‘to come’ yields in Mari *tola-; both more in line with development from *o than *u. An interesting recent discovery, premiered a few weeks ago on Twitter, has also been to note Khanty *lāńć ‘snow’ (> e.g. Surgut ɬ´åńť, Nizyam tɔńś, Obdorsk laś). UEW derives this from a distinct *ľomćɜ, listing here also some derivatives of PS *jom and probably incorrect Kola Sami reflexes meaning ‘frost’. But if we did reconstruct *lomə and not *lumə already in PU, the Khanty words, too, can be simply considered derived reflexes, at the PU level seemingly *lom-ća: *o-a > *ā is regular, and there does not seem to be counterevidence to assuming *mć > *ńć. Closer review might identify more cases like these that support the reconstruction of PU *o in the involved words.

As more of a long shot, there are also two unclear cases where evidence for *o might be found in Indo-European. For one, ‘to bite’ seems compareable with PIE *bʰe/orH-, root meaning probably ‘to strike, pierce’. The PU verb also probably meant specifically ‘bite thru’ (in contrast to *soskə- ‘to chew’), coming fairly close to ‘pierce’. Its descendants can be also used not of just biting with teeth, but also working with tools (cf. e.g. Fi. sahanpuru ‘sawdust’, as if “saw-biting”) — similar later development is attested in derivatives on the IE side too (Latin forō, Germanic *burō- ‘to bore, drill’) ^[5] and LIV goes as far as to give a gloss ‘mit scharfem Wergzeug bearbeiten’. Distribution all the way into Samoyedic makes it difficult to assume loaning, though, while a hypothesis about an old Indo-Uralic cognate would not, at the current state of research, rule out an original *u that was lowered to ablauting *e/o in PIE. — For two, there is Finno-Mordvinic *unə ‘sleep’, which Koivulehto (1991) has already compared with Greek ὄναρ, ὄνειρο- and explained exactly thru Janhunen’s Law: early IE *oner → early Uralic *onə > *unə. Whether the Greek word goes back far enough in IE for this to be feasible looks very dubious to me though, especially when there is a much better-attested PIE word for ‘sleep’, *swépnos.

A yet further possibility I would wish to look into in more detail in the future is, does the raising of *o that we seem to see really have the “same” *o as its starting point as is usually reconstructed in PU? Namely, traditional PU *o is in Samoyedic by default lowered to *å — such that its “survival” in Janhunen’s Law cases really looks to be also innovative really. As outlined in yet another presentation a few years ago, I have also developed a hypothesis that the unbalanced inventory of rounded vowels in Proto-Uralic: *ü *u *o but no **ö, probably comes by a chainshift from pre-PU *u *o *ɔ. (I have not discussed this on the blog in detail so far and, alas, cannot do so right now either.) Then, the common tendency of PU *o to be lowered to *a / *å probably indicates that this chainshift had actually not fully taken place by PU: that “*o” was really still open-mid *ɔ. Janhunen’s Law positions, however, look like they might have already had close-mid *o. This would allow us to do away with a raising that happened all across “Finno-Ugric” with seemingly no motivation, while still also not folding the vowel correspondence entirely into PU *u.

There would be also another option on the relationship of this *o with my pre-PU *u *o *ɔ. Rather than early raised cases of (pre-)PU *ɔ, they might be also straggling non-raised cases of pre-PU *o… And then was this *o really just an allophone of *ɔ either? *u is a very common vowel in PU, and perhaps this is partly because even some further cases should be likewise reconstructed as *o. This might be possible if we identified other evidence for it than retention as *o in Samoyedic. For the sake of example, one case might be Mansi *u: PU *u yields in Proto-Mansi either *u, *ŏ, *ă with no very strong conditioning apparent. (Some similarly open issues remain in Khanty and Hungarian.) So just maybe … could it be that PMs *u is a sign of PU *o as distinct from both *u and *ɔ in general? such that not only will we then reconstruct PU *por- ‘to bite’ (> PMs *pur-), but also e.g. *końćə ‘urine’ (> PMs *kuńćə), with *o > *u now also in Samoyedic in this environment (> PS *kunsə)? This would even have a good parallel among the front vowels: PMs *i is generally from PU (close-)mid *e, not from close *i. — But in the interests of putting these notes finally out at least in a somewhat assembled form, I will leave this line of thought open for now.

[0] See previously at least: Lehtinen’s Law; Moosberg’s Law; and one that definitely requires a name but I’m still mulling over what to call it precisely is *Ä-backing in Finnic. Several future installments remain planned too.
[1] On the contrary, an irregular fronting already in Proto-Western Khanty would also account for most of these reflexes: *tŭɣ > *tü̆ɣ > *tĭɣʷ > *təw, preserved in SKh and giving NKh *tŭw (cf. e.g. ‘fall’: PKh *sü̆ɣəs ~ *sü̆ɣs > SKh səwəs ~ süs, NKh *sŭws or *sūs). But it seems preferrable to me to restrict this irregularity to Southern Khanty and treat Konda tŏw and NKh *tŭw as regular reflexes. — Maybe there is some possibility that the SKh development here and in ‘bone’ can be explained as *ŭw > *ū > *ǖ > *ü̆w > əw, leveraging the known fronting *ū > *ǖ? It doesn’t look like *ŭw and *ū actually contrast at all, so the first step here might be entirely virtual.
[2] Хелимский, Е. А.: О соответствиях уральских a- и e-основ в тазовском диалекте селькупского языка. – Советскoе финно-угроведение 12: 113–132. No cognates known elsewhere in Samoyedic, but the simplification *wo- > *o- would have to be pre-PS anyway, since by PS a new *wo- does exist and per two examples yields in Selkup *ko- as expected: *woəj > *ko ‘island, hill’; *wotå > *kotə ‘blueberry’.
[3] Though, since PS shows *r > *l / C_ in various suffixes, could it be possible that after *j, the resulting cluster further coalescend to *ľ, and then evolved into just *j as usual? In this case Fi. kuiri and PS *kuj could both go back to PU *kujrə (now with no especial reason to suspect a suffix in there).
[4] For a formal match and semantics within speculation distance, cf. PU *luwə ‘south’ ≈ ‘direction where the weather is warm’?? Seems unlikely but not impossible.
[5] And cf. further PU *pura ‘drill’, also already proposed to be an IE loan. So far it seems morphologically unclear to me how to connect this with either the PU or PIE verbs, though.

Tagged with: historical linguistics, linguistics, proto-uralic, samoyedic, sound laws, vowels
Posted in Reconstruction

State of the Blog: Second Decade

Posted on Sun 2022-11-27 by sansdomino — 5 Comments

Blogging here at Freelance Reconstruction has been slowing down in recent times, as we approach the 10-year anniversary of its WordPress iteration, coming up just at the start of the next year.^[1] In 2013–2019 I have been writing about 1–2 articles per month; in the 2020s so far, less than 10 per year. To be sure, some of life’s external issues and circumstances have also been getting in the way, starting already with the obvious: CoViD-19 and issues downstream of it. But this also coincides with me finally being now at the rank of a graduate student, and being not just welcomed but expected (as of this year, by actual funders even ^[2]) now put out my ideas as proper peer-reviewed publications. There is a whole bunch of work to do on this. Or indeed re-do: it feels like every article draft I sketch out ends up with at least one footnote to the effect “for earlier discussion of this, see Pystynen 2014 [blogpost]”.

Another turning point approaches too: where this blog will, at last, have more published than unpublished posts, both being at ca. 160. This may give a hint to what extent I have also quite a lot of unpublished research, most again formulated back in the mid-2010s, still stewing in my blog drafts. This is a situation that definitely calls for skipping over a step in the publication pipeline and refactoring this corpus, too, into other forms, now that I am able to do so. And this also does mean much fewer blog posts coming out as intended.

Even a third venue to air my ideas is by now moreover the Finnic Etymological Wiki Database, which I have been setting up over these same few years, under the folds of our project on writing a new etymological dictionary of Finnish (which uh, I don’t think I’ve ever announced here in detail; partly since it’s being written in Finnish). The platform is intended not just as a data backend for the dictionary, but also for discussion among scholars, e.g. for proposing new etymological ideas that do not seem quite ready for publication just yet. I’m by now doing this with some frequency, instead of spending more work on turning them into etymology squibs here (sample: is Mordvinic čakš ~ šakš ‘pot’ not a cognate of Old Finnish haaksi ‘ship’, but maybe a derivative from čava ‘plate’, if from earlier *šaɣa?). — Any colleagues interested in this, and with serious familiarity with Finnic etymology at least, are also welcome to request an account from me or the rest of the moderation team for contributing to the discussion.

By no means do I wish to abandon blogging altogether. But I may aim to shift away from the more effort-demanding blogposts to the effect of a mini-research article, at least as long as blogposts continue to be neglected by the powers-that-be as a recognized type of research output. Perhaps I will focus more here on reviewing issues, or bringing up points already made about them in the literature, than on presenting major syntheses on what to do with them. It remains to be seen how this will work out. But you can probably at least expect to see the next few Uralic reconstruction posts appearing here to be rather in this paradigm. Of course posting of other matters, e.g. on the state, context, philosophy and methodology of historical linguistics, is likely also going to be continuing on to the next decade. And maybe I will yet get around to re-hauling the site’s appearence or organization, as already hinted in 2019.

Thanks to all readers and commenters, and see you in the rest of the 2020s!

[1] The decennary of my linguistics blogging altogether has already slipped by about a month ago…
[2] I would also like to take an opportunity here to issue my thanks to Ante Aikio and Martin Kümmel for letters of recommendation to go with my funding pitch.

Tagged with: academia, projects, resources
Posted in News

Long-Distance Comparisons As Butterflies

Posted on Fri 2022-07-29 by sansdomino — 14 Comments

One of the rationality-cluster blogs here on WordPress, Aceso Under Glass, a while ago posted about a concept I find immediately useful: “Butterfly Ideas“. Roughly speaking, hypotheses that need further development, are probably not ripe for serious criticism as they stand, but could benefit from preliminary discussion (read the full post for more).

On this blog and elsewhere, I have repeatedly entertained a variety of “long-distance” linguistic relationships: Nostratic, Uralo-Yukaghir, Uralo-Eskimo, the works, despite not being so far highly committed to any of them. One idiom I’ve previously used to defend this is “big fish are worth angling even if you don’t catch any”; that there are major potential gains for our understanding of history (both intra-linguistic and extra-linguistic) if any of these theories start to prove themselves in more detail. Or as the more succinct modern spin goes, “big if true”. A second motivation is provided by what I have called the “cell theory of language“: spoken natural languages only come from other natural languages, never out of nothing. ^[1] This gives a strong prior that all natural languages are, indeed, related, even if we currently lack the knowledge of the details. Factoring in also anthropology further gives strong reasons to believe also in the existence of a number of “bottleneck proto-languages”, such as Proto-Australian, Proto-Amerind or Proto-Exo-African. So big fish are very likely indeed out there, even if we are not sure if our lures are working. Though then these are weaker boundary conditions that do not establish what currently-known families exactly would be the daughters of such a proto-language. E.g. who knows if some American languages might be not Amerind ≈ Beringian, but something else, like para-Na-Dene, pre-Clovis-coastal, Solutrean…? Continuing the metaphor, this would mean we don’t even know how big the fish are exactly, and so also we might not know (yet?) what are the best ways to catch them.

But there’s also a sense in which I think long-distance relationships would be better seen as butterflies than big fish. We do not find relationships in an instant, as sudden flashy discoveries (by “bites” on a “lure”). All spoken languages are in principle compareable, with known typological differences but also universal family resemblance. ^[2] The universality of basic phonological categories in particular makes it possible to find some resemblances between any two languages that plausibly could be indicative of some etymological or indeed genealogical relationship. Whether they actually are, depends on additional work on fine-tuning details. Are they above the level of pure chance, and independent of known onomatopoetic and nursery word trends? Are they in conflict with other data of equal value? Do they show recurring sound correspondences, at least some of them nontrivial? These are questions for which we cannot expect to have every answer in place immediately. Any relationship must always begin from observing some similarities that are not probative in itself, and then pursuing this as a hypothesis and seeing if it guides us to more similarities, ones that will not require further costly assumptions to justify.

If all we knew about Finnish and Hungarian were that their verbs for ‘to live’ are, respectively, elää and él, this would not be sufficient evidence to establish them as related languages. But they are, indeed, cognates. Insufficiency or statistical insignificance does not in any way refute cognacy per se. And it is true that checking for more examples of the correspondences e ~ é and l ~ l turns up more evidence such as pelätä ~ fél ‘to fear’. Now with a new correspondence p ~ f, but this does not mean we turn up our nose and declare the hypothesis unworkable: it’s possible to continue and maybe discover, say, pesä ~ fészek ‘nest’. It always takes several steps like this to assemble e.g. a phonological core that will be self-evidently non-accidental. Same for other “evidential cores”, such as partial common morphological paradigms. There is no immediate bite that instantly proves a relationship, but rather, a first weak signal, which will rise in importance once combined with a proper selection of other datapoints.

Any “minimum convincing argument” will not be dozens of steps deep necessarily, but where patience is especially needed is that at any stage there will be plenty of false paths of expansion that will not lead to a workable theory. If at some early point, we had formed a hypothesis of a ~ a, and then run into vapaa ~ szabad ‘free’ (without realizing that both are loanwords from Slavic) — we could still find more evidence also for p ~ b (e.g. by misanalyzing the correspondence Fi. mp ~ Hu. b), but no additional good evidence would be turning up for v ~ sz. At some point we might end up concluding that, yes, this is going nowhere and should be discarded. But then only this comparison! Finnish and Hungarian are still ultimately related, even if their words for ‘free’ are not cognate. Discarding this one comparison does not (should not) mean discarding also any other adjacent comparisons. A burgeoning comparative edifice needs to be open for exploration and individual mistakes, if it is to ever reach any particular rank like “a probable relationship” or “a proven relationship”.

This plea of course has also a corresponding inverse. Anyone who wants a “butterfly” treatment of their ideas has to have enough intellectual humility to recognize that it is, indeed, a tentative first-pass version. All too often I see also people who have a new language relation hypothesis in hand double down on their speculation, and not be open to even constructive criticism. Perhaps in some part there is a misunderstanding where people do not recognize the proposal of better, non-cognate etymologies (borrowing, onomatopoeia, internal derivation) as progress. But certainly also lone-wolf-genius-ism, and its attached incapacity to admit mistakes, is a problem that exists.

On the other hand, I don’t think this side of the problem needs to be focused on too much. In historical linguistics, the exploration of linguistic relationships is already a known research programme, a goal that many people agree to pursue even if we tend to disagree on quite a lot of details. This in mind, if a K. Kookenstein puts out a paper on allegedly showing how English is related to Arabic, but then refuses to consider these comparisons in light of what Indo-European or Semitic linguistics has to say on this: we don’t actually need his approval on this! Language data is not locked, copyrighted, or in any other way tied down to one person, and if desired, it will be possible in any case to check such papers for insights relevant also to better situated IE–Semitic comparison. I know I at least keep a few “Hungarian is too a Turkic language” type works around for this purpose. The intended main thesis is not going to pan out; but any data cited to this end could prove to be regardless still valid. Usually anything of this sort mostly relies on word comparisons (appeals to typology are strangely rare), and these might remain valid as etymologies of any imaginable type… not just Turkic loans in Hungarian, but maybe also old Hu. loans in Tk.; Hu. cognates of Khanty or Samoyedic loans in Tk.; common loans from some third source like Iranian or Yeniseian or Mongolic; some could even end up being evidence for a general Turkic–Uralic relationship. None of this is a priori ruled out, and in this way it may well be possible, with patience, to find meaningful building blocks even within theories that don’t hold up in their entirety. Such is a nifty property of historical linguistics, something that definitely doesn’t generalize to every science.

The two animal metaphors from the start of this post, though, no longer work very well at this point. Some butterflies … may grow up to be big fish, even though most probably don’t? Moreover, I have been mostly illustrating this discussion with disputed-but-definitely-published ideas. More nascent ideas that are simply brought up in a discussion are a different beast for sure. Of course there’s a selection bias here too: the actual butterfly ideas I do have, you will probably not be seeing on this blog as such (and you might have to watch closely to catch any even on my side channels). ^[3] Arguably also scientific publishing is “a conversation”… especially any ideas that can be so far found only in some paper draft posted for comments online (in linguistics they’re not even concentrated yet on any arXiv analogue). For these, the original reading of a butterfly idea seems to still work fairly well. This may hopefully help (e.g.) various long-distance proposals to develop better in the end, before they end up with one of two common fates: shelved as not having passed the judgement of Reviewer #2, or self-published with excessive confidence. For this goal, yes, the ball very much is first in the court of people who do have an idea and want to develop it; but it is also in the hands of the rest of us, in being willing to offer first criticism that’s not a complete dismissal. Thirdly, worth noting, all this also depends on a social milieu where people even can find parties interested in discussing some out-there idea.

A further aspect of AUG’s original concept — avoiding unnecessary emotional stress upon people presenting a new idea — I haven’t really even touched here yet. This would be a whole other jar of larvae, but suffice to say I agree that academic discussion, for all its standards of civility, fairly often can have undertones all the way to hostility. This probably scares away many people without a thick skin who might otherwise have had a few interesting things to say; and those of us who do stay engaged, to whatever degree, it may leave with more stress than is necessary.

Some of it, I’m sure, does not even come from a particular need to be prickly, but from limited time… Sufficiently well-known figures in a field tend to get approached by a disproportionate amount of amateurs with A Revolutionary Discovery, unless they specifically keep themselves hard-to-contact, or, perhaps, maintain an aura of not suffering fools gladly. Again a problem that might be softened with other people being open and approachable enough. But this also starts edging towards the general area of science communication and public relations, a bigger fish still to fry that I’m not going to pretend to already have big original ideas for right now (and the butterflies, they will have to wait for other channels).

[1] The famous case of Nicaraguan Sign Language does not seem to have spoken analogues. In principle there is little directly preventing such a case (and something of the sort, maybe in several gradual episodes, will have to be assumed as the ultimate origin of human language too), but the conditions are unlikely to ever come about. A community of children who are capable of speech but do not have access to any pre-existing spoken language? Sorry, language in general is too adaptive to have been ever abandoned after its first introduction. I will go as far as to suggest that all known human cultures depend strongly enough on language for the transmission of cultural knowledge that any sudden failure of language skills across an entire human group (say, a transmissible disease that induces deafness, fast enough that a signed language does not have time to develop) would not lead to an all-new language being developed a few generations later; it would lead to the group’s extinction.
[2] In the philosophical sense, not the genealogical one. E.g. despite some exceptions, most languages still have nasal or labial or velar consonants; all but the most impoverished and unbalanced phonological inventories or even just consonant inventories are going to have substantial overlap between them. And even if we did find languages that somehow have completely disjoint phoneme inventories (lazy example: one has only stop consonants and front vowels, the other only continuant consonants and back vowels?), they will not be unbridgably far apart: the known typology of sound change allows hypotheses relating basically any two speech sounds. Grammatical categories, too, can be quite different but still only finitely far apart, where the details of known language histories likewise give us ways to relate non-identical categories to each other (or to derive them de novo language-internally, etc.)
[3] A freebie for the sake of example though: cf. some very loose thoughts about the subclassification of Oceanic as floated on Tumblr just a few days ago (also already with some, though not highly severe, critique from a regular correspondent over there).

Tagged with: historical linguistics, linguistics, macro-comparison, philosophy of science, sociology of science
Posted in Methodology

Language Family Tectonics

Posted on Thu 2022-07-14 by sansdomino — 10 Comments

Basic research in historical linguistics is mostly done within individual families: we take a swath of attested (in most cases modern) languages, and work towards the past to figure out their development from a common origin, one group at a time. Any knowledge of languages outside the family only really factors in as correction terms: filtering out loanwords and other contact influence, as data that the family’s overall internal history will not need to account for.

What the big picture of this looks like once we consider also geography is that we end up with a series of dots — “homelands” (though not to be understood as points of creation, but simply the last uncoverable phase of earlier processes) — somewhere in the past; some of which have then expanded, to cover the whole world by today. Just a few millennia ago, much of the world would have been an uncharted area, full of regions from which no knowledge of their languages has survived to us. The ones that do survive would, even, have been largely isolated dots. Most language contacts must eventually end (or rather, begin) at some point in the past. Languages of different families, that are today next to each other, cannot all have had their parents too as neighbors. Perhaps some individual cases were: Proto-Germanic seems to have been about as much of a neighbor of Proto-Finnic as Swedish and Finnish are still today; even further back, something like Proto-Kartvelian as a neighbor of Proto-Northwest Caucasian could be possible too. But once we consider highly expansive families, it is self-evidently absurd to propose that Proto-Indo-European could have been simultaneously a neighbor to all of (pre-)Proto-Kartvelian in the Caucasus, (pre-)Proto-Uralic in the taiga zone, (pre-)Proto-Dravidian in South Asia, pre-Basque in Iberia…

This already implies that most borders of today’s language families are collision zones: where two lineages have come to meet that were not in contact at some point in the past. (Same also for some, though fewer, language borders within them.) I’d like to think that we can probably divide them further in subtypes. This will have to include their history, not just their current but also past dynamics. One reasonable analogy might be plate tectonics. Geologists are not content to simply locate the current boundaries of the world’s tectonic plates, but ever since the rise of continental drift to a mainstream theory, already introductory maps will also aim to identify boundaries as either constructive, destructive or conservative. Often longer-term history or future, too, could be extrapolated from arrows of movement (of, yes, actual movement right now — as per the classic example and the mid-ocean ridge closest to me, the Atlantic Ocean is growing some three micrometers wider every hour, already a perfectly visible amount of maybe 0.3 millimeters since I began to write this blog post).

Of course this is not to be aped too closely. The social “forces” that drive linguistic expansions can be rather fickle, nowhere near as stable and predictable as the physical forces of geology in e.g. continental drift. No responsible linguist is going to be putting a predicted specific time of death on any but, perhaps, an already moribund language (those where all transmission to new generations has already ceased, and the only question is whether the last few speakers have 5 or 50 years left to live); and predictions on what languages will be gaining new ground entirely I have not really seen anywhere at all. If anyone wants to register particular predictions, be my guest, but currently these are really only going to be educated guesses, not derived from a theory with known predictive power.

So maybe let’s not draw any future-pointing arrows on linguistic fault zones just yet. Drawing past-originating ones, though, seems like a much more doable task, first of all in cases where (some) history is already known. And this I think also gives us anyway some analogues of geologists’ “constructive, destructive, conservative”. A look at known history actually suggests that just two types might be enough to get started. Of course we can have conservative boundaries, where languages have stayed each on their own side for a while. This often coincides with also geographic boundaries of some sort (e.g. the northern boundary of Indic has been, broadly, at the Himalaya for millennia, and it’s no wonder that the Korean / Japonic boundary has stabilized between the Korean peninsula and the Japanese archipelago). Then we have collision zones, where two lineages come head to head —

But wait. Head to head? No, actually, the most typical case we see anywhere in the world’s known history is not quite this. Where we find e.g. a Germanic / Celtic boundary in the British Isles, a Finnic / Samic boundary in northern Finland, a Turkic / Iranic boundary north of Iran, a Bantu / Khoe boundary in Botswana: these do not represent cases of two spread events that finally arrived at some common ground simultaneously, running out of no speaker’s land to claim. Almost always such a border represents one newer (Germanic, Finnic, Turkic, Bantu) and one older family (Celtic, Samic, Iranic, Khoe), with the latter’s historical range extending far into the former’s current-day one. The geological analogy happens to continue working here too to some extent: when two plates collide, for all the mountains that results, these still are not zones where both plates indefinitely squish and crumple without crossing. Instead one plate will be pushed underneath another, into the crust (and mainly the topmost one will jut up as mountains). Now the distribution of language families does not really have a Z-axis, but the time axis does similar duty here. We already routinely speak of e.g. English expanding (having expanded) “over” Brittonic; and call the latter a “substrate”, the former a “superstrate”, again employing terms from geology that strictly speaking refer to vertical location. I’m sure also a part of the motivation is one of geology’s core findings that, by default, vertical order reflects historical order!

To fully derive an understanding of this situation, the naive zeroth-order model of language family expansion (they start in some some compact area in the past and begin expanding) moreover needs to be amended by the fact that expansions are not infinitely powerful: they can run out of steam even without encountering another expansion in its path. Not only does Finnish supercede various lost Sami varieties, it is also not the case that Samic started somewhere in the north and expanded south until running into Finnic. Rather, Samic also itself originally expanded mainly northwards, probably much along the same geographic routes. There was no southward expansion front of Samic for Finnic to collide with; nor an eastward expansion of Celtic by the time of the Germanic expansions, etc. In this way linguistic expansions might have a better geological analogy still in lava flows in a volcanic field: they will layer on top of another, not by virtue of which one expands faster or more strongly, but by simple virtue of which one has already stopped, at least in a particular area, and which one is still going.

In those cases where two expansions do happen to be going on simultaneously, this is maybe indeed more likely to end up with something resembling a conservative boundary. And also among these, many though will prove not quite entirely stable if we look closely enough. They can turn out to be series of small advances on either side, just not spilling out to outright conquest of the other family (and likewise, mostly not inherently one-dimensional lines anyway, but a crossfade in the proportion of speakers of X versus Y). Again more like lava flows than continents.

Still, I will continue to keep the term “tectonics” here anyway. Etymologically looking, it is not a term that by itself implies the details of plate tectonics, but simply refers to the largest-scale analyzable units.

What can we do with this then? If we recognize that the world’s major language family boundaries are mostly collision zones — where one family is or has been in the process of expanding at the cost of another, not currently expanding one — this gives us first of all convenient rules of thumb about linguistic substrates. Anywhere near a language family boundary, the substrate of an expanding family X is probably primarily the non-expanding language family Y next to it. At least in the wide definition of “substrate”, that is “the language spoken there before the expansion of the current family”. If it has left any discernible substrate influence, structural or lexical or toponymic, would be another discussion entirely. Conversely, locations where we might be able to fruitfully hypothesize completely extinct substrates will be instead

more towards the geographic or expansion centers of recently expansive families (thus e.g. the Paleoeuropean substrates of Germanic);
underlying not-most-recently expansive families that have few or no leading edges over anything anymore (thus e.g. the Paleolaplandic substrate in Samic).

Or further yet. The facts that language families expand from small origins, readily take over other languages in the process, and are also generally just some thousands of years old, leads us to also a more powerful rule of thumb: There Was Some Other Language There Before. Almost no language is the absolute first language to have been spoken in “its” territory. The main exceptions would be a few cases of recent seafarers, above all in Polynesia; several more scattered cases also in the Atlantic, of which I think only Icelandic and Cape Verde Creole have been established as their own languages. ^[1] At any other ends of the Earth, Inuit is a known newcomer in the American high arctic, Pama-Nyungan is a known newcomer in the Australian interior desert (even if the languages preceding them are not attested)… and in places with long written history, we may find quite extensive known successions, to the effect of Hattic replaced by Hittite replaced by Luwian replaced by Aramaic replaced by Greek replaced by Arabic replaced by Turkish. Maybe some Assyrian or Kurdish phase in there somewhere too, depending on what point we’re considering here exactly. More importantly, over the remaining at least 60,000 years of modern human presence in West Asia without written records, obviously much much more of this still. Not all of this leaves major genetic or archeological fingerprints, either, and some specific cases might be very hard to identify if we didn’t have linguistics itself as a source of evidence.

For two, it will be generally beneficial to work out which of any two language families in contact at a particular border has been the more recently expansive one. ^[2] Know more widely, at least. I’m not sure if there actually are many cases where this would be a mystery entirely. I could think of some hard-to-tell cases once we’re talking about subfamily borders (Mari / Udmurt? Celtic / pre-Latin Italic?), but even here probably some dedicated experts would have an opinion. Maps of individual language families, especially in historical contexts, often enough also have some spread lines or historical distributions marked. But large-scale summary maps still trend towards presentations like this, seemingly entirely static, even though the process of restricting language families to complementary areas necessarily elides some current-day detail in favor of historical idealization (denoting where a language family “is native” or “is traditionally spoken”). I’ve seen sociolinguists criticize this whole genre of language distribution maps repeatedly already, in them not really capturing synchronic reality. The response though might not need to be to abandon them entirely, as much as admit that, yes, they are maps that display some historical information too, and adjust accordingly for more history-informed design. If there is knowledge on this mostly out there, why not?

For three, a concept of family tectonics readily draws attention to the point that there’s work to be done not just on charting language families’ “current” or “traditional” distribution, but also their past distribution. “Beneath” (before) any current language family there “is” (was) some different distribution of other languages. Some of them maybe belonging in it still extant neighboring families, some maybe its own lost relatives, some maybe unknown entirely.

The first possibility I find the most interesting for the sake of further work. The closest example to my work comes from central and eastern Siberia. An important but I think largely open question would be what was spoken in the area before the expansion of the relative newcomers? Russian is of course the newest layer all over the place, but Siberian Turkic (Yakut, Tuvan, etc.) and Northern Tungusic (Evenki, Even, etc.) are both parts of relatively recent families too. What have they ended up displacing? Early Russian explorers report, and rudimentarily attest to, first of all a formerly wider distribution of the Yukaghir family, today known only in two small islets; and a variety of Samoyedic and Yeniseic varieties in the southwest of this area. Still, the main Turkic and Tungusic expansions must have been early enough to predate all historical records in the region, so this cannot be the whole picture either. One hypothesis I keep coming back to is the possibility of a lost “tenth” Uralic branch — perhaps para-Samoyedic, perhaps an independent branch entirely. This might have some benefits to it in explaining a variety of known but not especially substantial similarities between Uralic and all the other families further east. Turkic of course has been in direct contact with (branches of) Uralic anyway, but various parallels continue sporadically into Yukaghir, Tungusic, Chukotkan, Nivkh, Eskaleut. All of them seem more likely to originate from the Uralic side, due to it being the Siberian family with the most known time-depth. Yeniseian is sometimes approximated as rather old as well, but otherwise both “Neosiberian” and “Paleosiberian” are all families without too much time-depth. ^[3]

Most notably, Uralic parallels in eastern Siberia include even basic words for ‘reindeer’, an all-important livelihood animal for many groups these days, especially Chukotkan *qora (whence the ethnonym Koryak), Tungusic ⁽*⁾oron (or probably *xoron, with further diffusion after *x > ∅ in NTg) (whence the ethnonym Oroqen). Kolyma Yukaghir qoroj ‘two-year-old male reindeer’ is usually adduced here too, as well as loanwords further into Siberian Yupik. This has been already identified in earlier research as a Wanderwort originating in Proto-Uralic *kojəra ‘male [domestic?] animal’ > Proto-Samoyedic *korå ‘id.; bull reindeer’, which might have already had an allophonic [q-] in Proto-Samoyedic or even earlier. But we seem to lack especially clear evidence on who is to be credited for the original diffusion of this word. Yakut, as far as I know, has no reflex of it, splitting the Eastern Siberian region off from Samoyedic, and thus probably suggesting a pre-Turkic movement eastward. If so, then maybe even already at the time of the original Uralic expansion (which I think must have been partly eastwards too in any case)? Who knows. Maybe someone will eventually though, if we get e.g. some additional toponym data for guidance and keep inter-family comparative research going.

Elsewhere in the world, I’m wondering also about e.g. how far Africa’s other language families might have reached before the Niger-Congo and particularly Bantu expansion. The case of possible contact between Khoe and Cushitic is already preliminarily discussed in a 2009 paper from Blench, though I’ve been unable to verify his interesting claim that Khoe #goe for ‘cow’ would be compareable with similar “widespread terms” in Cushitic. ^[4] The quite tattered Central Sudanic looks like another good candidate for a family that might have been more widespread earlier (but might have been also enroached upon by Chadic and the various branches of Eastern Sudanic). In the Americas, too, I could wonder especially what preceded the large continuous spreads of Athabaskan and Algonquian in most of Canada and the northern US? (And also which of them is the newer one?) Was there ever anything to the effect of “Inland Tsimshianic” or “Inland Tlingit”, “Plains Iroquioan” or “Forest Caddoan”? Or turning to Oceania: how far west and east did the various “”Papuan”” language families (many of them even today not confined to just New Guinea) extend before the Austronesian / Malayo-Polynesian expansion? For that matter has anyone even tried comparing any of these with the other continental SEA languages in any capacity, or just assumed that they must have been in splendid isolation amongst each other linguistically effectively forever?

These are questions that, again, some experts might already know answers to or at least have hypotheses for. But nowhere is this information available in centralized geographic form, even though it would be surely possible to represent so, giving a kind of a bird’s eye view of what are the major ethnohistorical results achieved or confirmed by historical linguistics, and what questions still remain open.

[1] Faroe Islands seem to be better established than Iceland as having had a pre-Norse population (at least as of the Nature study just last December). A longer list of cases without a distinct local ethnicity includes e.g. the Azores, Bermuda, Falkland Islands, Svalbard, Tristan da Cunha (and also remote islands in the other oceans, e.g. Kerguelen). There are some more within-reach cases like the Andamans, Maledives or Nicobars, for which I’m not sure what’s known of their prehistory (though then already the existence of two Andamanese language families suggests that one of them is very likely older than the other).
[2] Not always the same family on top in all interactions: Turkic has been expansive over Iranic, while Russian has been expansive over Turkic … and yet Russian and Iranian are both Indo-European. It should be no surprize at all either when we find e.g. language shift from Swedish into Finnish in Finland, vs. from Finnish into Swedish in Sweden.
[3] Really if “Neosiberian” is taken to mean “the recent but pre-Russian arrivals”, and “Paleosiberian” as everything else in the area — then we ought to be counting Uralic as the largest representative of the latter, not as some European family that somehow just happens to be also present. By now we do know the westernmost expansions of Finnic, Samic and especially Hungarian to be relatively recent, while Uralic or pre-Uralic presence in western Siberia has no established terminus post quem (short of the hard geological limit of the last ice age). — I suppose the usual exclusion of Uralic from “Paleosiberian” has been instead more informed by its typological similarity with Turkic and Tungusic. But then this seems improper when the term is Paleosiberian, not “Non-vowel-harmonic-siberian” or anything else of that sort.
[4] Checking with a recent monograph from Bender instead shows some very uncompareable-looking terms in most of Cushitic, such as Oromo /saʔa/, Konso /lawaa/, Agaw (North Cushitic) *lɨw-, South Cushitic *ɬee; or does Blench have some supposition about a Northeast Caucasian-esque *ɬ > *g?! — Further north, *gʷow- ‘cow’ in Indo-European does look amusingly similar to Khoe, but Afrasian is bit too wide and old of a family (definitely older than the domestication of cattle, which “only” dates to ~10,000 years BP) for me to think that there could be a connection entirely without it. Even something like the mysterious Y-DNA haplogroup R-V88, common in central Africa around Lake Chad yet seemingly derived from Eurasia, doesn’t really allow any connection that would reach all the way to southern Africa.

Tagged with: cartography, ethnohistory, historical linguistics, linguistic geography, linguistics
Posted in Methodology

Reviewing UraLex

Posted on Thu 2022-06-23 by sansdomino — 15 Comments

Nerdsnipe of the day: the BEDLAN team, researching diversification of the Uralic languages interdisciplinarily, mentioned earlier today that they will be soon uploading version 3 of their UraLex dataset of basic vocabulary across Uralic. I thought this might be a good time to do a look-over of the data, from a not-that-computational historical linguist’s point of view (i.e. mostly on the contents, not the technical details). Maybe these comments will be helpful either to the team or to other people aiming at similar projects.

Data sources

The selection / definition of languages looks mostly good already to me, with varieties being specified fairly closely, including details like “Sosva Mansi” rather than just “Northern Mansi”. Unmarked “Selkup” is however questionable at least. This is claimed in the documentation to be more specifically Taz Northern Selkup, the currently most vital dialect ^[1] and the basis of current written Selkup. The listed forms, though, often look more like the Proto-Selkup reconstructions from Sölkupisches Wörterbuch, e.g. in retaining PSk *č (> modern NSk /t/) and *uə (> *Cʷë > modern NSk /Cɤ/, /wɤ/). A similar issue is the database’s “Karelian Proper”. This too does not appear to be any real variety of Karelian, but rather the interdialectal lemma forms of Karjalan kielen sanakirja, which are frankly overly Finnishized (not really actual Proto-Karelian), and elide many important contrasts, especially voiced obstruents and, mostly, the s / š contrast. E.g. rasva for ‘fat’ only appears as such in the Oulanka dialect. Most northern Karelian has rašva, much of southern Karelian razva, some intermediate southern dialects ražva.

The KKS and SkWb lemmas are probably tolerable as lexicostatistic indices to Karelian and Selkup, but I hope some future update might fix this in favor of actually-recorded language varieties — and certainly before anyone tries to do phonological analysis with this data!

I would have some desiderata myself on what varieties’ classification would be interesting to gage by their lexicon. Foremost maybe transitional varieties, such as Karelian Isthmus Finnish; NE Erzya and Shoksha; Pelym, Lozva & Eastern Mansi; Berezovo, Nizyam, Salym & Vartovskoe Khanty; anything really among the Selkup dialects. But it’s possible that this is too fine detail for a Uralic-wide dataset and would call for within-language-group studies instead, similar to Rydving (2013) on Sami. And it appears that the most important additions for within-Uralic study have are already been planned: adding Moksha besides the currently represented Erzya; Hill Mari besides Meadow Mari; Obdorsk (Northern) Khanty and Pelym (Western) Mansi varieties besides EKh and NMs; Kamassian and Mator within Samoyedic. These should cover many bases. E.g. the well-known Mansi cognate(s) of Hung. tűz, EKh tö̆ɣət ‘fire’ are not recorded from NMs, but do appears in WMs (Pelym toåwt, Upper Lozva töät, North Vagilsk tüöwt, etc.)

A different point entirely is that attempts to study specifically the interrelationships of the nine basic Uralic branches would, I think, function the best if using their protolanguages as the basic data points. There are too a few gotcha cases where no coverage of modern-day languages is sufficient: occasional native Uralic terms might be reconstructible for Proto-Mansi only from early 19th century wordlists, for Proto-Samoyedic only from Castrén’s mid-19th century records, for Proto-Mordvinic only from Witsen’s 18th century records, for Proto-Hungarian only from early medieval records, etc. Comparative-historical Uralistics is maybe not particularly philology-centered, but has never been able to afford overlooking philology entirely. ^[2]

The selection of semantic concepts to cover is generally reasonable, pulled from major basic vocabulary lists like various Swadesh lists and the Leipzig-Jakarta list. Some of the items on these do break up completely to noise within Uralic, but that’s a good point to have on record as well. I do not think the classic Swadesh list was assembled very rigorously, and at some point it would be good to know not just something about the relative average stability of concepts on it, but also their variance in stability across different language families. An example I have often mentioned in dicussions related to this is how in Uralic, ‘fish’ and ‘moon’ are highly stable, while ‘cow’ is unreconstructible and ‘sun’ is highly unstable; while in Indo-European, ‘cow’ and ‘sun’ are highly stable, vs. ‘fish’ unstable and ‘moon’ just about unreconstructible. (This phenomenon e.g. already constitutes a fairly strong critique of glottochronology or any models resembling it, which would rather predict average variance to be a monotonic function of average stability.) — Many of the more unstable and entirely unreconstructible concepts seem to be from the LJ list. This is basically what we should expect I think, since these have been selected only by their stability vs. loaning, not vs. all the other lexical innovation processes out there like derivation, semantic shifts, onomatopoeia, a priori coinages (and also not even vs. the likelihood of synchronic synonymy).

There are regardless still many semantic concepts or etymological groups that I think would have a bunch to say about the diversification of Uralic, but which haven’t made the mark. These are I suspect typically more Uralic-specific, and they could not be easily located by general cross-linguistic considerations. Simple examples include e.g. terms for local fauna (*śixələ ‘hedgehog’, *onča ‘nelma, Stenodus‘), flora (*ďëmə ‘bird cherry’, *pečä ‘pine’) and technology (*joŋsə ‘bow’, *ńëlə ‘arrow’). More involved examples tend towards etyma that Helimski (2001) has called core vocabulary as distinct from basic vocabulary: often verb roots, relational terms, or incipiently grammaticalizing body part terms, that may not have strong semantic stability but do have decent etymological stability. In Uralic thus e.g. *kixə- ‘to rut, lek, be excited, lustful, want’, *kulə- ‘to go out, run out, wear, end’; *pučkə ‘hollow, tube, inside, marrow’; *pončə ‘tail, hem, back part’ (glosses not meant as PU but indicating the range of variation in reflexes). Most regular lexicostatistic methods run poorly however if matched against etyma that don’t have stable or well-defined proto-meanings, e.g. we can’t really ask what is “the” replacement of such an item in a language that has lost it. Down the line, some new techniques entirely will be required for making use of this kind of data instead.

Phonetics & Phonology

I do not know what use, if any, is planned for this part of the data, but especially inconsistent IPA transcription seems to remain a major problem, as many other times in Uralic studies.

v is transcribed as a fricative /v/ rather than the approximant /ʋ/ for Estonian, Votic and Ingrian (though correct in Finnish).
A phenomenon I’ve seen in many online sources over the last ~10 years, Finnish h is given superfluous and partly incorrect transcription as /ç/, /x/ in many clusters and /ɦ/ in many medial positions. E.g. karhea ‘rough’ as “/karçe̞a/”, though fricative allophones only appear with any systematicity in the syllable coda. Even then these have enough variability that I would think leaving this as phonological /h/ would be surely the safest choice.
Some Finnish falling diphthongs are transcribed with glides as the 2nd component (aurinko ‘sun’ /ɑwriŋko̞/, koira ‘dog’ /ko̞jrɑ/), others with close vowels (jauhaa ‘crush’ /jɑuɦɑː/, oikea ‘right’ /o̞ike̞ɑ/).
Estonian length marking is a mess. -p- -t- -k- appear seemingly at random as both /p t k/ (thus also -b- -d- -g-) or /pː tː kː/ (thus also -pp- -tt- -kk-); sometimes even in the same word, e.g. lükata ‘to push’ as “/lykɑtːɑ/” (as if ˣlügatta ?)! I don’t have strong opinions on if it’s more proper to use /pˑ tˑ kˑ/ for transcribing grade 2, or maybe /pːː tːː kːː/ for grade 3, but please at least make the distinction. — I’m not even going to start on long/short clusters or overlong vowels, which are maybe less phonologically relevant anyway.
Estonian palatalization has also gone absent, e.g. lill ‘flower’ as /lilː/ and not /lilʲː/. Also, four slip-ups of õ turning up as IPA /ɣ/ rather than /ɤ/: “/hɣːrutɑ/” ‘rub’, “/kɣvɑ/” ‘loud’ (but correct in /kɤvɑ/ ‘hard’!), “/lɣkːs/” ‘trap’, “/mɣmisetɑ/” ‘mumble’.
Votic transcription includes some allophones like [d̥ g̊ vʲ ɑˑ], but leaves unmarked maybe the most prominent allophone in the language, л = [ɫ], “dark L”. I did not catch any ˣ/ɣ/ pro /ɤ/ mistakes.
I’m happy to see that most languages’ palato-alveolar ľ, ń, ś etc. have been transcribed as /ʎ/, /ɲ/, /ɕ/ etc. rather than incorrect /lʲ/, /nʲ/ /sʲ/ seen in many naive attempts to IPA-fy Finno-Ugric transcription; … but this has been overdone to include also Erzya, for which palatalized alveolars are correct. Not a major issue ultimately, but still an inconsistency.
Meadow Mari ə̑ has been transcribed as /ə̱/, which is a bit superfluous; /ə/ would be sufficient. (It is rather Hill Mari ə (= reduced e) that would call for a diacritic in IPA, probably /ĕ/ or /ə̟/.) — The Ob-Ugric data has had the ə / ə̑ distinction phonologized away entirely, though if desired, it could be maintained phonetically at least in Eastern Khanty.
Komi and Udmurt: FUT i̮ / literary ‹ы› is given as /ɯ/, rather than the more correct /ɨ/, and e̮ / ‹ӧ› has been rendered as /ɤ/ though probably /ə/ or /ɘ/ would be likewise more consistent (as in the Oxford handbook of Uralic from this spring). Even a / ‹а› might be for the Permic languages better rendered as IPA /a/ (unlike most of Uralic, where a contrasts with /æ/ and is thus indeed better rendered as IPA /ɑ/).
Hungarian uses tie bars for some its affricates, /t͡s t͡ʃ/ etc. Not incorrect in any way, but this is used nowhere else in the data and not even entirely consistent within Hungarian. I also notice a straggling flap /ɾ/ appearing in erdő ‘woods’, féreg ‘worm’ that seems like an error.
Uvulars in Khanty aren’t dealt with very consistently at all. [q ʁ] as back-vocalic allophones of /k ɣ/ go unremarked, but /χ/ is indeed transcribed as uvular (ditto in Mansi). Worse, some data with /χ/ has been incorrectly entered for Vakh-Vasyugan Khanty, e.g. jŏχət- ‘come’, koχ ‘long’ (the actual VVj forms are jŏɣət-, koɣ). Only western Khanty ever has χ!
I suspected data mix-up initially, but this clearly must be a processing problem instead, given even e.g. köχ ‘stone’: no such form appears anywhere in Khanty (it’s VVj köɣ, Jugan kä̆w, other Surgut kä̆ɣʷ, all western kew). Are these words derived from some orthographic source that spells VVj /ɣ/ as Cyrillic ‹х›, by any chance? (But still correct forms in many other cases like oɣ ‘head’, soɣ ‘worm’, wajəɣ ‘bird’.)

Looking over these issues, I could formulate a Rule #1 for IPA-fying FUT: the transcription systems do not correspond 1:1 and several details must be, alas, checked on a language-by-language basis. Especially vital is understanding your source data: whether whatever you are IPA-fying is pre-WW2 “hyperphonetic” FUT; mid-century “major-allophonic” FUT; or post-70s “phonological” FUT. IPA comes with its bracket notation [d͇], /d/, //ð// etc. to warn what level of transcription you might be dealing with… FUT does not, perhaps its biggest flaw. A related Rule #2 might be that it’s similarly important to understand what you are trying to do with IPA: phonological, broad phonetic or narrow phonetic transcription? Most of the time, there is no One Correct IPA Representation either.

In the base FUT data I do not see any further major issues. It would be probably good to make sure to distinguish ´ (the suprasegmental palatalization sign) and ˈ (the overlength / strong-grade cluster sign) in the Samic data though. Currently both seem to be much of the time encoded as a simple apostrophe; e.g. Inari Sami kyevˈđi ‘snake’, Skolt ku´vdd ‘id.’ are given as “kyev’di”, “ku’vdd”. Occasionally even opening or closing single quotes appear (thanks, Microsoft). Apostrophes do actually even triple duty in marking palatalized ľ in other languages, but this seems unlikely to do any real harm.

Protoforms

The dataset is of course primarily about attested lexical data, so I maybe should not spend too much time on examining the proto-language reconstructions included (only Proto-Uralic, no intermediate reconstructions). Still, this is protouralic dot wordpress I am blogging at, so some observations on that topic too.

The transcription scheme seems to closely follow Janhunen 1981, Sammallahti 1988. The *i/i̮ reconstruction for noninitial syllables is used almost thruout; an *-e- has slipped in only in *koje-mV ‘husband’. *i̮ rather than *e̮ is used in initial syllables too, however still an **a in at least a few lexemes like *maksa ‘liver’, *maɣi̮ ‘earth’ (= J *mi̮kså; S *mɨkså, *mɨxi); also *ńś rather than *ńć, though a traditional *ć is still retained in some cases. Different transcription schemes are more inconsistently mixed for the “voiced spirants”, including ‹δ› in *śaδa- ‘rain’, but ‹ð› in *wuði̮- ‘new’; ‹x› in *juxi̮- ‘to drink’, but ‹ɣ› in *miɣi- ‘to give’.

A possible consequence of the dataset’s original compilation for a lexicostatistic review of the traditional Uralic classification is also that some meanings are marked as “[Not reconstructible]”, although they would have well-established though western-leaning proto-forms, e.g. *külmä ‘cold’ (maybe debatable; an IMO poor loan etymology from Balt(o-Slav)ic remains marked for the reflexes), *mälə ‘mind’ (clearly PU; this is reflected in derived verbs in Ob-Ugric), *läwlə ‘heavy’ (EKh ‘cold’ probably doesn’t belong). Some items reconstructed in recent literature are missing too, e.g. Aikio’s revamped *këččə ‘bitter’, *widä- ‘to kill’. More worrying for me is how also many long-known proto-forms are left absent, such as *küsə ‘thick’, *näkə- ‘to see’ (admittedly most reflexes derivatives w/o this meaning), *lükkä- and *puskə- both ‘to push’, *śepä ‘neck’, *sańća- ‘to stand’, *wëlkə ‘white’. I don’t think this can be just due to later semantic divergence in some reflexes, when e.g. *jelä ‘day’ has been admitted as a PU form only from Samoyedic direct evidence (parallels also at minimum in Samic); and *śilä ‘fat’ from no direct evidence at all? Yet also some poor comparisons from UEW seem to remain around, e.g. “*čočV-” ‘to wipe’; actually its only reflex meaning ‘wipe’ is Finnish huosi-, which I don’t think can belong here. ^[3] — These types of issues may even combine for more involved cases. E.g. the PU word for ‘full’ is given as *türə, a narrowly distributed Finnic–Permic etymon, and not the better-distributed *täwdə. This is again probably per UEW, which maintains Selkup tīr as reflecting the former and not, as recognized since Aikio 2002, the latter. ^[4] Or, the word for ‘year’ is given as *ärV; but this reconstruction was in effect already refuted by Aikio 2012, who points out that the Samoyedic forms (meaning ‘fall’) go back to PSmy back-vocalic *ër-, which continues rather the already better-distributed PU form *ëdə. ^[5]

A methodological choice also seems to have been that no synonyms are admitted for PU, although there probably are a few concepts in the data for which they existed; e.g. besides *śilä for ‘fat’, we can reconstruct also *wajə, *koja (both already alluded to in the database; the former though specializing to ‘butter’ in most Uralic languages familiar with agriculture).

(All my Uralonet links above show what I think of as their most reliable reconstructions, but defending those would be at times quite a debate that I don’t intend to get into in detail here — I’ll be happy as long as the reconstruction system chosen is at least internally consistent enough.)

Since following newer literature adequately appears to have given some difficulty for the team, I would like to note here (I think for the first time on this blog) that I’ve already a few years ago started a little repository of new results in Uralic etymology, currently keeping track of

newly proposed PU reconstructions;
newly found reflexes of known reconstructions;
newly found loan etymologies for what have previously been thought of as native Uralic etyma.

The list(s) can be found at the Sanat wiki, as a part of / appendix to our etymological database of Proto-Finnic. ^[6]

Currently pending updates include, besides better coverage of several earlier but post-UEW sources, especially several new native and loan etymologies for Mari and Permic from Metsäranta’s PhD thesis from 2020. I have also been thinking of starting an “antietymological” sister repository, tracking PU reconstructions that have been clearly disproven by better etymologies being published for all or all-but-one of their reflexes, of which there are quite a few by now too.

Etymological marking

Maybe the core content of the dataset. Standard literature has been followed quite faithfully here and I see no major flaws (even where etymological relationships have not been seen fit to be promoted to Proto-Uralic status). Mostly I can just point out some recent and overlooked results. Besides cases already mentioned:

The Hungarian word for ‘claw, nail’ has been unfortunately given as the less basic karom ‘claw, talon’ rather than köröm ‘claw, nail’; which was, even, recently argued by Aikio to be indeed a reflex of PU *künčə.
The Samoyedic words for ‘to scratch’ derive from PSmy *kətå ‘nail’, much as also e.g. Khanty *kö̆nč- does double duty as ‘nail; scratch’. The base noun is, probably correctly, not admitted as a cognate of rest-Uralic *künčə. The verb entry however inconsistently does encode them as cognates.
The most notable loan etymology missing entirely is probably the derivation of Erzya veśe and Hungarian össze- ‘all’ from earlier *wiśwV- ← Proto-Indo-Iranian *wićwa- > Sanskrit víśva, etc. (an etymology due to Katz 2003 that was unfortunately overlooked by Holopainen 2019). Both are regular: for Hungarian *wi- > *wü- > *ü- also in IIr. loans (besides native ones like *widä > öl- ‘to kill’), cf. the already long-known özvegy ‘widow’ < *wiðVwädźV ← Scythian / pre-Alanian *widawa-čī.
There is probably room to adjust many of the individual loanword etymologies, e.g. Kildin Sami sūll´ ‘salt’ is not borrowed from Russian сол but, as maybe the palatalization best reveals, from Finnic *soola (thus also UEW, SSA). This would regularly continue a Proto-Samic *sōlē > Peninsular Eastern Sami *suəllʲe, also present in Skolt suõ´ll. Would be way too much work for me to start digging into these on my own though with any consistency.
There are, on the other hand, still several Proto-Indo-European loanword etymologies advanced that do not seem very reliable (were they ever widely accepted?), e.g. *pelə- ‘to fear’ ~ PIE *pelh₁- ‘to shake’ (which only gives ‘fearful’ in derivatives in Gothic and Slavic); *śalə ‘gut’ ~ PIE ? *ḱolH- ‘turn’ (which only gives ‘gut’ in Greek). These are though only marked as “probable”, not “clear” — is this basically an euphemism for “not that likely”?

I suppose this is by now enough comments for one day. I know that assembling and curating datasets this big is quite the task, and I could probably also spend a week more reading this in further detail. Hopefully I’ve already pointed out some productive directions for future improvement though. (And if you were thinking of otherwise releasing 3.0 just tomorrow: sure, don’t mind me, there will be time in the future too to improve things.)

Edit 2022-06-27: See also some brief responses from Outi Vesakoski (and further from me) at Twitter!

[1] Very relatively so: at triple rather than double or single digits of speakers.
[2] So far the biggest gap in philological coverage are probably the old Swedish “Biblical Sami” records, substantial already in the 18th century, but to my knowledge they have never been looked over in detail etymologically.
[3] Has been further etymologized as being maybe from Proto-Finnic *hosja ~ *hoosja ‘horsetail, Equisetum‘ (traditionally used to make scrubs), which I don’t think has itself any etymology yet. By its phonological structure it obviously cannot be native Uralic as is. Inverting the semantic derivation though, an irregular (?) contraction from an agent noun *hosija < *hose/i-ja ‘sweeper, scrubber’ might be possible (cf. also Fi. hos-u- ‘to work carelessly, in a rush’). Or if this is, as UEW’s etymology would imply, really assibilated *hocija… a root that looks somewhat compareable to me is Samic–Mordvinic *šodə- ‘to let out, run out’ (maybe first derived to *šodə-j- > *hoci- ‘to throw/sweep things out’). A PU *čočV-, on the other hand, should not give Finnic *h- but *s-, via the affricate dissimilation seen also in e.g. *čečä ‘uncle’ > *ćečä > PF *setä.
[4] Worth noting, besides Aikio’s argument that cognates elsewhere in Samoyedic require a protoform with *ä-ə, is also that *türə would be expected to give Sk. **tir with a short vowel. tīr shows Helimski’s Law = Proto-Selkup vowel lengthening in Proto-Samoyedic *ə-stems, < PU *CVCCə stems and some *CV(C)CA stems (a relatively recent discovery from 2007).
[5] This does still leave Permic *ar ~ (Core) Mansi *ārmə (closed syllable per Pelym årəm with a short vowel), but the latter should clearly be analyzed a loan from the former; more specifically, from derived *arm as reflected in Udmurt. Permic *a has no well-established native source at all and even some more dubious cases only really point to some possible origin from *ä.
[6] “Us” being myself, Santeri Junttila, Sampsa Holopainen & Juha Kuokkala, plus original data assembly by Kallio.

Tagged with: open data, phonology, resources, transcription
Posted in Commentary, Links

A Finnic Family Tree

Posted on Thu 2022-05-26 by sansdomino — 6 Comments

I was recently asked on Twitter about the history and subclassification of Finnic. ^[1] Whipping up a full-length discussion paper or even a polished nice-looking family tree would be more work than I can produce on short notice or on free time (and probably something that might warrant wider publication still), but since I actually do have several opinions about this, that are probably either scattered in several places or that I haven’t mentioned anywhere yet, here is a summary of my current thinking.

I’ve given datings of proto-languages and extinction dates only where I can pretend to have any sense of accuracy to them. Error ranges are at least ±100 years for the former, at least ±10 for most of the latter.

┐ Proto-Finnic (ca. 500 BCE, ? middle Daugava)
├─┐ Proto-South Estonian (ca. 500 CE, ? upper Gauja)
│ ├── † Leivu (northern Latvia; extinct 1988)
│ └─┐ Mainline South Estonian
│   ╞══ Mulgi South Estonian
│   ╞══ Tarto South Estonian
│   │   [basis of Old Literary South Estonian]
│   └── East South Estonian (Võro–Seto)
├───┐ Proto-Livonian (ca. 1000 CE, lower Daugava)
│   ├── † Salaca Livonian (northwest Latvia; extinct ca. 1870)
│   ├── † Riga Livonian (unattested, extinct in the 13th C?)
│   └─┐ Courland Livonian
│     ├── † Eastern Livonian
│     ├── † Central Livonian
│     └── (†) West Livonian
└─┐ Proto-Core Finnic (location?)
  ├─┐ Proto-Central Finnic (location?)
  │ ├─┐ Estonian proper
  │ │ ╞══ Insular Estonian
  │ │ ╞══ East Estonian dialects
  │ │ ╞══ West Estonian dialects
  │ │ ╞══ Central Estonian dialects
  │ │ ╘══ North Estonian proper
  │ │     [basis of Modern Standard Estonian]
  │ └─┐ Proto-Votic (inland Ingria)
  │   ├── † Eastern Votic (extinct 1976)
  │   ╞══ † Central Votic (extinct > 1950)
  │   ╞══ Lower Luga Votic
  │   └── † Krevinian (Southern Latvia; extinct ca. 1850)
  └─┐ Proto-North Finnic (ca. 0 BCE, ? coastal Estonia)
    ├─┐ Proto-Northwest Finnic (? coastal Estonia?)
    │ ├── Northeast Coastal Estonian
    │ ╞══ Taivassalo / Very Southwestern Finnish
    │ ├─┐ Southwesternish Finnish
    │ │ │ [main basis of Old Literary Finnish]
    │ │ ╞══ North SW dialects
    │ │ ╞══ South SW dialects
    │ │ ╞══ Western Uusimaa dialects
    │ │ ╘══ probably other dialects in the SW transitional zone
    │ └─┐ Mainline Finnish (ca. 200 CE, Kumo River)
    │   │ [main basis of Modern Standard Finnish]
    │   ╞══ Lower Satakunta dialects
    │   ╞═╤ West Upper Satakunta dialects
    │   │ └── Austrobothnian Finnish
    │   ╞══ Ostrobothnian dialect chain
    │   ├── Kemi Finnish
    │   ├─┐ Torne Valley Finnish
    │   │ ╞══ Lower Torne Valley dialects
    │   │ ╘══ Upper Torne Valley dialects
    │   │     [incl. Meänkieli & Kven]
    │   ├─┐ Kalix Valley Finnish
    │   │ ├── † Lower Kalix Valley Finnish (unattested)
    │   │ └── Jällivaara Finnish
    │   └─┐ Core Tavastian (ca. 300 CE)
    │     ╞══ East Upper Satakunta dialects
    │     ╞══ Heartland Tavastian dialects
    │     ╞═╤ South Tavastian dialects
    │     │ └── colloquial Helsinki Finnish
    │     └─┐ East Tavastian
    │       ╞══ Southeast Tavastian dialects
    │       └─┐ Northeast Tavastian
    │         ╞══ Päijät-Häme dialects
    │         └─┐ Karelid Finnic
    │           ╞══ Savo dialects
    │           ╞══ Karelian Isthmus / Southeast Finnish dialects
    │           ╞══ Ingrian
    │           └─┐ Old Karelian (ca. 700 CE, NW Ladoga)
    │             ╞══ Olonets Karelian
    │             ├── † Sortavala Karelian (unattested)
    │             │   [substratal to Sortavala Finnish]
    │             └─┐ Karelian proper
    │               ╞══ Viena / Northern dialects
    │               ╘═╦ Southern dialects
    │                 ╚══ Central Russian dialects
    │                     (Tver, Tikhvin, Valdai)
    └─┐ Ludian–Veps (ca. 600 CE, SE of Ladoga)
      ├── † Olonets Ludian (unattested)
      │     [substratal to Olonets Karelian]
      ╞══ North Ludian dialects
      ╞══ Central Ludian dialects
      ╞══ South Ludian dialects
      ╞═╗ North-Central Veps
      │ ╠══ Northern Veps dialects
      │ ╚══ Central Veps dialects
      ╞══ Southern Veps dialects
      └── † North Chudian (unattested)
          [substratal to some Northern Russian,
           in contact with Proto-Komi]

The South Estonian sub-tree here is the part that has been published the most recently, basically from Kallio (2021, 2018); though I’d like to see more detail on the suggested Tarto–VS group still.

Some other divergences of note from earlier Finnic family trees include:

No Coastal Finnic (Livonian + Core), contra Kallio. I will be arguing for this in detail in a future paper. Among the early branches, Core Finnic and Central Finnic seem to hold up better so far, though I’m open to the possibility that some North Estonian dialects may eventually prove to have some fairly deep archaisms to them too. North Finnic I have several suspicions about, but Ludian–Veps still has nowhere better to go in the tree than with my “Northwest Finnic”.
No East Central Finnic (East Estonian + Votic), contra Viitso. These are united only by some cases of õ, which I however consider to be archaisms already from common Central Finnic. ^[2] This also allows for (re)introducting a non-paraphyletic Estonian sensu stricto.
Paraphyletic Western Votic, directly following Kuznetsova, Muslimov & Markus (2015).
Paraphyletic Western Finnish and Tavastian Finnish, generalizing further from Kallio (2013). Purely by linguistic evidence, the traditional “Western Finnish” grouping would be about as well-supported as my “Mainline Finnish”, but settlement history to me seems to strongly favor the latter: the Karelid group can’t just drop out of nowhere, it needs to be derived from somewhere at the time in the early 1st millennium when there simply wasn’t any Finnic presence yet in eastern Finland (but parts of western Finland had already been Finnic-speaking for some centuries, with presumable incipient diversification). Archeology so far does not favor an independent expansion from the south; the river Kymi would look like a good route candidate for that at first, but it might have been simply too non-navigable with its several major rapids. Hence, Karelid Finnic must be nested not just within “Finnish”, as has been known already for long, but indeed within “Western Finnish”.
Polyphyletic Ostrobothnian Finnish. Some of these lineages may eventually prove to be offshoots of specific Western dialect groups further south, but current research really hasn’t even started that line of investigation (though see next item).
Austrobothnian (= my term for South Ostrobothnian) as a West Upper Satakunta offshoot specifically. This is a well-known fact of settlement history, but has some implications for analyzing what is areal and what is old inheritance across the Western Finnish dialect continuum that I don’t think have been fully appreciated in the past.
No Karelian–Veps group. This seems like a no-brainer to me: there are practically zero common innovations (some lexical evidence has been claimed but without ruling out common archaisms or loanwords) vs. quite abundant Finnish–Karelian = Northwest Finnic innovations, even beyond the Karelid group. Some more narrowly distributed, e.g. Ludian–Karelian, innovations exist, but their absense from Veps or eastern Finnish I think immediately shows them to be areal rather than genealogical.
Paraphyletic Ludian and perhaps Veps. The latter above all due to the fact that most innovations in Veps could be attributed to Russian influence or at least are downstream of changes due to this. Not tying down the assumption that Veps must be monophyletic seems like the safer bet so far.

I take no stance here on the still gradually ongoing debate on if the Kukkuzi dialect is Votic-with-Ingrian-superstratum or Ingrian-with-Votic-substratum or a mixed variety entirely. ^[3]

Last, don’t take the rather fine detail of Finnish dialects as meaning that they’re actually more different from each other than what we find within other groups — they’re just a) more numerous (even 100 years ago Finnish had 2× the speakers of Estonian, 60× the speakers of Karelian, 200× the speakers of Veps…) and b) better known to me. If I had been looking into e.g. Estonian dialectology in as much detail, I would probably have some opinions also on how to re-tool things around there.

[1] Yes, I am on Twitter as of the start of this year. Not explicitly announced on the blog before, though you may have noticed if you’ve checked my About page recently.
[2] One intriguing example is PF *kota : *koda- ‘house’, giving in my view early PCF *këta : *këða- > later PCF *këta : *kë.a-, whence Vt. kõta : kõa-; EEst. kõda : kõja-; NEst. koda : koja-. What is telling here is that Estonian -j- as a hiatus filler only seems to be regular after illabial vowels, thus showing that NEst. koda does not retain PF *o; it has instead undergone the development *kë-a > ko-a that also appears in cases like *këldajnën > *këllainë(n) > kollane ‘yellow’ (which has “primary” *ë < *e, not “secondary” *ë < *o; cf. Finnish keltainen).
[3] For general historical Fennistics purposes it’s in any case sufficient to know that any attestations in Kukkuzi but not in “normal” Votic can be always from Ingrian, be it by loaning or descent, i.e. not requiring reconstructing anything all the way to Core Finnic.

Tagged with: dialectology, finnic, historical linguistics, language classification, linguistics
Posted in Commentary, Reconstruction

*-ətA adjectives in Mordvinic

Posted on Sun 2022-05-01 by sansdomino — 20 Comments

Across Finnic and Samic, one of the more characteristic adjective endings is *-əta ~ *-ətä; yielding e.g. Finnish -ea ~ -eä, Estonian -e, Northern Sami -at. The Permic cognate *-i̮t is also at least relatively common. Because Of Reasons I have gone for a hunt for reflexes in Mordvinic, where no productive reflex survives. More specifically I’ve gone over Paasonen’s Mordwinisches Wörterbuch (a few more could be probably found in other sources). The scoop is as follows.

First, some cases well-known in the comparative literature. (Noticably often these have exact equivalents in Finnic, or indeed specifically Finnish).

*kalgədə ‘hard’ (> Er. kalgodo, Mk. kalgəda) < WU *ka/ëlkəta > Fi. kalkea ^[1]
(MWB unwarrantedly lists this as a derivative of *kalgə ‘sheaf, etc.’, which is rather < WU *këlkə ‘haulm’)
*śejəďə ‘thick’ (> Er. śejeďe, Mk. śiďä) < WU *śikətä > Fi. sikeä ‘sound (of sleep)’
(the Moksha form miscited in Uralonet as śäjiďä — a real form, but rather from some Erzya dialect that has *e > ä)
*taŋgədə ‘firm, stiff’ (> Mk. taŋgəda) < WU *taŋkəta > Fi. tankea ‘id.’
*valdə ‘light’ (> Er. valdo, Mk. valda) < WU *wëləta > Fi. vaalea ‘id.’
(in UEW / Uralonet, Mordvinic incorrectly under the longer variant *wëlkəta)
*vijəďə ‘straight’ (> Er. vijeďe, Mk. viďä) < WU *wojkəta > Fi. oikea, NS vuoigat etc. ‘right’

We see here reflexes as *-ədə / *-əďə after a consonant cluster, syncopated *-də after a PU sonorant (but apparently not after single *k). Moksha śiďä, viďä are probably due to secondary post-Proto-Mordvinic syncope (unclear to me if with fusion *jď > ď or, as might be suggested with *ej > i in the former, with vocalization of the glide). Not many other cases follow this exactly, though. I find only one other clear example + one possible example in *-ədə:

? *ľifčədə ‘loose’ > Mk. ľifčəda; from a stem common with e.g. *ľifčańa ‘pliable’, Mk. ľifčəm- ‘to relax’. Attested as both an ə-stem ľifčədə- and an a-stem ľifčəda- though, hard to tell which might be primary.
*vačədə ‘hungry’ > Er. vačodo, Mk. vačəda; from *vačə ‘hunger; hungry’

For *-də after CVR-, I find two more examples, and also two nouns that might derive from former adjectives:

Er. boďo ‘obese’. Perhaps distorted from *vojdo, and thus a derivative from *vaj ‘butter, fat’ (which in Erzya develops as > *voj > oj)? Still would have expected *-ďə, but there’s no possible soundlawful origin for an Erzya word ending in -ďo anyway…
*naŕďə ‘firm, tough’ > Er. naŕďe, Mk. naŕďä (no base root that I can identify)
(update: or maybe from PU *ńërə ‘cartilage’??)
*śardə ‘elk, reindeer, deer’ > Er. śardo, Mk. śarda. Has clear cognates at least in Mari (*šårδə) and Khanty (*sūrtāj; Northern Mansi surti probably a loan from this), with the PU form usually reconstructed as *śarta. However I suspect this was originally rather an adjective *śarwəta ‘horned’ ← *śarwə ‘horn’. Loss of *-w- in clusters may have been early enough in Mordvinic and Mari to allow common syncope from *śarəda to *śarda. ^[2]
Mk. šoľďä ‘crazy person, crybaby’. Could this be from a common root with Finnic *hullu ‘crazy’ (both pointing to earlier *šul-)? The morphology of the Finnic word remains obscure though, and the palatalization in Moksha would be unexpected; maybe suggests something like *šuljəta. Alternately, maybe ‘crybaby’ is more original, and the Mk. word is instead from a common root with Erzya čoľeďe- ‘to chirp, trill’? Either way this would probably have been an original adjective.

There are however several adjectives ending in *-adə, derived mostly from stems already ending in *-a-. This contrasts with the suffix’s behavior in Finnic and Samic, where it always carries a 2nd-syllable *ə even when attaching to *a-stems (e.g. Fi. lauha ~ lauhea ‘mild (of weather)’, notka ~ notkea ‘pliable’). I suppose the widepread Proto-Mordvinic reduction of 2nd-syllable vocalism led to a reanalysis of *-ədə as just *-də, and then later on the rise of new cases attaching to different stems.

*kaladə ‘broken’ > Er. kalado, Mk. kalada; from a stem *kala- common with e.g. *kaladə- ‘to break (intr.)’, *kalaftə- ‘to break (tr.)’
*komadə ‘turned over’ > Er. komado, Mk. komada; from *koma- ‘to turn over (< PU *kuma-)
*naksadə ‘rotten’ > Er. naksado, Mk. naksada; from a stem *naksa- common with e.g. *naksaftə- ‘to let rot’, *naksalgadə- ‘to begin to rot’
*ozadə ‘sitting’ > Er. ozado, Mk. ozada; from *oza- ‘to sit’
*panžadə ‘opened’ > Er. panžado, Mk. panžada; from *panžə- (!) ‘to open’ (< PU *panča-)
*śťadə ‘straight, standing’ > Er. śťado, Mk. śťada; from *śťa- ‘to stand’
*štadə ‘naked’ > Er. štado, Mk. štada; from *šta- ‘to be exposed, cold’
*tajadə ‘stupid, grumpy’ > Er. tajado; from a stem *taja- common with e.g. *tajardə- ‘to be timid, dejected’, *tajaskadə- ‘to become grumpy’

Itkonen (1963, CIFU 1) has proposed to consider a chunk of these to be instead primarily adverbs, formed with the homophonic ablative suffix *-də, but I’m not sure if this is a good analysis: Mordvinic infinitives and participles are generally marked, not formed by appending case endings to a bare verb stem. Also, I would analyze *kala, *naksa, *taja to be primarily noun roots ‘brokenness’, ‘rottenness’, ‘unsatisfiedness’.

Still more interestingly, I can also find adjectives where the final vowel looks to have escaped vowel reduction.

Mk. aluda ‘underlying; under’. Another adverb/adjective, seemingly pleonastic from an unattested *aləŋ > *alu ‘underlying, undery’ (maybe ousted by the homophonic lative adverb: Er. alov, aloŋ, Mk. alu ‘(to) under’).
Er. čando, čonda ‘pricey; price’. Probably not a cognate of Fi. hinta ‘price’ as traditionally compared. MWB hesitantly but I think more likely correctly suggests a connection with Er. Mk. čana ‘price’, which is ← Ru. цена.
*pärda > Er. ala-berda, Mk. ala-pärda ‘missshapen’ (“under-pärda“). Probably still an independent word in PMo., given how Erzya and Moksha differ in if they adopt compound-medial stop voicing (“rendaku”, we might call it).
*säŕďa ‘fragile of old age’ > Er. seŕďa, Mk. śäŕďä; evidently from a common root with *säŕəďə- ‘to hurt, be sick’ (? < PU *särä-, though intriguing resemblance also with Finnic *särke- ‘to hurt’).
*šopəda ‘dark; darkness’ > Er. čopoda, čobda, Mk. šobda, šovda; from *šop ‘in a day, for a day’
*topəda ‘dark (of color), maroon’ > Er. topoda, Mk. tobda; from *topə ‘full’, the meaning apparently thru expressions like *topəda_seń ‘full blue’ = ‘dark blue’.
#ťožda ‘light’ > Er. čožda ~ Mk. ťožďä (no base root that I can identify). Reconstruction difficult due to several irregularities. Is Er. č- maybe by contamination with čova ‘thin, fine’?

In three of these, we find a similar environment to where PU 2nd-syllable *a survives: after a 1st-syllable *o < PU *u. Maybe the same would have originally allowed even retention of a 3rd syllable *a? — By contrast the disharmonic *pärda, *säŕďa pretty much have to be Mordvinic-internal formations. Could an adjective suffix *-da have been generalized / extracted just from cases like *topəda?

No further answers today; just a look at what other etymological candidates we might have in Mordvinic for residues of this ending.

[1] Close to a ghost word, though; kalkee ‘poor, low-quality’ is only known from one Finnish dialect. This can only really link to ‘hard’ thru kalki ‘poor, unlucky’ (“having a hard time”) from one early dictionary. The reported “dialect variant” kalkkea ‘loud, talkative, lively’ seems likely to be unrelated and instead from the verb kalkkaa ‘to ring (bell), make loud noise’ (many similar derivatives from this, also e.g. kalkas ‘lively’, kalkatti ‘blabbermouth’). — Estonian kalk, kalge ‘hard, brittle’ is a more reliable cognate in any case at least.
[2] In Mo. loss of *w probably postdates medial voicing though: by a few examples, *-tw- *-sw- seem to yield PMo. *-t- *-s-, not **-d- **-z- (at least *latə ‘shelter, roof’ ~ Finnic *latva ‘canopy’, *kas- ‘to grow’ ~ Finnic *kasva-).

Tagged with: derivation, historical linguistics, linguistics, mordvinic, morphology
Posted in Etymology

Will Someone Please Reconstruct Proto-Kurdish Already

Posted on Wed 2022-02-09 by sansdomino — 17 Comments

Some things about comparative linguistics you might just take for granted in your own little corner of a particular language family, until you start looking at how they do things in others. In Uralic studies, we’ve known for 200+ years, and put into explicit practice since 150+ years ago, that progress requires documenting unwritten language varieties (just comparing literary Hungarian / Finnish / Estonian / Sami runs out of steam fast ^[1]). For 120+ years, even, that it’s additionally good practice to get detailed interdialectal comparison of such languages started sooner rather than later, not just rely on one well-known doculect.

The big dog of our Eurasian linguistic region, Indo-European studies, has of course an enviable access to a good bunch of attested Old Indians, Old Church Slavonics and Old High Germans, which are lot more directly compareable with each other. But you’d think the field would have somewhere during the 20th century understood at least that, yes, newer-attested languages will have contributions to make to the overall picture too. Remember e.g. how Nuristani, a little bunch of languages up in the mountains of Afghanistan, turned out to have the key evidence for affricate reflexes of *ḱ *ǵʰ *ǵ in Proto-Indo-Iranian, preserved several millennia longer than in Avestan or Sanskrit?

Where Slavistics, Baltistics, Germanistics, Armenistics, Romanistics have all still gotten their general comparative programmes rolling pretty well, Indo-Iranian keeps being a rock that drags behind pretty badly. Considering extra-scientific causes, this is not a giant surprize / is clearly in some amount thanks to these other sub-fields’ status as National Sciences in the various nation-states of Europe. Still, this would not have to be the case, it’s not like Celtistics has been left in the dust. Comparative linguistics also seems like something with sufficiently little direct political valence that it should be doable enough even e.g. under Iran’s current theocratic administration, let alone by the sizable and somewhat intellectual-leaning Iranian diaspora(s). Indian fans of the Out-of-India theory also demonstrate an existing if unorganized interest in linguistic history.

But indeed. Indo-Iranian is not just any random branch of Indo-European; it is today the largest branch (e.g. Glottolog counts 319 varieties, out of 581 Indo-European varieties altogether), and also the only one to preserve all of its known main branches since antiquity. Reflects more branches today than in history, really: already Nuristani is nowhere to be seen until the 19th century. By contrast, in Europe East Germanic, West Baltic, Continental Celtic, Aeolian Greek etc. are long gone. If anywhere in IE, it is in Indo-Iranian that we should expect to be able to reach quite deep time depths by collecting data from modern varieties and applying comparative reconstruction efforts as usual. Yet this generally seems to have not been done, and approximations derived mainly from Sanskrit and Avestan end up making do as Proto-Indic, Proto-Iranian, and the main fodder for Proto-Indo-Iranian.

By now there is clear evidence that this is insufficient. One informative case from recent years is Martin Kümmel’s observation that “secondary” word-initial h- in several Iranian varieties — at least Khotanese and many western Iranian varieties including Middle to New Persian — actually seems to be a retention of PIE laryngeals (especially *h₂)! This may not have been completely out of the blue. Laryngeal hiatus in Vedic (*aHa > *a.a > ā in some cases still parsing as two syllables) has been known since the early decades of laryngeal theory, and Cheung’s Etymological Dictionary of the Iranian Verb from 2007 takes an extremely cautionary approach of projecting all PIE laryngeals into Proto-Iranian, including an implausible-looking contrast between this *H and secondary Iranian *h < *s (and implausible-looking clusters like *Hhauš- ‘to dry out’). ^[2] Regardless we do see that it is incorrect methodology to treat any divergences from attested Old Iranian as innovations, and that this will fail to connect archaisms in marginal new Indo-Iranian varieties back with the wider programme of Indo-European reconstruction. The same has been very patiently explained by Kümmel too, in a 2016 paper “Is ancient old and modern new? Fallacies of attestation and reconstruction“.

I’ve picked Kurdish here as a semi-random example of a modern Iranian language group that probably deserves closer investigation in this fashion, though its western peripheral location might indeed make it a more likely location for archaisms than smaller languages more fully encircled by Persian. It quite clearly shares at least the propensity of retaining *h₂. Even just looking over the lexicon of standard Kurmanji as listed at Wikipedia readily turns up cases like hêk ‘egg’ << PII *Hāwyam < PIE *h₂ōwyom; hirç ‘bear’ << PII *Hr̥ćšas < PIE *h₂r̥tḱos (~ Middle Persian xāyag, xirs). However also cases like hesp ‘horse’, where some kind of “aspiration throwback” could be considered (*asp- > *esʰp > hesp).

The outlines of Kurdish historical phonology are known, of course. Relatively detailed discussion is readily found in sources like Asarian & Livshits (1994), or at Iranica Online. What seems to be missing from these accounts, however, is any real integration of variation among the Kurdish “dialects” (by now widely thought to comprise at least 2–3 languages). They also spend much effort on lamenting difficulties in telling what might be native Kurdish words and what loanwords from Persian or Zazaki or some other neighboring Iranian variety; same as in many other studies on individual western Iranian varieties. But we — at least e.g. us Uralicists — know quite well that attention paid to dialectology is often able to resolve such issues! Maybe some Kurdish variety would turn out to display a form different from the others that would then need to be considered the native one; or to display a different loanword substitution, pointing in favor of relatively recent loaning, whether from Persian or not. Dialect differences could also help with relative chronology, in telling late areal changes (and across Iranian these are many) apart from what really are early Proto-Kurdish innovations. The retained laryngeals, too, are noted by Kümmel to not be entirely systematic. Conceivably it could be the case that e.g. Kurdish only gets them through Persian, at some older or newer date. Or inversely, maybe Kurdish might be in its native vocabulary more systematic about this than Persian is. No way to tell before looking.

Let me be clear here on the proposal. A reconstruction of e.g. Proto-Kurdish should not rely on just some handful of already available descriptions / dictionaries (though I’m sure their comparison, too, would already add up to several results), nor aim just for identifying phonological variation. The goal of such a project should be primarily lexicogeographic: to have detailed enough dialectological picture to be able to see the directions of vocabulary spread, to tell local innovations apart from local archaisms. In Uralic studies, when putting together an understanding of, or at least the data for understanding, Proto-Samic, Proto-Mari, Proto-Permic, Proto-Mansi, Proto-Khanty, Proto-Selkup, etc., we have routinely based this on low double digit numbers of varieties, each documented at least to low quadruple digits of vocabulary. And these are all smallish language groups, spoken by some tens or hundreds of thousands of people. The Kurdish languages have tens of millions of speakers altogether. Even if extensive fieldwork in Kurdistan were to look too dangerous or politically complex right now, already connecting with the diaspora communities worldwide should be easily able to provide data on some dozens of varieties.

I do not pretend that this would be a small or quick task (it is clearly beyond what I or anyone could accomplish as just one unattached researcher), but it seems like a very doable task, and likely fruitful, not just for the circles of Kurdish studies or Iranian studies, but for Indo-European studies altogether. And by no means is this a gigantic endeavor either. This could be all done in under a decade by one research group, if there was first of all the will for it to happen (be funded and prioritized).

Closing up this plea, let me also suggest one other hypothesis that could be up to something. In existing overviews, Kurdish is reported to “sometimes” show PIr. *x > kʰ, e.g. *xara- > /kʰer/ ‘mule’. The facts that this development (1) fails to be regular and (2) seems to be a regression (alleged PIr. free-standing *x is < PII *kʰ) should already suggest that it is perhaps an archaism rather than an innovation. The same might go for /tʰ/ from “PIr. *θ” < PII *tʰ, reported at least in *θaiwar- > /tʰiː/ ‘brother-in-law’. This interpretation is not airtight off the cuff by any means: both Armenian and Semitic influence could have encouraged secondary introduction of aspirated stops. But, interestingly, on a brief look-around I do not find cases where Persian /x/ ~ Kurdish /kʰ/ would derive from a secondary *x that continues *h₂, only cases with PII *kʰ. From the former, the result seems to be /h/, as above in e.g. ‘egg’. So did Kurdish regularly shift *x > /h/, while never shifting *kʰ? Again, detailed dialect evidence could perhaps swing this either way. One of these decades we will hopefully know better.

[1] Yes, written Sami already existed 200 years ago, indeed since the mid-17th century. The first variety to have been standardized to some practical extent was so-called “Old Swedish Sami”, a clergy-designed form from the mid-18th, based most closely on Ume Sami though aimed as a general western interdialectal standard. Standard Northern Sami took its first steps around the same time as well.
[2] Omniretained laryngeals are furthermore trouble also for e.g. RUKI. If we have *s > *š in e.g. *buHs- > *buHš ‘to endeavor’, as if triggered by a preceding *H and not *ū, why not also in e.g. *yaHs- > *yaHh- ‘to girdle’? Without the assumption of universal laryngeal preservation, though, this could be easily resolved by assuming *eH >> *ā as an independent vocalization from *iH/*uH > *ī/*ū. Note also a further but welcome corollary: if we do go with thinking that RUKI in *buHš- has been triggered by a long-retained *H, then also *Hhauš- will have to be simplified to just *hauš-, indeed already to a pre-RUKI late PIE *sews- < early PIE *h₂sews-.

Tagged with: historical linguistics, historical phonology, indo-iranian, kurdish, language documentation, laryngeals, linguistics
Posted in Commentary, Methodology