by

The Look and Song of Language

Reading Time: 16 minutes

Everyone speaks at least one language. It’s fascinating how we learned that first language, too. Without any formal instruction, we heard it, repeated it, and found ourselves effective communicators within about seven years. And in that time, we learned what our language sounded like, too. 

Paul Horn once said that Music is a universal language which unifies the spirits of mankind
Language is merely words seeking their melodies

Even if we grew up in a multilingual culture we somehow figured out when one language wasn’t another. It didn’t matter if we grew up speaking more than one language, either! If we grew up speaking English, we wouldn’t mistake it for Dutch. If we grew up speaking Spanish, we wouldn’t take it for Portuguese. 

It’s a very simple thing that happens in our brains that helps us recognize languages, but what is it? And how would we use that information to help a computer learn to recognize a language? In this article, we will explore methods for analyzing written language that can teach us what makes a language uniquely identifiable in written form, and how that can relate back to the spoken form. 

How do we recognize a language?

First, we should establish that for this article we are referring to spoken languages with an emphasis on the written modality; we are not referring to signed languages. So with that comes the assumption that our eyes and our ears are the two senses principally involved in recognizing a given language.

When we are using our eyes and ears to recognize language, what is it that stands out to us and signals, “this is your language,” or, “this is a language you don’t know at all?”

It can’t be syntax;  that would require knowing something about the nature of the language. Nor can it be semantics (meaning) or pragmatics (usage) for the same reason. It must be something basic; some small part of the language that we can discern without introspecting into the language. 

We recognize a language by looking at its parts, but what parts?

If we were to look at parts of a language that didn’t involve grammar, structure, meaning, or usage, that leaves us with the words themselves. It means exploring what it is about how those words are shaped; the morphology.  But in order to explore morphology without learning the morphology we must parse these words into arbitrary units that allow us to ignore the methodological structure that can seem so obvious. 

One approach for arbitrarily parsing words into units is to pick two or three-letter combinations of letters. The special name for an arbitrary sequence of such tokens is n-gram. What we’ll do is focus on two special kinds of n-grams: The bigram and the trigram. 

Introducing Bigrams and Trigrams

A bigram can be any two-unit sequence. We could break a sentence into word bigrams like so:

The dog ran after the cat at the park : [‘the dog’, ‘dog ran’, ‘ran after’, ‘after the’, ‘the cat’, ‘cat at’, ‘at the’, ‘the park’]

This is an approach commonly used for analyzing syntax because (at scale) it answers the question, “What are the patterns (rules) for where words are placed?” But that’s not our focus here. We will be focused on letter bigrams, meaning our two-unit  sequence will be the letters within words:

The dog ran after the cat  =>  [‘th’, ‘he’, ‘do’, ‘og’, ‘ra’, ‘an’, ‘af’, ‘ft’, ‘te’, ‘er’, ‘th’, ‘he’, ‘ca’, ‘at’]

Similarly, a trigram would break the sentence into three-letter parts:

The dog ran after the cat =>  [‘the’, ‘dog’, ‘ran’, ‘aft’, ‘fte’, ‘ter’, ‘the’, ‘cat’]

While an astute observer may note that there are quite a few whole words that end up being trigrams, that isn’t yet our focus. 

Why does looking at n-grams matter?

Computer programs may do bigram analysis on text to determine if it’s encrypted. This is because encrypted text (if it’s well-encrypted) shows all bigrams and trigrams at a relatively uniform frequency. Unencrypted text (i.e. natural language) has some two-letter combinations occurring more commonly than others.

While cryptographers aren’t concerned with why some two-letter combinations occur with greater frequency for a given language, this article asserts why these combinations are highly relevant for identifying unique characteristics of a language. 

English and many other spoken languages with written forms are alphabetic; the letters ( more accurately, graphemes) represent a sound that a speaker would vocalize. In some cases such as English, multiple graphemes may be used to represent a single sound (e.g. th for /θ/ and /ð/, sh for /ʃ/)  Bigrams and even trigrams can be a tool used within orthography (the rules for writing a language)  to capture meaningful units of sound (phonemes). 

With a large enough body of text (a corpus) we can learn rules for how graphemes are put together within words. Some rules can be discovered through the analysis of the frequency of individual graphemes, others with bigrams and trigrams. Other rules can be discovered by analyzing the placement of those n-grams within words and still other rules can be discovered by assessing the placement of frequently occurring n-grams.

N-gram analysis is capable of revealing morphological (structural) patterns of words for a given body of text. When we discover those patterns, we discover the look of a language. 

When we analyze a sentence, we aren’t only concerned with whether a particular n-gram occurs, but the frequency with which it occurs as well.

A small frequency analysis of English and French

If we analyze the English sentence, “The dog ran after the cat in the park,” we observe the following frequencies of bigrams and trigrams:

BigramFrequency
th3
he3
do1
og1
ra1
an1
af1
ft1
te1
er1
ca1
at1
in1
pa1
ar1
rk1
TrigramFrequency
the3
dog1
ran1
aft1
fte1
ter1
cat1
par1
ark1

We can additionally note that this 9-word sentence produced 20 bigrams and 11 trigrams.  The three occurrences of th, making it 15% of all bigrams, may seem significant. 

Let’s contrast with the French equivalent, le chat a couru après le chien dans le parc.

BigramFrequency
le3
ch2
ha1
at1
co1
ou1
ur1
ru1
ap1
pr1
re1
es1
hi1
ie1
en1
da1
an1
ns1
pa1
ar1
rc1
TrigramFrequency
cha1
hat1
cou1
our1
uru1
apr1
pre1
res1
chi1
hie1
ien1
dan1
ans1
par1
arc1

We can observe that this 10-word sentence produced 24 bigrams and 15 trigrams.  And in this one sentence, we again see that le is 12.5% of all bigrams. We can look at the words in the sentence and see why this is; le is a whole word that appears three times, just like “the” in English. 

We might want to conclude that “the” and le are commonly used words that help shape the looks of those languages. But based on that assumption, we might think words also have “he” in them in English, and “ch” in them in French. This one sentence isn’t a large enough sample to learn what a language sounds like

Let’s look at a slightly larger sample, The Universal Declaration of Human Rights, and see if bigrams derived from it can give us a better picture. This content was produced using the Methodius library and samples of the UDHR from various languages

English BigramFrequency
an46
he42
th42
nd39
re35
er30
on29
of25
es23
ti22
French BigramFrequency
es73
de58
on48
re37
le36
nt35
me34
er32
la30
et30

We observe that the th for English remains significant, but slightly less than an and heLe remains significant in French, but less so than es, de, and on.

Analyzing a Peculiarity: The Frequency of es, de, and on in French 

French speakers may be eager to chime in and explain these bigrams. But let’s put off the eagerness to jump to morphology and instead imagine how a baby, born in a French-speaking home, knows when its mother is talking to it. That baby doesn’t know what those particles mean, but it knows that its mother is talking to it. How is that baby discerning French from babble?

To learn how French looks (and sounds) like French, we have to look at more than frequency; we’ll also have to consider placement. 

Let’s first consider that any sort of n-gram has three general placements in a word: 

  • Start (Initial)
  • Middle (anywhere but the first or last of the sequence)
  • End (Final)

If we consider that a bigram has those three general positions, we find that in the 73 instances where es occurs, it’s at the end of a word 82% of the time. We can safely say that French words commonly have es, and that es commonly comes at the end of a word. 

If we look at the next most common bigram, de, we’ll find that of the 58 times that it occurs, it was at the beginning of a word 62% of the time. 

We might be tempted to think that French words start with de and end with es, but that still isn’t the whole story. 

Diving deeper into French bigram placement by incorporating trigrams

The top two French bigrams comprise 9% of all bigrams. What about that third, one, on? We’ll note 48 occurrences, making it a bit more than three percent of all bigrams. And we’ll also note that we find it in the middle in 75% of those occurrences. Does this mean French words have on sprinkled in them? Do babies hear some de + on + es sandwich all the time? Not quite.

If we look at the most common French trigrams, and where they are positioned in a word, we’ll find that on has a more interesting story. 

Ion,con, and ons rank the first, second, and fifth most frequent trigrams. We see that ion just barely favors occurring at the end of a word while con definitely prefers the start, and then ons slightly favors the middle over the end, but never the beginning.

What are we to make of these peculiar frequencies? And what of this les trigram in third place?

Now we’ll analyze the positions of bigram and trigram frequencies within words:

N-gramPositions
esstart: 5
middle: 8
end: 60
destart: 46
middle: 9
end: 3
onstart: 2
middle: 36
end: 10
restart: 11
middle: 20
end: 6
lestart: 21
middle: 12
end: 3
ntstart: 0
middle: 10
end: 25
mestart: 7
middle: 13
end: 14
erstart: 0
middle: 25
end: 7
lastart: 21
middle: 9
end: 0
etstart: 27
middle: 2
end: 1
N-gramPositions
ionstart: 0
middle: 9
end: 10
constart: 15
middle: 3
end: 0
lesstart: 11
middle: 0
end: 7
tiostart: 0
middle: 18
end: 0
onsstart: 0
middle: 10
end: 7
entstart: 1
middle: 6
end: 10
resstart: 3
middle: 3
end: 11
atistart: 0
middle: 15
end: 0
desstart: 13
middle: 0
end: 0
ommstart: 0
middle: 12
end: 0
The Morphological Explanation for French n-gram Frequencies

Francophones and linguists would  be eager to provide three rules to explain our bigram frequencies: 

  1. le is a definite article
  2. s is a common French morpheme indicating a plural
  3. French adjectives, including definite articles, must agree with their nouns in both gender and quantity

These morphological and syntactical facts are the reason behind the frequencies and placements we observe:

  • es being 17.6% of all bigrams and occurs at the ends of words 82% of the time
  • le being only the 8.7% of bigrams but occurring at either the start or middle 92% of the time
  • Les  (a union of le and es) comes in third place as 11% of trigrams while tending to occur at the start of words instead of the end

The peculiarity of on constituting 11.6% of bigrams requires a different kind of analysis, however. This peculiarity requires evaluating co-occurrence. 

Of the 48 occurrences of on, 46 (96%) are found in the middle or end of a word. French words overwhelmingly do not place on at the start. Why?

Astute observers will have caught that ion, con, tio, and ons are 11.6%, 11%, 11%, and 10% of the trigrams. This is where a review of co-occurrence helps. 

If we evaluate our trigrams not just by frequency, but by how often frequent trigrams occur next to each other, we’ll find that by analyzing the eight most frequent trigrams that ons, ati, tio, and ion have nearly the same frequency. 

TrigramOccurrences with Other Top Trigrams
con10
ons17
ati15
tio18
ion18

The algorithm for producing this frequency map is summarized as, “only add n-grams if they’re all very popular and also adjacent.” 

The result is that we now know that tio and ion co-occur, and ons frequently co-occur next to either con or ion

If we were to go the next step and form unions with these trigram sets, we’d discover that ati, tio, ion, and ons form one 5-character n-gram: tions. We will find seven instances of tions in the text and 18 instances of tion

We can now see how the analysis of sequence after frequency reveals a French morpheme: tion

Frequency and Sequence revealing the look of French

We’ve now learned that:

  • French words are likely to end in -s
  • Les is very popular
  • -ion and its bigger brother –tion are popular word endings
  • Con- wants to be at the beginning of words, and isn’t likely to be followed by an -s

If we were to train a language model on the Universal Declaration of Human Rights, we might expect that a handful of training epochs would make very similar assumptions about how to form words. 

Patterns discovered in a larger corpus

Using the Methodius CLI, which was developed for this article,  an analysis was done of six texts:

  • Alcools
  • Candide
  • Cyrano de Bergerac
  • Les Misérables
  • Les Plaisirs et Les Jours
  • Universal Declaration of Human Rights

Each text was analyzed for its 20 most frequent bigrams, and then all 6 lists of 20 bigrams were merged to produce 30 unique bigrams. 

esonti
leurit
enteio
nteuoi
delaro
reiens
oumeil
anerse
aineis
etqura
The 30 most frequent bigrams across six texts

The approach of analyzing each text independently and then merging the list reveals unique characteristics of the texts. 

  • Removing Universal Declaration of Human Rights: 25 Bigrams
  • Removing Alcools, Cyrano de Bergerac, Les Misérables: 29 Bigrams
  • Removing Candide or Plaisirs: 30 Bigrams (i.e. the size of the list is no different)

If removing the Universal Declaration of Human Rights removes 5 unique bigrams from our list, this informs us that it is using a language that’s meaningfully different from the others. 

When we analyze all six texts at once, we find that the frequencies we observed in the initial UDHR sample aren’t that different from 5 other texts spanning 300 years of French history. Let’s compare bigram frequencies to rates of co-occurrence:

BigramFrequency
es2.90%
le2.59%
re2.46%
de2.36%
en2.33%
ai2.17%
ou2.02%
nt1.98%
an1.82%
et1.81%
on1.74%
it1.67%
te1.61%
er1.56%
qu1.47%
la1.37%
me1.32%
ur1.31%
is1.24%
se1.23%
Top French Bigrams
BigramCo-occurrences
es17463
nt16707
ai12684
en11747
it9731
re9560
te8815
le7414
er6943
ur5938
an5635
me5400
is4964
se4851
de4710
ou4408
et3851
on2705
la2341
qu582
Related Top French bigrams

French is filled with es, le, re, de, and en, and they are associated strongly with other bigrams of the language. 

At this point, it would behoove us to compare our related top bigrams to their placements. 

French BigramCo-occurrences
er6943
re9560
ou4408
ur5938
de4710
es17463
on2705
nt16707
an5635
it9731
te8815
le7414
se4851
en11747
et3851
is4964
ai12684
me5400
la2341
qu582
French BigramWord Placement
erstart: 27
middle: 8882
end: 4819
restart: 4206
middle: 8187
end: 9243
oustart: 1395
middle: 16187
end: 196
urstart: 10
middle: 5997
end: 5578
destart: 16727
middle: 2086
end: 2019
esstart: 1629
middle: 5148
end: 18782
onstart: 1305
middle: 8813
end: 5248
ntstart: 0
middle: 6061
end: 11385
anstart: 426
middle: 14154
end: 1428
itstart: 21
middle: 3964
end: 10720
testart: 1606
middle: 5743
end: 6878
lestart: 11140
middle: 5713
end: 5975
sestart: 4391
middle: 3934
end: 2552
enstart: 4140
middle: 14642
end: 1761
etstart: 8872
middle: 5881
end: 1196
isstart: 33
middle: 5685
end: 5210
aistart: 637
middle: 17625
end: 872
mestart: 2665
middle: 5186
end: 3768
lastart: 7962
middle: 3195
end: 940
qustart: 9363
middle: 3585
end: 0

From this, we immediately identify restrictions for composing words in the French language:

  • Er, ur, it, and is rarely occur at the start of a word
  • qu is the least popular and it also never occurs at the end of a word
  • Nt is one of the most popular  but it will never occur at the beginning of the word

If we were to conduct a further analysis and then apply the orthography of the language, we could infer the phonotactic constraints of the French language. 

Analyzing a Peculiarity: The Frequency of an, he, th in English

Speakers of the English language would likely be delighted to share their own insights in regard to the frequencies of an, he, and th. But they must again be discouraged from sharing their own language intuition so that we can learn how a computer might discover it. 

As we did with French, again we shall in English consider placement in addition to frequency. We’ll remind ourselves that placement will fit into the three general categories of start, middle, and end. 

If we consider that a bigram has those three general positions, we find that th occurs in the start position of a word 78.8% of the time. It occurs in the middle and end of a word at a respective frequency of 10.6% and 10.4%. English overwhelmingly expects that th begin a word. 

The next most common bigram, he, occurs at the end of a word 44% of the time and the middle of a word at a frequency of 34%. While th tends to start words, he is more likely to end them.

Diving deeper into English bigram placement by incorporating trigrams

In looking at our frequency of English bigrams, we’ve observed that an is the most frequent, followed by he and th which share second place. The th and he duo make up 6% of all bigrams. One would be right to think that English likes these bigrams occurring together. 

If we were to create a union of these two bigrams based on their preferred positions, that union would be the most popular trigram: the.

We could begin to think babies that hear English for the first time think their language is teeming with thoughts of the. They aren’t wrong. 

The most common bigram in English, an, occurs a little more frequently than the he & th twins. When we look at the most common position for an, we find that it starts a word about 56% of the time. An occurs at the middle or end of a word with identical frequency. 

This fact is supported by the second most popular trigram, and

If we were wondering about the nd, we needn’t, because nd is the fourth most common bigram, and it favors ending words about 74% of the time. 

So we could guess that babies hearing English would expect that an should probably start words, and that nd would mostly end them. 

English is starting to sound interesting, isn’t it?

The Explanation for English n-gram Frequencies

As it would turn out, the reason for th and he occurring so frequently together has an explanation that’s very similar to French’s justification for its les: It’s a definite article. 

English text will be replete with determiners and the is its most popular. Etymologists would tell us that the originates from þe, which was a nominative masculine form of a demonstrative pronoun in Old English.
It should also be noted that the th bigram is used in two other English determiners: this and that, as well as their plural forms these and those. We should observe that thth at the beginning.

Thirteen of English’s 100 most common words contain the th- bigram:

  • The
  • That
  • With
  • This
  • They
  • There
  • Their
  • Them
  • Other
  • Than
  • Then
  • Think
  • These

Within those thirteen words, over half contain our the- trigram:

  • The
  • They
  • There
  • Their
  • Them
  • Other
  • Then
  • These

Native and non-native English speakers should note a pattern with our popular the- words: They tend to act as determiners or pronouns. In fact, then is the only popular the- word that behaves primarily as an adverb. 

It may be reasonable to guess that something in English’s past originated many of our the-related words. 

Patterns discovered in a larger corpus

Using the Methodius CLI, an analysis was done of six texts:

  •  Alice in Wonderland
  • The Great Gatsby
  • The Adventures of Huckleberry Finn
  • Paradise Lost
  • The Wizard of Oz
  •  Universal Declaration of Human Rights

Each text was analyzed for its 20 most frequent bigrams, and then all 6 lists of 20 bigrams were merged to produce 35 unique bigrams. 

herewaio
thnghiri
inonllof
ertoorar
anenearo
ouedst
itales
ndasis
atlise
hashti

The approach of analyzing each text independently and then merging the list reveals unique characteristics of the texts. By selectively not merging each of the lists, we see that Great Gatsby seems to contribute no unique bigrams, while Paradise Lost and the UDHR each contribute 4 unique bigrams. 

  • Removing Paradise Lost or Universal Declaration of Human Rights: 31 Bigrams
  • Removing Wizard of Oz: 32 Bigrams
  • Removing HuckleBerry Finn or Alice in Wonderland: 33 Bigrams
  • Removing Great Gatsby: 35 Bigrams ( i.e. the size of the list doesn’t change)

We should observe in a larger corpus that amongst all these is that 79.7% of all th occurrences are at the start of a word, and 10.6% occur at the end, leaving 9.7% to occur in the middle. English strongly prefers that th initiate words and is barely inclined to end them with it.

Native Speaker Intuition with English Bigrams

As part of this research, an informal survey was sent out to a variety of Information Technology professionals.  The survey supplied respondents with a list of the most common bigrams and asked them to create English-sounding words.

Respondents were provided the following instructions:

  1. Merge or combine any of the two-letter pairs listed below to make new English words
  2. The word must not exist in the English language to the best of your knowledge
  3. You may use as many pairs as you want to make a word as long as you want
  4. Feel free to make up a definition and provide it
  5. Do not use AI or any software to help you (use your imagination)

Respondents were supplied with the following related bigrams curated from Great Gatsby, Adventures of Huckleberry Finn, Paradise Lost, Wizard of Oz, and the Universal Declaration of Human Rights. 

thheenha
aterinng
anndreed
ontoouit
waasorro
ar

The results of this survey show that respondents reinforced many of the rules we have already captured surrounding th

Bigram ruleWord set
Initial ththand
ther
thatore
tharing
thared
thinouas
thedou
thar
thon
thator
thender
Final thinreth
arth
arth (provided 2x by different respondents)
harth
waroth
enth
wath
toth
hareth
tonderouth
Middle thwathat
hathon
enthaned
hathen
arthon
oratheit
ithas
wathar
wethor
th + hether
thedou
thender
hathen
oratheit

We observe that native and non-native English speakers obey the frequency and placement rules that we already identified, even when tasked with creating new words:

  • Initial th-  is the most common
  • Middle -th- and final -th occur equally 
  • When th and he occur together, they frequently form an initial the-

Though it may seem surprising that respondents didn’t create more initial the- words, we must be mindful that the instructions required that the word not exist to the best of the respondent’s knowledge. Since so many of the most frequent English words begin with the, this would have been a difficult task. 

Bigram Frequencies in Related Languages

A similarity of a language’s n-gram frequencies corresponds to the similarity of the languages themselves. 

We can observe this in the high frequencies of es, os, and de in Spanish, Portuguese, and Catalan. But there are two languages whose frequencies are so similar they could seem identical: English and Scots.

Frequency overlap in English and Scots

English and Scots are both members of the Anglic family of the Anglo-Frisian family tree. As sibling languages spoken on the same island, one can expect that the two languages may have some similarities. 

If we compare the 10 most frequent bigrams in both languages, we’ll observe quite a bit of similarity:

  • They have 8 bigrams in common
    • in, en remain unique to Scots
    • of, ti remain unique to English
  • The top four bigrams for both languages have the same frequency
BigramFrequency
an3.46%
he3.16%
th3.16%
nd2.94%
re2.64%
er2.26%
on2.18%
of1.88%
es1.73%
ti1.66%
English bigrams
BigramFrequency
an3.68%
th3.53%
he3.06%
nd2.98%
in2.04%
en1.80%
re1.72%
on1.72%
es1.65%
er1.57%
Scots bigrams

The similarities between these two languages are important to note, because Google Translate will misidentify Scots as English. I don’t know what algorithms Google uses for language identification, but they appear to not be focused on n-gram frequency and placement analysis. 

If we were to use frequency and placement analysis together, we’d describe a distinction between the two languages like so:

  • In English, th and he occur with the same frequency
  • In Scotts, th is more frequent
  • In the English UHDR sample, there are no occurrences of  he at the start of a word
  • In Scots, he starts a word 20% of the time. 

Suffice it to say that in Scots, th and he are less codependent. This fact could help a language detection algorithm distinguish English from Scots. We could then pair this with another observation that Scots has the trigram cht, which doesn’t appear at all in the English UHDR Sample. In fact, cht only appears 8 times across all of the English texts analyzed. 

Distinguishing English from Scots could be a matter of looking for less codependent bigrams, and more frequent trigrams. The position may not need to be evaluated at all. 

Opportunities for exploration

This is only a cursory analysis of a handful of features of a two proto-indo-European languages based entirely on modern texts. We have only focused on small features of a few languages and a small comparative analysis of two related languages. There is much more that we can discover with larger corpi and more language families (additionally, we have only explored languages with Latin scripts).

In one application, we could explore shared frequencies between related languages. If we consider that os and de share similar frequencies in Portuguese and Spanish, we could infer from frequency analysis of these and other n-grams that these languages are related. If we can determine that known members of a language family are related through this analysis, we could then attempt to develop rules that could be extrapolated and applied to other language sets. 

In another application, we can potentially identify sound-shift laws similar to Grimm’s Law. Consider the case of the high frequency of es in Catalan, and os in Spanish and Portuguese. The Levenshtein edit distance is 1 and the Jaro similarity is .67 which helps us determine that these are related morphologies.  Given that these two high-frequency n-rams are quantitatively similar, we could identify a trend throughout Iberian languages where /o/ transitions to /e/ by comparing n-grams with similar similarities and frequencies. 

N-gram analysis has several practical applications. Whether the language a chunk of text is written in, determining language families, or identifying sound shift laws, they all serve the goal of identifying the beauty in a language’s unique look and song. 

Works Cited

Etymology Online. “the.” the | Etymology of the by etymonline, Etymology Online, 2024, https://www.etymonline.com/word/the. Accessed 22 July 2024.

“Most common words in English.” Wikipedia, 20 March 2024, https://en.wikipedia.org/wiki/Most_common_words_in_English. Accessed 22 July 2024.

Taylor, Frank. “English Frequencies.” Methodius Demo, Frank M. Taylor.

Taylor, Frank M. Methodius Demo for languages, January 2024, https://experiments.frankmtaylor.com/methodius/#French–tableSet. Accessed 19 February 2024.

Taylor, Frank M. “Bigram and trigram intuition: English.” Bigram and Trigram intuition: English, 28 March 2024, https://docs.google.com/forms/d/1yNufFK3-F-dGU7htjhD6pRTMjdil_c1QFRz7ZNXbfOc/edit#question=1636384217&field=1814842493. Accessed 24 April 2024.

Taylor, Frank M. Six French Samples. Frank M. Taylor, April 2024, https://www.dropbox.com/scl/fo/9e27kvlt1isnuswyd78ik/h?rlkey=y19shfsqb0luadxslf1xp44ew&dl=0.“-tion – Wiktionary, the free dictionary.” Wikipedia, https://en.wiktionary.org/wiki/-tion#French. Accessed 9 April 2024.

Data and Code

Taylor, Frank. Methodius CLI. April 2024,
https://github.com/paceaux/methodius-cli. Accessed April 2024. 

Taylor, Frank. Analysis, April 2024
https://www.dropbox.com/scl/fo/5u358kuk8j6muye2dccc9/h?rlkey=4vtwioa9aonemhsb6a7lhkyp2&dl=0. Produced 8 April, 2024.

Final Notes

This was part of my application to the University of Illinois’ Graduate program for Linguistics. I lacked an academic writing sample that I could provide because I graduated college over 20 years ago; all of my academic work was on floppy disk. Lacking any academic writing samples, I chose to do original work which included writing all of the code and doing the analysis which you have read here. My hope was that this effort would demonstrate my capabilites of working in academia.

It did not.

If you have found flaws in my data, analysis, or conclusions, please gently enumerate them in a comment below as I am not an academic, despite my best efforts.



Leave a Reply

You don't have to register to leave a comment. And your email address won't be published. If you found a bug, be a gem and share your OS and browser version.

This site uses Akismet to reduce spam. Learn how your comment data is processed.