Everyone speaks at least one language. It’s fascinating how we learned that first language, too. Without any formal instruction, we heard it, repeated it, and found ourselves effective communicators within about seven years. And in that time, we learned what our language sounded like, too.

Even if we grew up in a multilingual culture we somehow figured out when one language wasn’t another. It didn’t matter if we grew up speaking more than one language, either! If we grew up speaking English, we wouldn’t mistake it for Dutch. If we grew up speaking Spanish, we wouldn’t take it for Portuguese.
It’s a very simple thing that happens in our brains that helps us recognize languages, but what is it? And how would we use that information to help a computer learn to recognize a language? In this article, we will explore methods for analyzing written language that can teach us what makes a language uniquely identifiable in written form, and how that can relate back to the spoken form.
How do we recognize a language?
First, we should establish that for this article we are referring to spoken languages with an emphasis on the written modality; we are not referring to signed languages. So with that comes the assumption that our eyes and our ears are the two senses principally involved in recognizing a given language.
When we are using our eyes and ears to recognize language, what is it that stands out to us and signals, “this is your language,” or, “this is a language you don’t know at all?”
It can’t be syntax; that would require knowing something about the nature of the language. Nor can it be semantics (meaning) or pragmatics (usage) for the same reason. It must be something basic; some small part of the language that we can discern without introspecting into the language.
We recognize a language by looking at its parts, but what parts?
If we were to look at parts of a language that didn’t involve grammar, structure, meaning, or usage, that leaves us with the words themselves. It means exploring what it is about how those words are shaped; the morphology. But in order to explore morphology without learning the morphology we must parse these words into arbitrary units that allow us to ignore the methodological structure that can seem so obvious.
One approach for arbitrarily parsing words into units is to pick two or three-letter combinations of letters. The special name for an arbitrary sequence of such tokens is n-gram. What we’ll do is focus on two special kinds of n-grams: The bigram and the trigram.
Introducing Bigrams and Trigrams
A bigram can be any two-unit sequence. We could break a sentence into word bigrams like so:
The dog ran after the cat at the park : [‘the dog’, ‘dog ran’, ‘ran after’, ‘after the’, ‘the cat’, ‘cat at’, ‘at the’, ‘the park’]
This is an approach commonly used for analyzing syntax because (at scale) it answers the question, “What are the patterns (rules) for where words are placed?” But that’s not our focus here. We will be focused on letter bigrams, meaning our two-unit sequence will be the letters within words:
The dog ran after the cat => [‘th’, ‘he’, ‘do’, ‘og’, ‘ra’, ‘an’, ‘af’, ‘ft’, ‘te’, ‘er’, ‘th’, ‘he’, ‘ca’, ‘at’]
Similarly, a trigram would break the sentence into three-letter parts:
The dog ran after the cat => [‘the’, ‘dog’, ‘ran’, ‘aft’, ‘fte’, ‘ter’, ‘the’, ‘cat’]
While an astute observer may note that there are quite a few whole words that end up being trigrams, that isn’t yet our focus.
Why does looking at n-grams matter?
Computer programs may do bigram analysis on text to determine if it’s encrypted. This is because encrypted text (if it’s well-encrypted) shows all bigrams and trigrams at a relatively uniform frequency. Unencrypted text (i.e. natural language) has some two-letter combinations occurring more commonly than others.
While cryptographers aren’t concerned with why some two-letter combinations occur with greater frequency for a given language, this article asserts why these combinations are highly relevant for identifying unique characteristics of a language.
English and many other spoken languages with written forms are alphabetic; the letters ( more accurately, graphemes) represent a sound that a speaker would vocalize. In some cases such as English, multiple graphemes may be used to represent a single sound (e.g. th for /θ/ and /ð/, sh for /ʃ/) Bigrams and even trigrams can be a tool used within orthography (the rules for writing a language) to capture meaningful units of sound (phonemes).
With a large enough body of text (a corpus) we can learn rules for how graphemes are put together within words. Some rules can be discovered through the analysis of the frequency of individual graphemes, others with bigrams and trigrams. Other rules can be discovered by analyzing the placement of those n-grams within words and still other rules can be discovered by assessing the placement of frequently occurring n-grams.
N-gram analysis is capable of revealing morphological (structural) patterns of words for a given body of text. When we discover those patterns, we discover the look of a language.
When we analyze a sentence, we aren’t only concerned with whether a particular n-gram occurs, but the frequency with which it occurs as well.
A small frequency analysis of English and French
If we analyze the English sentence, “The dog ran after the cat in the park,” we observe the following frequencies of bigrams and trigrams:
| Bigram | Frequency |
|---|---|
| th | 3 |
| he | 3 |
| do | 1 |
| og | 1 |
| ra | 1 |
| an | 1 |
| af | 1 |
| ft | 1 |
| te | 1 |
| er | 1 |
| ca | 1 |
| at | 1 |
| in | 1 |
| pa | 1 |
| ar | 1 |
| rk | 1 |
| Trigram | Frequency |
|---|---|
| the | 3 |
| dog | 1 |
| ran | 1 |
| aft | 1 |
| fte | 1 |
| ter | 1 |
| cat | 1 |
| par | 1 |
| ark | 1 |
We can additionally note that this 9-word sentence produced 20 bigrams and 11 trigrams. The three occurrences of th, making it 15% of all bigrams, may seem significant.
Let’s contrast with the French equivalent, le chat a couru après le chien dans le parc.
| Bigram | Frequency |
|---|---|
| le | 3 |
| ch | 2 |
| ha | 1 |
| at | 1 |
| co | 1 |
| ou | 1 |
| ur | 1 |
| ru | 1 |
| ap | 1 |
| pr | 1 |
| re | 1 |
| es | 1 |
| hi | 1 |
| ie | 1 |
| en | 1 |
| da | 1 |
| an | 1 |
| ns | 1 |
| pa | 1 |
| ar | 1 |
| rc | 1 |
| Trigram | Frequency |
|---|---|
| cha | 1 |
| hat | 1 |
| cou | 1 |
| our | 1 |
| uru | 1 |
| apr | 1 |
| pre | 1 |
| res | 1 |
| chi | 1 |
| hie | 1 |
| ien | 1 |
| dan | 1 |
| ans | 1 |
| par | 1 |
| arc | 1 |
We can observe that this 10-word sentence produced 24 bigrams and 15 trigrams. And in this one sentence, we again see that le is 12.5% of all bigrams. We can look at the words in the sentence and see why this is; le is a whole word that appears three times, just like “the” in English.
We might want to conclude that “the” and le are commonly used words that help shape the looks of those languages. But based on that assumption, we might think words also have “he” in them in English, and “ch” in them in French. This one sentence isn’t a large enough sample to learn what a language sounds like.
Let’s look at a slightly larger sample, The Universal Declaration of Human Rights, and see if bigrams derived from it can give us a better picture. This content was produced using the Methodius library and samples of the UDHR from various languages.
| English Bigram | Frequency |
|---|---|
| an | 46 |
| he | 42 |
| th | 42 |
| nd | 39 |
| re | 35 |
| er | 30 |
| on | 29 |
| of | 25 |
| es | 23 |
| ti | 22 |
| French Bigram | Frequency |
|---|---|
| es | 73 |
| de | 58 |
| on | 48 |
| re | 37 |
| le | 36 |
| nt | 35 |
| me | 34 |
| er | 32 |
| la | 30 |
| et | 30 |
We observe that the th for English remains significant, but slightly less than an and he. Le remains significant in French, but less so than es, de, and on.
Analyzing a Peculiarity: The Frequency of es, de, and on in French
French speakers may be eager to chime in and explain these bigrams. But let’s put off the eagerness to jump to morphology and instead imagine how a baby, born in a French-speaking home, knows when its mother is talking to it. That baby doesn’t know what those particles mean, but it knows that its mother is talking to it. How is that baby discerning French from babble?
To learn how French looks (and sounds) like French, we have to look at more than frequency; we’ll also have to consider placement.
Let’s first consider that any sort of n-gram has three general placements in a word:
- Start (Initial)
- Middle (anywhere but the first or last of the sequence)
- End (Final)
If we consider that a bigram has those three general positions, we find that in the 73 instances where es occurs, it’s at the end of a word 82% of the time. We can safely say that French words commonly have es, and that es commonly comes at the end of a word.
If we look at the next most common bigram, de, we’ll find that of the 58 times that it occurs, it was at the beginning of a word 62% of the time.
We might be tempted to think that French words start with de and end with es, but that still isn’t the whole story.
Diving deeper into French bigram placement by incorporating trigrams
The top two French bigrams comprise 9% of all bigrams. What about that third, one, on? We’ll note 48 occurrences, making it a bit more than three percent of all bigrams. And we’ll also note that we find it in the middle in 75% of those occurrences. Does this mean French words have on sprinkled in them? Do babies hear some de + on + es sandwich all the time? Not quite.
If we look at the most common French trigrams, and where they are positioned in a word, we’ll find that on has a more interesting story.
Ion,con, and ons rank the first, second, and fifth most frequent trigrams. We see that ion just barely favors occurring at the end of a word while con definitely prefers the start, and then ons slightly favors the middle over the end, but never the beginning.
What are we to make of these peculiar frequencies? And what of this les trigram in third place?
Now we’ll analyze the positions of bigram and trigram frequencies within words:
| N-gram | Positions |
|---|---|
| es | start: 5 middle: 8 end: 60 |
| de | start: 46 middle: 9 end: 3 |
| on | start: 2 middle: 36 end: 10 |
| re | start: 11 middle: 20 end: 6 |
| le | start: 21 middle: 12 end: 3 |
| nt | start: 0 middle: 10 end: 25 |
| me | start: 7 middle: 13 end: 14 |
| er | start: 0 middle: 25 end: 7 |
| la | start: 21 middle: 9 end: 0 |
| et | start: 27 middle: 2 end: 1 |
| N-gram | Positions |
|---|---|
| ion | start: 0 middle: 9 end: 10 |
| con | start: 15 middle: 3 end: 0 |
| les | start: 11 middle: 0 end: 7 |
| tio | start: 0 middle: 18 end: 0 |
| ons | start: 0 middle: 10 end: 7 |
| ent | start: 1 middle: 6 end: 10 |
| res | start: 3 middle: 3 end: 11 |
| ati | start: 0 middle: 15 end: 0 |
| des | start: 13 middle: 0 end: 0 |
| omm | start: 0 middle: 12 end: 0 |
The Morphological Explanation for French n-gram Frequencies
Francophones and linguists would be eager to provide three rules to explain our bigram frequencies:
- le is a definite article
- –s is a common French morpheme indicating a plural
- French adjectives, including definite articles, must agree with their nouns in both gender and quantity
These morphological and syntactical facts are the reason behind the frequencies and placements we observe:
- es being 17.6% of all bigrams and occurs at the ends of words 82% of the time
- le being only the 8.7% of bigrams but occurring at either the start or middle 92% of the time
- Les (a union of le and es) comes in third place as 11% of trigrams while tending to occur at the start of words instead of the end
The peculiarity of on constituting 11.6% of bigrams requires a different kind of analysis, however. This peculiarity requires evaluating co-occurrence.
Of the 48 occurrences of on, 46 (96%) are found in the middle or end of a word. French words overwhelmingly do not place on at the start. Why?
Astute observers will have caught that ion, con, tio, and ons are 11.6%, 11%, 11%, and 10% of the trigrams. This is where a review of co-occurrence helps.
If we evaluate our trigrams not just by frequency, but by how often frequent trigrams occur next to each other, we’ll find that by analyzing the eight most frequent trigrams that ons, ati, tio, and ion have nearly the same frequency.
| Trigram | Occurrences with Other Top Trigrams |
|---|---|
| con | 10 |
| ons | 17 |
| ati | 15 |
| tio | 18 |
| ion | 18 |
The algorithm for producing this frequency map is summarized as, “only add n-grams if they’re all very popular and also adjacent.”
The result is that we now know that tio and ion co-occur, and ons frequently co-occur next to either con or ion.
If we were to go the next step and form unions with these trigram sets, we’d discover that ati, tio, ion, and ons form one 5-character n-gram: tions. We will find seven instances of tions in the text and 18 instances of tion.
We can now see how the analysis of sequence after frequency reveals a French morpheme: tion
Frequency and Sequence revealing the look of French
We’ve now learned that:
- French words are likely to end in -s
- Les is very popular
- -ion and its bigger brother –tion are popular word endings
- Con- wants to be at the beginning of words, and isn’t likely to be followed by an -s
If we were to train a language model on the Universal Declaration of Human Rights, we might expect that a handful of training epochs would make very similar assumptions about how to form words.
Patterns discovered in a larger corpus
Using the Methodius CLI, which was developed for this article, an analysis was done of six texts:
- Alcools
- Candide
- Cyrano de Bergerac
- Les Misérables
- Les Plaisirs et Les Jours
- Universal Declaration of Human Rights
Each text was analyzed for its 20 most frequent bigrams, and then all 6 lists of 20 bigrams were merged to produce 30 unique bigrams.
| es | on | ti |
| le | ur | it |
| en | te | io |
| nt | eu | oi |
| de | la | ro |
| re | ie | ns |
| ou | me | il |
| an | er | se |
| ai | ne | is |
| et | qu | ra |
The approach of analyzing each text independently and then merging the list reveals unique characteristics of the texts.
- Removing Universal Declaration of Human Rights: 25 Bigrams
- Removing Alcools, Cyrano de Bergerac, Les Misérables: 29 Bigrams
- Removing Candide or Plaisirs: 30 Bigrams (i.e. the size of the list is no different)
If removing the Universal Declaration of Human Rights removes 5 unique bigrams from our list, this informs us that it is using a language that’s meaningfully different from the others.
When we analyze all six texts at once, we find that the frequencies we observed in the initial UDHR sample aren’t that different from 5 other texts spanning 300 years of French history. Let’s compare bigram frequencies to rates of co-occurrence:
| Bigram | Frequency |
|---|---|
| es | 2.90% |
| le | 2.59% |
| re | 2.46% |
| de | 2.36% |
| en | 2.33% |
| ai | 2.17% |
| ou | 2.02% |
| nt | 1.98% |
| an | 1.82% |
| et | 1.81% |
| on | 1.74% |
| it | 1.67% |
| te | 1.61% |
| er | 1.56% |
| qu | 1.47% |
| la | 1.37% |
| me | 1.32% |
| ur | 1.31% |
| is | 1.24% |
| se | 1.23% |
| Bigram | Co-occurrences |
|---|---|
| es | 17463 |
| nt | 16707 |
| ai | 12684 |
| en | 11747 |
| it | 9731 |
| re | 9560 |
| te | 8815 |
| le | 7414 |
| er | 6943 |
| ur | 5938 |
| an | 5635 |
| me | 5400 |
| is | 4964 |
| se | 4851 |
| de | 4710 |
| ou | 4408 |
| et | 3851 |
| on | 2705 |
| la | 2341 |
| qu | 582 |
French is filled with es, le, re, de, and en, and they are associated strongly with other bigrams of the language.
At this point, it would behoove us to compare our related top bigrams to their placements.
| French Bigram | Co-occurrences |
|---|---|
| er | 6943 |
| re | 9560 |
| ou | 4408 |
| ur | 5938 |
| de | 4710 |
| es | 17463 |
| on | 2705 |
| nt | 16707 |
| an | 5635 |
| it | 9731 |
| te | 8815 |
| le | 7414 |
| se | 4851 |
| en | 11747 |
| et | 3851 |
| is | 4964 |
| ai | 12684 |
| me | 5400 |
| la | 2341 |
| qu | 582 |
| French Bigram | Word Placement |
|---|---|
| er | start: 27 middle: 8882 end: 4819 |
| re | start: 4206 middle: 8187 end: 9243 |
| ou | start: 1395 middle: 16187 end: 196 |
| ur | start: 10 middle: 5997 end: 5578 |
| de | start: 16727 middle: 2086 end: 2019 |
| es | start: 1629 middle: 5148 end: 18782 |
| on | start: 1305 middle: 8813 end: 5248 |
| nt | start: 0 middle: 6061 end: 11385 |
| an | start: 426 middle: 14154 end: 1428 |
| it | start: 21 middle: 3964 end: 10720 |
| te | start: 1606 middle: 5743 end: 6878 |
| le | start: 11140 middle: 5713 end: 5975 |
| se | start: 4391 middle: 3934 end: 2552 |
| en | start: 4140 middle: 14642 end: 1761 |
| et | start: 8872 middle: 5881 end: 1196 |
| is | start: 33 middle: 5685 end: 5210 |
| ai | start: 637 middle: 17625 end: 872 |
| me | start: 2665 middle: 5186 end: 3768 |
| la | start: 7962 middle: 3195 end: 940 |
| qu | start: 9363 middle: 3585 end: 0 |
From this, we immediately identify restrictions for composing words in the French language:
- Er, ur, it, and is rarely occur at the start of a word
- qu is the least popular and it also never occurs at the end of a word
- Nt is one of the most popular but it will never occur at the beginning of the word
If we were to conduct a further analysis and then apply the orthography of the language, we could infer the phonotactic constraints of the French language.
Analyzing a Peculiarity: The Frequency of an, he, th in English
Speakers of the English language would likely be delighted to share their own insights in regard to the frequencies of an, he, and th. But they must again be discouraged from sharing their own language intuition so that we can learn how a computer might discover it.
As we did with French, again we shall in English consider placement in addition to frequency. We’ll remind ourselves that placement will fit into the three general categories of start, middle, and end.
If we consider that a bigram has those three general positions, we find that th occurs in the start position of a word 78.8% of the time. It occurs in the middle and end of a word at a respective frequency of 10.6% and 10.4%. English overwhelmingly expects that th begin a word.
The next most common bigram, he, occurs at the end of a word 44% of the time and the middle of a word at a frequency of 34%. While th tends to start words, he is more likely to end them.
Diving deeper into English bigram placement by incorporating trigrams
In looking at our frequency of English bigrams, we’ve observed that an is the most frequent, followed by he and th which share second place. The th and he duo make up 6% of all bigrams. One would be right to think that English likes these bigrams occurring together.
If we were to create a union of these two bigrams based on their preferred positions, that union would be the most popular trigram: the.
We could begin to think babies that hear English for the first time think their language is teeming with thoughts of the. They aren’t wrong.
The most common bigram in English, an, occurs a little more frequently than the he & th twins. When we look at the most common position for an, we find that it starts a word about 56% of the time. An occurs at the middle or end of a word with identical frequency.
This fact is supported by the second most popular trigram, and.
If we were wondering about the nd, we needn’t, because nd is the fourth most common bigram, and it favors ending words about 74% of the time.
So we could guess that babies hearing English would expect that an should probably start words, and that nd would mostly end them.
English is starting to sound interesting, isn’t it?
The Explanation for English n-gram Frequencies
As it would turn out, the reason for th and he occurring so frequently together has an explanation that’s very similar to French’s justification for its les: It’s a definite article.
English text will be replete with determiners and the is its most popular. Etymologists would tell us that the originates from þe, which was a nominative masculine form of a demonstrative pronoun in Old English.
It should also be noted that the th bigram is used in two other English determiners: this and that, as well as their plural forms these and those. We should observe that thth at the beginning.
Thirteen of English’s 100 most common words contain the th- bigram:
- The
- That
- With
- This
- They
- There
- Their
- Them
- Other
- Than
- Then
- Think
- These
Within those thirteen words, over half contain our the- trigram:
- The
- They
- There
- Their
- Them
- Other
- Then
- These
Native and non-native English speakers should note a pattern with our popular the- words: They tend to act as determiners or pronouns. In fact, then is the only popular the- word that behaves primarily as an adverb.
It may be reasonable to guess that something in English’s past originated many of our the-related words.
Patterns discovered in a larger corpus
Using the Methodius CLI, an analysis was done of six texts:
- Alice in Wonderland
- The Great Gatsby
- The Adventures of Huckleberry Finn
- Paradise Lost
- The Wizard of Oz
- Universal Declaration of Human Rights
Each text was analyzed for its 20 most frequent bigrams, and then all 6 lists of 20 bigrams were merged to produce 35 unique bigrams.
| he | re | wa | io |
| th | ng | hi | ri |
| in | on | ll | of |
| er | to | or | ar |
| an | en | ea | ro |
| ou | ed | st | |
| it | al | es | |
| nd | as | is | |
| at | li | se | |
| ha | sh | ti |
The approach of analyzing each text independently and then merging the list reveals unique characteristics of the texts. By selectively not merging each of the lists, we see that Great Gatsby seems to contribute no unique bigrams, while Paradise Lost and the UDHR each contribute 4 unique bigrams.
- Removing Paradise Lost or Universal Declaration of Human Rights: 31 Bigrams
- Removing Wizard of Oz: 32 Bigrams
- Removing HuckleBerry Finn or Alice in Wonderland: 33 Bigrams
- Removing Great Gatsby: 35 Bigrams ( i.e. the size of the list doesn’t change)
We should observe in a larger corpus that amongst all these is that 79.7% of all th occurrences are at the start of a word, and 10.6% occur at the end, leaving 9.7% to occur in the middle. English strongly prefers that th initiate words and is barely inclined to end them with it.
Native Speaker Intuition with English Bigrams
As part of this research, an informal survey was sent out to a variety of Information Technology professionals. The survey supplied respondents with a list of the most common bigrams and asked them to create English-sounding words.
Respondents were provided the following instructions:
- Merge or combine any of the two-letter pairs listed below to make new English words
- The word must not exist in the English language to the best of your knowledge
- You may use as many pairs as you want to make a word as long as you want
- Feel free to make up a definition and provide it
- Do not use AI or any software to help you (use your imagination)
Respondents were supplied with the following related bigrams curated from Great Gatsby, Adventures of Huckleberry Finn, Paradise Lost, Wizard of Oz, and the Universal Declaration of Human Rights.
| th | he | en | ha |
| at | er | in | ng |
| an | nd | re | ed |
| on | to | ou | it |
| wa | as | or | ro |
| ar | |||
The results of this survey show that respondents reinforced many of the rules we have already captured surrounding th.
| Bigram rule | Word set |
|---|---|
| Initial th | thand ther thatore tharing thared thinouas thedou thar thon thator thender |
| Final th | inreth arth arth (provided 2x by different respondents) harth waroth enth wath toth hareth tonderouth |
| Middle th | wathat hathon enthaned hathen arthon oratheit ithas wathar wethor |
| th + he | ther thedou thender hathen oratheit |
We observe that native and non-native English speakers obey the frequency and placement rules that we already identified, even when tasked with creating new words:
- Initial th- is the most common
- Middle -th- and final -th occur equally
- When th and he occur together, they frequently form an initial the-
Though it may seem surprising that respondents didn’t create more initial the- words, we must be mindful that the instructions required that the word not exist to the best of the respondent’s knowledge. Since so many of the most frequent English words begin with the, this would have been a difficult task.
Bigram Frequencies in Related Languages
A similarity of a language’s n-gram frequencies corresponds to the similarity of the languages themselves.
We can observe this in the high frequencies of es, os, and de in Spanish, Portuguese, and Catalan. But there are two languages whose frequencies are so similar they could seem identical: English and Scots.
Frequency overlap in English and Scots
English and Scots are both members of the Anglic family of the Anglo-Frisian family tree. As sibling languages spoken on the same island, one can expect that the two languages may have some similarities.
If we compare the 10 most frequent bigrams in both languages, we’ll observe quite a bit of similarity:
- They have 8 bigrams in common
- in, en remain unique to Scots
- of, ti remain unique to English
- The top four bigrams for both languages have the same frequency
| Bigram | Frequency |
|---|---|
| an | 3.46% |
| he | 3.16% |
| th | 3.16% |
| nd | 2.94% |
| re | 2.64% |
| er | 2.26% |
| on | 2.18% |
| of | 1.88% |
| es | 1.73% |
| ti | 1.66% |
| Bigram | Frequency |
|---|---|
| an | 3.68% |
| th | 3.53% |
| he | 3.06% |
| nd | 2.98% |
| in | 2.04% |
| en | 1.80% |
| re | 1.72% |
| on | 1.72% |
| es | 1.65% |
| er | 1.57% |
The similarities between these two languages are important to note, because Google Translate will misidentify Scots as English. I don’t know what algorithms Google uses for language identification, but they appear to not be focused on n-gram frequency and placement analysis.
If we were to use frequency and placement analysis together, we’d describe a distinction between the two languages like so:
- In English, th and he occur with the same frequency
- In Scotts, th is more frequent
- In the English UHDR sample, there are no occurrences of he at the start of a word
- In Scots, he starts a word 20% of the time.
Suffice it to say that in Scots, th and he are less codependent. This fact could help a language detection algorithm distinguish English from Scots. We could then pair this with another observation that Scots has the trigram cht, which doesn’t appear at all in the English UHDR Sample. In fact, cht only appears 8 times across all of the English texts analyzed.
Distinguishing English from Scots could be a matter of looking for less codependent bigrams, and more frequent trigrams. The position may not need to be evaluated at all.
Opportunities for exploration
This is only a cursory analysis of a handful of features of a two proto-indo-European languages based entirely on modern texts. We have only focused on small features of a few languages and a small comparative analysis of two related languages. There is much more that we can discover with larger corpi and more language families (additionally, we have only explored languages with Latin scripts).
In one application, we could explore shared frequencies between related languages. If we consider that os and de share similar frequencies in Portuguese and Spanish, we could infer from frequency analysis of these and other n-grams that these languages are related. If we can determine that known members of a language family are related through this analysis, we could then attempt to develop rules that could be extrapolated and applied to other language sets.
In another application, we can potentially identify sound-shift laws similar to Grimm’s Law. Consider the case of the high frequency of es in Catalan, and os in Spanish and Portuguese. The Levenshtein edit distance is 1 and the Jaro similarity is .67 which helps us determine that these are related morphologies. Given that these two high-frequency n-rams are quantitatively similar, we could identify a trend throughout Iberian languages where /o/ transitions to /e/ by comparing n-grams with similar similarities and frequencies.
N-gram analysis has several practical applications. Whether the language a chunk of text is written in, determining language families, or identifying sound shift laws, they all serve the goal of identifying the beauty in a language’s unique look and song.
Works Cited
Etymology Online. “the.” the | Etymology of the by etymonline, Etymology Online, 2024, https://www.etymonline.com/word/the. Accessed 22 July 2024.
“Most common words in English.” Wikipedia, 20 March 2024, https://en.wikipedia.org/wiki/Most_common_words_in_English. Accessed 22 July 2024.
Taylor, Frank. “English Frequencies.” Methodius Demo, Frank M. Taylor.
Taylor, Frank M. Methodius Demo for languages, January 2024, https://experiments.frankmtaylor.com/methodius/#French–tableSet. Accessed 19 February 2024.
Taylor, Frank M. “Bigram and trigram intuition: English.” Bigram and Trigram intuition: English, 28 March 2024, https://docs.google.com/forms/d/1yNufFK3-F-dGU7htjhD6pRTMjdil_c1QFRz7ZNXbfOc/edit#question=1636384217&field=1814842493. Accessed 24 April 2024.
Taylor, Frank M. Six French Samples. Frank M. Taylor, April 2024, https://www.dropbox.com/scl/fo/9e27kvlt1isnuswyd78ik/h?rlkey=y19shfsqb0luadxslf1xp44ew&dl=0.“-tion – Wiktionary, the free dictionary.” Wikipedia, https://en.wiktionary.org/wiki/-tion#French. Accessed 9 April 2024.
Data and Code
Taylor, Frank. Methodius CLI. April 2024,
https://github.com/paceaux/methodius-cli. Accessed April 2024.
Taylor, Frank. Analysis, April 2024
https://www.dropbox.com/scl/fo/5u358kuk8j6muye2dccc9/h?rlkey=4vtwioa9aonemhsb6a7lhkyp2&dl=0. Produced 8 April, 2024.
Final Notes
This was part of my application to the University of Illinois’ Graduate program for Linguistics. I lacked an academic writing sample that I could provide because I graduated college over 20 years ago; all of my academic work was on floppy disk. Lacking any academic writing samples, I chose to do original work which included writing all of the code and doing the analysis which you have read here. My hope was that this effort would demonstrate my capabilites of working in academia.
It did not.
If you have found flaws in my data, analysis, or conclusions, please gently enumerate them in a comment below as I am not an academic, despite my best efforts.