Introducing Methodius

Reading Time: 4 minutes

Folks who get to know me usually (and regrettably) discover that I am a language nerd. I like learning languages and I like learning about languages. There’s all sorts of things that are fascinating about languages: where they come from, why they sound a certain way, why grammar is what it is. But lately, I’ve been fascinated by what a language looks like.

When I say, “what a language looks like,” I mean, “when I read a bunch of text on a screen, what is it about that text that makes it Englishy, Frenchy, Irishy, or Basquey.”

Lately I’ve thought about this so much that I created a JavaScript library just for figuring this out.

Cyril and Methodius were very proud of the script they created. Latin was for dorks. — Cyril and Methodius, the conlangers of the 9th century. Painted by Zahari Zograf Public Domain

The Name is Methodius

My previous language-based JavaScript library was named after a saint so I decided to keep the tradition rolling. This is named after Saint Methodius, brother of Cyril, who’s kinda known for making that Cyrillic alphabet. Methodius and Cyril created alphabets, translated the Bible, and generally advocated for the Slavic peoples to have liturgy in their native languages.

The fact that they created writing systems was kinda cool. And this is a library to analyze writing (kinda). So Methodius works.

What does Methodius do?

Methodius is focused on analyzing text in a very peculiar way by breaking up text into equally-sized chunks called n-grams.

What’s an n-gram?

An n-gram is an n-sized sequence of adjacent things. Those things could be amino acids, proteins, words, or letters. In the case of Methodius, it’s letters (but it may eventually expand to words).

Given the sentence, “The quick brown fox,” a two-letter sequence (known as bigrams) would be [th, he, qu, ui, ic, ck, br, ro, ow, wn, fo, ox] and a three-letter sequence (trigram) would be, [the, qui, uic, ick, bro, own, fox]

Why n-grams?

Every language produces certain n-grams at different frequencies in written form. For instance, an and th are way more common in English text than in French. But French is way more likely to have re and le than you’d ever see in Spanish, where en and de are most frequent.

From this, we find two very interesting applications:

Text encryption programs that encrypt text by finding ways to set bigrams to occur with the same frequency
Translation programs (like Google Translate) that detect which language a bit of text is written in by looking at bigram frequencies

So if you’ve got something that can analyze n-gram frequencies, what you’ve got is a handy way to figure out if text is encrypted, and if not, what language it might be written in!

The handy thing about an n-gram is that its size is arbitrary. Whether it’s a bigram, trigram, or bigger, you’re choosing to ignore the rules about how words are formed (morphology) so that you can discover the rules for how to form words.

How to use Methodius

Import it and feed it a chunk of text

import {Methodius} from 'methodius';

const othello = new Methodius('never tell me! I take it much unkindly That thou, Iago, who hast had my purse As if the strings were thine, shouldst know of this.');

Once you’ve done that, you’ve a whole slew of properties and methods to use for learning about the text:

const bigrams = othello.bigrams // ['ne', 'ev', 've', 'er', 'te', 'el', 'll', 'me', 'ta', 'ak' ...]
const  topBigrams = othello.getTopBigrams(5) // Map {th → 5, in → 3, ha → 3, ho → 3, st → 3} 
const uniqueBigrams = othello.uniqueBigrams; // ['ne', 'ev', 've', 'er', 'te', 'el', 'll', 'me', 'ta', 'ak' ...]

And not only can you get facts about frequencies, you can get facts about placement, too.

You find out how often a particular bigram or trigram appears at the start, middle, or end of a word:

const bigramPlaces = othello.bigramPositions // Map {ne → {start: 1, middle: 0, end: 1}, ev → start: 0, middle: 1, end: 0} ...}
const letterPlaces = othello.letterPositions // Map {n → start: 1, middle: 5, end: 0}, e → {start: 0, middle: 4, end: 6} ...}

And even more fun (At least for me), there’s the ability to find out which most-frequent n-grams occur together. By default it’s set to get the 20 top bigrams, but as you raise or lower the second argument, you could get different results.

othello.getRelatedTopNgrams() // Map { ne → 2, ev → 1, ve → 1, er → 1, te → 1, el → 1, ll → 1, ta → 1, ak → 1, ke → 1, … }
othello.getRelatedTopNGgrams(2, 4) // Map { th → 2, ha → 1, ho → 1 }
othello.getRelatedTopNgrams(2, 8) // Map { th → 2, ha → 1, ho → 2, ou → 2, in → 1, ne → 1 }

Methodius probably isn’t too special

N-gram analysis has existed for a while as it’s a component of computational linguistics. But computational linguistics tends to use Python. So Methodius isn’t breaking new ground. I’ll call out a few things that might make it different, though:

Written in JavaScript. Since most NLP is written in Python, it’s fun to have something that can work in a web browser.
Written in TypeScript. Having everything typed means the linguist / developer experience is fairly pleasant.
Feature Rich. Beyond frequencies, you can discover where things are placed, which words have those things, produce a tree of ngrams, get mean and median word sizes, and even compare one Methodius instance to another

Where’s the Code?

As usual, it’s on Github. If you’re ready to play with it, you’ll find it over on NPM as well. You just need a lil’ install:

npm install methodius

You may notice that it’s already at 2.0.0! The leap from 1.2 to 2.0 was a conversion to TypeScript paired with the ability to actually find n-grams that were related to each other.

Want to see it in action?

There’s an interactive demo of Methodius with fifteen European languages for you to explore. The demo is also on GitHub if you wanted to add more samples, too.

What’s Next?

I’m probably going to write a longer, more linguistic-centered article on techniques in language identification (i.e. “the look of a language”).

After that, a few topics come to mind like:

More support for abjads, particularly Arabic
More features for word n-grams
Features for calculating similarities between n-grams & words

As always, contributions welcome.

Frank M Taylor

blog