Anyone who knows me knows that I have a bit of a thing for languages. I like studying them and I like speaking them. But I also like learning how they…work. A few years ago I read a book called, The Secret Life of Pronouns which got me interested in sentiment analysis. The central thesis of James Pennebaker’s book was that, by evaluating a small subset of words, you can learn a lot about the person: gender, age, whether the person is a subordinate, etc. The only catch is that, well, you have to know what a pronoun is. Or a definite article. Or an adjective.
And that got me thinking: I need a thing that can do that. I need a part-of-speech tagger.
Which also got me thinking, “I should make a part of speech tagger”.
So I did.
The Name is Isidore
Jeff Atwood is right, naming things is hard. So I decided to look for patron saints of languagey things. Saint Isidore of Seville was a scholar who lived in Seville, Spain. He wrote this book, Etymologiae, which, like the word sounds, was an etymology. Etymology is the study of the origin of words. And…well, I want to study words.
And Saint Isidore is also considered the patron saint of the internet. And, I work on the internet.
So Isidore seems like an appropriate name.
What does Isidore do?
Right now, Isidore does part-of-speech tagging. Feed it some text, it’ll tell you if a word is a pronoun, a verb, an adjective, or an adverb. So if you want to do something with that information, it’s on you to figure out what you want to do.
Eventually I want Isidore to be able to do some analysis; tell you how many verbs, adjectives, adverbs, etc you have in a block of text. But for right now, it’s a part-of-speech tagger.
Feed it something like this:
const { Sentence } = isidore
const mySentence = new Sentence('He gives him a car.');
const { wordList } = mySentence;
And you get a result like this:
Sentence {
text: 'He gives him a car.',
language: 'En',
rawWordList: [ 'he', 'gives', 'him', 'a', 'car' ],
wordList:
[
Pronoun {
partOfSpeech: 'pronoun',
word: 'he',
referent: 'animate',
gender: 'masculine',
type: 'subject',
person: 3,
quantity: 'singular'
},
Verb {
partOfSpeech: 'verb',
word: 'give',
type: 'transitive',
valence: 2
},
Pronoun {
partOfSpeech: 'pronoun',
word: 'him',
referent: 'animate',
gender: 'masculine',
type: 'object',
person: 3,
quantity: 'singular',
},
Adjective {
partOfSpeech: 'adjective',
word: 'a',
type: 'article',
degree: undefined
},
Noun {
partOfSpeech: 'noun',
word: 'car',
type: 'entityClass',
subType: 'common',
inflection: undefined
}
],
type: 'declarative'
}
Isidore isn’t too special
Natural Language Processing and part-of-speech tagging isn’t something new. It’s been around for a hot minute. So Isidore isn’t particularly special. I will call out two things that might make Isidore a bit different, possibly, from other PoS utilities:
- It’s written in JavaScript. As most NLP utilities are written in Python, this is a differentiator.
- It’s written with multiple languages in mind. I’ve done my level-best to not be English-centric in my approach.
Why Use Isidore?
Really the idea is powered by what James Pennebaker talked about in his book: discover neat facts about a person by the pronouns (or other words) they use. But be able to do it in more languages with a part-of-speech tagger that’s built to support more languages.
Where’s the code
Of course, it’s on Github. The master branch is currently version 0.0.4, and I’m already working on version 0.0.5 (my develop branch). You’ll see that I’ve set up issues and a project within Github, so you can see what I’m working on.
If you want to install it and try it out, it’s as easy as:
npm install isidore
You can checkout NPM, if you’re really curious.
As usual, I’m interesting in feedback and open to contributions.