Lemmatization is quite a long word. Long and infrequent enough to scare non-linguists. However, linguists or not, we all are quite familiar with the process of lemmatization, although we may not know its name. When we need to look up a word in a dictionary, we know that not every word will be listed. If the word is a verb, only the infinitive will appear on behalf of the entire verb family (which includes gerunds, participles, past forms, -s forms from the 3rd person singular, etc).  That process of converting a word to its headword o lemma (i.e., the word you would look it up from in a dictionary) is called lemmatization.

Apart from the dictionary experience (or when learning a foreing language), humans do not normally lemmatize in their every day life. At least not consciously, or not just for the sake of lemmatizing. Machines do. Or should. When we have a process that involves some sort of text or word analysis, the first step is usually lemmatizing all words in the text in order to keep analyzing.

Lemmatization also happens to be extremely useful for search processes, as lemmatizers allow users to get all appearances of a particular word, and not only the exact word the user typed. That is what a lemmatizer is useful for. A practical example on how lemmatisation can enhance automatic searches  can be found in the  dictionary of Spanish proverbs Refranario. Let’s say we want to check all proverbs in Spanish containing the verb llover, “to rain”.  A classic search of the word llover on the search box will return no results. There are no expressions that contain the infinitive llover.  However, there might be other forms of the verb llover in the proverbs of Refranario. In Spanish there are around 100 different forms for each infinitive, depending on what tense, person and mood  the verb is conjugated. One hundred ways of  conjugating the verb llover is a lot of forms. That’s when a lemmatized search comes in handy. If we do a lemmatized search on Refranario (that is, a search that takes into account the various morphological variations and inflections that a word may have), we will get four proverbs containing forms of the verb llover: two containing the a present form llueve, another one with the participle llovido and another one in the past llovía.

This example  shows on a small scale how a lemmatizer can provide better and more relevant search result. Just imagine the possibilities of improving information retrieval when applying lemmatized searchs to large amounts of text.

[Refranario uses the Lemmatization API from Apicultur]

 

Share →