Computational Linguistics

Explaining Computational Linguistics to friends and family

It is hard enough to explain what we are doing to our professors; explaining it in plain English to our friends and family is nearly impossible. So it is always good to see people who can explain what POS tagger is and why it is important without having to throw around references to Norvig or Jurafsky. Markus Dickinson has managed to do exactly such explanation in his non-linguistic primer to a serious research paper on Detecting Errors in Part-of-Speech Annotation.

Bulk converting doc files into txt (or html)

I have written about converting Microsoft Word files into text or html using OpenOffice before. However, the wizards I described in that article were crashing when the number of files crossed into several hundreds. I have written some macros to do the conversion, but they were scary looking and fragile. Fortunately, I now found a tool that does the same job better and with more flexibility. DocConverter by Danny Brewer and Dan Horwood allows to convert a whole directory of files at a time from any to any OpenOffice-understood format.

On uselessness of pretending to be somebody else

While reading weka Data Mining book, I have come across this impressive example of using machine learning to confirm person’s authorship (p. 358). In 19th century, there lived a famous rabbinic scholar Ben Ish Chai, who among other writings had two collections of letters. Ben Ish Chai claimed that only one collection was his and that the other one was somebody else’s, found by him. Modern scholars thought both collections were his, but could not prove it conclusively as the style of writing was different.

Parsing jumping jacks

What could be common between Computational Linguistics and Aerobics? Quite a lot, as it turns out to be. Dance descriptions, while not really in English do have a regular structure and can be thought of as a sub-language with full set of syntactic, semantic and pragmatic levels. There are basic words of the language (move names), correct ways of putting them in a sentence (a routine) and all the way up to good flowing text (classes that do not hurt the participants).

Upgrading to GATE 4? Beware of leftover configuration files.

From time to time I experiment with GATE NLP toolkit. Just now I tried to upgrade to the latest version (version 4) and run into really strange problem with ANNIE system not loading correctly. Later, when I uninstalled older GATE version, it stopped loading at all. The problem is the user configuration file gate.xml that is stored in the shared location, usually home directory. On Windows, that is _C:\Documents and Settings[ProfileName]_.