Computational Linguistics

As part of doing a PhD in Computational Linguistics, I need to understand both computers and linguistics. I am fine with computers, but linguistics is not my strong point. Unfortunately, many of the linguistics books and resources are quite dry. So, I was really happy to discover an audio course Story of Human Languagefrom The Teaching Company taught by John McWhorter. It is quite long a covers a lot of material, but - apart from some overly long parts on universal language - it is really interesting and Professor McWhorter is a great presenter.

Arthur C. Clarke once famously wrote “Any sufficiently advanced technology is indistinguishable from magic”. In the same vein, many people feel that any sufficiently established bureaucracy is like a black magic, sorcery even. Certainly, it often takes skills out of this world to follow the logic of modern tax return instructions. Bureaucracy often has its place and reason. Laws protect exploitable minorities; procedures serve to avoid known problems; cross-referencing forms are filled in triplicate to allow for audit and protection against falsification.

When OpenNLP toolkit uses MaxEnt parser, it has to read in about 25 MBytes of model files. The model reader uses basic unbuffered FileReader. The result is the excessive number of system calls (and disk access calls) during the parser startup. The fix is extremely simple: In maxent-2.4.0/src/java/opennlp/maxent/io/ObjectGISModelReader.java, replace new FileInputStream(f) with new BufferedInputStream(new FileInputStream(f), 1000000) Recompile maxent library Deploy new version of maxent-2.4.0.jar into OpenNLP’s lib directory The comparison is striking (the numbers are File access system calls):

I was not able to get OpenNLP parser to work. There were no samples to play with, no command line tools to run. And I don’t even want to talk about documentation. That’s because there was not any. There was an attempt at lame joke (at least that’s the only sense I can make of what.html file), but no actual documentation. Finally, I pinged my research colleague who did get the toolkit working (thanks Scott).

Bikel’s statistical parser is designed to be run from the command line. I need to run it from my own code. The following wrapper seems to do the trick on windows (with your own values for|parserdir| : `Bikel’s statistical parser is designed to be run from the command line. I need to run it from my own code. The following wrapper seems to do the trick on windows (with your own values for|parserdir| :

Story of Human Language – great introductory audio course on linguistics

Unravelling the black magic of bureaucracy

Reducing disk thrashing of OpenNLP/MaxEnt parser – with one line code change

Getting OpenNLP parser to work

Running Bikel’s parser programmatically