A recent article on lingpipe discussed conjuncted named entities such as Johnson and Johnson and Wallace and Gromit. They suggest that maybe a way of treating this is as a frozen expression. I assume that means relying on statistical measures to see this Multi-Word-Expression repeating enough times to be treated as a unit.
In the United Nations corpus, things can get even more interesting. Let’s look at a relatively easy example: draft resolution A/56/L.28 and Add.1.
Is this a one document (one draft resolution) or two? And if two, then which two? The first one is obviously A/56/L.28. But Add.1 is not a valid document symbol, it is actually an (additive?) coreference to the first one and resolves to A/56/L.28/Add.1?
The answer (as good as I can make it so far) could lie in FRBR distinction between Expression and Manifestation. A resolution is an expression of Member States’s proposals and negotiations. To some degree, it evolves over several meetings. However between the discussions, the latest version or changes need to be reported to make sure they are formally registered and also to ensure the next round of discussions could have latest documents to work from.
In our case, the first time the draft resolution had to be presented it was published under A/56/L.28 (which incidentally means a limited distribution document 28 of the General Assembly’s 56th regular session). So, the initial Manifestation of the draft resolution became this physical document with a distinct symbol assigned.
But apart from its text, draft resolution has a list of sponsoring Member States. That list can change as draft resolution gains sponsors. These additional sponsors were in the Addendum A/56/L.28/Add.1. But the addendum does not make sense without the original document, so actually both physical documents represent one logical draft resolution, which is reflected in the grammar of the text (draft resolution, not resolutions).
What this means for named entity annotations and for recognition algorithms is hard to say and is something I am looking at with my PhD research.