Zellig Harris, natural language processing, and search
Two quotes from Zellig Harris's Language and Information, which I keep
coming back to when I am trying to figure out the confusions of
natural language processing (NLP) and search. Discussing language in
general:
But natural language has no external metalanguage. We cannot
describe the structure of natural language in some other kind of
system, for any system in which we could identify the elements and
meanings of a given language would have to have already the same
essential structure of words and sentences as the language to be
described.
Discussing science sublanguages:
Though the sentences of a sublanguage are a subset of the sentences
of, say, English, the grammar of the sublanguage is not a
subgrammar of English. The sublanguage has important constraints
which are not in the language: the particular word subclasses, and
the particular sentence types made by these. And the language has
important constraints which are not followed in the sublanguage. Of
course, since the sentences of the sublanguage are also sentences
of the language, they cannot violate the constraints of the
language, but they can avoid the conditions that require those
constraints. Such are the likelihood differences among arguments in
respect to operators; those likelihoods may be largely or totally
disregarded in sublanguages. Such also is the internal structure of
phrases, which is irrelevant to their membership in a particular
word class of a sublanguage (my emphasis).
Recently, we found clear empirical evidence for this last point, and
indirect evidence for the more general point in the failure of several
teams to achieve significant domain adaptation from newswire parsing
to biochemical abstract parsing.
In general, discussions of natural language processing in search fail
to distinguish between search in general text material and search in
narrow technical domains. Both rule-based and statistical methods
perform very differently in the two kinds of search, and the reason is
implicit in Harris's analysis of the differences between general
language and technical sublanguages: the very different distributional
properties of general language and sublanguages.
Some of the most successful work on biomedical text mining relies on a
parser that descends in a direct line from Harris's ideas on the
grammar of science sublanguages.
Harris observed the very different distributions in general language
and technical sublanguages. Although he didn't put it this way, the
distributions in sublanguages are very sharp, light-tailed. In general
language, they are heavy-tailed (Zipf). Both manual lexicon and rule
construction methods and most of the machine learning methods applied
to text fail to capture the long tail in general text. The paradoxical
effect is that "deeper" analysis leads to more errors, because
analysis systems are overconfident in their analysis and resulting
classifications or rankings.
In contrast, in technical sublanguages there is a hope that both
rule-based and machine learning methods can achieve very high
coverage. Additional resources, such as reference book tables of
contents, thesauri, and other hierarchical classifications provide
relatively stable side information to help the automation. Recently, I
had the opportunity to spend some time with Peter Jackson and his
colleagues at Thomson and see some of the impressive results they have
achieved in large-scale automatic classification of legal documents
and in document recommendation. The law is very interesting in that it
has a very technical core but it connects to just about any area of
human activity and thus to a wide range of language. However, Harris's
distributional observations still apply to the technical core, and can
be exploited by skilled language engineers to achieve much better
accuracy than would be possible with the same methods on general text.
More speculatively, the long tail in general language may have a lot
to do with the statistical properties of the graph of relationships
among words. Harris again:
At what point do words get meaning? One should first note something
that may not be immediately obvious, and that is that meanings do
not suffice to identify words, They can give a property to words
that are already identified, but they don't identify words. Another
way of saying this is that, as everybody who has used Roget's
Thesaurus knows, there is no usable classification and structure of
meanings per se, such that we could assign the words of a given
language to an a priori organization of meanings. Meanings over the
whole scope of language cannot arranged independently of the stock
of words and their sentential relations. They can be set up
independently only for kinship relations, for numbers, and for some
other strictly organized parts of the perceived world.
Rule-based and parametric machine learning methods in NLP are based on
the assumption that language can be "carved at the joints" and reduced
to the free combination of a relatively small (to the number of
distinct tokens) number of factors. Although David Weinberger in
Everything is Miscellaneous does not write about NLP, his arguments
are directly applicable here. Going further, to the extent to which
general search works, it is because it is non-parametric: the ranking
of documents in response to a query is mostly determined by the
particular terms in the query and documents and their distributions,
not by some parametric abstract model of ranking. If and when we can
do machine learning and NLP this way accurately and efficiently, we
may have a real hope of changing general search significantly. In the
meanwhile, our parametric methods have a good chance in sublanguages
that matter, like the law or biomedical. The work I mentioned already
No comments:
Post a Comment