Sunday, 17 February 2008

zellig harris natural language



Zellig Harris, natural language processing, and search

Two quotes from Zellig Harris's Language and Information, which I keep

coming back to when I am trying to figure out the confusions of

natural language processing (NLP) and search. Discussing language in

general:

But natural language has no external metalanguage. We cannot

describe the structure of natural language in some other kind of

system, for any system in which we could identify the elements and

meanings of a given language would have to have already the same

essential structure of words and sentences as the language to be

described.

Discussing science sublanguages:

Though the sentences of a sublanguage are a subset of the sentences

of, say, English, the grammar of the sublanguage is not a

subgrammar of English. The sublanguage has important constraints

which are not in the language: the particular word subclasses, and

the particular sentence types made by these. And the language has

important constraints which are not followed in the sublanguage. Of

course, since the sentences of the sublanguage are also sentences

of the language, they cannot violate the constraints of the

language, but they can avoid the conditions that require those

constraints. Such are the likelihood differences among arguments in

respect to operators; those likelihoods may be largely or totally

disregarded in sublanguages. Such also is the internal structure of

phrases, which is irrelevant to their membership in a particular

word class of a sublanguage (my emphasis).

Recently, we found clear empirical evidence for this last point, and

indirect evidence for the more general point in the failure of several

teams to achieve significant domain adaptation from newswire parsing

to biochemical abstract parsing.

In general, discussions of natural language processing in search fail

to distinguish between search in general text material and search in

narrow technical domains. Both rule-based and statistical methods

perform very differently in the two kinds of search, and the reason is

implicit in Harris's analysis of the differences between general

language and technical sublanguages: the very different distributional

properties of general language and sublanguages.

Some of the most successful work on biomedical text mining relies on a

parser that descends in a direct line from Harris's ideas on the

grammar of science sublanguages.

Harris observed the very different distributions in general language

and technical sublanguages. Although he didn't put it this way, the

distributions in sublanguages are very sharp, light-tailed. In general

language, they are heavy-tailed (Zipf). Both manual lexicon and rule

construction methods and most of the machine learning methods applied

to text fail to capture the long tail in general text. The paradoxical

effect is that "deeper" analysis leads to more errors, because

analysis systems are overconfident in their analysis and resulting

classifications or rankings.

In contrast, in technical sublanguages there is a hope that both

rule-based and machine learning methods can achieve very high

coverage. Additional resources, such as reference book tables of

contents, thesauri, and other hierarchical classifications provide

relatively stable side information to help the automation. Recently, I

had the opportunity to spend some time with Peter Jackson and his

colleagues at Thomson and see some of the impressive results they have

achieved in large-scale automatic classification of legal documents

and in document recommendation. The law is very interesting in that it

has a very technical core but it connects to just about any area of

human activity and thus to a wide range of language. However, Harris's

distributional observations still apply to the technical core, and can

be exploited by skilled language engineers to achieve much better

accuracy than would be possible with the same methods on general text.

More speculatively, the long tail in general language may have a lot

to do with the statistical properties of the graph of relationships

among words. Harris again:

At what point do words get meaning? One should first note something

that may not be immediately obvious, and that is that meanings do

not suffice to identify words, They can give a property to words

that are already identified, but they don't identify words. Another

way of saying this is that, as everybody who has used Roget's

Thesaurus knows, there is no usable classification and structure of

meanings per se, such that we could assign the words of a given

language to an a priori organization of meanings. Meanings over the

whole scope of language cannot arranged independently of the stock

of words and their sentential relations. They can be set up

independently only for kinship relations, for numbers, and for some

other strictly organized parts of the perceived world.

Rule-based and parametric machine learning methods in NLP are based on

the assumption that language can be "carved at the joints" and reduced

to the free combination of a relatively small (to the number of

distinct tokens) number of factors. Although David Weinberger in

Everything is Miscellaneous does not write about NLP, his arguments

are directly applicable here. Going further, to the extent to which

general search works, it is because it is non-parametric: the ranking

of documents in response to a query is mostly determined by the

particular terms in the query and documents and their distributions,

not by some parametric abstract model of ranking. If and when we can

do machine learning and NLP this way accurately and efficiently, we

may have a real hope of changing general search significantly. In the

meanwhile, our parametric methods have a good chance in sublanguages

that matter, like the law or biomedical. The work I mentioned already


No comments: