The Only Winning Move: Gibberish, Really

Tuesday, November 07, 2006

Gibberish, Really

A bit over a week ago I said that I would report on my impressions of The Syntax of Valuation and the Interpretability of Features, a 2004 Pesetsky and Torrego paper, after I had read it more carefully.

I can't say I was very impressed.

For those with little background in syntax: the basic topic has to do with rethinking Chomsky's model of so-called features in language. The existence of "features" on words is motivated by the idea that words combine (or don't) in systematic ways. So, for example, in languages which have gender agreement, feminine nouns combine with the feminine forms of adjectives, etc. It is ungrammatical to have a feminine noun modified by a masculine adjective or determiner, etc. Chomsky calls this "features" on the item - where features are just atomic subparts of the semantics or syntax of a word. So, for example, the noun "Buch" in German has its gender feature set to neuter - which is why it's grammatical to say "das Buch" and ungrammatical to say "die Buch" or "der Buch."

In Chomsky's system, features come in two flavors. They're interpretable/valued or uninterpretable/unvalued. Valued/unvalued has to do with whether the feature is specified in the lexicon. So, for example, we can guess that the gender feature for "Buch" is specified as "neuter" in the lexicon - because "Buch" never varies in gender; it's always just neuter. The items that modify nouns, however, vary. We get "dieses Buch" for "this book" and "dieser Hund" for "this dog." The first instance of "this" is neuter, the second is masculine. So Chomsky hypothesizes that "dies" ("this") is unvalued for gender in the lexicon and picks up its gender feature when it combines with a noun that has a valued gender feature. The interpretable/uninterpretable distinction has to do with whether a feature makes a semantic contribution to the sentence. Does it mean anything? For Chomsky, this is the same as asking whether it is valued in the lexicon. Features that are valued are automatically interpretable, and those that are unvalued are automatically uninterpretable. And again, this seems to make sense (assuming you want to buy that the fact that "Buch" is neuter matters to any interpretation of the sentence; I'm not so convinced). Chomsky's mechanism for feature valuation is a process he calls Agree. It works like this: sentences are built up one word at a time, starting with the last words in the sentence and moving up to the front (there are reasons to believe this). If a word with an unvalued/uninterpretable feature gets added, it searches through all the words that are currently members (i.e. words that got added before it) and looks for a valued version of the same feature. If it finds one, it sets its own value to that value, and then the feature deletes. Chomsky believes this deletion is necessary because uninterpretable features have nothing to say to the semantics - and in Chomsky's system, first we deal with the syntax, and then we deal with the semantics. For reasons of "elegance" and "efficiency" (in perverse uses of the terms, in my opinion, as we'll see later), we don't want excess information floating around by the time the sentence gets to the semantic module.

Now, there are a couple of features (no pun intended) of Chomsky's system that seem a little clumsy, and it's Pesetsky and Torrego's purpose in the paper to go through and suggest cleaner alternatives. One thing they dislike is the assumption that uninterpretable features are also automatically unvalued and interpretable features are automatically valued. It seems to them that if you're going to have two oppositions (valued/unvalued and interpretable/uninterpretable) that you ought to have four kinds of features - i.e. there should also be interpretable/unvalued features and uninterpretable/valued features. Some of their arguments include:

Why should the lexicon care whether the semantics is sensitive to a given feature? The idea here is that there's no reason the lexicon should necessarily couple uninterpretable features with unvalued features systematically. On a first read it seems plausible: if syntax can't tell what the semantics will operate on, then how can the lexicon know? Syntax just deletes any valued but uninterpretable feature at the end of a "phase" (Pesetsky and Torrego need to stipulate this to avoid having these features delete as soon as they come out of the lexicon - since such features actually exist in their system). But in fact, I don't find this convincing at all. The fact remains that there are two kinds of features and the semantics is only sensitive to one of them. All the ones that it's not sensitive to have to delete (mysteriously) before it starts its work. It seems to me that whether or not the lexicon bothers to pair feature types that the semantics can "see" or not, this distinction is still made in the lexicon. That is, even if we have four types of features, the lexicon is still encoding information on the basis of "the semantics can see this" and "the semantics can't see that." More to the point, the idea that there are actually two oppositions and four kinds of features is a too-literal reading of Chomsky's system to begin with. For all we know, what he meant was that there were only two kinds of features and they happened to each be describable in terms of one of two oppositions. In any case, the whole process of "deletion" seems silly in the first place. Why can't the semantic module just ignore things that don't apply to it?

There are plausible examples of some of the features in question. This one is more convincing. They make a good argument that we see examples of all four types in complementizers. So, for example, they notice that complementizers appear to participate in a kind of agreement. You can say "I wonder what Mary bought" but not "The book what Mary bought." Likewise "I wonder why Mary left" but not "John left why Mary left" - though it's OK to say "the reason why Mary left." Etc. There seems to be reason to assume that which complementizer gets chosen is a kind of agreement - and yet it's also intuitive that interpretation happens on the complementizer and not on the actual sentence following. So we might want an interpretable but unvalued feature on the complementizer (it changes its form based on the value it gets). Not a rock-solid case, but it's more intuitive than Chomsky's system. Of course, there are also valued/interpretable features on complementizers. They cite "if" as a plausible example. It gives the interpretation of a "yes/no" question no matter what it combines with (of course, it has to combine with sentneces only - so that's the part that's the "uninterpretable" half (i.e. syntax-only) on the sentence). This again seems plausible - though notice it doesn't quite work. After all, what are we to make of things like "She left?" We can just as easily say "I wonder if she left" and "I wonder why she left" - but in the second case Pesetsky and Torrego seem to want to say that "she left" has a valued/uninterpretable feature that combines with a generic interpretable/unvalued complementizer to yield "why." In the case of "I wonder if she left" then the same feature on "she left" has to be unvalued/uninterpretable. In other words, the two "she left"s are completely different. But that doesn't seem right! However, I suppose it's not totally implausible, and it could certainly be cleaned up by going into more detail about what's going on (they might, for example, say that "if" in such cases is just an alternate pronounciation of "whether," and that the "if" they're talking about only applies to sentences like "If she left, I'm not going." It still doesn't answer why "if" is necessary - i.e. why we can't just say "She left, I'm not going," but whatever - it's good enough.).

The system is more compact on some questions of directionality. The "directionality" they're talking about here involves which features are searching for which. Since in Chomsky's system valued features are always also interpretable, then these features are always "goals," which is to say they never do any "searching." This runs into trouble in cases where we have reason to believe that the word that needs valuing was added to the sentence before the word that ultimately gives it its value. What Chomsky has to do in these cases, of course, is postulate two kinds of features on the word that's added later - one of which is an EPP feature that causes the lower word to raise. This keeps happening until it finds something that can give it a value, etc. In P and T's system, directionality is not a factor because it's simply too much to specify different kinds of Agree for all 12 kinds of possible feature interactions (actually 16 if you assume like types can interact) - so they have a system of "feature sharing," in which features that "find" their counterparts simply agree to share an index. If either of them is valued, then both of them now share that value. They become the same feature (rather than one valuing the other and that other then deleting).

The last point in the last justification is crucial: Pesetsky and Torrego have officially adopted a feature-sharing version of Agree. I argued in my last post on this that this was silly because HPSG has the same thing: they could have just written a paper arguing that HPSG is a cleaner system. Now, people will say that maybe they only want feature-sharing for this one problem, and that the overall system of nodes and movement (that Minimalism and GB and other transformational models offer) is overall better.

That's where this paper is ironically very useful. Adopting feature-sharing in a system that allows movement actually gets them into trouble - so to answer the commenter on the original post who seemed concerned that I was being rude and jumping the gun, it turns out that I was right to criticize the paper. Told ya so!

Here's what goes wrong. Pesetsky and Torrego believe - for other reasons (in a 2001 paper) that in fact English doesn't really have complementizers. What look like complementizers in English are really just tense features that have moved to the C in CP (CP is a node that allows for sentence embedding - it's where complementizers show up). One nice thing this buys them is a principled way to explain the difference between sentences like these:

That Sue bought the book I believe.

*Sue bought the book I believe.

The first is grammatical, but not the second (on the relevant reading - this sentence is of course possible: "Sue bought the book, I believe..." - where "I believe" is added as an afterthought). Why?

Well, Pesetsky and Torrego's system offers a kind of cool explanation. Either the Subject (Sue) moves to SpecCP (the position just "ahead of" C), or else the tense features move to C itself (not SpecCP, but actually merge with C and "become" the complementizer). Now - the nice thing about tense features is that they just so happen to be interpretable. So - they're what the semantics will see. So if you put them on C, then in the first of the two examples above, there is something that the semantics can see/interpret sitting at the top of the phrase "Sue bought the book.":

[That Sue bought the book] I believe.

"That" is, in their system, not a complementizer, but it now behaves like one by virtue of its position. It gets pronounced "that," and more importantly, it's interpretable, so the phrase "That Sue bought the book" is "visible" to the semantics. Great.

What happens if you do it the other way? Well, then "Sue" is the first element in the sentence:

[Sue bought the book] I believe.

Only problem is that the relevant feature on Sue (which they're saying is tense) has already been deleted. See - we hit the end of a "phase" (they don't really say what this is, but it's well-defined in countless other papers, so just pick a model), and so all the uninterpretable features deleted (or, for them, uninterpretable "instances of features," since in their system features that "Agree" become the same feature) - and now there is nothing left on "Sue" for the Semantics to see.

So that's a really cool solution.

Here's where it goes wrong...

What if you have a sentence like "That Mary likes Chess annoyed Bill?"

Remember that they are saying that the relevant feature is TENSE. The problem here is that the tenses of the two verbs are different. Unfortunately for them, they shouldn't be, because at an earlier stage in the derviation, "That Mary likes Chess" was actually below "annoyed." It has moved to the top of the sentence. This will, of course, seem silly to anyone who is not familiar with syntax - but in fact movement-based systems (called "transformational" systems) are one of many perfectly valid ways to explain grammaticality. The basic intution is that some constituents (words or phrases) seem to "participate" in the interpretation of a sentence at more than one level. (For example, when you say things like "The man I saw yesterday was tall," then it seems like "The man" should actually be after "saw" in order for us to interpret it as the object of that verb. "(x The man) I saw (x) yesterday was tall." We need it to be in "two places." Saying that it was in one place before and then "moved" somewhere else later is one way to approach this. In fact, it is the most popular way. HPSG, mentioned earlier, is one theory that does NOT use movement, and I find HPSG pretty convincing, so I don't personally "believe in" movement, but many intelligent people do.

Anyway - the point is that the authors are committed to a movement-based transformational syntax in which subject phrases (like "That Mary likes Chess") start out below the main verb and "raise to subject." But of course, if it was ever below the verb, then the verb must have Agreed with it in tense feature. More specifically - the "that" part of it did - and the "that" part of it also agreed with "likes." Since these are all now the same feature, the tense of every verb in the sentence should be present! I.e. - it should be "The Mary likes Chess annoys Bill." Certainly that's a legal sentence, but it's just as possible to say it "That Mary likes Chess annoyed Bill."

Even if you haven't completely followed this argument, you can at least appreciate that we don't want a system which requires all verbs in sentences of this kind to share tense! So something must be wrong here.

Pesetsky and Torrego "get around" this by saying that what's shared isn't the actual tense so much as the presence of tense. If that isn't ad hoc, I really don't know what is. If "tense" can be "interpretable" only in terms of whether it's present or absent, then I think we really need to rethink our whole engine here. I mean, surely what the semantics cares about is the actual tense? Not just it's presence or lack of same!!!

So this paper is deeply goofy. They build up this huge amount of machinery just to then have to turn around and tear it all down again. And they do this kind of abuse to the system just to explain this one little fact about English! Talk about caught in your own web...

What I think has gone wrong here is the hybridity. They're trying to introduce feature-sharing, which is a feature of non-transformational theories, into a transformational theory. Or - in plain English - if you're going to have feature sharing, you shouldn't allow movement, and vice versa. When you have both feature-sharing and movement, the consequence is that information that should be sectioned off is allowed to propagate through the system in an undisciplined way.

Properly speaking, movement and feature-sharing are two different ways to approach the same problem. It's overkill to use both of them. They both address the issue outlined above: that sometimes information needs to be available at different "levels" in the sentence.

The way GB and its successor "Minimalism" (properly though of as a "program" on GB and not a theory of its own, according to Chomsky, though the reality on the ground is that it's a theory) deal with these issues is to simply move constituents - which is fine.

The way that HPSG deals with it is to allow feature sharing between objects. "Rules" are themselves syntactic objects which take members. What information is visible where has to do with the rules. So - everything that Pesetsky and Torrego are trying to deal with in their paper is straightforward in HPSG. There is a rule that takes an NP and combines it with a VP to make a sentence. The NP can be anything that has a particular "visible" feature set right. The addition of a complementizer on a sentence adds the relevant feature. The higher rule is then able to see things like "That Mary likes Chess" as a "noun phrase" and combine it with the "verb phrase" "annoyed Bill." It doesn't matter that the tenses are different because the rule only takes two syntactic objects and outputs a new one. The two objects only agree in the relevant features.

It's a much cleaner theory in this way - because words always come in the order you see them on the page. There aren't things moving around willy-nilly, which makes the whole thing more deterministic, easier to program and (crucially) easier to control. The problem with GB/Minimalism is, in fact, well illustrated by this paper: it propagates consequences. Messing with the system here has unanticipated effects somewhere else (in this case - though the authors actually caught it - by having the side effect that sentneces with two verbs should be ungrammatical if the verbs don't share the same tense. To less thorough authors, this effect could easily have been ignored.).

Probably the biggest problem in Syntax today is lack of testing. In fact, it is possible to write computer programs that will "run" your various theories on corpora to check to make sure that all sentences contained within are indeed predicted by your theory to be grammatical. Likewise, it's possible to pass a "bag of words" to a program and have it generate sentences to make sure that you don't get any which shouldn't exist (that is, to test whether your theory predicts sentences to be grammatical that, in fact, are not). The trouble with movement-based theories like GB is that they are hugely difficult to program - and that's because they "run backward." That is, to parse a sentence in GB/Minimalism, what you're really asking is not "does this sentence pass my tests" but instead "could my theory have generated this sentence?" And that's a bummer - because deciding whether your theory could have generated it involves a lot of backtracking - with the problem that, since things can move, there are huge numbers of possible "paths" back to the original numeration.

It's not that Minimalism is an incoherent theory; it's coherent, in fact. The problem with it is that there's just too much going on. It's not an elegant theory - ironically, given that it claims to be paring down all the machinery that GB stipulated.

HPSG is cleaner and easier to test. This paper is a nice illustration of that.

8 Comments:

At 1:40 PM, noahpoah said...: What causes a feature to delete in Chomsky's version of this? Are, e.g., post-Agree adjectives still uninterpretable? They start out unvalued and uninterpretable and then become valued and uninterpretable after Agree, right?

If this is correct, is there some other process that produces unvalued and interpretable features?
At 2:12 PM, Joshua said...: I think it's the valuation itself that causes them to delete. The way I understand it, they delete as soon as they're valued - so there aren't ever any valued/uninterpretable features in Chomsky's system.

I suppose it wouldn't be an abuse of his system to say they wait until the end of the phase to delete. Of course, the mechanism that causes deletion would then be mysterious, but that's OK because this is just a formal theory, not an actual instantiation.

There are simply no unvalued and interpretable features in Chomsky, so no process produces them. There is also no process that produces them in Pesetsky and Torrego. They exist, but they come that way directly from the lexicon - i.e. are not the result of syntactic processes. This is actually an asymmetry in P and T that I hadn't noticed until I read your comment. A(n instance of a) feature will never change its "interpretability" specification, though it can change from valued to unvalued. Interesting (and messy).
At 3:57 AM, Anonymous said...: You write: "So - everything that Pesetsky and Torrego are trying to deal with in their paper is straightforward in HPSG. There is a rule that takes an NP and combines it with a VP to make a sentence. The NP can be anything that has a particular "visible" feature set right. The addition of a complementizer on a sentence adds the relevant feature. The higher rule is then able to see things like "That Mary likes Chess" as a "noun phrase" and combine it with the "verb phrase" "annoyed Bill." It doesn't matter that the tenses are different because the rule only takes two syntactic objects and outputs a new one. The two objects only agree in the relevant features."

But that sounds almost *exactly* like the part of Pesetsky & Torrego's paper that you considered so ad hoc earlier in your message. Why is it ok for HPSG to stipulate that some features are "relevant" and others not, but it's not ok for a Minimalist paper to do that?

Perhaps the crucial difference lies in the fact that HPSG accepts "ad hoc" as the norm. If everything is ad hoc, then of course nothing will look *more* ad hoc than anything else.

The better Minimalist research, in contrast, tries to distinguish what we think we really understand from what we do not. The inevitable result is that bits of it look "cool" and other bits "ad hoc". Consequence: we know where the achievements are and also where the problems are. We know what to worry about next. That's rarely so obvious in HPSG papers.
At 2:05 PM, Joshua said...: Anonymous writes:

"But that sounds almost *exactly* like the part of Pesetsky & Torrego's paper that you considered so ad hoc earlier in your message. Why is it ok for HPSG to stipulate that some features are "relevant" and others not, but it's not ok for a Minimalist paper to do that?"

That some features are relevant and others are not to a given grammatical phenomenon is not under dispute here. I wouldn't, for example, be so silly as to say that Tense has anything to say about whether it's better to put "die" or "das" in front of "Buch" in German. The decision there involves the Gender feature; everyone agrees on that. My bone to pick with Pesetsky and Torrego has nothing to do with their choosing one feature over others to describe the relevant interaction. It has to do with stipulating ad hoc mechanisms to patch up unwanted side effects of the explanation they offer.

Your comment here betrays an ignorance of HPSG mechanisms, by the way. In HPSG, unification procedes over all features in the two syntactic objects when they combine. We don't get to choose which apply and which don't: they all apply. Some of them will be relevant to the phenomenon in question, others will just be vacuous unifications - but this isn't something that the Linguist decides directly. HPSG is committed to specifying features on various words in such a way that those words then combine to form all and only the grammatical sentences in a given language. This, in fact, requires a great deal of care, as anyone who has messed with the LKB parser can testify.

HPSG does not accept "ad hoc" as the norm. Quite the contrary - it makes a very clear philosophical commitment to a particular description of grammaticality and sticks to that commitment. The machinery involved in an HPSG derivation is truly minimal: it involves feature unification and nothing else. This may or may not turn out to be an adequate model of grammaticality, but it is nobody's idea of "ad hoc."

You write: "The better Minimalist research, in contrast, tries to distinguish what we think we really understand from what we do not."

Correct. The better Minimalist research does indeed do that. The point I have been making is that Pesetsky and Torrego's 2004 paper is not an example of this kind of Minimalist research.

Now - what you seem to be suggesting is that Minimalism is a better theory because it admits all mechanisms and then tries to pare down to which ones it really needs. That's a plausible defense of Minimalism, and if Minimalism is indeed such a theory then it's a plausible approach to solving the problem of finding all and only the grammatical sentences.

However - it seems to me that this misses the point of what theories of grammar are. A good theory makes a certain committment to describing the mechanisms that are involved - to saying what grammaticality ultimately is and how language works. It seems like a better idea to allow various theories to pursue the consequences of their assumptions as far as they will go. We can amalgamate them later - when we have a better idea what the strengths, weaknesses, and limitations of each are.

(I point out again that it is hybridity that got Pesetsky and Torrego into trouble in this paper. By combining the "heavy lifters" of two very different approaches to syntax, they ended up unable to restrain information propagation. Their mechanisms were too powerful, and it led (predictably) to absurd predictions about tense mismatch which they then had to stipulate away.)

It is in any case clear that Minimalism is not "such a theory." It is not simply a collection of explorations of all available mechanisms. There are very definitely shared assumptions among Minimalist researchers, and theories like HPSG that do not share those assumptions are not welcome in their circles.

You write:

"Consequence: we know where the achievements are and also where the problems are. We know what to worry about next. That's rarely so obvious in HPSG papers."

I think you must be reading the wrong HPSG papers.
At 8:40 PM, Anonymous said...: "HPSG does not accept 'ad hoc' as the norm. Quite the contrary - it makes a very clear philosophical commitment to a particular description of grammaticality and sticks to that commitment. The machinery involved in an HPSG derivation is truly minimal: it involves feature unification and nothing else. This may or may not turn out to be an adequate model of grammaticality, but it is nobody's idea of 'ad hoc.'"

You gotta be kidding. The machinery in an HPSG grammar includes enormously complex lexical entries (albeit data-compressed in a "multiple default inheritance hierarchy") which specifies for each class of item the syntatic context in which it may occur. Crucially, it does this in as messy and baroque a fashion as you're likely to find in linguistics.

What have we got in our multiple default inheritance hierarchy? Let's see... binary features, features that take other feature matrices as their values, features whose values are lists, features whose values are lists with operations (append, shuffle, etc.) performed on them, features whose values are functions from the values of other features, features whose values are lists some of whose members are functions from the values of other features -- and lots more.

There are indeed insights to be found in HPSG papers (I'm particularly fond of Wechsler's work on Balinese, for example), but a minimal theory it ain't. And I think my previous remarks remain apropos, because separating out the insights from the "let's get it to crank out the right answer" stipulations is seldom an easy task -- precisely because of the chaotic jumble of feature-types that HPSG allows itself.
At 2:30 PM, Joshua said...: Anonymous writes:

"The machinery in an HPSG grammar includes enormously complex lexical entries (albeit data-compressed in a "multiple default inheritance hierarchy") which specifies for each class of item the syntatic context in which it may occur."

I agree that transformational theories are more compact here - since the types are built into the system and don't have to be specified in the lexicon. That's one advantage they have. However, I suspect that reality will turn out to be that information about category is distributed in precisely the way HPSG predicts. I find it difficult to swallow the notion of a small handful of innately-determined and inflexible grammatical categories of the kind that GB/Minimalism seem to want. Real language on the ground is less straightforward. Of course, which is right (if either) is an empirical question; we'll just have to wait and see. Hopefully some time over the break I can put together a more general defense of HPSG over Minimalism-type theories and can flesh out these points a bit more. My main purpose in this post was to point out the inadequcies of the Pesetsky and Torrego paper - which annoyed me.

Further:

"Crucially, it does this in as messy and baroque a fashion as you're likely to find in linguistics."

I disagree, but I can see your point. In particular, what I don't like here is the use of the word "baroque," which implies lots of superfluous detail for the sake of detail. I am aware of very few superfluous details in HPSG; most of what's been added (of course, not all - I've yet to see a perfect theory) has been added on the basis of the need to capture relevant data.

Now, before you misquote me here - I'm not accusing Minimalism in general of inventing machinery just for the sake of having it - just Pesetsky and Torrego in this particular paper. There are cases to be made against other prominent Minimalist papers on this account, of course (and I hope to make some of them in the coming months), but no doubt one or two for HPSG as well.

I can see why you would think the variety of types of features in HPSG is undesireable - but I think you will find that with a review of the data there is simply no other way to capture all the phenomena of grammaticality we are trying to capture. The pantheon of features is there because we have reason to believe they exist in reality. There is therefore nothing "ad hoc" about it: everything is consistent with the theory and motivated by the need to describe grammaticality. Minimalism is nothing like streamlined itself. To appropriate a famous Golda Meir quote: "I don't know whether HPSG is any better than Minimalism in this area, but it is certainly no worse."
At 5:49 PM, Anonymous said...: "There is therefore nothing 'ad hoc' about it: everything is consistent with the theory and motivated by the need to describe grammaticality."

But that's more or less the very definition of ad hoc -- a theory easily and readily modified as needed to describe facts.

Now there are worse things than describing facts correctly, and it's often an essential activity (not to mention hard). But my hope is that there are deeper explanations for many of those facts. and I hope we can be at least tolerant of papers that strive to uncover them - even when they fall short of perfection. Especially if we ourselves do not yet have something better to offer.
At 3:13 PM, Joshua said...: Oh please! There's nothing "intolerant" in pointing out the flaws in an obviously flawed paper, especially when one spells out his case in detail, as I have. Nor is it "intolerant" to prefer HPSG to Minimalism. It is, however, completely laughable for people who post comments like "ad hoc is the norm in HPSG" and call HPSG "messy and baroque" to then turn around and lecture people on tolerance. I can't help but get the impression that there's a personal stake in this for you. Care to tell us who you are?

The Only Winning Move

Tuesday, November 07, 2006

Gibberish, Really

8 Comments:

About Me

Previous Posts