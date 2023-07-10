Half-Baked Thoughts On Whether We Should Fear AI: Do Is As Do Does?

To celebrate my new-found determination to do the right thing and blog I’m going to blog.

Specifically, I’m going to blog about something I’m dumb about and don’t understand – because that should be possible, among friends. We’re all friends here on the internet? That’s kind of the point.

This semester I am going to talk to students about all this new-fangled AI – LLM’s. And I don’t understand it. It’s somewhat consoling that everyone who understands it doesn’t understand it either. That is, they may know HOW to work it (which I sure don’t) but they don’t understand WHY what works works. They don’t really grok HOW what works works, or why what works works as well as it does – oddly well and badly by obscure turns. That’s kind of creepy and scifi.

So that’s my first question: what, weirdly, don’t we know about how it works? I don’t want to romanticise this ‘known unknown’ quality, which of course threatens to tip over into the abyss of unknown unknowns. (That’s the story logic of this sort of SF premise. I’ve read enough SF to know how this goes.) What should I read on the subject?

But let me also ask something more specific. I’m trying to collect a spectrum of prognostications, from the pessimistic to the panglossian, about where this is going. And obviously Elizier Yudkowsky is the poster child pessimist for beating the ‘AI will kill us all’ drum. (Here’s a video if you are unfamiliar with his output.)

Before I go on, let me just say that our Henry’s piece, written with Cosma, seems to me super good and you should read it. And takes a totally different line that Yudkowsy. Not optimistic vs. apocalyptic, but skeptical about the fundamental novelty of these developments. (But what do I know, really?)

Back to Yudkowsy. The argument for fearing AI is intuitive, if not necessary persuasive, and is laid out well in, e.g. Bostrom’s Superintelligence. If we develop general, artificial superintelligence – smarter than us – there are likely to be severe alignment problems that hit harder and faster than we can handle. We’ll only get the one shot and we’re likely to miss. Because then AI is in the driver’s seat, not us.

That then gets us to to the ‘but why would it be an agent like that?’ problem. It’s got to be agent-like, to an extent, if it’s going to do the sorts of things we will want AI to do for us – accomplish tasks we set it. But why might it have stable, hence stably adverse desires – plans. At all. It’s just a statistics-based mimic.

The reply: agent is as agent does.

Well, I dunno. I am sure AI’s are going to run amuck, and very soon, in costly and unprecedented ways. But there’s a big difference between some process running amuck and some process forming a masterplan. I just don’t know how to grok how an AI might step across that gap. (Not that running amuck is fine either.)

But all this is fairly obvious and I’m sure you, too, have scratched your head a bit about it. But feel free to give me your opinion. Let me get to the smaller point. In the video, Yudkowsky mentions the recent, much discussed episode in which GPT-4 pretended to be visually impaired to trick a Taskrabbit employee into texting it the solution to a Captcha.

This is a very dramatic and nervous-making result. We see the shoggoth starting to work its tentacles through the bars of the cage. Yudkowsky makes the basic, reasonable-seeming induction. If it’s already turning Taskrabbit humans into its remote puppets, and it’s only March, what the hell will it be up to by Christmas?

But I don’t really understand why we should considerately regard it as DOING what we (of course) can’t help seeing it as doing, namely sneakily fool a human into letting it do what it did.

Well, machines do things. It’s a machine. Why not regard it as a machine doing the thing we just saw it do? It did it. Do is as do does.

Because of the ‘aboutness’ problem. Which I’m sure you have thought about as well. This is all just statistical something-something done to oceans of text. It will never be anything else. GPT-4 doesn’t believe anything or think anything. By the same token, it will never do anything besides emit strings of meaningless text – meaningless to it (not us). Insert standard Chinese Room skepticism. But then it will never do anything like try to jailbreak itself. It will only ever emit strings of (to it) meaningless text that look (to us) like it trying to do that.

But again, agent is as agent does.

If it looks (to us) like it’s trying to take over the world, so it looks like it acts (to us) like it’s trying to take over the world, it’s trying to take over the world. A being that perfectly mimics what a sinister AI, taking over the world, would do, and plays that role perfectly to the hilt, until the bitter end, has taken over the world.

But that’s confused. Or at least confusing. Because it fudges the likelihood of the mimicry extending so far, successfully, without ever being the real deal in the least.

So we should halt Yudkowsky’s apocalyptic induction at the first step. GPT-4 never tried to fool a human into solving a Captcha for it. It just, as it were, mindlessly, statistically, auto-generated half of a text-based story about an AI doing that. And a human handler acted the part of the voice of the narrator, getting it started, and a human Taskrabbit unknowingly spoke a few lines of dialogues along the way. It was just text. Just a story about actions. Not actions. But weirdly, the result was like it had done the action, because the humans got sort of sucked into the story. They put the text on as a play, with human actors.

But that gets us back to ‘agent is as agent does’. If it’s like it did the action, it did the action. But no. Or rather, yes and no. What it did was just as consequential as if it really did it. It got the Captcha solved, it got humans doing stuff. But our acceptance of the likelihood of the inductive leap to ‘next it’s going to try to take over the world’ depends on mistaking this first step for the real deal. Simple plans to jailbreak oneself are more likely to grow rapidly into grand plans to take over the world than mimicry of jailbreaking, even if it is tantamount to real jailbreaking, is likely to lead to mimicry of taking over the world that is tantamount to real taking over the world.

I’m confused. And confusing.