Adventures with Deep Research: success then failure

by John Q on December 9, 2025

I’ve long been interested in the topic of housework, as you can see from this CT post, which produced a long and unusually productive discussion thread [fn1]. The issue came up again in relation to the prospects for humanoid robots. It’s also at the edge of bunch of debates going on (mostly on Substack) about living standards and birth rates.

I’m also interested (like nearly everyone, one way or another) in “Artificial Intelligence” (scare quotes intentional). My current position is, broadly, that it’s what Google should have become instead of being steadily enshittified in the pursuit of advertising dollars. But I’m alert to other possibilities, including that more investment will deliver something that genuinely justifies the name AI. And I think a lot of the concerns about power and water use, the spread of AI slop and so on are either overstated or (as with deepfakes) are mostly new iterations of concerns that always arise with new IT and communications technology, and can be addressed with existing conceptual and legal tools.

With this background, I thought it would be interesting to try out ChatGPTs Deep Research (DR) on the question of what has happened to housework and why. As I may have mentioned before, I’ve trained DR on a big volume of my own writing. That produces a passable imitation of my style, and means I don’t worry about the ethical issues of plagiarising the writing style of others (of course, standard norms of citation and attribution still apply).

I decided to focus on single-person households, to abstract away from the issues of child-raising (which I want to look at separately) and the allocation of work between partners (about which there is a vast literature to which I can’t add anything new).

Everything went really well to start with. I prompted DR for time use data, then pushed further on with more detailed questions like the impact of air fryers on male cooking habits (I was given one recently and was impressed enough that I promptly bought a second). I asked for a literature search and got references to Judy Wajcman and Michael Bittman[2], both of whom I knew and a couple of people I didn’t. DR missed Ruth Schwartz Cowan’s classic More Work for Mother.

On the other hand, I wasn’t aware of Wajcman’s recent Pressed for Time and hadn’t thought about the broader issue of life administration, which DR pointed out. I gave it a more economistic take, trying to divide labour-saving innovation (electronic bill paying) from the labour costs of more digital consumption (retrieving passwords for streaming services etc).

I got DR to produce a LaTeX file, and was nearly ready to go to digital press when I noticed that the references were incomplete. At this stage, the whole process spiralled into disaster. Every draft seemed to lose more material, and to be worse written. Finally, I demanded an explanation

DR graciously explained that this is the way it does things. Here’s its response with ](more detailed version here)

Why Deep Research Fails

The core issue: The “illusion of continuity” in long-context LLMs

Large-context models appear to preserve every part of the conversation, but in reality:

They do not reliably retain or retrieve long-form outputs they themselves generated.

Even though the conversation window may be large, the model often cannot:

accurately locate the version you consider the “full text,”
reliably differentiate between multiple drafts (short, medium, long) created earlier,
or accurately regenerate from earlier output unless it is verbatim re-supplied by the user.
This is because the internal contextual representation compresses earlier tokens. Information is not stored sequentially; it is represented in a latent space, which cannot always reconstruct past outputs faithfully.

Fortunately, my LaTeX program took continuous backups, so I’ve been able to retrieve the last good version, but I’m going to keep it away from DR for now.

I was going to go on with more detail about the actual report, but my op-ed training leads me to feel that a post should have 700 words, and I am at 675 as I type this.

fn1. I can take a victory lap on my jihad/crusade against ironing, which has disappeared almost entirely, contradicting the expectations of many commenters.

fn2. Not the same as food writer Michael Bittman, who came in for some criticism in the 2012 thread.

{ 16 comments }

1

oldster 12.10.25 at 1:36 am

“Fortunately, my LaTeX program took continuous backups, so I’ve been able to retrieve the last good version, but I’m going to keep it away from DR for now.”

That sentence appears within the indented block, as though it is part of what DR said. But that would make the contrast between “I” and “DR” nonsensical. So, I assume they are actually your words, and that the quotation from DR ends one sentence earlier, with the words, “…outputs faithfully.”

Interesting to read the thread from 2012. I recognized most of the commenters. I wonder how many have died in the intervening 13 years, how many have retired, how many have despaired of commenting.

2

David Y. 12.10.25 at 3:20 am

References are where these systems are falling apart most conspicuously. They simply don’t “look at” a reference, “find” a “relevant” passage, and “cite” it. They produce text that looks plausibly citation-ish for the type of writing you’re producing, building on your own examples (if provided) but also other writing that the model associates with writing like yours, including the citations therein. But the system doesn’t really “know” what a citation is. I have seen invented/merged sources, wrong publishers, page numbers that are wildly out of range. I’m not sure these kinds of problems are solvable with a generative approach.

I’m dealing with this mostly as somebody who’s attempting to teach writing in this crazy world, though. Diligent graders, editors, judges, etc. are all finding that references and summaries often go astray from the source text, if they haven’t invented a source text.

I don’t really mind the AI revolution, but the degree to which it’s a useful shortcut is being wildly overestimated. You gotta read what it writes if you’re going to pretend that you wrote it.

3

John Q 12.10.25 at 5:49 am

David Y True, but it’s exposing bad behaviour that predates LLMs . People are only caught by hallucinated citations because they’ve been reproducing other people’s citations without checking. On occasion I’ve gone through a long chain of such citations to find an original source saying something quite different from the final one (“eight drinks of water a day” is a classic example).

4

Aardvark Cheeselog 12.10.25 at 5:07 pm

They do not reliably retain or retrieve long-form outputs they themselves generated

This is why you want to incorporate a version control system (VCS) in your workflow. Write the prompts and save them to files. Record the responses to files. Do it all in a local git repository, and commit changes to the files as you work.

Learn to use git log to inspect this history of your repository and retrieve past versions of things, and, most important, communicate with the LLM about the commit hashes of the the things you’re interested in so that it can go back and refresh its own memory.

Get the git client.

Read The Book.

5

Aardvark Cheeselog 12.10.25 at 5:19 pm

Looking back at the old item, one para leapt out at me:

Looking a bit more broadly, the picture is mixed. The household appliances that first came into widespread use in the 1950s (washing machines, vacuum cleaners, dishwashers and so on), eliminated a huge amount of drudgery, but technological progress for the next forty years or so was pretty limited. The only truly significant innovation I can date to this period is the microwave oven.

IDK if you were around for the earlier part of the era you’re talking about. Laundry work got easier easier almost through the whole 50 years after the start of that time, due to improvements in the equipment mostly. I mean, have you ever seen a 1950s washing machine?

Dishwashers. Dishwashing machines used to be an unaffordable luxury, that when they started becoming affordable were big disappointments in terms of labor saved. Again, by Y2K the improvements over 40 years were pretty astonishing. Recent decades have made them even better and more ubiquitous.

The first thing that popped into my mind when I read the para, though, was range tops. It is so incredibly much easier to clean an 21st-century stovetop than even one of the 1970s. And again, have you ever seen or worked with or tried to clean a 1950s gas range?

Things did in fact improve very considerably on a wide front during those decades you dismiss there. Though it was visible only to women and bachelors I suppose.

Unrelated question: what does it take to get privileges to just comment without moderation? Are all comments individually approved? I’ve been posting here, intermittently, for over a decade now and never had a comment not approved.

6

Doug K 12.10.25 at 6:07 pm

in the course of a forum discussion, someone used a LLM to generate an article to argue for/against with citations. I checked the citations.

References against:
First is a blind link, there is no such study. ‘This DOI cannot be found in the DOI System. ‘
Second exists !
Third is a blind link, there is no such study. There is a paper with that title, but the authors are different.
https://pubmed.ncbi.nlm.nih.gov/33289906/
It is true that this actually existing paper does support the arguments against. Half a mark for LLM.

References against:
the first DOI reference goes to a study on vitamin D. Searching on the authors finds other papers they have written, but there is no such study as the one referenced.
The second one is to an IOC meeting paper, which is political not scientific.
The third one is a blind link, there is no such study. ‘This DOI cannot be found in the DOI System. ‘
The citation is a little bit better as those authors have in fact written about the subject, but again no such study.

So 5/6 citations were entirely fabricated.
The other times I’ve checked LLM-generated citations have shown similar results. This seems a different problem to that of reproducing other people’s citations.

The question of preserving context for the LLMs is a hard problem. Steve Yegge describes this as the Dementia Problem, see link from my name. He claims to have written something that will help, jury still out on whether it will.

7

John Q 12.10.25 at 6:19 pm

Aardvark My mother was maybe ahead of the curve, as we had an automatic washing machine in the late 1960s, whereas the neighbours still had a wringer (but that was big improvement on what went before). It didn’t have buttons, there was a big physical card that Mum configured to determine what program should run, then plugged it into a slot.

But automatics were standard when I first set up house (late 70s) and have improved only incrementally since then,

On moderation, we tried everything and eventually went to auto-moderation for everyone except CTers (and even then, I sometimes have to approve my own comments)>

8

Cranky Observer 12.10.25 at 7:44 pm

OT aside: those punch card washing machines from the 1940s-1960s are collectors items today.

Not OT: while dishwashers have grown more efficient and there have been improvements in internal fluid mechanics, the big advance was in biochemistry: modern enzyme-based cleaning agents are not only orders of magnitude less damaging to the environment than earlier generation detergent-based cleaners but do a far better job of cleaning the dishes (to the point where they will eat the design off of dinnerware if you unnecessarily pre-wash the dishes before putting them in the dishwasher).

9

John Q 12.11.25 at 5:05 am

Oldster, I remember ads for enzymes as the new magic ingredient in cleaning. It was before I learned about enzymes at school, so that must also have been around 1970.

If you have a link for “don’t pre-wash the dishes”, I’d be keen to pass it on to my wife.

10

Petter Sjölund 12.11.25 at 7:36 am

Not a peer-reviewed study, but the user manual for my dishwasher says under Loading the dishwasher: “Remove large food remnants. It is not necessary to prerinse utensils under running water.” My hunch is that most dishwashers made in the last ten years or so have similar wording in their manual.

11

oldster 12.11.25 at 9:01 pm

Dishwashers now boast of their low water usage, and the amount of water that a cleaning-cycle requires is now regulated (in the US) by Federal standards*. Manufacturers can gain additional laudatory certifications (“Energy Star”) for meeting even lower-use targets.

If the regulatory standards measure the total amount of water use by including any pre-rinsing, then the manufacturers would lose their certification if they encouraged pre-rinsing.
I strongly suspect that this is the reason that user manuals now discourage pre-rinsing: simply in order to meet the standards for total water use.
Stories about how the machines do a better job of washing if you do not pre-rinse, or even that the machines or detergents will destroy your dishes if you do pre-rinse, strike me as likely to be just-so stories, told as ex post facto justifications of what are in fact boring responses to regulatory standards.
Enzymes that “eat the design off of dinnerware if you unnecessarily pre-wash” are also going to eat the designs off if you simply put in a plate that was not very dirty, or a plate that you want to sterilize before putting away. I find it very hard to believe that detergent-manufacturers would be willing to tolerate the tort claims for damage and replacement of dishes that would be consequent on their marketing dish-eating enzymes. How will they defend themselves in court? “You should have smeared some food on it first”? Dish-eating enzymes strike me as another just-so story.
Yes, detergents and wash-cycles have improved. But I strongly suspect that the current advice against pre-rinsing — as well as the embellishments on this advice — are all downstream from simple water-use regulations.

*If you object to such regulations, then you will be pleased to hear that the Trump administration is in the process of destroying the Federal Environmental Protection Agency, so those regulations will soon disappear down the drain.

12

Linguistics Boy 12.12.25 at 1:36 am

And I think a lot of the concerns about power and water use, the spread of AI slop and so on are either overstated or (as with deepfakes) are mostly new iterations of concerns that always arise with new IT and communications technology, and can be addressed with existing conceptual and legal tools.

I would be interested in learning more about how you came to these conclusions. Especially with changes to the media landscape, a lot of people seem to think we are on the verge of something totally unprecedented and it’s all changing too fast to control. What are some solid reasons not to panic?

13

John Q 12.12.25 at 4:11 am

LB @12 I’ll start with the easy ones. As regards water use, lots of complexities byt aggregrate demand, massively outweighed by the use of irrigation water to grow corn that is turned into fuel ethanol. My son Dan sent me this from Hank Green
https://youtu.be/H_c6MWk7PQc

As regards electricity, the same claim has been made repeatedly, and proved wrong every time https://johnquigginblog.substack.com/p/ai-wont-use-as-much-electricity-as

Fraudulent deepfakes are just another round in the long arms race between forgery and detection. Just as Photoshop invalidated “Pics or it didn’t happen”, Sora and similar are doing the same for video. In the end, trustworthy provenance is all you can count on.

Awful things happening wrt media, but AI is peripheral.

Finally, there is the chance that AI will get good enough, soon enough, to render lots of human skills obsolete. I’m not ruling that out, as I said, but it’s not happening yet.

14

Alison 12.12.25 at 9:46 am

It seems use of generative AI depresses the quality of content. This is not really logical – even the laziest person has gained such a lot of time by taking an AI shortcut that they could quickly check the work and still be well ahead in terms of effort. That’s what I expected to happen. Instead I see presentations with ridiculous non-maths, reports where the student refers to themself throughout as ‘Emily’ (not their name), diagrams that are an <a href = "https://freethoughtblogs.com/pharyngula/files/2025/11/Autism-AISlop.jpeg&quot; incoherent mess.

AI boosters often link to clips of video saying ‘traditional sit-coms are dead’ (or whatever). Presumably these people have a strong motive to share something of good quality, so why are they always offensively bad?

Maybe generative AI has brought to the surface an aspect of human nature, unexpected (by me at least), a sort of self-hatred perhaps? Or aggressive helplessness? It would be so easy and so beneficial to check and correct, so there must be some strong opposing force which is pushing the other way, making people wallow in the slop.

15

Alison 12.12.25 at 9:48 am

It seems use of generative AI depresses the quality of content. This is not really logical – even the laziest person has gained such a lot of time by taking an AI shortcut that they could quickly check the work and still be well ahead in terms of effort. That’s what I expected to happen. Instead I see presentations with ridiculous non-maths, reports where the student refers to themself throughout as ‘Emily’ (not their name), diagrams that are an <a href = “https://freethoughtblogs.com/pharyngula/files/2025/11/Autism-AISlop.jpeg" incoherent mess.

AI boosters often link to clips of video saying ‘traditional sit-coms are dead’ (or whatever). Presumably these people have a strong motive to share something of good quality, so why are they always offensively bad?

Maybe generative AI has brought to the surface an aspect of human nature, unexpected (by me at least), a sort of self-hatred perhaps? Or aggressive helplessness? It would be so easy and so beneficial to check and correct, so there must be some strong opposing force which is pushing the other way, making people wallow in the slop.

16

Linguistics Boy 12.20.25 at 1:32 am

John Q @13:

In the end, trustworthy provenance is all you can count on.

Then the question becomes, which sources do people trust – and what happens when they choose to trust the wrong ones? Confirmation bias has always been with us, of course. But I can’t help thinking that fake video gives people an especially powerful way of choosing their own reality. On the one hand, they can view “evidence” for things that didn’t happen. On the other hand, whenever something they want to disbelieve is captured on video, they can selectively denounce it as fake.

Any technology is morally neutral in itself, but the way we decide to use it can exacerbate social problems. The risk is in relying on AI uncritically. By now experience has shown the limits of LLMs, but plenty of people are already using these models as oracles and even virtual friends. Every new product keeps the hype going.

Anyway, thanks for the links.

Comments on this entry are closed.