Is Deep Research deep? Is it research?

by John Q on September 12, 2025

I’m working on a first draft of a book arguing against pro-natalism (more precisely, that we shouldn’t be concerned about below-replacement fertility). That entails digging into lots of literature with which I’m not very familiar and I’ve started using OpenAI’s Deep Research as a tool.

A typical interaction starts with me asking a question like “Did theorists of the demographic transition expect an eventual equilibrium with stable population”. Deep Research produces a fairly lengthy answer (mostly “Yes” in this case) and based on past interactions, produces references in a format suitable for my bibliographic software (Bookends for Mac, my longstanding favourite, uses .ris). To guard against hallucinations, I get DOI and ISBN codes and locate the references immediately. Then I check the abstracts (for journal articles) or reviews (for books) to confirm that the summary is reasonably accurate.

A few thoughts about this.

First, this is a big time-saver compared to doing a Google Scholar search, which may miss out on strands of the literature not covered by my search terms, as well as things like UN reports. It’s great, but it’s a continuation of decades of such time-saving innovations, going back to the invention of the photocopier (still new-ish and clunky when I started out). I couldn’t now imagine going to the library, searching the stacks for articles and taking hand-written notes, but that was pretty much how it was for me unless I was willing to line up for a low-quality photocopy at 5 cents a page.

Second, awareness of possible hallucinations is a Good Thing, since it enforces the discipline of actually checking the references. As long as you do that, you don’t have any problems. By contrast, I’ve often seen citations that are obviously lifted from a previous paper. Sometimes there’s a chain leading back to an original source that doesn’t support the claim being made (the famous “eight glasses of water a day” meme was like this).

Third, for the purposes of literature survey, I’m happy to read and quote from the abstract, without reading the entire paper. This is much frowned upon, but I can’t see why. If the authors are willing to state that their paper supports conclusion X based on argument Y, I’m happy to quote them on that – if it’s wrong that’s their problem. I’ll read the entire paper if I want to criticise it or use the methods myself, but not otherwise. (I remember a survey in which 40 per cent of academics admitted to doing this, to which my response was “60 per cent of academics are liars”).

Fourth, I’ve been unable to stop the program pretending (even to describe this I need to anthropomorphise) to be a person. If I say “stop using first-person pronouns in conversation”, it plays dumb and quickly reverts to chat mode.

Finally, is this just a massively souped-up search engine, or something that justifies the term AI? It passes the Turing test as I understand it – there are telltale clues, but nothing that would prove there wasn’t a person at the other end. But it’s still just doing summarisation. I don’t have an answer to this question, and don’t really need one.

{ 21 comments }

1

engels 09.12.25 at 9:11 pm

I think there’s a pretty wide consensus among people who have thought about these things that the Turing test is a bit daft. Iirc Turing didn’t envisage the machine taking his test would have instant access to all humanity’s recorded knowledge while doing so (in all the exams I’ve ever taken that would definitely have been considered cheating).

2

Jimmy 09.12.25 at 9:40 pm

I remember when teachers were panicked about Wikipedia and the impact it would have on student research. Turns out all these years later that it’s a pretty decent starting point as a bibliographic resource, even if you wouldn’t want to copy and paste directly from it into a paper. Increasingly, it feels like this is the best use of chatgpt as well. And the most attuned to its limitations.

On those limitations, I don’t think that it justifies using the term artificial intelligence. It’s a predictive large language model. It’s kind of incredible how much a predictive large language model can sound like intelligence, but it isn’t really.

What remains to be seen is whether, like Wikipedia, chatgpt and other large language models evolve towards their most constructive uses. Or weather, like social media, they have all sorts of negative societal consequences. To the extent that monetization has been a direct cause of most of those negative consequences, and Wikipedia has avoided this by remaining a non-profit, the Outlook doesn’t look good.

3

Edward Gregson 09.12.25 at 10:29 pm

As a robotics person, these large models are impressive at text and images but I don’t think any of them have yet yielded a robot that can operate in the physical world with the robustness and problem-solving ability of a coconut crab, much less a person. So I would probably still go with search engine, although who knows if that could change.

4

KT2 09.12.25 at 11:19 pm

“Second, awareness of possible hallucinations is a Good Thing”

Just read…
“New Study: How Often Do AI Assistants Hallucinate Links? (16 Million URLs Studied)”
By Ryan Law, Xibeijia Guan
September 2, 2025

“To find out, we looked at the http status of 16 million unique URLs cited by ChatGPT, Perplexity, Copilot, Gemini, Claude, and Mistral.

“We found that AI assistants send visitors to 404 pages 2.87x more often than Google Search.

“ChatGPT is the greatest offender, with 1.01% of clicked URLs and 2.38% of all cited URLs returning a 404 status (compared to baseline 404 rates of 0.15% and 0.84% respectively).

“Here’s what we found:

https://ahrefs.com/blog/how-often-do-ai-assistants-hallucinate-links/

And apologies if I’m telling you how to suck eggs JQ…
“deep-research-prompts

“Apply Proven Techniques: – Use methods such as chain-of-thought prompting (e.g., “think step by step”) for complex tasks.
– Encourage the AI to break down problems into intermediate reasoning steps. Set a Role or Perspective:
– Assign a specific role (e.g., “act as a market analyst” or “assume the perspective of a historian”) to tailor the tone and depth of the analysis.
Avoid Overloading the Prompt:

https://github.com/langgptai/awesome-deep-research-prompts/

I’d be interested in your results JQ, if you were to “Assign a specific roles”… “young woman, red state, requiring abortion” “VC buying health services” “professor of economics” “peace acrivist”… etc.

5

Lee A. Arnold 09.12.25 at 11:48 pm

Off-topic of research methods: The Economist just published an article, “Don’t panic about the global fertility crash: A world with fewer people would not be all bad”

https://www.economist.com/leaders/2025/09/11/dont-panic-about-the-global-fertility-crash

6

Alex SL 09.13.25 at 2:23 am

I have not used Deep Research, but I have recently experimented with a different system meant for research, and although it has yet to hallucinate a reference for me, it clearly has a junior high-schooler’s understanding of what a good reference is and an even worse understanding of what the references even say. In one case, it used nine references, and five of them were about a minuscule fraction of the topic, the equivalent of asking a student to provide an overview of Chinese history from the Han dynasty to today, and them spending more than half of their time researching the Battle of Red Cliffs. In the other case, it repeatedly used references to support statements that they do not and often could not possibly support.

Again, this isn’t Deep Research; it is a system that is supposedly even more custom-made for research than Deep Research. None of this builds confidence in me that complicated autocorrect should play any role in science. Leaving aside whether the IP theft makes all of this fruit of the poison tree, there was a relatively innocent time around ca. 2022-2023 when many of us had fun using generative AI to, you know, actually generate things. I made an image of a DNA double helix caught in a mouse trap to illustrate the method of “DNA sequence capture” in a talk, and no artist lost a commission from that. Today some of our colleagues try to brute-force this technology into every process no matter how ill-suited it is, even those where reproducibility would be crucial. No, just because you have hammer doesn’t mean that it is the right tool to exchange a light bulb! Stop it!

Regarding the abstract, of course it is silly to expect everybody to read every word of every paper they cite. I would, however, want to skim the methods, the results, and the main figure in addition to the abstract. The precise minimum due diligence required would depend on the field and the research question.

Regarding the Turing Test, I find it difficult to believe that LLMs would pass a test conducted by a critical evaluator (as opposed to an excited tech journalist). There are tells. Just looking at my two trials mentioned above, the writing style is just so… LLM. It is downright off-putting. And purely on the quality, I would recognise the output as from a bot because of the discrepancy between its failure to understand how to use references (which suggests a very inexperienced student) and its perfect spelling and grammar and the terminology it uses (which suggests a well-trained academic). Seems unlikely to get both in the same human.

But there is another way in which these ‘agentic systems’ would fail the test for me: the report is written too fast, no human could write like this in 10-15 min. Not sure that is what Turing meant, admittedly…

7

M Caswell 09.13.25 at 1:18 pm

“To guard against hallucinations, I get DOI and ISBN codes and locate the references immediately. Then I check the abstracts (for journal articles) or reviews (for books) to confirm that the summary is reasonably accurate.”

Suggests to me that it is not researching, rather you are the one doing the research, and what you are researching are abstracts and reviews, not articles and books.

8

David in Tokyo 09.13.25 at 1:47 pm

“Not sure that is what Turing meant, admittedly…”

As I’ve said before: go read Turing’s paper. It’s not the stupid thing most people think.

9

LFC 09.13.25 at 4:03 pm

Haven’t read all the comments, but just want to say that libraries with hard-copy books and hard-copy journals on their shelves are still important for several reasons. I’ve had the experience (when working on diss. some yrs ago now) of finding a significant journal article by browsing through some bound volumes of journals on the shelves (now sadly much reduced, I think, in terms of space occupied in this particular library). Could I have found it via an electronic search? Probably but I’m not entirely sure. Technological time-savers are often great but they’re not the be-all and end-all. P.s. I’ve not had occasion to use Deep Research (or really any other AI for that matter except the stuff one gets automatically at the top of many Google or other search-engine searches these days). But I’m not currently writing a book or engaged in sustained research.

10

Kenny Easwaran 09.13.25 at 8:10 pm

“is this just a massively souped-up search engine, or something that justifies the term AI?”

Some people want this to be a meaningful distinction. I think it’s better to think of all kinds of different information processing that are useful for achieving goals on a kind of continuum, depending on how many different kinds of information they can use, how many different kinds of environments they can work in, how many different kinds of goals they can achieve, and how reliably they can achieve those goals in those environments with that information. Clocks and thermostats would be on this scale, but far lower than your average search engine. The product you’re using is well above your average search engine. A Roomba is neither above nor below this sort of product, because it’s using a very different sort of information in a very different sort of environment and has very different sort of robustness.

Insisting that there is some point on this scale that counts as “justifying the term AI” just doesn’t seem useful to me.

I think it’s valuable to note both the ways that the neural net architecture makes these systems achieve things that in 2010 I thought wouldn’t happen within my lifetime (computers responding to arbitrary grammatical sentences with long grammatical passages while maintaining the topic) and also the ways that these things aren’t like people (no matter how many times you tell it to look carefully at the squares of the grid when entering entries into the crossword puzzle, it sometimes tries to enter letters in the black squares).

11

Alex SL 09.14.25 at 12:42 am

Jimmy,

I think the problem is greater than profit motives, as bad as those are. The principle behind search engines is that they lead you to original sources (even if it is just somebody’s blog of Chinese food recipes). The policy of Wikipedia is that it provides citations you can follow to the original sources. The principle behind ChatGPT and its ilk is that they summarise the information for you, making the use of search engines and the checking of original sources unnecessary. Services for researchers will provide those sources, but most people will not use those and instead simply use a generic chatbot that doesn’t provide sources.

The consequences are already apparent: site visits are collapsing, meaning that websites don’t get enough ad revenue, meaning that many websites may have to shut down or at least stop generating new content. Self-help forums like Stackoverflow see declining discussions, as programmers ask their questions to Claude or ChatGPT or Grok instead of asking colleagues at the web forum, meaning that the LLMs keep their own future training data from being created.

(Side note: I have seen few greater examples of not understanding how anything works than leaders of AI companies who say that future iterations of LLMs will simply be trained on the outputs of earlier iterations of LLMs. Imagine a climate modeller saying that they will simply re-fit their model on its own predictions, and the problem should be immediately obvious. But if you run a tech company, you are taken seriously no matter what nonsense you say. Oh, hey, Mars colonisation again.)

And everywhere around us we see how people are using these models already. Submit blurry photo, say “enhance”, and then pontificate on social media about the face the bias machine made up. A government official telling an interviewer that they discuss the direction of future technology with the chatbot. Governments, think tanks, and major consultancies writing reports with an LLM and not checking if it made up half the references or if the references actually say what is claimed. Eggs users so predictably and unthinkingly responding to statements they don’t like with “Grok is this true” that somebody has already created an auto-response bot for BlueSky called Gork to poke fun at that stupidity. Our future is millions of empty heads clonking against each other over what an LLM told them, which is in turn partly based on what its powerful owners wanted it to say or not say.

David in Tokyo,

I was merely making a joke here. Even without having read the book, I am reasonably confident the TT is not about failing the machine for outperforming humans on keystrokes per second.

LFC,

The misconception that everything is already digitised is very problematic and has been for quite some time, e.g., with open plan offices having very little space, and hotdesking having no space, for the books researchers regularly need to consult.

In my area the even bigger problem is that the initiative that has digitised most legacy literature, the Biodiversity Heritage Library, has apparently lost its support from the Smithsonian. I understand it has found a new sponsor. But if that site were to disappear, many important works from the 18th and 19th century would again become very difficult to consult.

12

NomadUK 09.14.25 at 10:43 am

Interesting to see a double application of Betteridge’s law in one headline.

13

Jonshine 09.14.25 at 1:29 pm

As a research engineer and honest member of that 60%, I don’t read anything in smaller font than the title and mostly look at the pretty pictures…

14

Michael Cain 09.14.25 at 3:04 pm

Jimmy @ 2
Wikipedia is starting to show a weakness that could/should have been anticipated. Entries for fields where there is rapid change are falling behind. For example, in looking some starting points on one aspect of integrated circuit fabrication a couple of weeks ago, the Wiki pages contained many statements of the form, “Intel has announced this will be put into production in 2023.” The cited reference is accurate. But maintenance is a pain, so there’s no one staying on top of what Intel actually did in 2023. I wonder what portion of Wikipedia’s entries are effectively now abandon-ware?

15

John Q 09.15.25 at 2:42 am

NomadUK @12 FTW!

16

SusanC 09.15.25 at 5:25 pm

Re. Pretending to be a person … the chain of thought of a reasoning model is particularly funny in this regard, as there’s lots of pretending to be a person.

You might not be seeing the chain of thought the way you’ve using it ( sometimes deliberately hidden from the user, sometimes there’s so much of it in research. Ode you wouldn’t want to read it, etc)

The working theory among AI researchers is that the pretending to be a person in the chain of thought is doing something useful, even if it isn’t literally true. (E.g. organizi g information about the topic by constructing these false memories)

E.g. in answering a query about IBM operating systems, DeepSeek claimed it had the keyboard layout in muscle memory from touch-typing. Yes, obviously, this is not a true memory.

17

SusanC 09.15.25 at 5:31 pm

If I feel sufficiently enthused to go to a physical library and check an acceptable-to-Wikipedia source, I might put a note on a certain wiki page that a claim it makes probably isn’t true [cite]. This was encountered during an AI Deep Research query, where the language model is like, well, Wikipedia says this but I can’t actually find any other evidence it’s true, and this is highly suspicious. [yes, I need to go read the actual paper book that is a solidly citable reference to the claim being false]

18

SusanC 09.16.25 at 10:43 am

I have also reported errata to Project Gutenberg as a result of deep research queries…

a) The OCR used to create the PG etext incorrectly transcribed an accent over a letter
b) However many proofreaders checked the etext, they all missed it. (This is not a criticism of Project Gutenberg; their etexts are very high quality, but the number of residual OCR errors while very low, is not zero)
c) LLM tries to use the PG etext to answer a query
d) LLM complains to user that the word looks wrong at a critical point in the citation used to respond to user’s query
e) User compares the PG etext against the bitmap images of the same book from The Internet Archive. Yeah, it’s an OCR error, not a printer’s error in the edition as printed.
f) User sends errata report to PG

19

SusanC 09.16.25 at 10:50 am

It depends on how the LLM you’re using was trained, but most LLMs will expect text to be in Unicode Normalisation Form C; if it’s in NFD instead, the LLM will be like: this text looks completely garbled.

The framework you’re using for deep research might handle all this for you without you needing to know it’s happening; I had to implement “apply Unicode canonicalization to all resources before passing them to the LLM.”

20

JBL 09.17.25 at 1:30 am

Michael Cain @14: and did you fix it?

21

KT2 09.22.25 at 7:40 am

Deeper research? Up your ally? Or ephemera?
More Betteridge the better.

[Submitted on 5 Jun 2025]

“Inference economics of language models”

“We develop a theoretical model that addresses the economic trade-off between cost per token versus serial token generation speed when deploying LLMs for inference at scale. Our model takes into account arithmetic, memory bandwidth, network bandwidth and latency constraints; and optimizes over different parallelism setups and batch sizes to find the ones that optimize serial inference speed at a given cost per token. We use the model to compute Pareto frontiers of serial speed versus cost per token for popular language models.”

https://export.arxiv.org/abs/2506.04645

Comments on this entry are closed.