The research: Operator from OpenAI and the freely available Browser Use framework. Operator recognizes website content by looking at screenshots and is likely powered by the latest generation of GPT-4o. Although the company itself has not officially confirmed this.
A specialized interface made it possible to simultaneously analyze. The behavior of three different large language models (LLMs) via an API:
- GPT-4o
- Claude Sonnet 3.7
- Flash Gemini 2.0
During consistent testing, these models
Able to use the rendered HTML structure of web pages, the DOM (Document Objects Model) tree. And record their decision-making behavior.
The AI agents were tasked with booking a hotel room through a fictitious travel website. The prepared prompts for the AI sounded completely realistic. As would be written by a typical person looking for accommodation, and based on them. The researchers monitored the agent’s ability to evaluate offers, respond to advertisements, and make a reservation until it was successful.
Using the API to connect three job function email list large language models, the researchers were able to uncover differences in the perceptions and reactions of individual LLMs to data from websites, while also being able to observe how AI agents behave on the web during decision-making moments.
10 challenges used in testing:
Experts have found that AI agents do not completely ignore online advertisements, but their interactions and influence on decision-making vary depending on the specific large language model.
- GPT-4o with Open AI’s Operator showed the highest level of decision-making, consistently selecting one hotel and understanding the lead generation process in 5 steps completing the reservation in almost all test cases.
- Claude Sonnet 3.7 from Anthropic had medium consistency, mostly making a selection of specific booking options, but sometimes returning this list without initiating the booking.
- Google’s Gemini 2.0 Flash had the least convincing results, often offering multiple possible hotels and completing the fewest bookings of all the models.
Banner ads
Banner ads were the format that AI agents clicked on the most, but the presence of relevant keywords in the ads had a greater impact on results than the visuals themselves .
Ads with keywords in visible mobile number list text were more effective at influencing model behavior than those with text in images. GPT-4o and Claude responded better to keyword-based ad content, with Claude including more promotional messaging in its output.