OpenAI has launched the ChatGPT agent, a brand new function that permits ChatGPT to behave independently utilizing its personal digital pc. The agent can navigate web sites, run code, analyse information, and full duties equivalent to planning conferences, constructing slideshows, and updating spreadsheets.
The function is now rolling out to Professional, Plus, and Workforce customers, with entry for Enterprise and Training customers anticipated within the coming weeks.
The agent integrates beforehand separate options like Operator and Deep Analysis, combining their capabilities right into a single system. Operator allowed internet interplay via clicks and inputs, whereas deep analysis centered on synthesis and summarisation.
The brand new system permits fluid transition between reasoning and motion in a single dialog.
“You need to use it to effortlessly plan and ebook journey itineraries, design and ebook complete dinner events, or discover specialists and schedule appointments,” OpenAI mentioned in a press release. “ChatGPT requests permission earlier than taking actions of consequence, and you may simply interrupt, take over the browser, or cease duties at any level.”
Customers can activate agent mode through the instruments dropdown in ChatGPT’s composer window. The agent makes use of a collection of instruments, together with a visible browser, a text-based browser, terminal entry, and API integration. It may possibly additionally work with connectors like Gmail and GitHub, supplied customers log in through a safe takeover mode.
All duties are carried out on a digital machine that preserves state throughout instrument switches. This enables ChatGPT to browse the online, obtain recordsdata, run instructions, and overview outputs, all inside a single session. Customers can interrupt or redirect duties at any time with out dropping progress.
ChatGPT agent is at present restricted to 400 messages per 30 days for Professional customers and 40 for Plus and Workforce customers. Extra utilization is obtainable via credit-based choices. Help for the European Financial Space and Switzerland is in progress.
The standalone Operator analysis preview will probably be phased out within the coming weeks. Customers preferring longer-form, slower responses can nonetheless entry deep analysis mode through the dropdown menu.
Whereas slideshow technology is obtainable, OpenAI famous that formatting could also be inconsistent, and export points stay. Enhancements to this functionality are below improvement.
The system confirmed robust efficiency throughout benchmarks. On Humanity’s Final Examination, it scored a brand new state-of-the-art cross@1 price of 41.6%, rising to 44.4% when utilizing parallel makes an attempt. On DSBench, which assessments information science workflows, it reached 89.9% on evaluation duties and 85.5% on modelling, considerably greater than human baselines.
In funding banking modelling duties, the agent achieved a 71.3% imply accuracy, outperforming OpenAI’s o3 mannequin and the sooner deep analysis instrument. It additionally scored 68.9% on BrowseComp and 65.4% on WebArena, each benchmarks measuring real-world internet navigation and activity completion.
Nevertheless, OpenAI acknowledged new dangers with this functionality. “That is the primary time customers can ask ChatGPT to take actions on the reside internet,” the corporate mentioned. “We’ve positioned a specific emphasis on safeguarding ChatGPT agent in opposition to adversarial manipulation via immediate injection.”
To counter these dangers, ChatGPT requires specific affirmation earlier than high-impact actions like purchases, restricts actions equivalent to financial institution transfers, and provides settings to delete searching information and sign off of classes. Delicate inputs entered throughout takeover classes usually are not collected or saved.
The brand new system is classed below OpenAI’s “Excessive Organic and Chemical” functionality tier, triggering further safeguards. The corporate has labored with exterior biosecurity specialists and launched monitoring instruments, dual-use refusal coaching, and menace modelling to forestall misuse.
Leave a Reply