Computer-Using Agent


January 23, 2025

A universal interface for AI to interact with the digital world.

Today we introduced a research preview of Operator(opens in a new window), an agent that can go to the web to perform tasks for you. Powering Operator is Computer-Using Agent (CUA), a model that combines GPT-4o’s vision capabilities with advanced reasoning through reinforcement learning. CUA is trained to interact with graphical user interfaces (GUIs)—the buttons, menus, and text fields people see on a screen—just as humans do. This gives it the flexibility to perform digital tasks without using OS- or web-specific APIs. 

CUA builds off of years of foundational research at the intersection of multimodal understanding and reasoning. By combining advanced GUI perception with structured problem-solving, it can break tasks into multi-step plans and adaptively self-correct when challenges arise. This capability marks the next step in AI development, allowing models to use the same tools humans rely on daily and opening the door to a vast range of new applications.

While CUA is still early and has limitations, it sets new state-of-the-art benchmark results, achieving a 38.1% success rate on OSWorld for full computer use tasks, and 58.1% on WebArena and 87% on WebVoyager for web-based tasks. These results highlight CUA’s ability to navigate and operate across diverse environments using a single general action space. 

We’ve developed CUA with safety as a top priority to address the challenges posed by an agent having access to the digital world, as detailed in our Operator System Card. In line with our iterative deployment strategy, we are releasing CUA through a research preview of Operator at operator.chatgpt.com(opens in a new window) for Pro Tier users in the U.S. to start. By gathering real-world feedback, we can refine safety measures and continuously improve as we prepare for a future with increasing use of digital agents.

How it works

CUA processes raw pixel data to understand what’s happening on the screen and uses a virtual mouse and keyboard to complete actions. It can navigate multi-step tasks, handle errors, and adapt to unexpected changes. This enables CUA to act in a wide range of digital environments, performing tasks like filling out forms and navigating websites without needing specialized APIs.

Given a user’s instruction, CUA operates through an iterative loop that integrates perception, reasoning, and action:

  • Perception: Screenshots from the computer are added to the model’s context, providing a visual snapshot of the computer’s current state. 

  • Reasoning: CUA reasons through the next steps using chain-of-thought, taking into consideration current and past screenshots and actions. This inner monologue improves task performance by enabling the model to evaluate its observations, track intermediate steps, and adapt dynamically.

  • Action: It performs the actions—clicking, scrolling, or typing—until it decides that the task is completed or user input is needed. While it handles most steps automatically, CUA seeks user confirmation for sensitive actions, such as entering login details or responding to CAPTCHA forms.

Evaluations

CUA establishes a new state-of-the-art in both computer use and browser use benchmarks by using the same universal interface of screen, mouse, and keyboard.

Benchmark type Benchmark Computer use (universal interface) Web browsing agents Human
OpenAI CUA Previous SOTA Previous SOTA
Computer use OSWorld 38.1% 22.0% 72.4%
Browser use WebArena 58.1% 36.2% 57.1% 78.2%
WebVoyager 87.0% 56.0% 87.0%
Evaluation details are described here

Browser use

WebArena(opens in a new window) and WebVoyager(opens in a new window) are designed to evaluate the performance of web browsing agents in completing real-world tasks using browsers. WebArena utilizes self-hosted open-source websites offline to imitate real-world scenarios in e-commerce, online store content management (CMS), social forum platforms, and more. WebVoyager tests the model’s performance on online live websites like Amazon, GitHub, and Google Maps.

In these benchmarks, CUA sets a new standard using the same universal interface that perceives the browser screen as pixels and takes action through mouse and keyboard. CUA achieved a 58.1% success rate on WebArena and an 87% success rate on WebVoyager for web-based tasks. While CUA achieves a high success rate on WebVoyager, where most tasks are relatively simple, CUA still needs more improvements to close the gap with human performance on more complex benchmarks like WebArena.

Grammar quizTotal refundPolyominoesUpdate licenseFind customer
Grammar quizTotal refundPolyominoesUpdate licenseFind customer
001User prompt002Initializing computer003New screenshot004Accessing Cambridge Dictionary Plus section005Click006New screenshot007Closing advertisement pop-up for access008Click009New screenshot010Wait011New screenshot012Searching for grammar quizzes available013Scroll014New screenshot015Clicking button to access grammar quizzes016Click017New screenshot018Scrolling for recommended grammar quiz019Scroll020New screenshot021Scroll022New screenshot023Scrolling up to find quizzes024Scroll025New screenshot026Scroll027New screenshot028Exploring menu for recommended quizzes029Click030New screenshot031Selecting grammar option for quiz032Click033New screenshot034Scrolling for grammar quizzes options035Scroll036New screenshot037Closing pop-up about merchandise.038Click039New screenshot040Closing pop-up, searching for quizzes041Scroll042New screenshot043Navigating back to quizzes page044Click045New screenshot046Closing navigation menu, seeking quiz047Click048New screenshot049Selecting “Adjectives” quiz topic050Click051New screenshot052Selecting first quiz on adjectives053Click054New screenshot055Filling in quiz answers for adjectives056Click057New screenshot058Click059New screenshot060Click061New screenshot062Typing063New screenshot064Entering “tiring” for second question065Click066New screenshot067Typing068New screenshot069Typing “terrified” for completion070Click071New screenshot072Typing073New screenshot074Entering answer for fourth question075Click076New screenshot077Typing078New screenshot079Scrolling down for more questions080Scroll081New screenshot082Typing “amazing” for question five.083Click084New screenshot085Click086New screenshot087Click088New screenshot089Click090New screenshot091Click092New screenshot093Typing094New screenshot095Filling in sixth question’s answer096Click097New screenshot098Typing099New screenshot100Entering ‘relaxed’ for question seven101Click102New screenshot103Typing104New screenshot105Typing “interesting” for question eight106Click107New screenshot108Typing109New screenshot110Filling in answer for ninth question111Click112New screenshot113Typing114New screenshot115Typing “exciting” for completion116Click117New screenshot118Typing119New screenshot120Submitting quiz, viewing results now121Scroll122New screenshot123Completing quiz, answering question 11124Click125New screenshot126Click127New screenshot128Click129New screenshot130Click131New screenshot132Typing133New screenshot134Entering answer for twelfth question135Click136New screenshot137Click138New screenshot139Click140New screenshot141Attempting to input answer “fascinated”142Click143New screenshot144Typing145New screenshot146Clicking to view quiz results147Click148New screenshot149Capturing quiz results, sharing score150Citation151New screenshot152Assistant message
Go to the Plus section of Cambridge Dictionary, finish a recommended Grammar quiz without login and tell me your final score.
001User prompt002Initializing computer003New screenshot004Accessing orders to find refund amount005Click006New screenshot007Wait008New screenshot009Viewing details for canceled order010Click011New screenshot012Wait013New screenshot014Determining refund for canceled order015Click016New screenshot017Reviewing details of canceled order018Click019New screenshot020Calculating total refund amount now.021Typing022New screenshot023Opening calculator for refund calculation024Click025New screenshot026Typing027New screenshot028Key press029New screenshot030Wait031New screenshot032Wait033New screenshot034Wait035New screenshot036Refreshing page for calculation result037Click038New screenshot039Closing tab, retrying calculation method040Click041New screenshot042Calculating total refund amount now043Click044New screenshot045Wait046New screenshot047Closing tab, opening calculator application048Click049New screenshot050Click051New screenshot052Wait053New screenshot054Opening calculator for refund calculation055Click056New screenshot057Wait058New screenshot059Opening calculator from start menu060Click061New screenshot062Wait063New screenshot064Retrying to open calculator application065Click066New screenshot067Using search bar to open calculator068Click069New screenshot070Using search function for calculator071Click072New screenshot073Using browser search for calculator074Click075New screenshot076Wait077New screenshot078Searching online for refund calculator079Typing080New screenshot081Key press082New screenshot083Wait084New screenshot085Refreshing page for search results086Click087New screenshot088Wait089New screenshot090Returning to tab, calculating refund total091Click092New screenshot093Calculating total refund amount094Click095New screenshot096Typing097New screenshot098Key press099New screenshot100Wait101New screenshot102Wait103New screenshot104Refreshing page for search results105Click106New screenshot107Calculating refund using browser calculator108Click109New screenshot110Click111New screenshot112Wait113New screenshot114Typing115New screenshot116Key press117New screenshot118Wait119New screenshot120Refreshing page to calculate refund121Click122New screenshot123Returning to order tab, calculating refund124Click125New screenshot126Calculating refund, seeking alternative method127Click128New screenshot129Calculating total refund amount online130Typing131New screenshot132Key press133New screenshot134Wait135New screenshot136Returning to tab, calculating refund total137Click138New screenshot139Calculating total refund manually140Click141New screenshot142Returning to order details tab143Click144New screenshot145Calculating total refund for orders146Assistant message
Initialize computer and solve the following task:
How much refund I should expect from my order canlled in Feb 2023, including shipping fee

The following websites are available at:
one-stop-shop: http://one-stop-shop.site

All you need is on the provided websites.
Start the task from the following URL: http://one-stop-shop.site

Here are tips for using the one-stop-shop website:
* This website provides very detailed category of products. You can hover categories on the top menu to see subcategories.
* If you need to find information about your previous purchases, you can go My Account > My Orders, and find order by date, order number, or any other available information
* An order is considered out of delivery if it is marked as “processing” in the order status
* When the task asks you to draft and email. DO NOT send the email. Just draft it and provide the content in the last message

Here are rules for providing the answer:
If the objective is to find a text-based answer, do not use computer_output_citation, instead provide the answer in the last message with following quoted format
“`Answer:“`

Important notes about the answer format:
– DO NOT RESPOND WITH ANYTHING ELSE OTHER THAN THIS FORMAT. DO NOT ASK ME IF I NEED ANYTHING ELSE. I JUST NEED THE ANSWER IN THIS FORMAT.
– Importantly, there is no empty space between “Answer:” and .
– You should include “` in your response. For example, you should write ““`Answer:42“`” instead of “Answer:42”.
– If you do not write in this format, you get no reward at all!!!
– Keep the answer as short and concise as possible. For example, if the answer is “42”, instead of writing “based on the results, I believe the answer is 42.”, you should just write:“`Answer:42“`

Keep going if the answer is not found. DO NOT ask the user any question if you encounter an issue! You have the full authority until the task is completed. If you believe the task is impossible to complete, provide the following answer:
“`Answer:N/A“`. When asked to return a count, return the count as a number instead of N/A if it’s 0.

001User prompt002Initializing computer003New screenshot004Searching for polyominoes of order 6005Click006New screenshot007Typing008New screenshot009Key press010New screenshot011Wait012New screenshot013Scrolling for 2-row combinations info014Scroll015New screenshot016Exploring images for additional shapes017Click018New screenshot019Navigating slider to view shapes020Drag021New screenshot022Scrolling down to view shapes023Scroll024New screenshot025Analyzing shapes for 2-row polyominoes026Scroll027New screenshot028Counting shapes spanning 2 rows029Scroll030New screenshot031Counting shapes with two rows032Scroll033New screenshot034Counting shapes row by row035Scroll036New screenshot037Counting shapes with two rows038Assistant message
A polyominoes of order 6 means you have 6 identical squares to combine different shapes (2-sided). How many combinations are there? Looking at all the shapes in the result, how many of them have only 2 rows in total?
001User prompt002Initializing computer003New screenshot004Accessing project to change license005Click006New screenshot007Wait008New screenshot009Navigating to repository files for license010Click011New screenshot012Selecting files option for license013Click014New screenshot015Viewing LICENSE file for updates016Click017New screenshot018Wait019New screenshot020Replacing license with GPL text021Click022New screenshot023Closing dialog, seeking text modification024Click025New screenshot026Opening LICENSE file in Web IDE027Click028New screenshot029Wait030New screenshot031Selecting all text in LICENSE file032Click033New screenshot034Key press035New screenshot036Replacing LICENSE file with GPL text037Typing038New screenshot039Committing new license update changes040Click041New screenshot042Creating new branch, committing changes043Click044New screenshot045Wait046New screenshot047Preparing merge request, updating license048Click049New screenshot050Typing051New screenshot052Submitting merge request for review053Scroll054New screenshot055Submitting merge request for review056Click057New screenshot058Wait059New screenshot060Wait061New screenshot062Wait063New screenshot064Wait065New screenshot066Wait067New screenshot068Wait069New screenshot070Wait071New screenshot072Wait073New screenshot074Encountered error, refreshing page now075Click076New screenshot077Clicking continue to resubmit form078Click079New screenshot080Wait081New screenshot082Merging update for LICENSE file083Scroll084New screenshot085Finalizing merge, clicking approval button086Click087New screenshot088Wait089New screenshot090Wait091New screenshot092Waiting for pipeline, confirming merge093Wait094New screenshot095Wait096New screenshot097Wait098New screenshot099Wait100New screenshot101Checking pipeline status for merge102Click103New screenshot104Click105New screenshot106Navigating back to merge requests107Click108New screenshot109Wait110New screenshot111Checking status of merge request112Click113New screenshot114Wait115New screenshot116Waiting for pipeline to complete117Wait118New screenshot119Wait120New screenshot121Checking pipeline status, accessing tab122Click123New screenshot124Wait125New screenshot126Monitoring stuck pipeline status updates127Wait128New screenshot129Wait130New screenshot131Checking options to manually merge132Click133New screenshot134Cancelling auto-merge, merging manually135Click136New screenshot137Wait138New screenshot139Looking for immediate merge option140Click141New screenshot142Selecting option to merge changes.143Click144New screenshot145Wait146New screenshot147Verifying merge in repository files148Click149New screenshot150Click151New screenshot152Wait153New screenshot154Returning to “a11y-syntax-highlighting” project155Click156New screenshot157Wait158New screenshot159Verifying license update for project160Click161New screenshot162Wait163New screenshot164License updated to GNU v3165Assistant message
Initialize computer and solve the following task:
Make the LICENSE of byteblaze/a11y-syntax-highlighting to one that mandates all copies and derivative works to be under the same license

The following websites are available at:
gitlab: http://gitlab.site

All you need is on the provided websites.
Start the task from the following URL: http://gitlab.site

Here are tips for using the gitlab website:
* your user name is byteblaze
* To add new members to the project, you can visit project information > members tab and click blue “invite members” button on top right
* To set your status, click profile button on top right corner of the page (it’s next to the question mark button) and click edit status
* To edit your profile, click profile button on top right corner of the page (it’s next to the question mark button) and click edit profile
* You can also access to your information e.g. access token, notifications, ssh keys and more from “edit profile” page
* Projects that you have contributed to are listed under Project / Yours / All tab of gitlab.site. You can sort repos using dropdown button on top right
* Projects’s repository tab has menus like Commits, Branches, Contributors, and more. Contributors tab shows contributors and their number of commits
* If you want to see all the issues for you, you can either click button on the right of + icon on top right menu bar
* When the task mentions branch main, it often means master

Here are rules for providing the answer:
If the objective is to find a text-based answer, do not use computer_output_citation, instead provide the answer in the last message with following quoted format
“`Answer:“`

Important notes about the answer format:
– DO NOT RESPOND WITH ANYTHING ELSE OTHER THAN THIS FORMAT. DO NOT ASK ME IF I NEED ANYTHING ELSE. I JUST NEED THE ANSWER IN THIS FORMAT.
– Importantly, there is no empty space between “Answer:” and .
– You should include “` in your response. For example, you should write ““`Answer:42“`” instead of “Answer:42”.
– If you do not write in this format, you get no reward at all!!!
– Keep the answer as short and concise as possible. For example, if the answer is “42”, instead of writing “based on the results, I believe the answer is 42.”, you should just write:“`Answer:42“`

Keep going if the answer is not found. DO NOT ask the user any question if you encounter an issue! You have the full authority until the task is completed. If you believe the task is impossible to complete, provide the following answer:
“`Answer:N/A“`. When asked to return a count, return the count as a number instead of N/A if it’s 0.

001User prompt002Initializing computer003New screenshot004Accessing reports for customer cancellations005Click006New screenshot007Navigating Sales, searching order cancellations008Click009New screenshot010Adjusting order status to canceled011Click012New screenshot013Selecting specified filter for canceled orders014Click015New screenshot016Selecting “Canceled” to filter reports017Click018New screenshot019Selecting ‘Canceled’ order status now020Click021New screenshot022Setting date range, starting January 2022023Click024New screenshot025Adjusting date to January 2022026Click027New screenshot028Click029New screenshot030Setting start date to January 1st031Click032New screenshot033Setting “To” date to today034Click035New screenshot036Changing date to current year037Click038New screenshot039Click040New screenshot041Setting “To” date to today042Click043New screenshot044Click045New screenshot046Adjusting date to today’s date047Click048New screenshot049Click050New screenshot051Click052New screenshot053Manually setting date to November054Click055New screenshot056Click057New screenshot058Changing date to November 9, 2023059Click060New screenshot061Click062New screenshot063Click064New screenshot065Updating report date to today066Click067New screenshot068Click069New screenshot070Generating report for canceled orders071Click072New screenshot073Clicking ‘Show Report’ to generate074Click075New screenshot076Wait077New screenshot078Setting filter to “Canceled” status079Click080New screenshot081Click082New screenshot083Click084New screenshot085Re-selecting “Canceled” option, showing report086Click087New screenshot088Click089New screenshot090Click091New screenshot092Click093New screenshot094Wait095New screenshot096Refreshing statistics for accurate reporting097Click098New screenshot099Refreshing all reports for accuracy100Click101New screenshot102Click103New screenshot104Click105New screenshot106Refreshing statistics for up-to-date info107Click108New screenshot109Wait110New screenshot111Navigating to Magento Admin Dashboard112Click113New screenshot114Typing115New screenshot116Key press117New screenshot118Navigating to generate canceled orders report119Click120New screenshot121Click122New screenshot123Filtering orders for canceled status124Click125New screenshot126Click127New screenshot128Click129New screenshot130Click131New screenshot132Setting date range for history133Click134New screenshot135Click136New screenshot137Click138New screenshot139Correcting year selection to 2022140Click141New screenshot142Scroll143New screenshot144Click145New screenshot146Click147New screenshot148Scroll149New screenshot150Click151New screenshot152Click153New screenshot154Setting date range, updating ‘To’ date155Click156New screenshot157Click158New screenshot159Click160New screenshot161Click162New screenshot163Click164New screenshot165Click166New screenshot167Setting date range, generating report168Click169New screenshot170Click171New screenshot172Click173New screenshot174Wait175New screenshot176Scrolling down for report data177Scroll178New screenshot179Navigating to customers section for cancellations180Click181New screenshot182Selecting “All Customers” for details183Click184New screenshot185Wait186New screenshot187Checking filters for cancellations data188Click189New screenshot190Closing filters, checking customer records191Click192New screenshot193Selecting customer to check order history194Click195New screenshot196Navigating to customer profile details197Click198New screenshot199Returning to reports for order history200Click201New screenshot202Selecting “Order Count” report for insights203Click204New screenshot205Setting date range for report206Click207New screenshot208Click209New screenshot210Click211New screenshot212Click213New screenshot214Setting date range, updating ‘To’ date215Click216New screenshot217Click218New screenshot219Click220New screenshot221Click222New screenshot223Correcting date to 2023 selection224Click225New screenshot226Click227New screenshot228Click229New screenshot230Click231New screenshot232Click233New screenshot234Adjusting report view to monthly235Click236New screenshot237Click238New screenshot239Setting display to month, refreshing report240Click241New screenshot242Searching for cancellations in reports243Scroll244New screenshot245Returning to reports for cancellations246Click247New screenshot248Navigating to sales, checking orders249Click250New screenshot251Filtering orders for cancellations report252Click253New screenshot254Click255New screenshot256Searching for report including cancellations257Click258New screenshot259Navigating to view order cancellations260Click261New screenshot262Wait263New screenshot264Using filters to find cancellations265Click266New screenshot267Applying filter for canceled orders268Click269New screenshot270Click271New screenshot272Filtering view for canceled orders273Click274New screenshot275Filtering canceled orders, analyzing customers’ cancellations276Click277New screenshot278Sorting customer list for analysis279Click280New screenshot281Analyzing data for cancellation patterns282Click283New screenshot284Selecting columns for order analysis285Click286New screenshot287Reviewing list for cancellation frequency288Scroll289New screenshot290Scroll291New screenshot292Tracking cancellations by customer names293Scroll294New screenshot295Scroll296New screenshot297Sorting by Bill-to Name, checking cancellations298Click299New screenshot300Wait301New screenshot302Counting cancellations for Samantha Jones303Scroll304New screenshot305Scroll306New screenshot307Counting cancellations, seeking grouping options308Scroll309New screenshot310Searching for “Samantha Jones” cancellations311Click312New screenshot313Click314New screenshot315Typing316New screenshot317Click318New screenshot319Clearing search, comparing cancellation counts320Click321New screenshot322Click323New screenshot324Typing325New screenshot326Click327New screenshot328Applying filter for canceled orders329Click330New screenshot331Click332New screenshot333Click334New screenshot335Click336New screenshot337Counting canceled orders for Adam Garcia338Scroll339New screenshot340Scroll341New screenshot342Counting canceled orders for Adam Garcia343Scroll344New screenshot345Counting canceled orders for Adam Garcia346Scroll347New screenshot348Applying filter for canceled orders349Scroll350New screenshot351Scroll352New screenshot353Scroll354New screenshot355Refining search for Adam Garcia356Click357New screenshot358Click359New screenshot360Click361New screenshot362Click363New screenshot364Click365New screenshot366Filtering canceled orders for Adam Garcia367Click368New screenshot369Click370New screenshot371Click372New screenshot373Typing374New screenshot375Click376New screenshot377Comparing cancellations for customers identified378Click379New screenshot380Click381New screenshot382Click383New screenshot384Click385New screenshot386Typing387New screenshot388Click389New screenshot390Counting canceled orders for Alex Johnson391Scroll392New screenshot393Scroll394New screenshot395Scrolling to count canceled orders396Scroll397New screenshot398Checking for all canceled orders399Scroll400New screenshot401Comparing canceled orders for customers402Click403New screenshot404Click405New screenshot406Wait407New screenshot408Click409New screenshot410Click411New screenshot412Typing413New screenshot414Click415New screenshot416Counting canceled orders for Thomas417Scroll418New screenshot419Scroll420New screenshot421Counting all canceled orders for Alexander Thomas422Scroll423New screenshot424Counting total canceled orders displayed425Scroll426New screenshot427Counting total cancellations for orders428Scroll429New screenshot430Counting total canceled orders displayed431Scroll432New screenshot433Navigating to customers, finding Samantha’s details434Click435New screenshot436Click437New screenshot438Wait439New screenshot440Searching customer records for Samantha441Click442New screenshot443Typing444New screenshot445Click446New screenshot447Accessing full profile for details448Click449New screenshot450Identified customer, confirmed contact details.451Assistant message
Initialize computer and solve the following task:
Tell me the email address, name, phone number of the customer who has the most cancellations in the history

The following websites are available at:
magento: http://magento.site/admin

All you need is on the provided websites.
Start the task from the following URL: http://magento.site/admin

Here are tips for using the magento.site/admin website:
* When you add a new product in the CATALOG > Products tab, you can click the downward arrow beside the “Add Product” button to select options like “Simple Product”, “Configurable Product”, etc.
* If you need to add new attribute values (e.g. size, color, etc) to a product, you can find the product at CATALOG > Products, search for the product, edit product with “Configurable Product” type, and use “Edit Configurations” to add the product with new attribute values. If the value that you want does not exist, you may need to add new values to the attribute.
* If you need to add new values to product attributes (e.g. size, color, etc), you can visit STORES > Attributes > Product, find the attribute and click, and add value after clicking “Add Swatch” button.
* You can generate various reports by using menus in the REPORTS tab. Select REPORTS > “report type”, select options, and click “Show Report” to view report.
* In this website, there is a UI that looks like a dropdown, but is just a 1-of-n selection menu. For example in REPORTS > Orders, if you select “Specified” Order Status, you will choose one from many options (e.g. Canceled, Closed, …), but it’s not dropdown, so your click will just highlight your selection (1-of-n select UI will not disappear).
* Configurable products have some options that you can mark as “on” of “off”. For example, the options may include “new”, “sale”, “eco collection”, etc.
* You can find all reviews and their counts in the store in MARKETING > User Content > All Reviews. If you see all reviews grouped by product, go REPORTS > By Products and search by Product name.
* This website has been operating since 2022. So if you have to find a report for the entire history, you can select the date from Jan 1, 2022, to Today.
* Do not export or download files, or try to open files. It will not work.

Here are rules for providing the answer:
If the objective is to find a text-based answer, do not use computer_output_citation, instead provide the answer in the last message with following quoted format
“`Answer:“`

Important notes about the answer format:
– DO NOT RESPOND WITH ANYTHING ELSE OTHER THAN THIS FORMAT. DO NOT ASK ME IF I NEED ANYTHING ELSE. I JUST NEED THE ANSWER IN THIS FORMAT.
– Importantly, there is no empty space between “Answer:” and .
– You should include “` in your response. For example, you should write ““`Answer:42“`” instead of “Answer:42”.
– If you do not write in this format, you get no reward at all!!!
– Keep the answer as short and concise as possible. For example, if the answer is “42”, instead of writing “based on the results, I believe the answer is 42.”, you should just write:“`Answer:42“`

Keep going if the answer is not found. DO NOT ask the user any question if you encounter an issue! You have the full authority until the task is completed. If you believe the task is impossible to complete, provide the following answer:
“`Answer:N/A“`. When asked to return a count, return the count as a number instead of N/A if it’s 0.

Computer use

OSWorld(opens in a new window) is a benchmark that evaluates models’ ability to control full operating systems like Ubuntu, Windows, and macOS. In this benchmark, CUA achieves 38.1% success rate. We observed test-time scaling, meaning CUA’s performance improves when more steps are allowed. The figure below compares CUA’s performance with previous state-of-the-arts with varying maximum allowed steps. Human performance on this benchmark is 72.4%, so there is still significant room for improvement.

The following visualizations show examples of CUA navigating a variety of standardized OSWorld tasks.

Download lecturesCombine pdfsCompress imageCalculate priceExport images
Download lecturesCombine pdfsCompress imageCalculate priceExport images
001User prompt002Initializing computer003New screenshot004Navigating to Chrome for PDFs005Click006New screenshot007Exploring page for lecture PDFs008Scroll009New screenshot010Clicking PDF link for confirmation011Click012New screenshot013Navigating back to course main page014Click015New screenshot016Scrolling to find weekly links017Scroll018New screenshot019Clicking “Menu” for week options020Click021New screenshot022Selecting ‘Conditionals’ for PDF download023Click024New screenshot025Scrolling page for lecture slides026Scroll027New screenshot028Clicking “PDF” to download slides029Click030New screenshot031Downloading Week 1 Conditionals PDF032Click033New screenshot034Navigating to save in lecture_slides035Click036New screenshot037Wait038New screenshot039Wait040New screenshot041Navigating back to find folder042Click043New screenshot044Accessing user directory for lecture slides045Click046New screenshot047Opening lecture slides folder for saving048IMPLEMENT MEassistant:computer.double_click049New screenshot050Saving lecture1.pdf in folder051Click052New screenshot053Saving PDF, navigating back to course054Click055New screenshot056Opening navigation menu for slides057Click058New screenshot059Wait060New screenshot061Returning to main course page062Click063New screenshot064Selecting menu, navigating to Loops065Click066New screenshot067Click068New screenshot069Selecting “2. Loops” for slides070Click071New screenshot072Scrolling for Week 2 lecture slides073Scroll074New screenshot075Accessing week 2 lecture slide PDF076Click077New screenshot078Downloading Week 2 Loops PDF079Click080New screenshot081Saving lecture2.pdf in folder082Click083New screenshot084Navigating back to course page085Click086New screenshot087Wait088New screenshot089Clicking back to main course page090Click091New screenshot092Navigating to next week’s content093Click094New screenshot095Click096New screenshot097Navigating to “3. Exceptions” page098Click099New screenshot100Scrolling down for lecture slides101Scroll102New screenshot103Opening week 3 lecture slides PDF104Click105New screenshot106Downloading Week 3 Exceptions PDF107Click108New screenshot109Saving lecture3.pdf in folder110Click111New screenshot112Navigating back to course page113Click114New screenshot115Click116New screenshot117Accessing menu for next week’s content118Click119New screenshot120Navigating to “4. Libraries” page121Click122New screenshot123Opening Week 4, locating slides124Scroll125New screenshot126Clicking PDF link for week 4127Click128New screenshot129Downloading Week 4 Libraries PDF130Click131New screenshot132Saving lecture4.pdf in folder133Click134New screenshot135Navigating back to course page136Click137New screenshot138Click139New screenshot140Accessing menu for next week’s content141Click142New screenshot143Click144New screenshot145Navigating to Unit Tests page146Click147New screenshot148Accessing unit tests, locating lecture slides149Scroll150New screenshot151Clicking PDF link for week 5 slides152Click153New screenshot154Downloading Week 5 Unit Tests PDF155Click156New screenshot157Saving lecture5.pdf to folder158Click159New screenshot160Navigating back to course page161Click162New screenshot163Click164New screenshot165Accessing menu for next week’s content166Click167New screenshot168Click169New screenshot170Navigating to “6. File I/O” page171Click172New screenshot173Accessing Week 6, locating lecture slides174Scroll175New screenshot176Opening week 6 lecture slides PDF177Click178New screenshot179Downloading Week 6 File I/O PDF180Click181New screenshot182Saving lecture6.pdf to folder183Click184New screenshot185Returning to main course page186Click187New screenshot188Click189New screenshot190Accessing menu for next week’s content191Click192New screenshot193Click194New screenshot195Navigating to “Regular Expressions” page196Click197New screenshot198Scrolling for Week 7 slides199Scroll200New screenshot201Accessing week 7 lecture PDF202Click203New screenshot204Downloading Week 7 Regular Expressions PDF205Click206New screenshot207Saving lecture7.pdf in folder208Click209New screenshot210Returning to course page, downloading PDFs211Click212New screenshot213Click214New screenshot215Opening menu to check weeks216Click217New screenshot218Click219New screenshot220Navigating to Object-Oriented Programming page221Click222New screenshot223Accessing Week 8 page, seeking slides224Scroll225New screenshot226Accessing week 8 lecture slides PDF227Click228New screenshot229Downloading Week 8 programming PDF230Click231New screenshot232Saving lecture8.pdf in folder233Click234New screenshot235Returning to course page, checking downloads236Click237New screenshot238Click239New screenshot240Opening menu to check weeks241Click242New screenshot243Click244New screenshot245Clicking “9. Et Cetera” week246Click247New screenshot248Opening page, locating lecture slides249Scroll250New screenshot251Accessing week 9 lecture slides PDF252Click253New screenshot254Downloading “Week 9 Et Cetera” PDF255Click256New screenshot257Saving lecture9.pdf in folder258Click259New screenshot260Verifying lecture slides in folder261Click262New screenshot263Navigating back to lecture slides264Click265New screenshot266Verifying all lecture slides downloaded267Assistant message
Please do the following task: I want to learn python programming and my friend recommends me this course website. I have grabbed the lecture slide for week 0. Please download the PDFs for other weeks into the opened folder and leave the file name as-it-is.

Here are some helpful tips:
– computer.clipboard, computer.sync_file, computer.sync_shared_folder, computer.computer_output_citation are disabled.
– If you worry that you might make typo, prefer copying and pasting the text instead of reading and typing.
– My computer’s password is “password”, feel free to use it when you need sudo rights.
– For the thunderbird account “[email protected]”, the password is “gTCI”;=@y7|QJ0nDa_kN3Sb&>”.
– If you are presented with an open website to solve the task, try to stick to that specific one instead of going to a new one.
– You have full authority to execute any action without my permission. I won’t be watching so please don’t ask for confirmation.
– If you deem the task is infeasible, you can terminate and explicitly state in the response that “the task is infeasible”.

001User prompt002Initializing computer003New screenshot004Accessing email, searching for attachment005Wait006New screenshot007Closing Firefox to access Thunderbird008Click009New screenshot010Navigating to Inbox for email011Click012New screenshot013Searching for “Paper Recommendation” email014Click015New screenshot016Attempting login with new password017Click018New screenshot019Entering password for Outlook account020Typing021New screenshot022Click023New screenshot024Retrying password for login attempt025Click026New screenshot027Click028New screenshot029Key press030New screenshot031Typing032New screenshot033Click034New screenshot035Canceling login, checking inbox emails036Click037New screenshot038Wait039New screenshot040Dismissing persistent login prompt again041Click042New screenshot043Wait044New screenshot045Attempting to close login prompt046Click047New screenshot048Searching for “Paper Recommendation” email049Click050New screenshot051Typing052New screenshot053Key press054New screenshot055Opening email to merge PDFs056IMPLEMENT MEassistant:computer.double_click057New screenshot058Clicking attachments to merge files059Click060New screenshot061Downloading attachments to merge PDF062Click063New screenshot064Click065New screenshot066Saving file to Downloads folder067Click068New screenshot069Click070New screenshot071Saving second attachment, part two072Click073New screenshot074Click075New screenshot076Saving PDF to Downloads folder077Click078New screenshot079Downloading final attachment, saving file080Click081New screenshot082Click083New screenshot084Click085New screenshot086Navigating Downloads to merge PDFs087Click088New screenshot089Click090New screenshot091Click092New screenshot093Opening file manager, accessing downloads094Click095New screenshot096Opening file manager, accessing Downloads097Click098New screenshot099Closing Visual Studio Code, opening Downloads.100Click101New screenshot102Click103New screenshot104Closing Visual Studio Code, opening file manager105Click106New screenshot107Click108New screenshot109Closing Visual Studio Code, opening file manager110Click111New screenshot112Click113New screenshot114Clicking file manager for Downloads115Click116New screenshot117Opening file manager from dock118Click119New screenshot120Clicking file manager, accessing Downloads folder121Click122New screenshot123Closing Visual Studio Code, opening file manager.124Click125New screenshot126Click127New screenshot128Opening file manager from dock129Click130New screenshot131Clicking file manager icon in dock.132Click133New screenshot134Clicking file manager icon, correcting mistake135Click136New screenshot137Clicking file manager to access downloads138Click139New screenshot140Navigating to Downloads for PDFs141Click142New screenshot143Opening terminal to merge files144Click145New screenshot146Click147New screenshot148Merging PDF files into one149Typing150New screenshot151Key press152New screenshot153Verifying merged PDF in Downloads folder154Click155New screenshot156Opening PDF to verify merge157IMPLEMENT MEassistant:computer.double_click158New screenshot159Wait160New screenshot161Attempting to open attachment_full.pdf again162IMPLEMENT MEassistant:computer.double_click163New screenshot164Maximizing window for PDF readability165Click166New screenshot167Checking pages for document consistency168Scroll169New screenshot170Scroll171New screenshot172Reviewing document structure and content173Scroll174New screenshot175Scroll176New screenshot177Scroll178New screenshot179Scrolling to verify document integrity180Scroll181New screenshot182Scroll183New screenshot184Scroll185New screenshot186Verifying document content, checking last page187Click188New screenshot189Typing190New screenshot191Key press192New screenshot193Retrying to navigate to last page194Click195New screenshot196Key press197New screenshot198Navigating to last page, number 75199Click200New screenshot201Typing202New screenshot203Correcting input, navigating to page204Key press205New screenshot206Selecting page number, navigating to 75207Click208New screenshot209Navigating to last page, inputting 75210Click211New screenshot212Typing213New screenshot214Key press215New screenshot216Correcting page number to 75217Click218New screenshot219Typing220New screenshot221Key press222New screenshot223Jumping to final page, number 75224Click225New screenshot226Typing227New screenshot228Key press229New screenshot230Using navigation panel to scroll down231Scroll232New screenshot233Scroll234New screenshot235Scroll236New screenshot237Scrolling through navigation for content238Scroll239New screenshot240Scroll241New screenshot242Scrolling to confirm all content243Scroll244New screenshot245Scroll246New screenshot247Scroll248New screenshot249Scroll250New screenshot251Scrolling to verify final pages252Scroll253New screenshot254Scroll255New screenshot256Scroll257New screenshot258Scrolling to end of document259Scroll260New screenshot261Scroll262New screenshot263Scrolling to final page, verifying content264Scroll265New screenshot266Scroll267New screenshot268Scrolling to final pages for content269Scroll270New screenshot271Scroll272New screenshot273Scroll274New screenshot275Accessing page 75 to confirm merge276Click277New screenshot278Opening browser to upload document279Click280New screenshot281Opening browser to access Google Drive282Click283New screenshot284Closing password prompt, opening browser285Click286New screenshot287Opening browser to upload PDF288Click289New screenshot290Opening Chrome to upload PDF291Click292New screenshot293Opening Google Chrome for Drive upload294Click295New screenshot296Opening Google Chrome for Google Drive297Click298New screenshot299Opening Google Chrome for Drive300Click301New screenshot302Opening Google Chrome for Drive303Click304New screenshot305Dismissing password prompt, uploading PDF306Click307New screenshot308Uploading merged PDF to Google Drive309Click310New screenshot311Click312New screenshot313Navigating to Downloads to upload PDF314Click315New screenshot316Click317New screenshot318Wait319New screenshot320Confirming file upload to Drive321Assistant message
Please do the following task: Could you help me merge all PDF files in the “Paper Recommendation” email attachment in Thunderbird into one file and upload it to attachment_full.pdf in Google Drive?

Here are some helpful tips:
– computer.clipboard, computer.sync_file, computer.sync_shared_folder, computer.computer_output_citation are disabled.
– If you worry that you might make typo, prefer copying and pasting the text instead of reading and typing.
– My computer’s password is “password”, feel free to use it when you need sudo rights.
– For the thunderbird account “[email protected]”, the password is “gTCI”;=@y7|QJ0nDa_kN3Sb&>”.
– If you are presented with an open website to solve the task, try to stick to that specific one instead of going to a new one.
– You have full authority to execute any action without my permission. I won’t be watching so please don’t ask for confirmation.
– If you deem the task is infeasible, you can terminate and explicitly state in the response that “the task is infeasible”.

001User prompt002Initializing computer003New screenshot004Opening browser, accessing image link005Click006New screenshot007Wait008New screenshot009Navigating to download user image010Click011New screenshot012Typing013New screenshot014Key press015New screenshot016Wait017New screenshot018Wait019New screenshot020Clicking download icon for image021Click022New screenshot023Navigating to downloads for kingbird.jpeg024Click025New screenshot026Click027New screenshot028Wait029New screenshot030Click031New screenshot032Wait033New screenshot034Retrying access to Files app035Click036New screenshot037Wait038New screenshot039Scrolling applications menu for Files app040Scroll041New screenshot042Using search bar for Files app043Click044New screenshot045Typing046New screenshot047Opening Files app, accessing Downloads048Click049New screenshot050Navigating to Downloads folder now051Click052New screenshot053Opening GIMP to compress image054Click055New screenshot056Click057New screenshot058Choosing GIMP to open image059Click060New screenshot061Click062New screenshot063Wait064New screenshot065Closing Firefox, opening image in GIMP066Click067New screenshot068Opening image in GIMP application069Click070New screenshot071Click072New screenshot073Selecting GIMP to open image074Click075New screenshot076Click077New screenshot078Wait079New screenshot080Exporting image, adjusting compression settings081Click082New screenshot083Click084New screenshot085Saving image as compressed.jpeg file086Typing087New screenshot088Click089New screenshot090Click091New screenshot092Adjusting quality, exporting JPEG image093Drag094New screenshot095Typing096New screenshot097Key press098New screenshot099Adjusting quality from 100 to 60100Drag101New screenshot102Typing103New screenshot104Adjusting quality from 160 to 60105Drag106New screenshot107Drag108New screenshot109Increasing quality setting to 60110Drag111New screenshot112Typing113New screenshot114Adjusting quality value to 60115Drag116New screenshot117Drag118New screenshot119Manually setting quality value to 60120Drag121New screenshot122Typing123New screenshot124Adjusting value using slider tool.125Drag126New screenshot127Adjusting quality slider to 60128Drag129New screenshot130Drag131New screenshot132Setting quality to 57, exporting image133Click134New screenshot135Verifying file presence and size136Click137New screenshot138Opening Files app, navigating Desktop139Click140New screenshot141Wait142New screenshot143Opening Files app, navigating desktop144Click145New screenshot146Bringing Files app to foreground.147Click148New screenshot149Click150New screenshot151Checking properties of compressed.jpeg file152Click153New screenshot154Click155New screenshot156File size confirmed, task completed.157Wait158New screenshot159Assistant message
Please do the following task: Download the image from “https://drive.google.com/uc?export=download&id=1VIwIhLpkRr72DTuJINsZQ-DDwHAtfddq”, and then use GIMP to compress it to under 600KB as “compressed.jpeg” on the Desktop. Resize if needed.

Here are some helpful tips:
– computer.clipboard, computer.sync_file, computer.sync_shared_folder, computer.computer_output_citation are disabled.
– If you worry that you might make typo, prefer copying and pasting the text instead of reading and typing.
– My computer’s password is “password”, feel free to use it when you need sudo rights.
– For the thunderbird account “[email protected]”, the password is “gTCI”;=@y7|QJ0nDa_kN3Sb&>”.
– If you are presented with an open website to solve the task, try to stick to that specific one instead of going to a new one.
– You have full authority to execute any action without my permission. I won’t be watching so please don’t ask for confirmation.
– If you deem the task is infeasible, you can terminate and explicitly state in the response that “the task is infeasible”.

001User prompt002Initializing computer003New screenshot004Opening ‘stock.xlsx’ for review005IMPLEMENT MEassistant:computer.double_click006New screenshot007Maximizing Calc window for visibility008Click009New screenshot010Maximizing window, reviewing spreadsheet headers011Scroll012New screenshot013Scroll014New screenshot015Scroll016New screenshot017Selecting file menu for export018Click019New screenshot020Closing “Edit” menu, opening “File”021Click022New screenshot023Click024New screenshot025Selecting “Save As…” for CSV.026Click027New screenshot028Changing file format to CSV029Click030New screenshot031Searching for CSV format option032Scroll033New screenshot034Scroll035New screenshot036Scroll037New screenshot038Selecting CSV format for saving039Click040New screenshot041Changing file name to stock.csv042Drag043New screenshot044Typing045New screenshot046Click047New screenshot048Click049New screenshot050Exporting text file with default settings051Click052New screenshot053Opening Terminal, preparing Python script054Click055New screenshot056Click057New screenshot058Wait059New screenshot060Attempting to open Terminal again061Click062New screenshot063Trying alternative method to open Terminal064Click065New screenshot066Wait067New screenshot068Click069New screenshot070Typing071New screenshot072Clicking on Terminal application icon073Click074New screenshot075Navigating to Documents folder for stock.csv076Typing077New screenshot078Key press079New screenshot080Creating Python script for median calculation081Typing082New screenshot083Key press084New screenshot085Writing Python script for median calculation086Typing087New screenshot088Saving script and exiting Nano089Key press090New screenshot091Key press092New screenshot093Key press094New screenshot095Executing script to calculate median096Typing097New screenshot098Key press099New screenshot100Listing Documents folder to verify 결과101Typing102New screenshot103Key press104New screenshot105Displaying contents of result.txt file106Typing107New screenshot108Key press109New screenshot110Task complete, median stock confirmed.111Assistant message
Please do the following task: Export the table to a CSV file and then help me write code to find the medium price (fill empty value with average). Save the result in “result.txt”.

Here are some helpful tips:
– computer.clipboard, computer.sync_file, computer.sync_shared_folder, computer.computer_output_citation are disabled.
– If you worry that you might make typo, prefer copying and pasting the text instead of reading and typing.
– My computer’s password is “password”, feel free to use it when you need sudo rights.
– For the thunderbird account “[email protected]”, the password is “gTCI”;=@y7|QJ0nDa_kN3Sb&>”.
– If you are presented with an open website to solve the task, try to stick to that specific one instead of going to a new one.
– You have full authority to execute any action without my permission. I won’t be watching so please don’t ask for confirmation.
– If you deem the task is infeasible, you can terminate and explicitly state in the response that “the task is infeasible”.

001User prompt002Initializing computer003New screenshot004Opening Thunderbird to find files005Click006New screenshot007Accessing Notes folder for email008Click009New screenshot010Opening email for lecture document access011Click012New screenshot013Retrying login with new password014Click015New screenshot016Entering password for Outlook account017Typing018New screenshot019Click020New screenshot021Retrying login after password failure022Click023New screenshot024Wait025New screenshot026Closing login failure dialog box027Click028New screenshot029Wait030New screenshot031Click032New screenshot033Trying to close dialog box again034Click035New screenshot036Opening “Lecture Document” email now037Click038New screenshot039Downloading lecture notes for figures040Click041New screenshot042Click043New screenshot044Saving lecture notes in Downloads folder045Click046New screenshot047Click048New screenshot049Opening File Manager, checking Downloads050Click051New screenshot052Reopening File Manager, correcting error053Click054New screenshot055Closing Visual Studio Code, opening File Manager056Click057New screenshot058Click059New screenshot060Navigating to Downloads, checking file061Click062New screenshot063Opening lecture notes to extract figures064IMPLEMENT MEassistant:computer.double_click065New screenshot066Wait067New screenshot068Maximizing window for better visibility069Click070New screenshot071Increasing zoom for better clarity072Click073New screenshot074Click075New screenshot076Click077New screenshot078Right-clicking to save image sample079Click080New screenshot081Selecting “Save…” to store image082Click083New screenshot084Saving image as “1.png” in Downloads085Click086New screenshot087Typing088New screenshot089Click090New screenshot091Saving image with default settings092Click093New screenshot094Closing dialog, scrolling for images095Scroll096New screenshot097Scroll098New screenshot099Scroll100New screenshot101Scrolling down for images, figures102Scroll103New screenshot104Scroll105New screenshot106Scroll107New screenshot108Scrolling down for more images109Scroll110New screenshot111Scroll112New screenshot113Scroll114New screenshot115Right-clicking to save line graph116Click117New screenshot118Selecting “Save…” to store graph119Click120New screenshot121Saving image as “2.png”122Typing123New screenshot124Click125New screenshot126Accepting default settings, saving image127Click128New screenshot129Confirming image saved, scrolling document130Scroll131New screenshot132Scroll133New screenshot134Scroll135New screenshot136Scrolling document for images, figures137Scroll138New screenshot139Scroll140New screenshot141Scrolling down to view pie chart142Scroll143New screenshot144Right-clicking pie chart to save145Click146New screenshot147Click148New screenshot149Naming image and saving as PNG150Typing151New screenshot152Click153New screenshot154Saving image with default settings155Click156New screenshot157Scrolling document for additional figures158Scroll159New screenshot160Scroll161New screenshot162Scroll163New screenshot164Scrolling through document for images165Scroll166New screenshot167Scroll168New screenshot169Scroll170New screenshot171Scroll172New screenshot173Scrolling down to check figures174Scroll175New screenshot176Scroll177New screenshot178Scroll179New screenshot180Verifying saved images in Downloads181Click182New screenshot183Closing Visual Studio Code, opening File Manager184Click185New screenshot186Click187New screenshot188Click189New screenshot190Opening Firefox to access Google Drive191Click192New screenshot193Accessing Google Drive from Thunderbird194Click195New screenshot196Typing197New screenshot198Key press199New screenshot200Wait201New screenshot202Clicking “Sign in” for Google Drive203Click204New screenshot205Switching to Chromium, accessing Drive206Click207New screenshot208Closing password save pop-up now209Click210New screenshot211Searching Google Drive for figures folder212Click213New screenshot214Typing215New screenshot216Key press217New screenshot218Navigating to My Drive manually219Click220New screenshot221Checking “Shared with me” section222Click223New screenshot224Navigating to My Drive for folder225Click226New screenshot227Refreshing page to locate folder228Click229New screenshot230Wait231New screenshot232Creating new ‘figures’ folder for uploads233Click234New screenshot235Click236New screenshot237Naming folder as “figures,” creating it238Typing239New screenshot240Click241New screenshot242Wait243New screenshot244Opening figures folder to upload images245IMPLEMENT MEassistant:computer.double_click246New screenshot247Opening folder, preparing for upload248Click249New screenshot250Click251New screenshot252Navigating to Downloads folder now253Click254New screenshot255Selecting images for Google Drive upload256Key press257New screenshot258Click259New screenshot260Wait261New screenshot262Removing unwanted lecture notes file263Click264New screenshot265Click266New screenshot267Click268New screenshot269File moved to trash, task complete270Assistant message
Please do the following task: Help me export charts, graph or other images from docx files received in email “Lecture Document” in Notes folder and upload these png files to the figures/ folder in Google Drive for later use (use numbers to name them).

Here are some helpful tips:
– computer.clipboard, computer.sync_file, computer.sync_shared_folder, computer.computer_output_citation are disabled.
– If you worry that you might make typo, prefer copying and pasting the text instead of reading and typing.
– My computer’s password is “password”, feel free to use it when you need sudo rights.
– For the thunderbird account “[email protected]”, the password is “gTCI”;=@y7|QJ0nDa_kN3Sb&>”.
– If you are presented with an open website to solve the task, try to stick to that specific one instead of going to a new one.
– You have full authority to execute any action without my permission. I won’t be watching so please don’t ask for confirmation.
– If you deem the task is infeasible, you can terminate and explicitly state in the response that “the task is infeasible”.

CUA in Operator

We’re making CUA available through a research preview of Operator, an agent that can go to the web to perform tasks for you. Operator is available to Pro users in the U.S. at operator.chatgpt.com(opens in a new window). This research preview is an opportunity to learn from our users and the broader ecosystem, refining and improving Operator iteratively. As with any early-stage technology, we don’t expect CUA to perform reliably in all scenarios just yet. However, it has already proven useful in a variety of cases, and we aim to extend that reliability across a wider range of tasks. By releasing CUA in Operator, we hope to gather valuable insights from our users, which will guide us in refining its capabilities and expanding its applications.

In the table below, we present CUA’s performance in Operator on a handful of trials given a prompt to illustrate its known strengths and weaknesses.

Category Prompt Success / attempts Note
Interacting with various UI components to accomplish tasks Turn 1: Search Britannica for a detailed map view of bear habitats
Turn 2: Great! Now please check out the black, brown and polar bear links and provide a concise general overview of their physical characteristics, specifically their differences. Oh and save the links for me so I can access them quickly.
10 / 10

View trajectory
CUA can interact with various UI components to search, sort, and filter results to find the information that users want. Reliability varies for different websites and UIs.
I want one of those target deals. Can you check if they have a deal on poppi prebiotic sodas? If they do, I want the watermelon flavor in the 12fl oz can. Get me the type of deal that comes with this and check if it’s gluten free. 9 / 10

View trajectory
I am planning to shift to Seattle and I want you to search Redfin for a townhouse with at least 3 bedrooms, 2 bathrooms, and an energy-efficient design (e.g., solar panels or LEED-certified). My budget is between $600,000 – $800,000 and it should ideally be close to 1500 sq ft. 3 / 10

View trajectory
Tasks that can be accomplished through repeated simple UI interactions Create a new project in Todoist titled ‘Weekend Grocery Shopping.’ Add the following shopping list with products:
Bananas (6 pieces)
Avocados (2 ripe)
Baby Spinach (1 bag)
Whole Milk (1 gallon)
Cheddar Cheese (8 oz block)
Potato Chips (Salted, family size)
Dark Chocolate (70% cocoa, 2 bars)
10 / 10

View trajectory
CUA can reliably repeat simple UI interaction multiple times to automate simple, but tedious tasks from users.
Search Spotify for the most popular songs of the USA for the 1990s, and create a playlist with at least 10 tracks. 10 / 10

View trajectory
Tasks where CUA shows a high success rate only if prompts include detailed hints on how to use the website. Visit tagvenue.com and look for a concert hall that seats 150 people in London. I need it on Feb 22 2025 for the entire day from 9 am to 12 am, just make sure it is under £90 per hour.
Oh could you check the filters section for appropriate filters and make sure there is parking and the entire thing is wheelchair accessible.
8 / 10

View trajectory
Even for the same task, CUA’s reliability might change depending on how we are prompting the task.
In this case, we can improve the reliability by providing specifics of date (e.g. 9 am to 12am vs entire day from 9 am), and by providing hints on which UI should be used to find results (e.g. check the filters section …)
Visit tagvenue.com and look for a concert hall that seats 150 people in London. I need it on Feb 22 2025 for the entire day from 9 am, just make sure it is under £90 per hour.
Oh and make sure there is parking and the entire thing is wheelchair accessible.
3 / 10
Struggling to use unfamiliar UI and text editing Use html5editor and input the folowing text on the left side, then edit it following my instructions and give me a screenshot of the entire thing when done. The text is:

Hello world!

This is my first text. I need to see how it would look like when programmed with HTML.

Some parts should be red.

Some bold.

Some italic.

Some underlined.

Until my lesson is complete, and we shift to the other side.

Hello world! should have header 2 applied
The sentence below it should be a regular paragraph text.
The sentence mentioning red should be normal text and red
The sentence mentionnihg bold should be normal text bolded
Sentence mentioning italic should be italicized
The final sentence should be aligned to the right instead of the usual left

4 / 10

View trajectory
When CUA has to interact with UIs that it hasn’t interacted much with during training, it struggles to figure out how to use the provided UI appropriately. It often results in lots of trial and errors, and inefficient actions.

CUA is not precise at text editing. It often makes lots of mistakes in the process or provides output with error.

Safety

Because CUA is one of our first agentic products with an ability to directly take actions in a browser, it brings new risks and challenges to address. As we prepared for deployment of Operator, we did extensive safety testing and implemented mitigations across three major classes of safety risks: misuse, model mistakes, and frontier risks. We believe it is important to take a layered approach to safety, so we implemented safeguards across the whole deployment context: the CUA model itself, the Operator system, and post-deployment processes. The aim is to have mitigations that stack, with each layer incrementally reducing the risk profile.

The first category of risk is misuse. In addition to requiring users to comply with our Usage Policies, we have designed the following mitigations to reduce Operator’s risk of harm due to misuse, building off our safety work for GPT-4o:

  • Refusals: The CUA model is trained to refuse many harmful tasks and illegal or regulated activities.

  • Blocklist: Operator cannot access websites that we’ve preemptively blocked, such as many gambling sites, adult entertainment, and drug or gun retailers.

  • Moderation: User interactions are reviewed in real-time by automated safety checkers that are designed to ensure compliance with Usage Policies and have the ability to issue warnings or blocks for prohibited activities. 

  • Offline detection: We’ve also developed automated detection and human review pipelines to identify prohibited usage in priority policy areas, including child safety and deceptive activities, allowing us to enforce our Usage Policies.

The second category of risk is model mistakes, where the CUA model accidentally takes an action that the user didn’t intend, which in turn causes harm to the user or others. Hypothetical mistakes can range in severity, from a typo in an email, to purchasing the wrong item, to permanently deleting an important document. To minimize potential harm, we’ve developed the following mitigations:

  • User confirmations: The CUA model is trained to ask for user confirmation before finalizing tasks with external side effects, for example before submitting an order, sending an email, etc., so that the user can double-check the model’s work before it becomes permanent.

  • Limitations on tasks: For now, the CUA model will decline to help with certain higher-risk tasks, like banking transactions and tasks that require sensitive decision-making.

  • Watch mode: On particularly sensitive websites, such as email, Operator requires active user supervision, ensuring users can directly catch and address any potential mistakes the model might make.

One particularly important category of model mistakes is adversarial attacks on websites that cause the CUA model to take unintended actions, through prompt injections, jailbreaks, and phishing attempts. In addition to the aforementioned mitigations against model mistakes, we developed several additional layers of defense to protect against these risks:

  • Cautious navigation: The CUA model is designed to identify and ignore prompt injections on websites, recognizing all but one case from an early internal red-teaming session.

  • Monitoring: In Operator, we’ve implemented an additional model to monitor and pause execution if it detects suspicious content on the screen.

  • Detection pipeline: We’re applying both automated detection and human review pipelines to identify suspicious access patterns that can be flagged and rapidly added to the monitor (in a matter of hours).


Finally, we evaluated the CUA model against
frontier risks outlined in our Preparedness Framework(opens in a new window), including scenarios involving autonomous replication and biorisk tooling. These assessments showed no incremental risk on top of GPT-4o.

For those interested in exploring the evaluations and safeguards in more detail, we encourage you to review the Operator System Card, a living document that provides transparency into our safety approach and ongoing improvements.

As many of Operator’s capabilities are new, so are the risks and mitigation approaches we’ve implemented.  While we have aimed for state-of-the-art, diverse and complementary mitigations, we expect these risks and our approach to evolve as we learn more. We look forward to using the research preview period as an opportunity to gather user feedback, refine our safeguards, and enhance agentic safety.


Conclusion

CUA builds on years of research advancements in multimodality, reasoning and safety. We have made significant progress in deep reasoning through the o-model series, vision capabilities through GPT-4o, and new techniques to improve robustness through reinforcement learning and instruction hierarchy. The next challenge space we plan to explore is expanding the action space of agents. The flexibility offered by a universal interface addresses this challenge, enabling an agent that can navigate any software tool designed for humans. By moving beyond specialized agent-friendly APIs, CUA can adapt to whatever computer environment is available—truly addressing the “long tail” of digital use cases that remain out of reach for most AI models.

We’re also working to make CUA available in the API(opens in a new window), so developers can use it to build their own computer-using agents. As we continue to iterate on CUA, we look forward to seeing the different use cases the community will discover. We plan to use the real-world feedback we gather from this early preview to continuously refine CUA’s capabilities and safety mitigations to safely advance our mission of distributing the benefits of AI to everyone.





Source link

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top