Notes on Claude Computer Use

Anthropic has updated its Haiku and Sonnet lineup. Now, we have Haiku 3.5—a smaller model that outperforms Opus 3, the former state-of-the-art—and Sonnet 3.5, with enhanced coding abilities and a groundbreaking new feature called computer use. This is significant for everyone working in the field of AI agents.

As someone whose bread and butter is AI agents, I wanted to know how good it is and what it holds for the future of AI agents.

I tested the model across several real-world use cases you might encounter, and in this article, I’ll walk you through each of them.

TL;DR

If you have anywhere else to go, here is the summary of the article.

  • Computer Use is Anthropic’s latest LLM capability. It lets Sonnet 3.5 determine the coordinates of components in an image.
  • Equipping the model with tools like a Computer allows it to move cursors and interact with the computer as an actual user.
  • The model could easily handle simple use cases, such as searching the Internet, retrieving results, creating Spreadsheets, etc.
  • It still relies on screenshots, so do not expect it to perform real-time tasks like playing pong.
  • The model excels at many tasks, from research to filling out simple forms. However, it is expensive and slow, but its future looks bright.

What is Claude Computer Use?

The Computer Use feature takes the Sonnet’s image understanding and logical reasoning to the next level, allowing it to interact with the computer directly. It can now understand the image and figure out the display component to move cursors, click, and type text to interact with computers like humans.

The model can understand pixels on the images and figure out how many pixels vertically or horizontally it needs to move a cursor to click in the correct place.

Sonnet is now officially the state-of-the-art model for computer interaction, scoring 14.7% in OSWorld, almost double that of the closest model.

If you want to learn more, check out Anthropic’s official blog post; though they have not revealed much about their training, it is still a good read.

How to Set up Claude Computer Use?

Anthropic also released a developer cookbook to help you quickly set it up and explore how it works.

You need access to any Anthropic API keys, AWS bedrock, and Vertex to use Sonnet. The README file explains how to do this.

To get started, clone the repository.

git clone <https://github.com/anthropics/anthropic-quickstarts>

Move into the Computer Use demo directory.

cd computer-use-demo

Now, pull the image and run the container. I used the Bedrock-hosted model; you can use other providers as well.

export AWS_PROFILE=<your_aws_profile>
docker run \\
    -e API_PROVIDER=bedrock \\
    -e AWS_PROFILE=$AWS_PROFILE \\
    -e AWS_REGION=us-west-2 \\
    -v $HOME/.aws/credentials:/home/computeruse/.aws/credentials \\
    -v $HOME/.anthropic:/home/computeruse/.anthropic \\
    -p 5900:5900 \\
    -p 8501:8501 \\
    -p 6080:6080 \\
    -p 8080:8080 \\
    -it ghcr.io/anthropics/anthropic-quickstarts:computer-use-demo-latest

This will take some time; once it is finished, it will spawn a Streamlit server.

You can view the site.

Claude computer use

Send a message to see if everything is up and running.

Claude computer use greet

It takes a screenshot to verify what it got.

By default, it has access to a few apps: Firefox, a Spreadsheet, a Terminal, a PDF viewer, a Calculator, etc.

Let’s see how it works

Example 1: Find the top 5 movies and make a CSV

I started with a simple internet search. I asked it to find the top five movies of all time. The model has access to the Computer Tool, enabling it to move cursors to click on Computer components.

The model dissects the prompt and develops a step-by-step reasoning to complete the task.

It decided to visit MyAnimeList using Firefox.

Response from Claude 1

From every screenshot, it calculates the coordinates of the required component and moves the cursor accordingly.

response from Claude 2

Each input corresponds to a specific action that determines what task to perform, and the inputs vary depending on the action type. For instance, if the action type is “type,” there will be a text input, while for a “mouse_move” action, the relevant inputs would be coordinates.

Based on the original prompt, screenshots, and reasoning, the model decides which actions to take.

The model moves the cursor, opens Firefox, finds the address bar, inputs the web page, scrolls down the page, and outputs the answers.

You can observe the model could successfully fetch the movies. Next, I asked it to create a CSV file of the film. (I asked for the top 10 this time)

The model created and updated the file using the Shell tool, and it was then opened using Libre Office.

So far, the model has been able to execute the commands successfully. Sometimes, it doesn’t get it right, so it attempts repeatedly until it does. Right now, this can get very expensive for even minor things.

Let’s see another example.

Example 2: Find the best restaurants based on the City’s Weather

So, this time, I asked it to find me the best restaurants and weather in Bengaluru. It was able to successfully search the web for the best restaurants from the web and get the weather from AccuWeather.

What future holds for Agents?

Anthropic is in full swing with AI agents, and this release solidifies Anthropic’s stance on them. We can expect more labs to release models optimised for computer interaction.

The model at the current stage has a lot to improve. It is expensive, slow, and can hallucinate during execution, as the Claude team also pointed out. However, the future looks bright. Smaller models like Haiku, which can use a computer, can be game-changing.

We would also like to see what OpenAI does in future. Anyway, the future is exciting.

Leave a Reply

Your email address will not be published. Required fields are marked *