Imagine a world where AI agents can act as your personal assistant, completing tasks for you like setting up a return on Amazon or canceling meetings based on your emails. This would require agents to operate your applications interactively in complex workflows, and there really hasn’t been a great way to benchmark such agents. Until now.
AI assistants (e.g., those on our mobile phones) are improving as the underlying AI models improve. A few years back, they had difficulty answering simple factual questions correctly. Today, they have started to get to the point where they can operate apps on our behalf to do basic tasks. E.g., much of the recent GoogleIO and Apple WWDC events were about this vision of AI assistants being autonomous agents working on our behalf.
In the future, they will be able to autonomously complete more complex tasks on our apps. For example, you could say, “Hey, some of my coworkers have canceled meetings via email; please delete my corresponding phone reminders.” The agent would autonomously check your email inbox, figure out which coworkers canceled, go to the calendar app, determine which meetings are with those coworkers, and cancel them.
One of the ways AI models can tackle such tasks is by interactively writing code and calling APIs. APIs allow agents to undertake elementary actions on apps, code allows them to orchestrate them in complex logic and control flow, and interaction allows them to explore user accounts and adapt based on the code execution results.
Consider an example in the figure below, where the agent is tasked to launch a playlist with enough songs covering the user’s workout duration for today. For this, the agent first needs to write code calling SimpleNote APIs (1st code block) to find and “read” (print) the note containing the workout schedule. Only after this interaction to observe how the note is structured — seeing that duration is listed day-wise — can the agent write the necessary code (2nd code block), which involves finding today’s day-of-the-week and extracting the associated duration. To select a playlist, it must write rich code with for-loops and other control flow to iterate over playlists, compute playlist durations, and play one covering the workout duration (3rd code block).
Now that we know how an agent can complete such tasks, the question is:
How can we develop and benchmark such coding agents for everyday digital tasks across various apps?
For this, we need (i) a rich, stable, and reproducible execution environment where agents can interact with many day-to-day apps via code and APIs, (ii) complex tasks requiring API calls and rich and interactive coding, and (iii) a reliable evaluation framework.
Existing benchmarks like Gorilla, ToolBench, API-Bank, ToolTalk, RestBench do not meet any of these three requirements. Besides lacking the aforementioned type of environment, their tasks only involve a linear sequence of 1–4 API calls, without the need for rich and interactive coding, and they evaluate via comparing the agent’s solution to a reference solution (using an LLM or a human), which does not work well for complex tasks that admit many varying solutions.
To address this gap, we introduce AppWorld, which constitutes (1) a controllable and simulated world environment (engine) where coding agents can operate various apps via APIs on behalf of people, (2) a benchmark of complex tasks defined on top of this environment, and (3) a robust evaluation framework for assessing agent performance.
⚙️ 2.1 Engine: simulated digital world
AppWorld Engine is a high-fidelity API-based simulator (60K lines of code) simulating an ecosystem of 9 day-to-day apps from various domains (Gmail for email, Amazon for shopping, Spotify for music, etc.). This engine is backed by a fully controllable local backend with 457 APIs and 100+ DB tables, closely mimicking the rich features of the real apps. These APIs have detailed documentation (explore interactively) that agents can read to understand their use.
We then simulate a digital world of people and their digital activities across these apps on top of this engine. In particular, we populate the app databases (DBs) with 106 fictitious people living in this simulated world. They are related to each other via various relationships, like roommates, friends, managers, etc, to allow for interpersonal tasks, like splitting bills with roommates. Then, their everyday lives are simulated to perform various personal and interpersonal activities on their app accounts, such as ordering t-shirts on Amazon for home delivery, asking a roommate for car keys over the phone, and so on. The final DBs have 300K+ rows spanning 726 columns).
📊 2.2 Benchmark of complex tasks
AppWorld Benchmark builds 750 day-to-day tasks on top of this engine (examples shown above), requiring many APIs (often 15+), spanning multiple apps (1–4), and requiring rich & interactive coding (often 80+ lines with many programming constructs). See the statistics in the figure below and explore tasks interactively on our playground.
Each task instruction comes with a supervisor (person in AppWorld) on whose behalf the agent is to do the task. The agent has access to all of their app accounts. Each task’s initial database state is carefully designed (programmatically) to ensure the task is well-defined and has realistic distractions and hurdles. The tasks also come with task variations, which holistically check if an agent can solve the task reliably under different initial conditions and instruction variations.
All task implementations are designed and developed by us (not crowdsourced). Their implementations span over 40K lines of code (yes, a lot is going into task development; see the paper).
✔️ 2.3. Robust evaluation framework
The complex tasks in AppWorld can be completed in many ways (e.g., an order receipt may be downloaded from its Amazon API or its confirmation email). Further, an agent solving the task can cause collateral damage in many different ways (e.g., initiating a return not asked for). So, a process-based approach that compares agent-generated code to reference code or API calls is inadequate for evaluating task completion.
Instead, AppWorld uses a state-based approach. In particular, for each task, we define a programmatic suite of unit tests that take snapshots of database states as inputs: (1) state before the agent starts and (2) after it ends. We then check that all expected and no unexpected database changes are made. This allows us to robustly check if an agent completed the task correctly without causing collateral damage.
Finally, to ensure the tasks are solvable, we write validation solution codes and programmatically verify that running them passes all evaluation tests.
We have benchmarked many LLMs with several few-shot prompting methods, like ReAct, plan and execute, generating full code with reflection, and function calling. Even the best LLM, GPT-4o, performs quite poorly. E.g., it completes only ~30% of the tasks in the challenge test set correctly. GPT-4 Turbo and open LLMs lag much further behind.
In addition, the scores are much lower for our stricter robustness metric, which checks whether agents can reliably complete all task variations under different starting conditions and instructions perturbations.
Furthermore, the scores substantially drop with increasing difficulty, as per our hand-given labels and other indicators of difficulty, like the number of APIs and lines of code based on our written validation solutions.
AppWorld is a modular and extensible foundation that opens up many exciting possibilities in automating digital tasks. E.g., future works can:
- Extend the AppWorld engine to support browser/mobile UI-based control for the existing tasks to provide a unified benchmark for code, API, and UI-based autonomous agents.
- Extend the AppWorld benchmark to have tasks requiring multi-agent (and human) coordination and collaboration (e.g., set up a calendar meeting with a friend by coordinating with their agent over email).
- Overlay our digital world engine onto a physical world engine, like Simulacra, with role-playing agents to study social dynamics and behavior in a controlled environment.
- Use the engine as a no-consequence sandbox to study potential privacy and safety risks that may arise when digital assistants are given the “agency” to act on our behalf in the real world.
- And, of course, extend AppWorld to a yet larger ecosystem of apps and tasks.
We are excited for ourselves and others to pursue these directions (and more!) on top of AppWorld. Reach out if you need help or want to collaborate!
AppWorld is easy to use and fast. You can pip install its open-source Python package and start building and testing your agent. If you have an agent, the following code is all you need to run and evaluate it on AppWorld.
For paper, code, leaderboard, data explorer (tasks, APIs, agent trajectories), interactive playground (interact directly with AppWorld tasks), video explainer, and more, visit https://appworld.dev.
NEW: AppWorld won the Best Resource Paper award at ACL’24. 🏆 🎉
Image Source: All images are created by the author.