OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Tianbao Xie¹, Danyang Zhang¹, Jixuan Chen¹, Xiaochuan Li¹,
Siheng Zhao¹, Ruisheng Cao¹, Toh Jing Hua¹, Zhoujun Cheng¹, Dongchan Shin¹, Fangyu Lei¹, Yitao Liu¹, Yiheng Xu¹, Shuyan Zhou³, Silvio Savarese², Caiming Xiong², Victor Zhong⁴, Tao Yu¹

¹The University of Hong Kong, ²Salesforce Research, ³Carnegie Mellon University, ⁴University of Waterloo

2025-07-28: Major Upgrade! OSWorld has been enhanced and is now OSWorld-Verified with comprehensive improvements: fixed community-reported examples, AWS support reducing evaluation time to within 1 hour, and updated benchmark results. See the verified benchmark results in the Benchmark section below. Please compare your OSWorld results with the new benchmark results when running the latest version.

Paper Code Doc Data Data Viewer Slides Twitter Discord

**OSWorld** is a first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across operating systems. It can serve as a unified environment for evaluating open-ended computer tasks that involve arbitrary apps (e.g., task examples in the above Fig). We also create a benchmark of 369 real-world computer tasks in **OSWorld** with reliable, reproducible setup and evaluation scripts. *Note: 8 Google Drive tasks may require manual configuration or can be excluded (361 tasks) due to network dependencies.*

Abstract

Autonomous agents that accomplish complex computer tasks with minimal human interventions have the potential to transform human-computer interaction, significantly enhancing accessibility and productivity. However, existing benchmarks either lack an interactive environment or are limited to environments specific to certain applications or domains, failing to reflect the diverse and complex nature of real-world computer use, thereby limiting the scope of tasks and agent scalability. To address this issue, we introduce **OSWorld**, the first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across various operating systems such as Ubuntu, Windows, and macOS. **OSWorld** can serve as a unified, integrated computer environment for assessing open-ended computer tasks that involve arbitrary applications. Building upon **OSWorld**, we create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and workflows spanning multiple applications (note: 8 Google Drive tasks may require manual setup or can be excluded for a 361-task evaluation). Each task example is derived from real-world computer use cases and includes a detailed initial state setup configuration and a custom execution-based evaluation script for reliable, reproducible evaluation. Extensive evaluation of state-of-the-art LLM/VLM-based agents on **OSWorld** reveals significant deficiencies in their ability to serve as computer assistants. While humans can accomplish over 72.36% of the tasks, the best model achieves only 12.24% success, primarily struggling with GUI grounding and operational knowledge. Comprehensive analysis using **OSWorld** provides valuable insights for developing multimodal generalist agents that were not possible with previous benchmarks.

OSWorld Environment Infrastructure

The **OSWorld** environment uses a configuration file for initializing tasks *(highlighted in red)*, agent interaction, post-processing upon agent completion *(highlighted in orange)*, retrieving files and information *(highlighted in yellow)*, and executing the evaluation function *(highlighted in green)*. The corresponding configuration items are highlighted in colors that match their respective components within the environment. Environments can run in parallel on a single host machine for learning or evaluation purposes. Headless operation is supported.

Data Statistics and Comparison

Below we present an overview of the main statistics of **OSWorld**, showcasing the outline and a broad spectrum of tasks. **OSWorld** contains a total of 369 tasks (and an additional 43 tasks on Windows for analysis).

Key statistics of OSWorld.

The "Supp. tasks" refers to the Windows-based tasks, that could only be used after activation due to copyright restrictions.
data-overview

Distribution of task instructions in OSWorld
based on the app domains and operation types to showcase the content intuitively.

We make a comparison of **OSWorld** against some other different benchmarks for digital agents as presented below.
**The columns indicate:** whether they provide a controllable executable environment *(Control. Exec. Env.)*, the ease of adding new tasks involving arbitrary applications in open domains *(Environment Scalability)*, support for multimodal agent evaluation *(Multimodal Support)*, support for and inclusion of cross-app tasks *(Cross-App)*, capability to start tasks from an intermediate initial state *(Intermediate Init. State)*, and the number of execution-based evaluation functions *(# Exec.-based Eval. Func.)*.

	OSWorld
# Instances (# Templates)	369
Control. Exec. Env.	Computer
Environment Scalability?	✔️
Multimodal Support?	✔️
Cross-App?	✔️
Intermediate Init. State?	✔️
# Exec.-based Eval. Func.	134

GAIA	Mind2Web	WebLINX	PixelHelp	MetaGUI	AitW	OmniAct	ScreenAgent	AgentBench	InterCode	MiniWoB++	WebShop	WebArena	VisualWebArena	WorkArena	WikiHow	AssistGUI
466	2350	2337	187	1125	30k	9802	70	1091	1350(3)	125	12k(1)	812(241)	910(314)	23k(29)	150(16)	100
❌	❌	❌	❌	❌	❌	❌	❌	Multi-isolated	Code	Web	Web	Web	Web	Web	Mobile	❌
-	-	-	-	-	-	-	-	❌	❌	❌	❌	❌	❌	❌	❌	❌
❌	✔️	✔️	✔️	✔️	✔️	✔️	✔️	❌	❌	✔️	✔️	✔️	✔️	✔️	✔️	✔️
❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌
❌	✔️	✔️	❌	❌	✔️	✔️	✔️	❌	❌	❌	❌	❌	❌	✔️	❌	✔️
0	0	0	0	0	0	0	0	7	3	125	1	5	6	7	16	2

Benchmark

We adopt state-of-the-art LLM and VLM from open-source representatives such as UI-TARS, Agent-S, Qwen, Mixtral and CogAgent, and closed-source ones from Operator, GPT, Gemini, and Claude families on **OSWorld**, as LLM and VLM agent baselines. We also explore methods such as the Set-of-Marks aided approach, which has been demonstrated to improve spatial capabilities for visual reasoning. **We are actively updating the benchmark with new LLMs, VLMs and methods. Pull requests welcomed!**

Important Notice: Google Drive Tasks (2025-07-28)

OSWorld contains 8 Google Drive-related tasks that may encounter setup issues during task initialization due to IP changes or other network-related factors, even when following our configuration guidelines correctly.

Two acceptable approaches for evaluation:

Manual Adjustment: You can manually configure these 8 tasks to complete the full 369 tasks evaluation
Exclude Tasks: You can exclude these 8 tasks and run 361 tasks instead - this is officially permitted and acceptable

Both approaches are valid for benchmark comparison and leaderboard submission.

Results

These are official results evaluated by our team under unified settings and environment. All models are tested with consistent evaluation protocols to ensure fair comparison.

For self-reported results and progress trends across different modalities, click here.

All verified trajectories are hosted on Hugging Face for community analysis.

What are the differences among General model, Specialized model, and Agentic framework?
A General model is a model with broad, general-purpose capabilities. "Computer use" is one capability that can be elicited via prompting; the model itself can still perform other tasks such as dialogue and code generation. A Specialized model is trained specifically to serve as a computer-use agent; other capabilities are out of scope and are not emphasized in the corresponding reports. An Agentic framework organizes one or more General and Specialized models into a structured workflow—commonly, a GPT-family model acts as the planner while a proprietary or task-specific model serves as the grounder.
We will add new paradigms as they emerge.

OSWorld-Verified Results

`; let tableHTML = subTabsHTML + `

Filters:

Model Type:

${isFoundationMode ? '' : `

Tags: A11y tree Additional coding-based action Multiple rollout

Max Steps:

Sort by:

${isFoundationMode ? '' : `

Used additional a11y tree Used Additional coding-based action 🔁Multiple rollout

Rank	Model & Date	Approach & Details	Success Rate (Avg±Std)

`; container.innerHTML = tableHTML; // Store data globally for filtering window.verifiedTableData = aggregatedData; // Attach sub-tab switching const scopeTabs = document.querySelectorAll('#verifiedScopeTabs a'); scopeTabs.forEach(a => { a.addEventListener('click', (e) => { const title = a.getAttribute('title'); window.verifiedSubTab = (title === 'Foundation E2E GUI') ? 'foundation' : 'all'; // Re-render with the selected scope renderVerifiedTable(window.verifiedRawData, 'BoardPanel5'); }); }); // Debug: Check date formats in the data if (validData.length > 0) { console.log('Date debugging - first few rows:'); validData.slice(0, 3).forEach((row, i) => { console.log(`Row ${i}: Date = ${row.Date} (type: ${typeof row.Date})`); }); } // Set default filter selections and render setDefaultModelTypeFilters(); filterVerifiedTable(); } function setDefaultModelTypeFilters() { // Set default selections: Specialized model and General model let defaultSelections = ['Specialized model', 'General model']; if (window.verifiedSubTab === 'all') { defaultSelections = ['Specialized model', 'General model', 'Agentic framework']; } const allCheckbox = document.getElementById('approachFilterAll'); // Enable all item checkboxes and select default ones const itemCheckboxes = document.querySelectorAll('.approachFilterItem'); itemCheckboxes.forEach(cb => { cb.disabled = false; cb.checked = defaultSelections.includes(cb.value); const label = cb.nextElementSibling; if (label) { label.style.color = '#333'; } }); // If all visible items are selected by default, mark "All" as checked (do not trigger disabling) if (allCheckbox) { const allSelected = Array.from(itemCheckboxes).every(cb => cb.checked || cb.disabled); allCheckbox.checked = allSelected; } updateModelTypeDisplay(); } function renderVerifiedTableRows(data) { const tbody = document.getElementById('verifiedTableBody'); if (!tbody) return; // Get current sort preference const sortBy = document.getElementById('sortBy')?.value || 'score'; // Sort data based on selection let sortedData = [...data]; if (sortBy === 'score') { sortedData.sort((a, b) => { const scoreA = parseFloat(a['Success rate avg']) || 0; const scoreB = parseFloat(b['Success rate avg']) || 0; return scoreB - scoreA; }); } else if (sortBy === 'date') { sortedData.sort((a, b) => { const parseExcelDateForSort = (dateVal) => { if (!dateVal || dateVal === 'NaT') return new Date(0); try { let date; const dateStr = String(dateVal); // Method 1: Check if it's already a valid Date object if (dateVal instanceof Date && !isNaN(dateVal.getTime())) { return dateVal; } // Method 2: Excel serial number if (typeof dateVal === 'number' && dateVal > 40000 && dateVal < 50000) { date = new Date((dateVal - 25569) * 86400 * 1000); if (!isNaN(date.getTime()) && date.getFullYear() >= 2020) { return date; } } // Method 3: ISO date string patterns if (dateStr.includes('T') || dateStr.match(/^\d{4}-\d{2}-\d{2}/)) { const isoMatch = dateStr.match(/(\d{4}-\d{2}-\d{2}(?:T\d{2}:\d{2}:\d{2})?)/); if (isoMatch) { date = new Date(isoMatch[1]); if (!isNaN(date.getTime()) && date.getFullYear() >= 2020) { return date; } } } // Method 4: Try direct parsing date = new Date(dateVal); if (!isNaN(date.getTime()) && date.getFullYear() >= 2020 && date.getFullYear() <= 2030) { return date; } return new Date(0); } catch (e) { return new Date(0); } }; const dateA = parseExcelDateForSort(a.Date); const dateB = parseExcelDateForSort(b.Date); return dateB - dateA; // Newest first }); } else if (sortBy === 'model') { sortedData.sort((a, b) => { const modelA = (a.Model || '').toLowerCase(); const modelB = (b.Model || '').toLowerCase(); return modelA.localeCompare(modelB); }); } let tableRowsHTML = ''; sortedData.forEach((row, index) => { const rank = index + 1; const isFirstRank = rank === 1; // Create unique identifier for this row const uniqueId = `${row.Model || 'unknown'}-${row['Max steps'] || 'unknown'}`.replace(/[^a-zA-Z0-9-]/g, '-'); const score = parseFloat(row['Success rate avg']).toFixed(1); // Check if Success rate std exists and add it to score display let scoreDisplay = `${score}%`; if (row['Success rate std'] && row['Success rate std'] > 0) { const std = parseFloat(row['Success rate std']).toFixed(1); scoreDisplay = `${score}±${std}%`; } // Generate category breakdown for horizontal expansion let categoryBreakdown = ''; if (row.categoryStats && Object.keys(row.categoryStats).length > 0) { const categoryHeaders = Object.keys(row.categoryStats).map(category => ` ${category} `).join(''); // Show original data for each run if (row.runCount > 1) { // Multiple runs - show each run separately let tableRows = ''; row.runs.forEach((run, runIndex) => { const runValues = Object.keys(row.categoryStats).map(category => { const runValue = run[category] || ''; return ` ${runValue} `; }).join(''); tableRows += ` Run ${runIndex + 1} ${runValues} `; }); categoryBreakdown = ` ${categoryHeaders} ${tableRows}

Run

`; } else { // Single run - show original data directly const categoryValues = Object.keys(row.categoryStats).map(category => { const originalValue = row.categoryStats[category].originalValues[0] || ''; return ` ${originalValue} `; }).join(''); categoryBreakdown = ` ${categoryHeaders} ${categoryValues}

`; } } // Parse date with robust fallback for XLSX.js data const parseExcelDate = (dateVal) => { if (!dateVal || dateVal === 'NaT') return null; // Log for debugging console.log('Parsing date:', dateVal, 'type:', typeof dateVal); try { let date; const dateStr = String(dateVal); // Method 1: Check if it's already a valid Date object if (dateVal instanceof Date && !isNaN(dateVal.getTime())) { return dateVal; } // Method 2: Excel serial number (XLSX.js sometimes returns this) if (typeof dateVal === 'number' && dateVal > 40000 && dateVal < 50000) { // Excel serial date: days since 1900-01-01 date = new Date((dateVal - 25569) * 86400 * 1000); if (!isNaN(date.getTime()) && date.getFullYear() >= 2020) { return date; } } // Method 3: ISO date string patterns (from pandas Timestamp) if (dateStr.includes('T') || dateStr.match(/^\d{4}-\d{2}-\d{2}/)) { // Remove any extra text like "Timestamp('2025-07-28 00:00:00')" const isoMatch = dateStr.match(/(\d{4}-\d{2}-\d{2}(?:T\d{2}:\d{2}:\d{2})?)/); if (isoMatch) { date = new Date(isoMatch[1]); if (!isNaN(date.getTime()) && date.getFullYear() >= 2020) { return date; } } } // Method 4: Try direct parsing date = new Date(dateVal); if (!isNaN(date.getTime()) && date.getFullYear() >= 2020 && date.getFullYear() <= 2030) { return date; } // Method 5: Manual regex parsing for various formats const patterns = [ /(\d{4})-(\d{2})-(\d{2})/, // YYYY-MM-DD /(\d{2})\/(\d{2})\/(\d{4})/, // MM/DD/YYYY /(\d{4})\/(\d{2})\/(\d{2})/ // YYYY/MM/DD ]; for (const pattern of patterns) { const match = dateStr.match(pattern); if (match) { let year, month, day; if (pattern.source.startsWith('(\\d{4})')) { [_, year, month, day] = match; } else { [_, month, day, year] = match; } date = new Date(parseInt(year), parseInt(month) - 1, parseInt(day)); if (!isNaN(date.getTime()) && date.getFullYear() >= 2020) { return date; } } } return null; } catch (e) { console.warn('Date parsing error for:', dateVal, e); return null; } }; // Format date let dateDisplay = ''; const parsedDate = parseExcelDate(row.Date); if (parsedDate) { dateDisplay = parsedDate.toLocaleDateString('en-US', { year: 'numeric', month: 'short', day: 'numeric' }); } else if (row.Date) { // Show raw value if parsing failed dateDisplay = String(row.Date); } tableRowsHTML += `

${isFirstRank ? `${rank} *` : rank}

${row.Model || ''} ${row.hasAdditionalA11yTree ? ` ` : ''} ${(!window.verifiedSubTab || window.verifiedSubTab === 'all') && row.hasAdditionalTool ? ` ` : ''} ${(!window.verifiedSubTab || window.verifiedSubTab === 'all') && row.hasMultipleRollout ? ` 🔁 ` : ''}

${row.Institution || ''}

${dateDisplay ? `

📅 ${dateDisplay}

` : ''} ${row.PaperLink ? `${row.PaperAuthors || 'Paper Link'}` : ''}

Type: ${row['Approach type'] || 'N/A'}

Max Steps: ${row['Max steps'] || 'N/A'}

${row.runCount ? `

Runs: ${row.runCount}

` : ''}

${scoreDisplay} ${row.categoryStats && Object.keys(row.categoryStats).length > 0 ? ` ` : ''}

${categoryBreakdown}

`; }); tbody.innerHTML = tableRowsHTML; } function toggleCategories(uniqueId) { const categoryCell = document.getElementById(`category-cell-${uniqueId}`); const toggleButton = document.getElementById(`toggle-${uniqueId}`); const benchmarkContainer = document.querySelector('#benchmark .container'); if (categoryCell && toggleButton) { const isExpanded = categoryCell.style.width !== '0px' && categoryCell.style.width !== '0' && categoryCell.style.width !== ''; if (!isExpanded) { // Expand this row categoryCell.style.width = 'auto'; categoryCell.style.maxWidth = 'none'; categoryCell.style.padding = '8px'; categoryCell.style.opacity = '1'; categoryCell.style.whiteSpace = 'normal'; toggleButton.textContent = '▼'; toggleButton.title = 'Hide domain breakdown'; // Scale up content const contentDiv = document.getElementById(`content-${uniqueId}`); if (contentDiv) { contentDiv.style.transform = 'scale(1)'; } // Expand container when showing categories if (benchmarkContainer) { benchmarkContainer.style.width = '100%'; benchmarkContainer.style.maxWidth = 'none'; benchmarkContainer.style.overflowX = 'auto'; } // Close other expanded rows document.querySelectorAll('.category-cell').forEach(cell => { if (cell.id !== `category-cell-${uniqueId}`) { const otherCellExpanded = cell.style.width !== '0px' && cell.style.width !== '0' && cell.style.width !== ''; if (otherCellExpanded) { const otherId = cell.id.replace('category-cell-', ''); const otherContentDiv = document.getElementById(`content-${otherId}`); if (otherContentDiv) { otherContentDiv.style.transform = 'scale(0.95)'; } cell.style.width = '0'; cell.style.maxWidth = '0'; cell.style.padding = '0'; cell.style.opacity = '0'; cell.style.whiteSpace = 'nowrap'; const otherButton = document.getElementById(`toggle-${otherId}`); if (otherButton) { otherButton.textContent = '▶'; otherButton.title = 'Show domain breakdown'; } } } }); } else { // Collapse this row const contentDiv = document.getElementById(`content-${uniqueId}`); if (contentDiv) { contentDiv.style.transform = 'scale(0.95)'; } categoryCell.style.width = '0'; categoryCell.style.maxWidth = '0'; categoryCell.style.padding = '0'; categoryCell.style.opacity = '0'; categoryCell.style.whiteSpace = 'nowrap'; toggleButton.textContent = '▶'; toggleButton.title = 'Show domain breakdown'; // Check if any other rows are expanded const hasExpanded = Array.from(document.querySelectorAll('.category-cell')).some(cell => { return cell.style.width !== '0px' && cell.style.width !== '0' && cell.style.width !== ''; }); // If no rows are expanded, restore container size if (!hasExpanded && benchmarkContainer) { benchmarkContainer.style.width = '95%'; } } } } function toggleModelTypeDropdown() { const dropdown = document.getElementById('modelTypeDropdownMenu'); const isVisible = dropdown.style.display === 'block'; dropdown.style.display = isVisible ? 'none' : 'block'; } function updateModelTypeDisplay() { const allCheckbox = document.getElementById('approachFilterAll'); const itemCheckboxes = document.querySelectorAll('.approachFilterItem:checked:not(:disabled)'); const displaySpan = document.getElementById('modelTypeSelected'); if (allCheckbox.checked) { displaySpan.textContent = 'All'; } else if (itemCheckboxes.length === 0) { displaySpan.textContent = 'None selected'; } else if (itemCheckboxes.length === 1) { displaySpan.textContent = itemCheckboxes[0].value; } else { displaySpan.textContent = `${itemCheckboxes.length} selected`; } } function handleApproachAllChange() { const allCheckbox = document.getElementById('approachFilterAll'); const itemCheckboxes = document.querySelectorAll('.approachFilterItem'); if (allCheckbox.checked) { // Check all items (do not disable; allow user to uncheck later) itemCheckboxes.forEach(cb => { cb.checked = true; cb.disabled = false; const label = cb.nextElementSibling; if (label) label.style.color = '#333'; }); } else { // Uncheck all items itemCheckboxes.forEach(cb => { cb.disabled = false; cb.checked = false; const label = cb.nextElementSibling; if (label) label.style.color = '#333'; }); } updateModelTypeDisplay(); filterVerifiedTable(); } function handleApproachItemChange() { const allCheckbox = document.getElementById('approachFilterAll'); const itemCheckboxes = document.querySelectorAll('.approachFilterItem'); // If all items are checked, check the "All" checkbox const allItemsChecked = Array.from(itemCheckboxes).every(cb => cb.checked); if (allItemsChecked && !allCheckbox.checked) { allCheckbox.checked = true; handleApproachAllChange(); return; } updateModelTypeDisplay(); filterVerifiedTable(); } function filterVerifiedTable() { if (!window.verifiedTableData) return; // Get selected approach types const allApproachChecked = document.getElementById('approachFilterAll').checked; let selectedApproaches = []; if (!allApproachChecked) { const itemCheckboxes = document.querySelectorAll('.approachFilterItem:checked'); selectedApproaches = Array.from(itemCheckboxes).map(cb => cb.value); } const maxStepsFilter = document.getElementById('maxStepsFilter').value; // Additional filters const filterA11y = document.getElementById('filterA11y')?.checked; const filterTool = document.getElementById('filterTool')?.checked || document.getElementById('filterCoding')?.checked; const filterMulti = document.getElementById('filterMulti')?.checked; let filteredData = window.verifiedTableData; // Filter by approach type (only if not "All" and has selections) if (!allApproachChecked) { if (selectedApproaches.length > 0) { filteredData = filteredData.filter(row => selectedApproaches.includes(row['Approach type'])); } else { // If no approach types selected, show no results filteredData = []; } } if (maxStepsFilter) { filteredData = filteredData.filter(row => String(row['Max steps']) === maxStepsFilter); } // Apply additional filters (only in All tab) const isFoundationModeLocal = (window.verifiedSubTab === 'foundation'); if (!isFoundationModeLocal) { // A11y tree tag: checked => include both Yes and No; unchecked => only No if (filterA11y === false) { filteredData = filteredData.filter(row => !row.hasAdditionalA11yTree); } // Additional coding-based action tag: checked => include both Yes and No; unchecked => only No if (filterTool === false) { filteredData = filteredData.filter(row => !row.hasAdditionalTool); } // Multiple rollout tag: checked => include both Yes and No; unchecked => only No if (filterMulti === false) { filteredData = filteredData.filter(row => !row.hasMultipleRollout); } } renderVerifiedTableRows(filteredData); } function sortVerifiedTable() { filterVerifiedTable(); // Re-apply current filters with new sort } function clearVerifiedFilters() { // Reset approach type to default selections setDefaultModelTypeFilters(); // Close dropdown document.getElementById('modelTypeDropdownMenu').style.display = 'none'; document.getElementById('maxStepsFilter').value = ''; document.getElementById('sortBy').value = 'score'; const a11yCb = document.getElementById('filterA11y'); const toolCb = document.getElementById('filterTool'); const codingCb = document.getElementById('filterCoding'); if (a11yCb) a11yCb.checked = true; if (toolCb) toolCb.checked = true; if (codingCb) codingCb.checked = true; if (window.verifiedTableData) { // Apply default filter filterVerifiedTable(); } } function initTabSwitching() { // Add event listeners for both tab containers document.querySelectorAll("#verifiedOnlyTabs li, #allTabsContainer li").forEach(e => { e.addEventListener("click", function(eve) { const clickedElement = eve.target; const tabTitle = clickedElement.getAttribute('title'); // Find the currently visible tab container const verifiedTabs = document.getElementById('verifiedOnlyTabs'); const allTabs = document.getElementById('allTabsContainer'); const activeContainer = verifiedTabs.style.display !== 'none' ? verifiedTabs : allTabs; // Update active tab only in the active container activeContainer.querySelectorAll('a').forEach(link => { if (link.getAttribute('title') === tabTitle) { link.parentElement.className = "is-active"; } else { link.parentElement.className = ""; } }); // Show/hide corresponding content for (let block of document.getElementsByClassName('lib_examples')) { block.style.display = (block.title === tabTitle) ? 'block' : 'none'; } }); }); } // Handle showing self-reported results async function showSelfReportedResults() { // First load self-reported data if not already loaded await loadSelfReportedData(); // Hide verified-only tabs and show all tabs document.getElementById('verifiedOnlyTabs').style.display = 'none'; document.getElementById('allTabsContainer').style.display = 'block'; // Show progress chart section - try multiple selectors let chartSection = document.querySelector('section .container.is-max-desktop .content.has-text-justified'); if (!chartSection) { chartSection = document.querySelector('.content.has-text-justified'); } if (!chartSection) { // Find the chart container by the canvas element const canvas = document.getElementById('progressChart'); if (canvas) { chartSection = canvas.closest('.content'); } } if (chartSection) { chartSection.style.display = 'block'; console.log('Chart section found and shown'); // Initialize chart with delay to ensure DOM is ready setTimeout(() => { console.log('Initializing chart...'); initProgressChart(); }, 300); } else { console.error('Chart section not found'); } // Update the description text const descContainer = document.querySelector('.infoBody > div:first-child'); if (descContainer) { descContainer.innerHTML = `

OSWorld Leaderboard

Verified Results: Official results evaluated by our team under unified settings. Self-reported Results: Results submitted by the community across different modalities.

`; } } document.addEventListener('DOMContentLoaded', () => { loadAndRenderExcelData(); // Add click handler for showing self-reported results const showLink = document.getElementById('showSelfReportedLink'); if (showLink) { showLink.addEventListener('click', function(e) { e.preventDefault(); showSelfReportedResults(); }); } // Close dropdown when clicking outside document.addEventListener('click', function(e) { const dropdown = document.getElementById('modelTypeDropdownMenu'); const button = document.getElementById('modelTypeDropdown'); if (dropdown && button && !dropdown.contains(e.target) && !button.contains(e.target)) { dropdown.style.display = 'none'; } }); });

Loading verified benchmark data...

Analysis

We conduct a qualitative analysis in the aspect of models, methods, and human to find out factors influencing the performance of VLMs in digital agent tasks and their underlying behavioral logic. We investigate the impact of task attributes (such as difficulty, feasibility, visual requirement, and GUI complexity), input measurements (such as screenshot resolution, the influence of trajectory history, and the effect of UI layout), and explore whether there are patterns in the agent's performance across different operating systems. Here is an overview of our analysis outcome.

Higher screenshot resolution typically leads to improved performance.

Longer text-based trajectory history context improves performance, unlike screenshot-only history, but poses efficiency challenges.

Current VLM agents are not robust to UI layout and noise.

The performance of VLM agents across different OS is in strong correlation. This implies that insights and methodologies developed within the OSWorld framework can be effectively transferred to Windows environments with a high degree of reliability.

A success case of LLM/VLM agent baselines.

Videos

Special thanks to the following YouTubers and enthusiasts for their reports. We are delighted to see the community's interest. If you would like a brief video introduction and their thoughts, feel free to check them out!

@Yannic Kilcher

@Wes Roth

@hu-po

@Dylan Curious

@WorldofAI

@Gourcer

@AI Explained

@Fireship

@1littlecoder

Acknowledgement

We thank Sida Wang, Peter Shaw, Alane Suhr, Luke Zettlemoyer, Haoyuan Wu, Junli Wang, Chengyou Jia, Junlin Yang, Junlei Zhang, Chen Henry Wu, Pengcheng Yin, Shunyu Yao, Xing Han Lu, Siva Reddy, Ruoxi Sun, Zhiyuan Zeng, and Lei Li for their helpful feedback on this work

Acknowledgement for OSWorld-Verified

Special thanks to the following institutions that provided feedback and participated in the fixes (as well as institutions that provided feedback during the process): [MoonShot AI, a.k.a. Kimi](https://www.moonshot.ai/)，[Human Data](https://www.hud.so/), [OpenAI](https://openai.com/), [ByteDance Seed TARS](https://seed-tars.com/), [Anthropic](https://www.anthropic.com/), [Simular](https://www.simular.ai/), [HKU Data Intelligence Lab](https://sites.google.com/view/chaoh)

Special thanks to the following students who participated in the specific fixes: [Mengqi Yuan](https://yuanmengqi.github.io/), [Danyang Zhang](https://zdy023.github.io/), [Xinzhuang Xiong](https://thisisxxz.com/), [Zhennan Shen](https://scholar.google.com/citations?user=JPwg5MwAAAAJ&hl=en), [Zilong Zhou](https://github.com/adlsdztony), Yanxu Chen, [Jiaqi Deng](https://millank0817.github.io/), [Tianbao Xie](https://tianbaoxie.com/), Junda Chen, [Jixuan Chen](https://chenjix.github.io/), [Haoyuan Wu](https://www.linkedin.com/in/haoyuan-wu-240878291/).

Special thanks to the following students who participated in running the re-evaluation: [Mengqi Yuan](https://yuanmengqi.github.io/), [Zilong Zhou](https://github.com/adlsdztony), [Xinyuan Wang](https://xinyuanwangcs.github.io/), [Bowen Wang](https://bowenbryanwang.github.io/).

Evaluation

Local Evaluation

Please start by reading through the [agent interface](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/README.md) and the [environment interface](https://github.com/xlang-ai/OSWorld/blob/main/desktop_env/README.md). Correctly implement the agent interface and import your customized version in the `run.py` (for single-threaded execution) or `scripts/python/run_multienv.py` / `scripts/python/run_multienv_xxx.py` (for parallel execution) file. Afterward, you can execute a command similar to the one in the previous section to run the benchmark on your agent.

Public Evaluation

If you want your results to be verified and displayed on the verified leaderboard, you need to schedule a meeting with us (current maintainer: tianbaoxiexxx@gmail.com, yuanmengqi732@gmail.com) to run your agent code on our side and have us report the results. You need to upload and allow us to disclose your agent implementation under the OSWorld framework (you may choose not to expose your model API to the public), along with a report that allows the public to understand what's happening behind the scenes. Alternatively, if you are from a trusted institution, you can share your monitoring data and trajectories with us. Please carefully follow the [Setup Guideline - Public Evaluation Platform](https://github.com/xlang-ai/OSWorld/blob/main/SETUP_GUIDELINE.md#3-public-evaluation-platform) to get results.

FAQ

What is the username and password for the virtual machines?

The username and password for the virtual machines are as follows (for provider vmware, virtualbox and docker): we set the account credentials for Ubuntu as user / password.
For cloud service providers like aws, to prevent attacks due to weak passwords, we default to osworld-public-evaluation.
If you make further modifications, remember to set the client_password variable and pass it to DesktopEnv and Agent (if supported) when running experiments.
Some features like setting up proxy require the environment to have the client VM password to obtain sudo privileges, and for some OSWorld tasks, the agent needs the password to obtain sudo privileges to complete them.

How to setup the account and credentials for Google and Google Drive?

See Setup Guideline - Google Account Setup.

What should I do if Google Drive tasks fail to initialize properly?

OSWorld contains 8 Google Drive-related tasks that may encounter setup issues during initialization due to various factors:

Common Issues:

IP address changes causing authentication problems
Network restrictions or firewalls
Google API rate limiting or access restrictions
Regional availability limitations

Solutions:
Option 1 - Manual Configuration: Manually troubleshoot and configure these 8 tasks to complete the full 369-task evaluation.
Option 2 - Task Exclusion: Exclude these 8 tasks and run the remaining 361 tasks - this is officially permitted and acceptable for benchmark evaluation.

Both approaches are valid for research comparison and leaderboard submission. Please specify which approach you used when reporting your results.

How can I configure a proxy for the VM (if I'm behind the GFW, or I don't want some of my tasks to be identified as bot and get lower scores)?

See Setup Guideline - Proxy Configuration.
We also provide a pre-configured solution based on DataImpulse, please refer to proxy setup section.

BibTeX

@misc{OSWorld,
      title={OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments},
      author={Tianbao Xie and Danyang Zhang and Jixuan Chen and Xiaochuan Li and Siheng Zhao and Ruisheng Cao and Toh Jing Hua and Zhoujun Cheng and Dongchan Shin and Fangyu Lei and Yitao Liu and Yiheng Xu and Shuyan Zhou and Silvio Savarese and Caiming Xiong and Victor Zhong and Tao Yu},
      year={2024},
      eprint={2404.07972},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}