Media Summary: On SWE-Bench Pro, six frontier models land within a couple of percentage points of each other. The harness they run inside shifts ... Shishir Patal, a Research Scientist at Meta, delivered a presentation on AI agents and their Ready to become a certified watsonx AI Assistant Engineer? Register now and use code IBMTechYT20 for 20% off of your exam ...

Agentic Evaluations At Scale For - Detailed Analysis & Overview

On SWE-Bench Pro, six frontier models land within a couple of percentage points of each other. The harness they run inside shifts ... Shishir Patal, a Research Scientist at Meta, delivered a presentation on AI agents and their Ready to become a certified watsonx AI Assistant Engineer? Register now and use code IBMTechYT20 for 20% off of your exam ... This lecture discusses the critical shift from evaluating static LLMs to complex AI agents that take action. It explores the vital role of ... As agents evolve from text conversations to autonomous agents capable of multi-step reasoning, tool use, and real-world task ... Join Mahesh Yadav, top Maven instructor and former AI PM leader at Google, Meta, and Microsoft. In this session, Mahesh breaks ...

For more information about Stanford's graduate programs, visit: November 21, ... Recorded at the Advanced Track of n8n Builders Berlin, this talk features JP van Oosten, who leads the AI team at n8n, explaining ... Anyone can be a math and science person with Brilliant! Visit to start learning and save 20% off an ... Turning AI agents into reliable, production-ready tools that deliver tangible business results requires more than just great models. This video introduces a new series on testing AI agents, focusing on why traditional Alex Ratner co-founded Snorkel AI out of Chris Ré's Stanford lab and helped establish data-centric AI as a field. Today, Snorkel is ...

Building AI Agents is one thing — but ensuring they are reliable, auditable, trustworthy, and properly evaluated is what truly ... Evaluating AI agents in 2025 goes beyond simply checking outputs. As agents take on multi-step, autonomous workflows, ... In this episode of Chain of Thought, 's Brad Kenstler (Head of Agent Capabilities and Environments) sits down with ... Today's episode takes us inside three very different frontiers of AI: production-grade agent

Photo Gallery

Agentic Evaluations at Scale, For Everybody — Nicholas Kang & Michael Aaron, Google DeepMind
Agentic Evals by Shishir Patil
LLM as a Judge: Scaling AI Evaluation Strategies
Agent Evaluation & Benchmarks - Agentic AI MOOC 2025 Lecture 4 Summary
Agentic Evaluations Workshop - Deep Dive on the Future on Evals for Agents.
How to set Evaluation for AI Agents & Scale them
Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 8 - LLM Evaluation
AI and Agent Observability in Azure AI Foundry and Azure Monitor | BRK168
Evaluations in Agentic Workflows - n8n Builders Berlin (Live Demo)
Generative vs Agentic AI: Shaping the Future of AI Collaboration
How AI Engineers Improve Agentic Products
Ensure AI Agents Work: Evaluation Frameworks for Scaling Success — Aparna Dhinkaran, CEO Arize
Sponsored
Sponsored
View Detailed Profile
Sponsored
Sponsored