AWS re:Invent 2025 Recap

AWS re:Invent 2025 revealed three converging trends: multi-agent AI systems, design-time observability, and cost optimization during development.

Dec 11, 2025

I’ve attended AWS re:Invent for a few years now and its become my to-go event of the year to get a snapshot of the tech landscape. It has a mix of everything, from early stage startups working on cutting edge technologies, to incumbent enterprises trying to prove they’re still relevant. Obviously, AWS makes sure that their products, partners and messaging take the front stage but that’s not where the interesting parts of the conference are.

Disclaimer: I’m interested in exploring the infrastructure, observability, developer experience and AI tooling space, so this article is just based on these topics. It’s, in no way, a generic coverage of the event. Also, I’m not paid by any companies I mention in this post.

My pre-conference strategy is to do a few things

Shortlist companies that are relevant to my interests
1. Mark the days I’ll spend at the Expo hall and schedule booth visits
Prepare a list of questions to ask their reps
1. This includes pricing, tech deep dives, integrations, limits etc. Anything that might impact the vendor selection decision later on
2. This saves significant time later in case we do decide to explore any new tools in a certain category
Secure ISV event invites via AWS Account Managers
1. These events provide an amazing opportunity to learn what’s happening on the ground in other companies. e.g. one of the events I attended this time was: AI Pricing Strategies.
Shortlist vendor sponsored events/dinners I want to attend. Once you’ve completed the registration, many vendors start reaching out with dinner invites, poker events, lunch, brunch, breakfast, golf, concerts to schmooze you.
1. These are a hit or a miss based on who else attends. The vendor execs will be there to understand your use cases and follow up later(or sooner, based on your title and company size)

TL;DR

Shift-left dominated AWS re:Invent 2025. Not the DevOps shift-left we’ve been doing for a decade though. The pattern showed up everywhere: AI agents validating code during development, observability designed into systems before deployment, incident management done by agents, cost optimization baked into CI/CD pipelines. Three domains converging around the same idea: move decisions earlier in the development lifecycle.

AI Architecture Goes Multi-Agent

The evolution from RAG to multi-agent systems is real. AWS made a big push with Kiro code which promises to solve problems early in development cycle. AI agents validate work during development, not just execute tasks. Companies like EdgeDelta introduced a slack-like chat interface directly in their product where the “users” are configurable agents that work together to achieve the task provided by the user.

This changes architecture decisions. Engineering leaders need to plan for multi-agent systems, not monolithic AI. The question isn’t “should we use AI?” It’s “where can agents validate our work?” Code review, testing, deployment checks, these are the natural starting points.

For Platform Engineers: Evaluate agent orchestration frameworks now. The infrastructure for multi-agent coordination is becoming a first-class platform requirement.

For Product Engineers: Identify repeatable validation tasks that agents can handle. Start with low-risk, high-volume work.

AI SRE steals the show

After years of “AI native everything,” many companies realized they have data to solve recurring SRE problems.

Resolve.ai handles all alerts, performs root cause analysis, and troubleshoots incidents within minutes. EdgeDelta provides 100% data visibility with AI-powered observability automation. But, these platforms work when your data architecture allows it.

Traversal provides a nice chat interface to find, debug and manage issues and incidents. Same with Incident.io which help you triage alerts using AI, reducing SRE time spend on this.

PagerDuty’s AI Triage Agent and Grafana’s SLO-first approach signal a fundamental shift: observability is moving left to design time. Define SLOs during sprint planning, not after production incidents.

But here’s the hard part: buying incident response agents while keeping siloed logging, metrics, and traces is pointless. Agents can’t do root cause analysis if they can’t correlate data across systems. The organizational change matters more than the tool selection.

Companies must give agents autonomy in low-risk areas. Build incremental confidence. Start with incidents that don’t require human judgment: service restarts, cache clearing, scaling operations. Let agents handle these autonomously while humans monitor. Gradually expand the autonomy zones as confidence builds.

The critical requirement is to make your data agent-consumable. Break down silos in incident management. Create clear access controls so agents can access logs, metrics, traces, and infrastructure state across systems. Without this, agents remain glorified alert routers. These systems invariably ask access to all the existing tools that an SRE does (which seems natural), so be prepared to ask pointed questions about permissions and data compliance before signing the check.

For Engineering Leaders: Identify low-risk incident types where agents can respond autonomously. Audit your data architecture for agent consumption. Can an agent access and correlate data across your observability stack?

For Platform Engineers: Build agent-accessible APIs for logs, metrics, and traces. This isn’t just about human-readable dashboards anymore. Can you use existing tools like Claude Code or custom agents to solve some of these problems? Experiment before you buy.

For Product Engineers: Define SLOs during design phase. Make observability requirements explicit before writing code. This helps the Platform team think about what’s important from a business perspective.

Cost Optimization Shifts Into Development

Resource optimization dominated FinOps messaging this year. Last year, most FinOps companies focused on passive financial planning mechanisms (savings plans, RIs) to save costs. This year, they’re building real-time mechanisms.

CloudZero’s real-time cost anomaly detection and nOps’ Kubernetes cost management treat cost as a first-class metric alongside performance. The pattern: developers see cost impact during PR review, not quarterly reports.

ScaleOps does resource optimization for K8S clusters using binpacking and resource reallocation dynamically.

This requires FinOps to shift left. Integrate cost feedback into development workflows. Make cost visible in CI/CD pipelines. When engineers see “$0.23 per deployment” in their pull request, they make different architectural decisions.

For Platform Engineers: Add cost visibility to CI/CD pipelines. Make cost data accessible during development, not just in monthly reviews. Products like Firefly.ai provide mechanisms for this, but you can build this in-house. At least the first few versions. Claude Code can provide cost impact for PRs with Terraform files based on publicly available pricing data—good enough for a baseline.

For Product Engineers: Include cost considerations in architectural decisions from day one. “This endpoint costs $X at Y scale” should be visible during design review.

Changes needed

It’s not about buying tools. Organizations and teams need to think holistically about both the problems and processes. Teams that are being successful are actively iterating on their processes to streamline communication paths, creating new patterns and weaving AI in their day to day workflows.

Traditional organizations keep approval-heavy deployment processes while buying AI agents. This means agents can’t deploy autonomously if deployments require three approval gates. They keep siloed data while buying incident response agents. Agents can’t correlate if data lives in disconnected systems. They keep quarterly cost reviews while buying optimization tools. Developers need feedback loops during development, not retrospectives.

Tools must augment new organizational structures and processes, not just automate old ones. Define low-risk autonomy zones for agents. Build incremental confidence with measurable boundaries. Make organizational data accessible to agents with proper access controls.

The companies that win will change their processes alongside their tooling. The ones that lose will buy the same tools but keep the same processes.

Action Items by Persona

Engineering Leaders:

Map out low-risk areas for agent autonomy (service restarts, scaling operations, cache clearing)
Audit data architecture for agent-consumability across observability, deployment, and cost systems
Start one shift-left pilot in Q1 2026 (pick one domain: AI validation, observability-first design, or cost feedback)

Platform Engineers:

Evaluate agent orchestration frameworks for multi-agent coordination
Build agent-accessible APIs for logs, metrics, traces, and infrastructure state
Add cost visibility to CI/CD pipelines (cost per deployment, cost per feature flag, cost per service)

Product Engineers:

Define SLOs during sprint planning, before writing code
Include cost considerations in architectural decisions (document “$X at Y scale” in design reviews)
Identify validation tasks suitable for AI agents (code review checks, test generation, deployment validation)

Miscellaneous

AWS announcements lacked any surprises. Bunch of new features in Bedrock. Push to use Kiro. Customer testimonials. The usual. Bedrock is now becoming more usable as a real AI platform which is good if you’re limited to using AWS ecosystem.
AWS needs to replace the team that comes up with swag ideas. Just saying.
I was surprised to see a massive presence by Anthropic (didn’t think they needed it). They had sessions on using Claude and Claude Code and some scenarios. Full house.
Meta is really trying to make money. Whatsapp business had a big booth. Go figure!
Always amazed to see a giant Github booth. They always have something innovative going on there. Can’t complain, they do have cool stickers.

Logistics tips

Vegas in December is cold. Pack accordingly. Hotel room thermostats may be capped as well.

Traffic is brutal. F1 (Las Vegas Grand Prix) happens the weekend before re:Invent. Add 30 minutes to every commute estimate. Or, plan ahead and get a hotel that you can walk to the venue from.

Coffee and food: AWS kiosk in the Expo hall center has free coffee(average), tea, and hot water. Most coffee near the venue is objectively bad. Buchon cafe near the main Venetian entry serves good coffee and pastries for breakfast. Long lines. Most Venetian cafes get booked for private events so scout ahead of time for your go-to place if you’re planning to eat outside of the lunch hours.

Networking tip: Ask your account managers 2+ months ahead for executive session invites on architecture, GTM, and pricing. These sessions provide deeper technical content than general keynotes.

If you’re navigating shift-left adoption or building agent-consumable data platforms, I’d love to hear what’s working (and what’s not). The organizational change is harder than the technology - let’s compare notes.

The AI Architect

Dec 11

Solid recap of the shift-left convergence across AI, observability, and cost. The part about making data agent-consumable really landed because most companies buy incident response agents without fixing the siloed architecture underneath. If agents can't correlate logs/metrics/traces cross-system, they're basically fancy alert routers. The organizational prerequisite (defining SLOs at design time, cost visiblity in PRs) matters way more than the tool stack itself.

devashish.me

Discussion about this post

Ready for more?