Alerting Experience at Chronosphere
From One Page to Two: Designing a Purpose-Built Experience for On-Call Engineers
Designed for engineers wherever they are — Chronosphere's first mobile-friendly experience
Company: Chronosphere — unicorn-status observability platform, acquired by Palo Alto Networks 2025
My Role: Lead Designer and Researcher
Team: Principal Product Manager · 2 Backend Engineers · 2 Frontend Engineers
My Contribution: User research · Product strategy · Information hierarchy · Interaction design · Data density system · Design system contribution · Responsive design
Status: Early Access — designed and validated with real users, awaiting full rollout
Context — Two Personas, One Broken Page
In observability platforms like Chronosphere, there are two distinct types of users:
Makers — SRE admins who set up and maintain monitors. Their job is to configure what to watch, refine thresholds based on data, and optimize to reduce noise. Their work happens during peacetime — when systems are stable and there's time to think deliberately.
Responders — on-call engineers who get paged when something breaks. Their job is to investigate, mitigate, and remediate — fast. Every minute of confusion costs the business. Their work happens during wartime — under pressure, with no room for noise.
Before this project, Chronosphere had one page serving both personas. That's the problem we set out to solve.

Problem
The monitor page was built for Makers. But when Responders got paged, they landed there too — confronted with configuration tools designed for peacetime, not investigation.
One example: an attribute-based monitor selector that Responders had no use for during an incident — and couldn't figure out how to reset. The wrong tool, at the wrong moment, for the wrong persona.
There were no loud complaints — because there was nothing better to compare it to. But the latent need was clear: Responders deserved their own purpose-built experience.
How might we design a dedicated alert experience that gives Responders exactly what they need — and nothing they don't?

My Role — and How We Got Unstuck
I co-led the research with our shared researcher, defined the information hierarchy, and made key design decisions independently while collaborating closely with engineering to shape the solution.
This project wasn't without friction. A significant conflict emerged between the PM and lead backend engineer on product direction. Rather than letting competing opinions stall progress, I used research findings as neutral common ground — reframing the decision around user needs and data. That shift unblocked the team and kept the project moving.
Research — Understanding the Responder's World
I conducted interviews with on-call engineers and shadowed them during active on-call shifts — observing how alert investigation actually works under real pressure, not how people remember it afterward.
Key insights:
Responders don't read — they scan. Visual hierarchy isn't a nice-to-have, it's survival
Context is everything. Without related signals surfaced automatically, engineers piece together what happened manually — adding precious minutes to every incident
Configuration tools create cognitive noise during investigation — Responders need the page to disappear everything that isn't relevant to the current incident
Change events are critical — recent deploys and configuration changes are the first thing engineers look for when an alert fires
The Most Important Design Decision — Information Hierarchy
The foundational decision was defining what is primary data versus metadata — and how each should appear on the page. This forced the team to align on what an engineer actually needs to see first, second, and never.
I chose to organize the page around the event — the start and end of an incident — rather than the system's data structure. A deliberate shift from the system's perspective to the engineer's mental model.
From this decision, I developed a new data density specification and introduced it into Chronosphere's design system — so the principle scaled consistently across the entire alerting experience, not just this one page.

Constraints
Technical: The solution had to work within an existing alert manager architecture inherited from Grafana and remain compatible with open source standards. Certain data structures and alert behaviors couldn't be changed — the design had to work with the system, not around it.
Resource: A small frontend and backend team meant design decisions needed to be precise and well-justified from the start. No room for expensive pivots late in the build.
Design System: No existing pattern for data-dense interfaces existed in Chronosphere's design system. Rather than treating this as a blocker, I used it as an opportunity — developing a new data density specification that became a reusable system-level pattern for the entire product.
Key Design Solutions
Event-centered layout: Reorganized the page around the alert event — start time, end time, and all context needed to debug it — rather than raw system data
Contextual information on one page: Eliminated the need to navigate away by surfacing related signals, change events, and alert history in a single view
Redesigned data density: Introduced new visual hierarchy specs into the design system to ensure consistent treatment of primary data vs. metadata across the platform
What Changed — For the Engineer
The new experience consolidates everything an on-call engineer needs into a single page — event timeline, related context, change events, and alert history — eliminating the multi-click navigation the old experience required. The page now speaks the engineer's language: what happened, when, and what's related. Not the system's language.
The alert page was also Chronosphere's first responsive experience — designed to work on mobile browsers, recognizing that on-call engineers are often paged away from their desks.
Validation
Survey results — 5 EA customers · Usability study — 8 internal users:

100% of users rated the new experience as same or better than the previous experience
40% rated it as meaningfully better
Change events rated as most valuable by 100% of users — directly validating the core design hypothesis of surfacing contextual signals on one page
Status and Signal rated valuable by 80% of users — validating the information hierarchy decisions

"Nice that each alert is separated from the monitor — creates a snapshot perspective." — EA user
"Definitely better overall." — EA user
" I like how it's more focused on the 'event' and tries its best to get all the context to debug the alert." — Internal user
"I really like that it feels like I get the same information all on the same page that would have taken me multiple clicks to navigate before." — Internal user
FullStory engagement data — past 30 days:
23.9K pageviews — heavily used page
100% median scroll depth — engineers read the entire page
4m 29s average time on page — sustained, meaningful engagement
What We Learned — Iteration Opportunities
Survey data also surfaced clear priorities for the next iteration:
SLO information was rated least valuable by 80% of users during active incidents — engineers are focused on root cause, not SLO context, when firefighting. This section needs to be deprioritized or moved post-incident
Runbook link needs to be more prominent — multiple users referenced it as the most actionable element on the page
Query should be visible by default — currently requires an extra step that slows down investigation
Related firing alerts — users want to see other alerts that might share the same root cause
These findings directly informed the next iteration roadmap and validate that the foundational design decisions landed — users are now asking for refinements, not fundamental changes.
Reflection
Designing for high-stakes, time-pressured users taught me that great UX is about ruthless prioritization — deciding what not to show is as important as deciding what to surface. Separating Makers and Responders into distinct experiences wasn't just a product strategy decision. It was the design principle that made everything else possible.
The clearest learning: 80% of users found SLO information least valuable during active incidents. In retrospect, that's obvious — Responders in wartime mode are focused on root cause, not SLO context. I'd have tested that assumption earlier rather than discovering it through post-launch validation. Terminology is another thing I'd treat as a first-class design problem from day one — the Events confusion could have been caught in research.
Lastly, The Maker/Responder framework — two personas with fundamentally different needs sharing one surface — is a universal enterprise design challenge. It shows up in observability. It shows up wherever complex data meets human decisions under pressure.
