top of page
  • LinkedIn
  • Dribbble
  • Instagram

Alerting Experience at Chronosphere

From One Page to Two: Designing a Purpose-Built Experience for On-Call Engineers

Designed for engineers wherever they are — Chronosphere's first mobile-friendly experience

hoverSolutionImg.png

Company: Chronosphere — unicorn-status observability platform, acquired by Palo Alto Networks 2025

My Role: Lead Designer and Researcher

Team: Principal Product Manager · 2 Backend Engineers · 2 Frontend Engineers

My Contribution: User research · Product strategy · Information hierarchy · Interaction design · Data density system · Design system contribution · Responsive design

Status: Early Access — designed and validated with real users, awaiting full rollout


Context — Two Personas, One Broken Page


In observability platforms like Chronosphere, there are two distinct types of users:


Makers — SRE admins who set up and maintain monitors. Their job is to configure what to watch, refine thresholds based on data, and optimize to reduce noise. Their work happens during peacetime — when systems are stable and there's time to think deliberately.

Responders — on-call engineers who get paged when something breaks. Their job is to investigate, mitigate, and remediate — fast. Every minute of confusion costs the business. Their work happens during wartime — under pressure, with no room for noise.


Before this project, Chronosphere had one page serving both personas. That's the problem we set out to solve.



Problem


The monitor page was built for Makers. But when Responders got paged, they landed there too — confronted with configuration tools designed for peacetime, not investigation.


One example: an attribute-based monitor selector that Responders had no use for during an incident — and couldn't figure out how to reset. The wrong tool, at the wrong moment, for the wrong persona.

There were no loud complaints — because there was nothing better to compare it to. But the latent need was clear: Responders deserved their own purpose-built experience.


How might we design a dedicated alert experience that gives Responders exactly what they need — and nothing they don't?


Before: one page, two personas, one compromise — all data shown is fictional
Before: one page, two personas, one compromise — all data shown is fictional


My Role — and How We Got Unstuck


I co-led the research with our shared researcher, defined the information hierarchy, and made key design decisions independently while collaborating closely with engineering to shape the solution.


This project wasn't without friction. A significant conflict emerged between the PM and lead backend engineer on product direction. Rather than letting competing opinions stall progress, I used research findings as neutral common ground — reframing the decision around user needs and data. That shift unblocked the team and kept the project moving.


Research — Understanding the Responder's World


I conducted interviews with on-call engineers and shadowed them during active on-call shifts — observing how alert investigation actually works under real pressure, not how people remember it afterward.


Key insights:

  • Responders don't read — they scan. Visual hierarchy isn't a nice-to-have, it's survival

  • Context is everything. Without related signals surfaced automatically, engineers piece together what happened manually — adding precious minutes to every incident

  • Configuration tools create cognitive noise during investigation — Responders need the page to disappear everything that isn't relevant to the current incident

  • Change events are critical — recent deploys and configuration changes are the first thing engineers look for when an alert fires


The Most Important Design Decision — Information Hierarchy


The foundational decision was defining what is primary data versus metadata — and how each should appear on the page. This forced the team to align on what an engineer actually needs to see first, second, and never.


I chose to organize the page around the event — the start and end of an incident — rather than the system's data structure. A deliberate shift from the system's perspective to the engineer's mental model.


From this decision, I developed a new data density specification and introduced it into Chronosphere's design system — so the principle scaled consistently across the entire alerting experience, not just this one page.


Zone mapping used to align the team on what Responders see first, second, and never.
Zone mapping used to align the team on what Responders see first, second, and never.

Constraints


Technical: The solution had to work within an existing alert manager architecture inherited from Grafana and remain compatible with open source standards. Certain data structures and alert behaviors couldn't be changed — the design had to work with the system, not around it.


Resource: A small frontend and backend team meant design decisions needed to be precise and well-justified from the start. No room for expensive pivots late in the build.


Design System: No existing pattern for data-dense interfaces existed in Chronosphere's design system. Rather than treating this as a blocker, I used it as an opportunity — developing a new data density specification that became a reusable system-level pattern for the entire product.


Key Design Solutions


  • Event-centered layout: Reorganized the page around the alert event — start time, end time, and all context needed to debug it — rather than raw system data

  • Contextual information on one page: Eliminated the need to navigate away by surfacing related signals, change events, and alert history in a single view

  • Redesigned data density: Introduced new visual hierarchy specs into the design system to ensure consistent treatment of primary data vs. metadata across the platform

What Changed — For the Engineer


The new experience consolidates everything an on-call engineer needs into a single page — event timeline, related context, change events, and alert history — eliminating the multi-click navigation the old experience required. The page now speaks the engineer's language: what happened, when, and what's related. Not the system's language.


The alert page was also Chronosphere's first responsive experience — designed to work on mobile browsers, recognizing that on-call engineers are often paged away from their desks.


Validation


Survey results — 5 EA customers · Usability study — 8 internal users:


Alert details page — design used in usability study with 8 internal users
Alert details page — design used in usability study with 8 internal users
  • 100% of users rated the new experience as same or better than the previous experience

  • 40% rated it as meaningfully better

  • Change events rated as most valuable by 100% of users — directly validating the core design hypothesis of surfacing contextual signals on one page

  • Status and Signal rated valuable by 80% of users — validating the information hierarchy decisions


"Nice that each alert is separated from the monitor — creates a snapshot perspective." — EA user
"Definitely better overall." — EA user
" I like how it's more focused on the 'event' and tries its best to get all the context to debug the alert." — Internal user
"I really like that it feels like I get the same information all on the same page that would have taken me multiple clicks to navigate before." — Internal user

FullStory engagement data — past 30 days:

  • 23.9K pageviews — heavily used page

  • 100% median scroll depth — engineers read the entire page

  • 4m 29s average time on page — sustained, meaningful engagement

What We Learned — Iteration Opportunities


Survey data also surfaced clear priorities for the next iteration:

  • SLO information was rated least valuable by 80% of users during active incidents — engineers are focused on root cause, not SLO context, when firefighting. This section needs to be deprioritized or moved post-incident

  • Runbook link needs to be more prominent — multiple users referenced it as the most actionable element on the page

  • Query should be visible by default — currently requires an extra step that slows down investigation

  • Related firing alerts — users want to see other alerts that might share the same root cause


These findings directly informed the next iteration roadmap and validate that the foundational design decisions landed — users are now asking for refinements, not fundamental changes.

Reflection


Designing for high-stakes, time-pressured users taught me that great UX is about ruthless prioritization — deciding what not to show is as important as deciding what to surface. Separating Makers and Responders into distinct experiences wasn't just a product strategy decision. It was the design principle that made everything else possible.


The clearest learning: 80% of users found SLO information least valuable during active incidents. In retrospect, that's obvious — Responders in wartime mode are focused on root cause, not SLO context. I'd have tested that assumption earlier rather than discovering it through post-launch validation. Terminology is another thing I'd treat as a first-class design problem from day one — the Events confusion could have been caught in research.



Lastly, The Maker/Responder framework — two personas with fundamentally different needs sharing one surface — is a universal enterprise design challenge. It shows up in observability. It shows up wherever complex data meets human decisions under pressure.

© 2025 by Soraya Nukunkit

bottom of page