Alert View
Redesigning a core feature to be more actionable and visually appealing
summary
The alert view in SignalFx is one of the most widely used parts of the app; it’s the entry into the product when someone is on call, and is supposed to provide answers to “What’s going wrong?” “How concerned should I be?” and “How do I fix it?”
However, in its existing state users found it more confusing than helpful. This project aimed to eliminate extraneous information, better communicate what happened, and give the user clear call-to-actions to troubleshoot the problem.
Cory Watson, Technical Director in the Office of the CTO at SignalFx, walks the audience at Monitorama through the new alert view and how it will improve users’ day-to-day.
Problem
In its existing state, users typically experienced the following flow - they wake up to a pager duty alert at 3am. They groggily click through to the SignalFx app, and see a modal which purports to explain the problem. They see a message comprising some dimensions and values, a chart, and a list of what might be signals (though the descriptors are not intuitive.)
A few factors make it difficult to understand what’s going on:
The modal title doesn’t necessarily make sense (up to user configuration), and it’s not clear what it’s referring to
The view begins with a plain text message with no formatting; the blob of text is hard to parse, contains redundant information, naming isn’t self-evident, there’s a lot of text to read, and the important information is buried (critical details are collapsed in the message, the chart is lower down in the hierarchy, the option to visualize the plot in an interactive chart is below the fold.)
It’s not usually clear what to do next or how to handle the problem if:
The detector creator didn’t fully configure the tip (piece of text suggesting what to do) or runbook (a link to a set of instructions)
The user is a junior alert responder, or less familiar with the service / app / being on call
Goals
Solve the 3am case: when a user receives an alert at a very inconvenient time. The UI has to be self-explanatory so it can be easily parsed while the user is not fully alert, and they have clear actions they can take to investigate and resolve the problem.
Make it easy to understand what happened, if the user should care, and what to do next.
Make the alert state very clear - if the alert is still active and the user should care, that should be self-evident. If the alert has cleared or been muted and is no longer relevant, that should also be obvious.
Use this highly touched feature as a way to start introducing the new design system and enhancing the app’s visual appeal.
PersonaS
Two of our user personas mostly use the app for alert responding. The Junior Oncaller, Alex, is not particularly familiar with the app or troubleshooting in general. We wanted to optimize for the user who would have the least knowledge and expertise around alert responding, and make a potentially stressful experience easier. The Senior Oncaller, James, has significant experience in dealing with infrastructure issues and has developed workarounds for the alert view tool. We wanted to simplify his flow and give him shortcuts to the troubleshooting destinations he prefers.
Redesign process
We started by talking with users to understand what their goals were and how they used the alert view. Based on the information we gathered, we decided which new content should be introduced, what information we should maintain from the existing alert modal, and which unhelpful content could be clarified or eliminated.
We distilled the research to the main questions users brought with them, and iterated through ways to answer them:
“What’s happening?”
Added a templatized header so even without user configuration, it’s clear what kind of event they’re looking at. State and severity are conveyed with text then visually reinforced by the fill and color of the header background.
{event type: trigger/clear} {alert severity} Alert - {alert status:active/inactive}
The red background on the Active Critical Alert header gives it a sense of urgency.
The blue background of this informational alert is eye catching enough to grab the user’s attention without being worrisome.
The empty white background, green underline, and collapsed chart visualization indicate that this clear event is indeed nothing to be worried about.
We moved the chart to the top of the page, so the user can immediately see the signal’s behavior, rather than read text attempting to explain it.
The old view: scattered information, difficult to understand plot details, no way to get context, alarming red code text for no reason.
The New View: information is grouped logically, chart details are labeled and rewritten to be human readable, and the why is obvious.
To provide context, we titled the chart with the plot label and condition, and a colored dot to indicate that the condition has been satisfied (a pattern the user is already familiar with from other parts of the app.)
We moved the event details to be co-located with the chart, and made them more human-readable.
We replaced the static chart with an interactive one so the user can pan through time and better understand the signal’s behavior
“How real/bad is this?”
We introduced a detail view that is always at native data resolution, so the user can see the raw data and why the condition was triggered.
We highlight the exact value and time on the chart that the alert triggered.
We introduced a time picker so the user can zoom out/pan through time and get context on the moment in time which triggered the alert.
Customized header styling to convey alert severity and state:
“How do I fix it?”
Next Steps section, broken down into 3 use cases:
Next Steps: For the junior alert responder who wants to follow a set of instructions a coworker wrote to guide them through what to do.
Explore Further: For the experienced alert responder, who may want to explore more contextual data in order to troubleshoot the issue. (e.g. look at a linked dashboard to understand the upstream problem, analyze traces from that time range to find the root cause, view content that was linked to the alert dimensions)
Manage Alert: For when the user has determined the alert is not relevant, or when the detector is flappy and they don’t want to keep being pinged while resolving the problem.
new User Flow
The new alert view encourages users through the following flow:
Whilst in the midst of something else, the engineer on call receives a pager duty notification that a SignalFx alert has been triggered.
They click on the link in PagerDuty, and are led to the SignalFx alert view.
Once there, they understand what the event is about, the severity of the alert, whether it is “real” or not, and some insight into the issue(e.g. which host was experiencing memory pressure, which realm the issue occurred in.)
They have a clear steps forward to handle the issue:
Follow instructions included by the detector creator (Open a runbook or follow a tip)
Look at contextual data in order to pin down the source of the problem (apply analytics to the metric in a new chart, view traces, analyze linked dashboards, view content linked to the alert’s dimensions)
Dismiss the alert (Mute or resolve)
Visual Design
Though this project was much more than a visual change, the new design system for SignalFx instigated the redesign. Some visual component-level improvements you’ll see here:
Buttons satisfy the WCAG 2.0 level AA color contrast standard
Switching from a light grey to pure white background allows the text to contrast more and be more readable
Colored headers to indicate state
Buttons and input fields have rounded corners, rendering the UI more approachable
Customizable message is broken up into sections with headers, using markdown to differentiate units of content
Font style is pared down and rendered more consistent. Rather than two typefaces and four text colors, we streamlined down to one typeface, and two colors — #333 for plain text, and #19855C for links.
We updated the typeface from open sans and menlo, to Splunk Data Sans. This typeface gives the page a new look, is more compact without losing legibility, and communicates numbers more clearly (which, as a data driven/number rich app, was very important to us.)
End result
A UI that concisely communicates what went wrong, whether I should care, and guides the user through issue resolution
Defaults which teach the user how to configure their detectors in order to end up with helpful alert views
A clear jumping off place for users to investigate the event
A new, rich chart view that responders can use to troubleshoot, and share with teammates for collaboration
Next Steps / Room to grow
An incident view that bring the trigger and clear event together