Observability review with Rosterfy

Background

Rosterfy is an end-to-end volunteer management software solution that helps organisations streamline their volunteer management and engagement.

The company was born out of Event Workforce (now known as Spark Event Group), a staffing company created to help connect university students with work opportunities across Australia. The need for an online solution platform to manage volunteer teams became clear as Event Workforce grew, and from this need, Rosterfy was born.

Today, Rosterfy works with over 3 million volunteers in 35 countries. Their market leading technology helps to create an engaging experience throughout the whole lifecycle of a volunteer journey. They work with a broad range of clients, big and small, to streamline volunteer management. In 2024, Rosterfy celebrated their inclusion in the AFR BOSS Most Innovative Companies list, ranking in the top 10 companies for technology.

The Engagement

Like many start-up and scale-up organisations, Rosterfy are interested in improving their application observability. Having invested in observability tooling for specific niches within their application, like custom exception alerting developed in-house, exception stack trace reporting using BugSnag, and a comprehensive dashboard that derives metrics from consolidated application logs, Rosterfy sought out third-party guidance to achieve better observability maturity.

Midnyte City agreed to conduct a review of Rosterfy’s current application observability, with the goal of identifying actionable areas of uplift and developing a roadmap to guide Rosterfy’s journey into better application observability.

The Review

What is Rosterfy doing well?

Rosterfy’s exception reporting is particularly advanced. It’s important to know when your application is throwing exceptions, as it’s a great indicator that users are having a bad time.

Consolidating their logs into a single environment within AWS is also a mature approach towards application observability, and the dashboard created from this consolidated logging in AWS CloudWatch demonstrates excellent insights into the application’s performance, and subsequently user experience.
‍

What exists, that could be improved upon?

Structured Logging

Uplifting logging solutions from plain string-based logging to structured logging provides immediate gains to the ease of understanding telemetry data. Any good observability backend will allow users to query structured logs across any arbitrary field, so this structure allows insights to be derived from queries across multiple logs, instead of only looking at the output of individual logs (or awkwardly deriving multi-log insights using fragile string-matching techniques).

Rosterfy were already using structured logging across most of their application, so an easy win identified was to uplift all logs to structured logging to permit consistent querying for all telemetry data.
‍

Logical transaction correlation using hard context

More powerful telemetry insights can be derived from logs that are grouped into logical transactions, which refers to the set of application events connected to a given trigger point, like a user request or scheduled cron job.

Correlating structured logs with their logical transaction can be achieved by injecting hard context (e.g. “transaction_id”) into one of their fields. This permits querying telemetry data by another dimension: If an exception occurred in module X, what else was happening in the application during that logical transaction? What was exceptional that caused the exception?

Correlating telemetry data of all kinds into logical transactions is a key feature of the OpenTelemetry project. Depending on the logging solution in use across an application, and the architecture of that application, manually connecting structured logs with injected hard context may prove more difficult than just implementing OpenTelemetry and leveraging that to achieve the same.
‍

Uplift Alerting

The golden standard for alerting is for a given alert to represent an actionable risk to business outcomes, with a prepared set of instructions on how exactly to respond to that alert. If alerts are misconfigured and too sensitive, you risk alert fatigue where the engineers learn to ignore them: When everything is an alert, nothing is.

Rosterfy had several alerts configured on key infrastructure metrics, like CPU Utilisation and SQS message age, that are indicators of business outcomes being at risk. Reviewing the configuration of these alerts from a “gold standard” perspective would improve their utility to the business by minimising alert fatigue and encouraging a culture of observability.

Alert playbooks are a useful tool not just for responding to an alert, but for guiding the creation of alerts in the first place. In a highly mature organisation, any engineer can respond to any alert using only the alert playbook and the available application telemetry data. While this is a high bar to reach, aiming for it will ensure alerts are configured appropriately, uplift telemetry data, and disseminate observability knowledge across an organisation, helping create a high-performing culture of observability.
‍

What new solutions could be beneficial?

The key new solution that was recommended to Rosterfy was to implement OpenTelemetry instrumentation into their application.

OpenTelemetry is an open-source framework for holistic application observability. It provides standards for connecting telemetry data in useful ways using hard and soft context, emitting, transforming, and collecting that telemetry data in a consistent way, and a protocol for transmitting telemetry data to any observability backend of your choice (or multiple backends!).

Telemetry data under the “Observability 1.0 Three Pillars Model” is divided into Logs, Traces, and Metrics. The OpenTelemetry framework collects and emits all three of these, but emphasises connecting them all into a single source of truth. When all of your telemetry data is in the same place and have been correlated through the injection of hard and soft context, observability tools (backends) can query across all three of those “pillars” of telemetry data to provide useful insights that you just can’t get when your telemetry data is divided. This kind of single source of truth analysis is the key mindset shift described under “Observability 2.0”.

Implementing OpenTelemetry instrumentation into their application would allow Rosterfy to fully leverage their observability backend of choice, AWS CloudWatch’s, suite of telemetry data analysis tools, including Logs Insights for querying structured telemetry data and X-Ray for viewing logical transactions as trace waterfalls.
‍

Application signal-based alerting

Metrics like CPU utilisation and memory utilisation are good indicators of machine happiness, but they are lagging indicators of customer happiness and business outcomes.

Once advanced instrumentation has been implemented in an application, it becomes possible to derive metrics that are better models of customer happiness and business outcomes. One example of such a derived metric is HTTP response times: a delay of 500ms can lead to a 20% drop in traffic, and that was in 2006! Collecting this information and deriving metrics to gauge application performance is a powerful tool for aligning engineering efforts with business outcomes.
‍

Setting SLOs

SLOs allow engineering teams to measure their ongoing development performance against business outcomes and prioritise work accordingly. Using the HTTP response times metric example, an SLO could be defined as “within a rolling 7-day window, 99% of application traffic will have a response time of under 500ms”. Response times exceeding 500ms would then burn down the allowed budget, indicating when engineers should stop feature work and prioritise addressing tech debt in the platform, aligning their work with business outcomes.

It also becomes possible to configure alerts against these kinds of derived metrics. If an update is deployed which suddenly causes HTTP response times to spike, and we burn a big chunk of our SLO budget in a short period, it makes sense to alert the engineering team, because immediate action needs to be taken to halt the burn-down.

The Result

The key artifact produced through the engagement was a roadmap to guide Rosterfy’s next steps towards more mature application observability, along with technical documentation describing why and how existing solutions should be uplifted. It also contained recommendations on how an uplift ties together with new solutions and lays the groundwork for developing an organisation-wide culture of observability.

The foundational principles behind every recommendation are:

Correlate telemetry data as much as possible to maximise its utility.
Leverage tooling to programmatically derive insights from telemetry data.
Tie the development cycle to business outcomes using application signal-derived metrics that closely model customer happiness.

Developing a culture of observability is an ongoing process within any organisation; even the most mature organisations should be regularly reviewing their application observability to ensure it is meeting the business’ needs. Alerts, SLOs, dashboards, and instrumentation are not one-and-dones - if it’s not meeting your needs, delete it and iterate to the next most useful thing. Making constant incremental improvements is how any high-performing organisation became a high-performing organisation.

If you think your organisation could benefit from a guiding hand on what’s next for observability, reach out to Midnyte City today for a review.

Case studies

Observability review with Rosterfy

Background

The Engagement

The Review

The Result

Contact us