Our client’s product offering is a multi-lingual, culturally aware survey platform running outbound campaigns on behalf of large health plans. The platform supports bidirectional SMS communication with the health plans’ patient population to gather critical feedback via surveys. There is also an automated reporting function, which produces a daily report and delivers it to their clients to provide aggregate statistics on the patient survey results and campaign status.
Our client enjoyed exponential growth as their pool of plan members increased several orders of magnitude, from tens of thousands to over 1 million members. However, this surge in volume caused system reliability issues, and their systems began to fail in unpredictable and persistent ways.
Several batch data jobs, on which their daily reporting process relied, became progressively sluggish and resource-intensive. As the reporting query execution times increased, so did the load on the client’s transactional databases, resulting in disruption to their internal web applications. Despite efforts to vertically scale their database resources and shard client data across multiple instances, many mission-critical functions continued to suffer from degraded performance, and their engineering team was overwhelmed by the additional system complexity and persistent instability.
Our solution to solving the client’s scaling issue was elegant in its simplicity:
- Optimize inefficient reporting logic by refactoring sequential queries to utilize bulk transactions. Moving from a pattern of continuous sequential updates to a process that involved queueing and batch processing of patient survey response data vastly reduced the strain on the database.
- Modernize the data architecture to partition transactional and analytical workloads, leveraging Postgres’ logical replication to consolidate sharded client data into a single data warehouse, in real-time. In doing so, the load placed on the system for reporting was isolated from the load associated with running campaigns and supporting normal business transactions via their web apps. This approach was also an incremental step towards re-architecting their system to facilitate horizontal scaling.
- Integrate Looker, a dashboarding and reporting tool, on top of their newly created data warehouse solution, which included several benefits:
- Native support for automating the execution and delivery of reports, allowing them to vastly simplify portions of their code base, reduce surface area for bugs, and develop more robust reporting functionality.
- Incremental shift towards a self-service model for internal datasets, eliminating the risk of disruption to the production systems.
After implementing our solution, the client’s reports consistently execute in seconds, rather than hours or days.
The new reports are able to be executed during regular business hours without disruption, even while active campaigns are running. Moreover, the modified system design now scales to support an additional order of magnitude in population size, buying the client’s internal technical team enough time to undergo a more significant re-architecture effort.