Instaclustr, a provider of fully managed solutions for scalable open source technologies, today announced it has successfully created an anomaly detection application capable of processing and vetting real-time events at a uniquely massive scale – 19 billion events per day – by leveraging open source Apache Cassandra and Apache Kafka and Kubernetes container orchestration. Instaclustr completed this as an example of the scalability achievable with its Managed Platform and has made detailed design information available here, and source code available here.
Anomaly detection is the identification of unusual events within an event stream – often indicating fraudulent activity, security threats or in general a deviation from the expected norm. Because recognizing such anomalies is integral to the integrity and security of critical business and/or customer data, anomaly detection applications are widely deployed across numerous industries and use cases, including financial fraud detection, IT security intrusion and threat detection, website user analytics and digital ad fraud, IoT systems and beyond. Anomaly detection applications typically compare inspected streaming data with historical event patterns, raising alerts if those patterns match previously recognized anomalies or show significant deviations from normal behavior. These detection systems utilize a stack of solutions that often include machine learning, statistical analysis, and algorithm optimization, and that leverage data-layer technologies to ingest, process, analyze, disseminate, and store streaming data.
However, there are significant challenges in designing an architecture capable of detecting anomalies in high-scale environments where the volume of daily events reaches into the millions or billions. In these scenarios, data-layer technologies must overcome substantial computational, performance and scalability requirements in order to cope with the massive scale of events.
To showcase just how powerful the open source data-layer technologies Instaclustr delivers through its fully-managed platform can be for processing massive real-time event streams, its engineering team built a streaming data pipeline application able to overcome the hurdles of mass-scale anomaly detection. To do so, Instaclustr teamed the NoSQL Cassandra database and the Kafka streaming platform with application code hosted in Kubernetes to create an architecture with the scalability, performance and cost-effectiveness required for the solution to be viable in real-world scenarios.
“Our anomaly detection solution showcases how critical applications can scale – colossally – using expertly-optimized Kafka and Cassandra in their fully open source form,” said Ben Slater, Chief Product Officer, Instaclustr. “We welcome enterprises across industries interested in knowing how Kafka and Cassandra can be leveraged to meet the data scale requirements within their own applications to get in touch, whether you’re building a real-time anomaly detection application or any other solution.”
Cassandra and Kafka are not just performant and scalable, they are also naturally complementary technologies. Kafka supports fast, scalable ingestion of streaming data, and uses a store and forward design that provides a buffer preventing Cassandra from being overwhelmed by large data spikes. Cassandra then serves as a linearly scalable, write-optimized database ideal for storing high-velocity streaming data. In the successful experiment, Instaclustr combined Kafka, Cassandra and the anomaly detection application in a Lambda architecture, with Kafka as the speed layer and Cassandra as the batch and serving layer. Instaclustr’s solution also utilized Kubernetes on AWS EKS in order to automate the provisioning, deployment, and scaling of the application. Proceeding with an incremental development approach, Instaclustr carefully monitored, debugged, tuned and retuned specific functions within the pipeline to optimize its capabilities. The result: an anomaly detection application able to process 19 billion real-time events per day and detect anomalies in those events.
“Apache Cassandra and Apache Kafka each hold a well-earned reputation for their ability to deliver high data performance in mass-scale use cases, as is thoroughly demonstrated by Instaclustr’s new anomaly detection data pipeline,” said James Curtis, Senior Analyst, Data, AI, and Analytics at 451 Research. “Through this successful experiment, Instaclustr again showcases the vast potential of these open source technologies, which organizations can take full advantage of through Instaclustr’s managed platform.”