A Big Data Architecture for Log Data Storage and Analysis
We propose an architecture for analysing database connection logs across different instances of databases within an intranet comprising over 10,000 users and associated devices. Our system uses Flume agents to send notifications to a Hadoop Distributed File System for long-term storage and ElasticSearch and Kibana for short-term visualisation, effectively creating a data lake for the extraction of log data. We adopt machine learning models with an ensemble of approaches to filter and process the indicators within the data and aim to predict anomalies or outliers using feature vectors built from this log data.
💡 Research Summary
The paper presents a comprehensive big‑data architecture designed to collect, store, visualize, and analyze database connection logs generated across CERN’s extensive intranet, which serves more than 10,000 users and thousands of devices. The primary motivation is to provide a scalable, secure, and centralized repository that can ingest heterogeneous audit data (listener, alert, and OS logs) from multiple Oracle database instances and enable automated anomaly detection for security‑related events such as unauthorized connections, credential misuse, and malware‑driven traffic spikes.
Data ingestion is handled by Apache Flume agents deployed on each Oracle instance. Flume’s flexible source plugins (Avro, Thrift, file) capture both successful and failed connection events and forward them to a central Flume collector. From there, the stream is bifurcated: a long‑term copy is written to Hadoop Distributed File System (HDFS) forming a JSON‑based data lake, while a short‑term copy is indexed in Elasticsearch. Kibana (and optionally Grafana) provides real‑time dashboards for operators, allowing immediate visual inspection of connection patterns, traffic volumes, and potential outliers. The use of HDFS ensures durability and cheap bulk storage, whereas Elasticsearch offers low‑latency search and aggregation capabilities essential for interactive monitoring.
Feature engineering transforms raw log fields—timestamp, client_program, client_host, client_ip, client_port, client_protocol, client_user, etc.—into a structured feature vector. Numerical fields are normalized, categorical fields are one‑hot encoded, and temporal windows are defined to preserve sequence information. This vector serves as the input for a suite of unsupervised and semi‑supervised machine learning models. The authors employ an ensemble of five algorithms: K‑Nearest Neighbours (distance‑based), K‑Means clustering (density‑based), Isolation Forest (tree‑based isolation), Local Outlier Factor (local density deviation), and One‑Class Support Vector Machine (boundary‑based). Each model independently flags a percentage of records as anomalous; the reported outlier rates range from 2 % (K‑NN, One‑Class SVM) to 5 % (LOF). By intersecting the results—only records flagged by a majority of models are escalated—the system reduces false‑positives while retaining high recall for truly suspicious activities.
Performance testing demonstrates that Flume introduces an average transmission latency of less than 150 ms per log event, and Elasticsearch‑Kibana dashboards refresh within five seconds, satisfying near‑real‑time monitoring requirements. Storage engine tuning (HDFS block size, replication factor) is shown to be critical for maintaining throughput and ensuring rapid change propagation across the pipeline. The authors also note that Spark can be layered on top of the HDFS lake for batch analytics or streaming jobs, and Spark SQL provides a familiar query interface for analysts.
From a security perspective, the architecture enables early detection of network intrusion attempts, repeated connection bursts indicative of compromised hosts, and credential abuse. By correlating connection metadata with temporal patterns, the system can highlight abnormal usage trends that would otherwise be hidden in massive log volumes. The paper concludes that the combination of Hadoop’s robust data‑warehousing capabilities, Flume’s reliable ingestion, and Elasticsearch’s fast search creates a solid foundation for large‑scale log analytics. Future work is outlined to include advanced time‑series forecasting models (e.g., LSTM, Prophet), automated response mechanisms (SOAR integration), and multimodal analysis that fuses log data with system metrics and user behavior profiles to further improve detection accuracy and operational response.
Comments & Academic Discussion
Loading comments...
Leave a Comment