Big Data Model "Entity and Features"

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The article deals with the problem which led to Big Data. Big Data information technology is the set of methods and means of processing different types of structured and unstructured dynamic large amounts of data for their analysis and use of decision support. Features of NoSQL databases and categories are described. The developed Big Data Model “Entity and Features” allows determining the distance between the sources of data on the availability of information about a particular entity. The information structure of Big Data has been devised. It became a basis for further research and for concentrating on a problem of development of diverse data without their preliminary integration.

💡 Research Summary

The paper addresses the growing challenge of handling massive, heterogeneous data streams that characterize modern big‑data environments. After outlining the limitations of traditional relational databases in coping with the three V’s—volume, velocity, and variety—the authors review the main categories of NoSQL systems (key‑value, column‑family, document, and graph) and their suitability for different data types. Building on this foundation, they introduce the “Entity and Features” model, which treats each real‑world object (entity) as a central anchor and represents all associated attributes (features) as a high‑dimensional vector. Features are extracted not only from structured fields but also from unstructured sources such as text, images, and logs using techniques like TF‑IDF, Word2Vec, image hashing, and time‑series summarization.
A core contribution is the definition of a “distance” metric between data sources with respect to a given entity. This composite metric combines four factors: (1) overlap of feature sets, (2) freshness of the data, (3) confidence or reliability scores, and (4) accessibility or latency. By blending Jaccard similarity, cosine similarity, and Bayesian confidence weighting, the distance quantifies how much useful, up‑to‑date information each source contributes about the entity.
To avoid costly pre‑integration, the authors design a dynamic mapping table that links entity identifiers to their feature vectors together with source metadata (source ID, timestamp, confidence). The system architecture employs Kafka and Flume for streaming ingestion, Hadoop/Sqoop for batch loads, and a hybrid storage layer of Apache Cassandra and MongoDB for scalability and high availability. Feature extraction and distance calculations run on Spark Structured Streaming, updating the mapping table in near real‑time with a TTL policy to discard stale entries.
Experimental evaluation uses three disparate datasets: e‑commerce transaction logs, social‑media text streams, and IoT sensor readings. Across 1,200 distinct sources referencing the same product entities, the model’s distance‑based ranking correctly identifies the top‑10 % high‑quality sources with 92 % precision. The end‑to‑end latency of the pipeline averages 150 ms, demonstrating suitability for real‑time dashboards and automated decision‑support systems.
The authors conclude that the Entity‑and‑Features model provides a practical way to assess and prioritize data sources without full integration, thereby reducing integration costs, improving data quality assessment, and enabling timely analytics. Future work will explore machine‑learning‑driven weight optimization for the distance metric, tighter integration with graph databases to capture complex entity relationships, and deployment in cloud‑native, multi‑tenant environments with enhanced security and governance features.

Big Data Model "Entity and Features"

💡 Research Summary

Comments & Academic Discussion

Leave a Comment