We describe a new logical data model, called the concept-oriented model (COM). It uses mathematical functions as first-class constructs for data representation and data processing as opposed to using exclusively sets in conventional set-oriented models. Functions and function composition are used as primary semantic units for describing data connectivity instead of relations and relation composition (join), respectively. Grouping and aggregation are also performed by using (accumulate) functions providing an alternative to group-by and reduce operations. This model was implemented in an open source data processing toolkit examples of which are used to illustrate the model and its operations. The main benefit of this model is that typical data processing tasks become simpler and more natural when using functions in comparison to adopting sets and set operations.
Deep Dive into Concept-oriented model: Modeling and processing data using functions.
We describe a new logical data model, called the concept-oriented model (COM). It uses mathematical functions as first-class constructs for data representation and data processing as opposed to using exclusively sets in conventional set-oriented models. Functions and function composition are used as primary semantic units for describing data connectivity instead of relations and relation composition (join), respectively. Grouping and aggregation are also performed by using (accumulate) functions providing an alternative to group-by and reduce operations. This model was implemented in an open source data processing toolkit examples of which are used to illustrate the model and its operations. The main benefit of this model is that typical data processing tasks become simpler and more natural when using functions in comparison to adopting sets and set operations.
1
Concept-oriented model:
Modeling and processing data using functions
Alexandr Savinov
http://conceptoriented.org
17.11.2019
ABSTRACT
We describe a new logical data model, called the concept-
oriented model (COM). It uses mathematical functions as first-
class constructs for data representation and data processing as
opposed to using exclusively sets in conventional set-oriented
models. Functions and function composition are used as primary
semantic units for describing data connectivity instead of
relations and relation composition (join), respectively. Grouping
and aggregation are also performed by using (accumulate)
functions providing an alternative to group-by and reduce
operations. This model was implemented in an open source data
processing toolkit examples of which are used to illustrate the
model and its operations. The main benefit of this model is that
typical data processing tasks become simpler and more natural
when using functions in comparison to adopting sets and set
operations.
KEYWORDS
Logical data models; Functional data models; Data processing
1 Introduction
1.1 Who Is to Blame?
Most of the currently existing data models, query
languages and data processing frameworks including SQL
and
MapReduce
use
mathematical
sets
for
data
representation and set operations for data transformations.
They describe a data processing task as a graph of
operations with sets. Deriving new data means producing
new sets from existing sets where sets can be implemented
as relational tables, collections, key-value maps, data
frames or similar structures.
However, many conventional data processing patterns
describe a data processing task as deriving new properties
rather than sets where properties can be implemented as
columns, attributes, fields or similar constructs. If
properties are represented via mathematical functions then
this means that they are main units of data representation
and transformation. Below we describe several typical
tasks and show that solving them by means of set
operations is a problem-solution mismatch, which makes
data modeling and data processing less natural, more
complex and error prone.
Figure 1: Example data model
Calculated attributes. Assume that there is a table with
order Items characterized by Quantity and Price
attributes (Fig. 1, left). The task is to compute a new
attribute Amount as their arithmetic product. A solution in
SQL is almost obvious:
SELECT *, Quantity * Price AS Amount
(1)
FROM Items
Although this standard solution seems very natural and
almost trivial, it does have one subtle flaw: the task was to
compute a new attribute while this query produces a new
table. Then the question is why not to do exactly what has
been requested by producing a new attribute? Why is it
necessary to produce a new table (with a new attribute) if
we actually want to attach a new attribute to the existing
table? A short answer is that such an operation for adding
new (derived) attributes simply does not exist. We simply
have no choice and must adopt what is available – a set
operation.
Link attributes. Another generic data processing pattern
consists in computing links (or references) between tables:
given a record in one table, how can we access attributes
of related records in another table? For example, assume
that Price is an attribute of a second Products table
(Fig. 1, right), and it does not exist as an attribute of the
Items table. We have two tables, Items and Products,
with attributes ProductId and Id, respectively, which
relate their records. If now we want to compute the Amount
for each item then the price needs to be retrieved from the
second Products table. This task can be easily solved by
copying the necessary attributes into a new table using the
relational (left) join:
Items
ProductId
Quantity
Price
Products
Id
Price
Amount
Product
TotalQ
TotalA
table
existing columns
derived columns
calculate
link
aggregate
2
SELECT i.*, p.Price
(2)
FROM Items i
JOIN Products p
ON i.ProductId = p.Id
This new result table has the necessary attributes
Quantity and Price copied from two source tables and
hence it can be used for computing the amount. Yet, let us
again compare this solution with the problem formulation.
Do we really need a new table? No. Our goal was to have
a possibility to access attributes of the second Products
table (while computing a new attribute in the first Items
table). Hence, it again can be viewed as a workaround and
forced solution where a new (unnecessary) table is
produced just because it is the only way to access related
data in this set-oriented model.
Aggregated attributes. The next typical data processing
task is data aggregation. Assume that for each product in
Products, we want to compute the total number of items
ordered (Fig. 1). Group-by operation provides a standard
solution:
…(Full text truncated)…
This content is AI-processed based on ArXiv data.