Technical report: CSVM dictionaries
CSVM (CSV with Metadata) is a simple file format for tabular data. The possible application domain is the same as typical spreadsheets files, but CSVM is well suited for long term storage and the inter-conversion of RAW data. CSVM embeds different levels for data, metadata and annotations in human readable format and flat ASCII files. As a proof of concept, Perl and Python toolkits were designed in order to handle CSVM data and objects in workflows. These parsers can process CSVM files independently of data types, so it is possible to use same data format and parser for a lot of scientific purposes. CSVM-1 is the first version of CSVM specification, an extension of CSVM-1 for implementing a translation system between CSVM files is presented in this paper. The necessary data used to make the translation are also coded in another CSVM file. This particular kind of CSVM is called a CSVM dictionary, it is also readable by the current CSVM parser and it is fully supported by the Python toolkit. This report presents a proposal for CSVM dictionaries, a working example in chemistry, and some elements of Python toolkit usable to handle these files.
💡 Research Summary
The paper introduces CSVM (CSV with Metadata), a lightweight extension of the traditional comma‑separated values format that embeds descriptive metadata directly in the plain‑text file. By prefixing special lines that start with “#” (e.g., #TITLE, #HEADER, #TYPE, #WIDTH, #META), CSVM stores column names, data types, display widths, and optional annotations alongside the data rows. This design preserves the simplicity and universal readability of CSV while giving both humans and programs enough context to interpret the data without external schema files.
Building on the first version of the specification, CSVM‑1, the authors propose a new construct called a CSVM dictionary. A dictionary is itself a CSVM file that contains two header sections: one describing the source file’s column identifiers, units, and types, and another describing the target file’s identifiers and units. Between these sections a mapping table defines how each source column should be transformed into a target column, optionally including a conversion expression written as a short Python (or Perl) lambda. In effect, the dictionary acts as a “translation sheet” that tells a generic CSVM parser how to rename columns, change units, and apply simple calculations during import or export.
The paper details the implementation of parsers in both Perl and Python. The Python toolkit, built around a module named csvm, provides three core functions: load() to read any CSVM file and separate metadata from data rows, translate(dict_file, src_file, dst_file) to apply a dictionary to a source file and write a transformed target file, and utility methods for accessing column metadata programmatically. The Perl module CSVM::Parser offers analogous capabilities, demonstrating that the dictionary concept is language‑agnostic as long as the parser respects the CSVM syntax.
A concrete chemistry example illustrates the workflow. An experimental results file contains columns “SampleID”, “Concentration(mg/L)”, and “pH”. An international standard requires the columns to be named “ID”, “Conc(µg/mL)”, and “pHvalue”, with concentration expressed in micrograms per milliliter. The dictionary file lists the source header, the target header, and a mapping block such as:
Concentration(mg/L) -> Conc(µg/mL) : lambda x: float(x)*1000
pH -> pHvalue : lambda x: float(x)
When the Python translate function processes the source file together with this dictionary, it automatically renames the columns, converts the numeric values, and writes a new CSVM file that complies with the target specification. Because the dictionary itself follows the CSVM format, it can be version‑controlled, shared, and parsed by any CSVM‑aware tool without additional configuration.
The authors argue that CSVM dictionaries provide three major benefits. First, standardization: a single dictionary can map raw laboratory data to multiple external standards, enabling seamless data exchange across institutions. Second, long‑term preservation: embedding metadata inside the file eliminates the need for external schema repositories, ensuring that future users can reconstruct the original meaning of the data. Third, workflow automation: pipelines can accept a dictionary as a parameter and perform deterministic transformations without custom scripting for each new target format.
Limitations are also acknowledged. The free‑form nature of metadata lines can lead to inconsistencies unless a style guide is adopted. More complex transformations—such as conditional logic, merging several columns into one, or non‑linear unit conversions—cannot be expressed with the simple lambda syntax and would require supplemental scripts. Finally, the current parsers load the entire file into memory, which is acceptable for modest data sets but would need streaming support for gigabyte‑scale files.
In conclusion, the paper presents CSVM dictionaries as a pragmatic, extensible mechanism for embedding translation logic directly within a human‑readable, flat‑file format. By coupling metadata, data, and conversion rules in a single, parser‑friendly artifact, CSVM dictionaries facilitate data interoperability, reproducibility, and archival stability across diverse scientific domains. Future work suggested includes formalizing dictionary syntax, adding richer expression capabilities, and implementing memory‑efficient streaming parsers to broaden applicability to large‑scale datasets.
Comments & Academic Discussion
Loading comments...
Leave a Comment