Enabling Lock-Free Concurrent Fine-Grain Access to Massive Distributed Data: Application to Supernovae Detection

Reading time: 6 minute
...

📝 Original Info

  • Title: Enabling Lock-Free Concurrent Fine-Grain Access to Massive Distributed Data: Application to Supernovae Detection
  • ArXiv ID: 0810.2226
  • Date: 2008-10-14
  • Authors: Researchers from original ArXiv paper

📝 Abstract

We consider the problem of efficiently managing massive data in a large-scale distributed environment. We consider data strings of size in the order of Terabytes, shared and accessed by concurrent clients. On each individual access, a segment of a string, of the order of Megabytes, is read or modified. Our goal is to provide the clients with efficient fine-grain access the data string as concurrently as possible, without locking the string itself. This issue is crucial in the context of applications in the field of astronomy, databases, data mining and multimedia. We illustrate these requiremens with the case of an application for searching supernovae. Our solution relies on distributed, RAM-based data storage, while leveraging a DHT-based, parallel metadata management scheme. The proposed architecture and algorithms have been validated through a software prototype and evaluated in a cluster environment.

💡 Deep Analysis

Deep Dive into Enabling Lock-Free Concurrent Fine-Grain Access to Massive Distributed Data: Application to Supernovae Detection.

We consider the problem of efficiently managing massive data in a large-scale distributed environment. We consider data strings of size in the order of Terabytes, shared and accessed by concurrent clients. On each individual access, a segment of a string, of the order of Megabytes, is read or modified. Our goal is to provide the clients with efficient fine-grain access the data string as concurrently as possible, without locking the string itself. This issue is crucial in the context of applications in the field of astronomy, databases, data mining and multimedia. We illustrate these requiremens with the case of an application for searching supernovae. Our solution relies on distributed, RAM-based data storage, while leveraging a DHT-based, parallel metadata management scheme. The proposed architecture and algorithms have been validated through a software prototype and evaluated in a cluster environment.

📄 Full Content

arXiv:0810.2226v1 [cs.DC] 13 Oct 2008 Enabling Lock-Free Concurrent Fine-Grain Access to Massive Distributed Data: Application to Supernovae Detection Bogdan Nicolae #1, Gabriel Antoniu ∗2, Luc Boug´e +3 # University of Rennes 1/IRISA Campus de Beaulieu, 35042 Rennes cedex, France 1 Bogdan.Nicolae@inria.fr ∗INRIA/IRISA Campus de Beaulieu, 35042 Rennes cedex, France 2 Contact: Gabriel.Antoniu@inria.fr + ENS Cachan, Brittany Extension/IRISA Campus Ker Lann, 35170 Bruz, France 3 Luc.Bouge@bretagne.ens-cachan.fr Abstract—We consider the problem of efficiently managing massive data in a large-scale distributed environment. We con- sider data strings of size in the order of Terabytes, shared and accessed by concurrent clients. On each individual access, a segment of a string, of the order of Megabytes, is read or modified. Our goal is to provide the clients with efficient fine- grain access the data string as concurrently as possible, without locking the string itself. This issue is crucial in the context of applications in the field of astronomy, databases, data mining and multimedia. We illustrate these requiremens with the case of an application for searching supernovae. Our solution relies on distributed, RAM-based data storage, while leveraging a DHT- based, parallel metadata management scheme. The proposed architecture and algorithms have been validated through a software prototype and evaluated in a cluster environment. I. INTRODUCTION Large scale data management is becoming increasingly important for a wide range of applications, both scientific and industrial: modeling, astronomy, biology, gouvernamental and industrial statistics, etc. All these applications generate huge amounts of data that need to be stored, processed and eventually archived globally. In order to better illustrate these needs, this paper focuses on a real life astronomy problem: finding supernovae (stellar explosions). In a typical scenario, a telescope is used to take pictures of the same part of space at regular intervals, usually every month. Corresponding digital images are then compared in an attempt to find variable objects, which might be candidates for supernovae. To confirm that such objects are supernovae, considerable computational effort is necessary in order to distinguish the supernovae themselves from the other variable objects that may be present in the image: this requires to ana- lyze the light curve and spectrum of each potential candidate. To speed up the process of finding supernovae, multiple parts of space should be analyzed concurrently: as there is no dependency between different regions of space, the analysis itself is an embarrassingly parallel problem. The difficulty lies in the massive amount of data that needs to be managed and made available to the machines providing the computational power. Huge data size. Hundreds of GB of images from various parts of the sky may correspond to a single point in time. Since the analysis requires multiple consecutive images of the same part of the sky, the order of TB is quickly reached. Global view. Managing independent images manually is cumbersome. Applications finding supernovae (and not only) are much easier to design if a global view of the sky is available: finding the right image at a given time simply translates into accessing the right part of the sky view for that time. Let us consider a very simple abstraction of this problem, in which the view of the sky is a very long string of bytes (blob), obtained by concatenating the images in binary form. Assuming all images have a fixed size, a specific part of the sky is accessible by providing the corresponding offset in the string. A simple transformation from two-dimensional to unidimensional coordinates is sufficient. Efficient fine grain access. While many images make up the global view of the sky, each of them needs to be accessed individually. As each image is much smaller than the size of the string representing the sky, fine-grain access to substrings is crucial. Versioning. As new images are taken by the telescope, the view of the sky needs to be updated, while the previous views of the sky still need to be accessible. It is desirable to refer to views of the sky at particular moments in time, therefore versioning is necessary. Read-read concurrency. Comparison of images for different parts of the sky is a massively parallel problem. That is, concurrent reads of different images in a view or concurrent reads of the same image in different views should be efficiently processed in parallel. Read-write concurrency. The telescope may gather and store new pictures (i.e. new versions of some part of the sky) while the analysis proceeds on the previous versions. Consequently, in our model, it is important to allow new versions of our global string to be generated and written while the earlier versions are read and analyzed: read-write concurrency is highly desirable for efficiency. Write-write concurrency. As multiple telescopes may

…(Full text truncated)…

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut