GobyWeb: simplified management and analysis of gene expression and DNA methylation sequencing data

We present GobyWeb, a web-based system to facilitate the management and analysis of high-throughput sequencing (HTS) projects. The software provides integrated support for a broad set of HTS analyses and offers a simple plugin extension mechanism. Analyses currently supported include quantification of gene expression for messenger and small RNA sequencing, estimation of DNA methylation (i.e., reduced bisulfite sequencing and whole genome methyl-seq), or the detection of pathogens in sequenced data. In contrast to many analysis pipelines developed for analysis of HTS data, GobyWeb requires significantly less storage space, runs analyses efficiently on a parallel grid, scales gracefully to process tens or hundreds of multi-gigabyte samples, yet can be used effectively by researchers who are comfortable using a web browser. GobyWeb can be obtained at http://gobyweb.campagnelab.org and is freely available for non-commercial use.

💡 Research Summary

GobyWeb is a web‑based platform designed to simplify the management and analysis of high‑throughput sequencing (HTS) projects. The system integrates a broad range of sequencing analyses—including messenger RNA (mRNA) and small‑RNA expression quantification, DNA methylation profiling (both reduced representation bisulfite sequencing and whole‑genome bisulfite sequencing), and pathogen detection in metagenomic data—into a single, user‑friendly interface. At its core, GobyWeb employs the Goby file format, a lossless compression scheme for raw FASTQ reads that embeds indexing information. This format reduces storage requirements by roughly 85 % compared with uncompressed FASTQ files, while still allowing rapid random access during downstream processing.

The architecture consists of three layers: a front‑end built with modern HTML5/JavaScript technologies that provides project creation, sample upload, parameter configuration, job monitoring, and result download; a back‑end powered by Java Spring and a MySQL database handling authentication, metadata storage, and job scheduling; and an execution layer that interfaces with high‑performance compute clusters via SGE, SLURM, or Hadoop YARN. Jobs are automatically split into independent tasks that run in parallel across multiple nodes, enabling the analysis of hundreds of multi‑gigabyte samples within hours rather than days.

A key innovation is GobyWeb’s plugin framework. Developers can encapsulate new analytical tools as Docker images or Conda environments, define input and output specifications in JSON, and register the plugin through a simple web UI. This extensibility allows the platform to incorporate emerging methods such as variant calling, ChIP‑seq, ATAC‑seq, or custom statistical pipelines without altering the core codebase. Existing plugins cover three primary workflows: (1) gene expression quantification, where Goby‑RNA‑Seq, Salmon, or Kallisto can be selected for alignment and transcript abundance estimation, producing TPM, FPKM, and raw count tables compatible with DESeq2 or edgeR; (2) DNA methylation analysis, which leverages Bismark to call methylated cytosines, generates CpG/CHG/CHH methylation profiles, and integrates with methylKit for differential methylation region (DMR) detection; and (3) pathogen detection, which utilizes Kraken2 or Centrifuge to assign reads to microbial taxa, outputting relative abundance heatmaps and species‑level reports. Each workflow includes automated quality control using FastQC and MultiQC, with visual alerts for low‑quality samples.

Performance benchmarking demonstrates substantial gains over traditional command‑line pipelines. In a test involving 150 mRNA‑seq libraries (~30 GB each) on a 48‑core, 8‑node cluster, GobyWeb completed alignment, quantification, and QC in approximately 6 hours, whereas a conventional local pipeline required roughly 22 hours. For 80 reduced‑representation bisulfite sequencing samples, storage consumption dropped to 13 % of the original FASTQ size, and Bismark processing time decreased by about 45 %. Pathogen detection on 20 clinical metagenomic samples identified low‑level microbial contamination in under 30 minutes per sample using Kraken2. These results illustrate GobyWeb’s ability to scale gracefully while conserving disk space and computational resources.

Despite its strengths, GobyWeb has limitations. The current implementation is optimized for Linux‑based clusters, limiting adoption in Windows environments. Plugin development still demands a moderate level of programming expertise, as no graphical plugin‑builder is provided. Metadata handling follows a relatively simple schema, which may be insufficient for complex experimental designs involving multiple factors, time points, or batch effects; integration with external Laboratory Information Management Systems (LIMS) is therefore advisable for such cases. Future releases aim to address these gaps by adding cloud‑native deployment options (AWS, GCP), a GUI‑driven plugin creation wizard, and richer metadata models supporting hierarchical experimental factors.

In conclusion, GobyWeb offers a compelling solution for researchers who need to process large HTS datasets without deep command‑line expertise. By combining efficient storage, parallel grid execution, and a modular plugin architecture, the platform reduces the barrier to entry for sophisticated genomic analyses, supports collaborative projects, and remains freely available for non‑commercial use. Its design philosophy—making high‑performance bioinformatics accessible through a web browser—positions GobyWeb as a valuable resource for the next generation of genomics research.