ConGUSTo: (HT)Condor Graphical Unified Supervising Tool

ConGUSTo: (HT)Condor Graphical Unified Supervising Tool
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

HTCondor is a distributed job scheduler developed by the University of Wisconsin-Madison, which allows users to run their applications in other users’ machines when they are not being used, thus providing a considerably increase in the overall computational power and a more efficient use of the computing resources. Our institution has been successfully using HTCondor for more than ten years, and HTCondor is nowadays the most used Supercomputing resource we have. Although HTCondor provides a wide range of tools and options for its management and administration, there are currently no tools that can show detailed usage information and statistics in a clear, easy to interpret, interactive set of graphics displays. For this reason, we have developed ConGUSTo, a web-based tool that allows to collect HTCondor usage and statistics data in an easy way, and present them using a variety of tabular and graphics charts.


💡 Research Summary

The paper presents ConGUSTo, a web‑based graphical monitoring and statistics tool for HTCondor clusters, developed to address the shortcomings of the native HTCondor logging and reporting facilities. The authors describe the context of the Instituto de Astrofísica de Canarias (IAC), which operates more than 200 Linux desktops providing roughly 700 computing slots. In a single six‑month period the cluster delivered about 1.3 million CPU‑hours, equivalent to 150 years of sequential computing. Managing such a resource requires detailed insight into per‑machine usage, both for troubleshooting (users often blame HTCondor for perceived slowdowns) and for energy‑saving policies (the institute wishes to power down idle desktops at night and on weekends).

Existing HTCondor tools (condor_userlog, condor_history, condor_status, built‑in HTML statistics, and third‑party solutions such as Ganglia, CycleServer, or Cumin) either provide only aggregate pool‑level data, require complex installation on each node, or lack the specific per‑slot, per‑machine information needed by the IAC administrators. In particular, the native logs are scattered across machines, can be very large, and their plain‑text format is difficult to parse for historical queries.

ConGUSTo’s design deliberately avoids direct log parsing. Instead, a lightweight Bash script (under 20 lines) periodically runs standard HTCondor commands (condor_status, condor_q) on a single collector node, extracts the relevant fields with classic Unix text‑processing utilities (grep, sed, awk, cut, tr), and writes the results into CSV‑like plain‑text files. The files are organized in a hierarchical directory tree by year, month, day, and machine, with one file per machine per day. This structure enables fast retrieval of data for any date range without a database. Real‑time status (e.g., current slot states) is obtained on demand via condor_status with a predefined output format, eliminating the need to store transient information.

The presentation layer is built with PHP for server‑side processing and JavaScript libraries (e.g., Chart.js or D3.js) for interactive charts. ConGUSTo offers several key features:

  1. Machine‑centric job views – both summary and detailed tables for all jobs executed on a given host, with clickable entries that reveal owner, state, start/end times, etc.
  2. Time‑range visualizations – users can select daily, weekly, or monthly windows and define the start day, producing line and bar charts of job counts, CPU usage, idle vs. running ratios, etc.
  3. Panoramic cluster view – a configurable dashboard showing the status of every machine and slot at a glance, enriched with information not available in standard HTCondor stats, such as last job execution time, slot‑level restrictions, and scratch‑disk space. Filters allow administrators to focus on specific subsets (e.g., machines allowed to run only at night).
  4. Ease of deployment – only the web server needs PHP; the collector node runs the Bash script via cron. No additional software is required on the compute nodes, and the tool works with any HTCondor version because it relies solely on the command‑line interface.
  5. Extensibility – adding new metrics involves extending the Bash extraction script and updating the PHP/JS rendering templates; no schema changes or database migrations are needed.

The authors acknowledge limitations: the cron‑based data collection introduces a latency of several minutes, making true real‑time monitoring impossible; the flat‑file approach may become unwieldy for very large clusters with many years of history; and the system assumes continuous availability of the collector node. Future work includes integrating WebSocket‑based streaming for sub‑minute updates, optional storage in relational databases for advanced querying, and tighter integration with external monitoring platforms such as Grafana.

In conclusion, ConGUSTo provides a lightweight, version‑agnostic, and user‑friendly solution for detailed HTCondor usage analytics. By delivering per‑machine, per‑slot insights in an interactive web interface, it helps administrators justify energy‑saving measures, improve user confidence, and ultimately increase the scientific productivity of large‑scale high‑throughput computing environments.


Comments & Academic Discussion

Loading comments...

Leave a Comment