Vision-Based Early Fault Diagnosis and Self-Recovery for Strawberry Harvesting Robots

📝 Abstract

Strawberry harvesting robots faced persistent challenges such as low integration of visual perception, fruit-gripper misalignment, empty grasping, and strawberry slippage from the gripper due to insufficient gripping force, all of which compromised harvesting stability and efficiency in orchard environments. To overcome these issues, this paper proposed a visual fault diagnosis and self-recovery framework that integrated multi-task perception with corrective control strategies. At the core of this framework was SRR-Net, an end-to-end multi-task perception model that simultaneously performed strawberry detection, segmentation, and ripeness estimation, thereby unifying visual perception with fault diagnosis. Based on this integrated perception, a relative error compensation method based on the simultaneous target-gripper detection was designed to address positional misalignment, correcting deviations when error exceeded the tolerance threshold. To mitigate empty grasping and fruit-slippage faults, an early abort strategy was implemented. A micro-optical camera embedded in the end-effector provided real-time visual feedback, enabling grasp detection during the deflating stage and strawberry slip prediction during snap-off through MobileNet V3-Small classifier and a time-series LSTM classifier. Experiments demonstrated that SRR-Net maintained high perception accuracy. For detection, it achieved a precision of 0.895 and recall of 0.813 on strawberries, and 0.972/0.958 on hands. In segmentation, it yielded a precision of 0.887 and recall of 0.747 for strawberries, and 0.974/0.947 for hands. For ripeness estimation, SRR-Net attained a mean absolute error of 0.035, while simultaneously supporting multi-task perception and sustaining a competitive inference speed of 163.35 FPS. The compensation

💡 Deep Analysis

📄 Full Content

Vision-Based Early Fault Diagnosis and Self-Recovery for Strawberry Harvesting Robots Meili Sun1, Chunjiang Zhao2,1, Lichao Yang2, Hao Liu2, Shimin Hu2, Ya Xiong2,1 1College of Mechanical and Electrical Engineering, Shihezi University, Xinjiang, 832003, China. 2Intelligent Equipment Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing, 100097, China. *Corresponding author(s). E-mail(s): zhaocj@nercita.org.cn; yaxiong@nercita.org.cn; Abstract Strawberry harvesting robots faced persistent challenges such as low integration of visual perception, fruit-gripper misalignment, empty grasping, and strawberry slippage from the gripper due to insufficient gripping force, all of which compro- mised harvesting stability and efficiency in orchard environments. To overcome these issues, this paper proposed a visual fault diagnosis and self-recovery frame- work that integrated multi-task perception with corrective control strategies. At the core of this framework was SRR-Net, an end-to-end multi-task percep- tion model that simultaneously performed strawberry detection, segmentation, and ripeness estimation, thereby unifying visual perception with fault diagnosis. Based on this integrated perception, a relative error compensation method based on the simultaneous target-gripper detection was designed to address positional misalignment, correcting deviations when error exceeded the tolerance thresh- old. To mitigate empty grasping and fruit-slippage faults, an early abort strategy was implemented. A micro-optical camera embedded in the end-effector provided real-time visual feedback, enabling grasp detection during the deflating stage and strawberry slip prediction during snap-off through MobileNet V3-Small classi- fier and a time-series LSTM classifier. Experiments demonstrated that SRR-Net maintained high perception accuracy. For detection, it achieved a precision of 0.895 and recall of 0.813 on strawberries, and 0.972/0.958 on hands. In seg- mentation, it yielded a precision of 0.887 and recall of 0.747 for strawberries, and 0.974/0.947 for hands. For ripeness estimation, SRR-Net attained a mean absolute error of 0.035, while simultaneously supporting multi-task perception and sustaining a competitive inference speed of 163.35 FPS. The compensation 1 arXiv:2601.02085v1 [cs.RO] 5 Jan 2026 method reduced physical relative errors from 11.52 mm and 5.15 mm to 3.12 mm and 4.11 mm, below the tolerance threshold, while the early-abort strategy shortened execution time by 1.89 s and 0.3 s during deflating and snap-off stages. Overall, the proposed framework enhanced both the robustness and efficiency of strawberry harvesting robots, providing a practical and reliable solution for autonomous fruit harvesting. Keywords: visual fault perception, early diagnosis, relative error compensation, early abort strategy, harvesting robot 1 Introduction Harvesting robots [1] [2] [3] have made significant progress in recent years, showing great potential to reduce labor dependence and improve agricultural productivity. However, several mechanical, electrical, and control faults [4] [5] [6] still arise during robotic operations, compromising operational stability and continuity. For example, mechanical structural failures [7], air leakage in pneumatic end-effectors [8], and frac- tures in end-joint connectors can impair the normal operation of strawberry harvesting robots. Moreover, the absence of active learning and self-updating mechanisms renders models ineffective in adapting to changing conditions. Failure to respond to abnormal signals also poses operational and safety risks to the robot. Once any single component fails or produces an erroneous output, the entire harvesting process is interrupted. Due to the lack of robust fault diagnosis and self-recovery mechanisms, harvesting robots remain prone to frequent work interruptions, which shortens their effective operation time in the field. Generally, fruit harvesting with robots involves seven key steps: image acquisi- tion, fruit detection, segmentation and ripeness estimation, instance tracking and localization, motion planning, execution, result evaluation with fault diagnosis, and result recording with optimization. Along this pipeline, potential faults may occur in multiple stages [9]. On the perception side, problems such as blurred or occluded cam- eras, unstable illumination [10], network/data loss, inaccurate segmentation [11] [12], ripeness misclassification [13], ID tracking drift, and depth or calibration errors can lead to detection failures. At the motion control level, incorrect path planning, colli- sion avoidance errors, inverse kinematics failures, or grasp misalignment may disrupt harvesting. Even after successful contact, execution faults such as loose grasps, failed detachments, fruit damage, or gripper jamming remain common. These are further compounded by inaccurate success/failure judgment and inadequate fault recovery, resulting in prolonged downtime. Without robust fault diagnosis and self-recovery mechanisms, current robots are prone to frequent interruptions, severely limiting their effective operating time in orchards. In the HarvestFlex system (https://xiong-lab.cn/), as shown in Fig. 1 (a) and (b), a RealSense D455F depth camera (Intel, Realsense, USA) captured real-time RGB-D images, which were processed by YOLOv8/11 for strawberry detection and 2 Fig. 1: Images collection devices and environment. (a) HarvestFlex robot with a shad- ing cover; (b) HarvestFlex robot without a shading cover; (c) table-top strawberries Fig. 2: The picking process of strawberry harvesting robot: inflating and approaching, swallowing, deflating, snap off, descending and placing. segmentation. A checkerboard-based calibration enabled eye-to-hand coordinate trans- formation, while polynomial regression compensated spatial errors. For sequence planning, a minimal-height sorting algorithm determined harvesting order, and a point-to-point speed pattern-based method ensured efficient move the end-effector. The end-effector was a flexible pneumatic gripper that inflated and deflated to gently envelop and detach strawberries. The complete harvesting workflow consisted of six steps: inflating and approaching, swallowing, deflating, snap off, descending and plac- ing, in Fig. 2. However, several issues emerged with our HarvestFlex robot that severely compromised its ability to harvest stably and continuously. These primary challenges were: (1) Low integration of visual perception tasks – the limited fusion of detection, segmentation, and ripeness estimation led to increased model complexity and reduced inference speed, thereby constraining overall system efficiency. (2) Positional inaccu- racy between the end-effector and the target fruit – although the eye-to-hand approach offered a stable field of view for dynamic fruit tracking, it remained highly sensitive to hand-eye calibration errors. Even minor deviations could lead to misalignment, empty grasps, unsuccessful detachments, or incorrect placements, thereby disrupting continuous operation. (3) Inadequate gripping and strawberry slippage – although the pneumatic soft gripper reduced fruit damage, its high compliance makes the integra- tion of reliable tactile or slippage sensors challenging. Consequently, slippage often went undetected, resulting in wasted cycle time and, in some cases, unintended fruit release during the snap-off stage. This problem was further compounded by the imma- turity of current tactile sensing technologies, which remained expensive, unstable, 3 and impractical for large-scale deployment in orchard environments. Moreover, exist- ing tactile sensor technologies remained immature, with solutions that are typically expensive, unstable, and challenging to deploy reliably in orchard environments. To mitigate such problems, fault diagnosis and recovery methods were generally classified into two categories: traditional approaches and deep learning-based tech- niques to deal with electrical, software and control, process-related, and human errors faults [14]. Traditionally, various methods were proposed to address robotic faults, such as signal processing techniques [15], rule-based systems [16], model-based approaches [17], etc. However, these traditional methods faced limitations in handling complex fault scenarios, particularly in the case of rule-based and threshold-based approaches. With advances in computer hardware, deep learning emerged a new paradigm for fault diagnosis, enabling automatic feature extraction from multi-modal data such as sensor signals, vibrations, acoustics, and images. CNNs were applied to surface defect detec- tion [18], while RNNs and LSTMs were used to analyze time-series data for early fault detection in [19]. However, these methods often relied on additional sensors to acquire high-quality multi-modal data, which not only increased system cost and complexity but also complicated deployment in real-world agricultural environments. Further- more, ensuring temporal and spatial alignment across heterogeneous data sources (e.g., vision, vibration, and tactile signals) remained a nontrivial challenge, as even slight misalignments could degrade fault diagnosis accuracy. To overcome these limitations, this paper focused on a vision-based fault diagnosis framework for HarvestFlex, thereby avoiding the need for additional multi-modal sen- sors and the complexities of data alignment. The objective was to address software- and control-related faults that compromised harvesting stability and efficiency. To this end, an end-to-end multi-task perception method SRR-Net [20] was introduced and integrated visual perception with fault object detection while maintaining a lightweight model architecture and low computational load. The method comprised three primary subtasks: object detection, instance segmentation, and ripeness estimation. To align the gripper with the harvesting point, a relative error compensation method based on the simultaneous target-gripper detection was implemented in the coordinate frame of robot arm once the gripper reached the position beneath the target. Then, a corrected harvesting point was generated, guiding the gripper to move beneath the target for a second time to ensure accurate alignment. Furthermore, an early abort strategy was introduced to improve the reliability of the harvesting process. Specifically, a micro- optical camera embedded at the base of the gripper continuously monitors the presence and stability of strawberries. In the deflating stage, the MobileNet V3-Small [21] ver- ified whether the fruit had been successfully grasped, while in the snap-off stage, an LSTM classifier estimated the probability of the strawberry slippage from the gripper. If no strawberry was detected, an early abort signal was triggered to terminate the current cycle. This integrated perception–action framework enabled prompt detection of empty grasps or slippage, allowing the robotic arm to respond rapidly and maintain stable operation. The main contributions of this paper were as follows. • An end-to-end multi-task perception framework, SRR-Net, was introduced to integrate visual perception with fault diagnosis. 4 • A relative error compensation method based on the simultaneous target-gripper detection was developed to realign the position of end-effector when approaching the area beneath the picking point. • To enhance operational reliability, an early abort strategy was implemented. During the deflating stage, MobileNet V3-Small was used to classify whether the straw- berry had been successfully grasped. Subsequently, in the snap-off stage, an LSTM classifier predicted the likelihood of fruit slippage from the gripper, enabling timely corrective actions. 2 Dataset Benchmark In this paper, FaultData, GraspData and SnapData were constructed to support the multi-task vision perception task, grasp adjustment task and enable the time-series LSTM classifier task. All data were collected using the HarvestFlex strawberry- harvesting robot, which includes two robotic arms [22], two Realsense D455F cameras, and two controllable light sources, as shown in Fig. 1(a–b). The cameras captured RGB-D images at a resolution of 640×480 pixels, with the distance between the lens and the strawberries ranging from 30 to 90 cm. Data were collected in both natu- ral and controlled lighting environments to enhance the robot’s visual adaptability to complex and dynamic orchard conditions. FaultData For the tasks of detection, segmentation and ripeness estimation of strawberry and gripper, diverse strawberry formations-such as isolated, overlapping, and occluded fruits-as well as the complete operational sequence of the end-effector during picking images were collected. To ensure robust and continuous operation for multi-task vision perception task, the dataset also incorporated a variety of lighting conditions, weather scenarios, and nighttime environments, laying the foundation for stable, autonomous, all-weather performance. In addition, high-resolution images were captured using a Redmi Note 13 Pro smartphone and an Orbbec Gemini Pro RGB-D camera (Orbbec, Gemini Pro, China) providing enhanced texture and color informa- tion. The entire dataset was collected at the Cuihu Factory in Beijing, China, and featured the Fragaria × ananassa ‘Kaorino’ cultivar, as shown in Fig. 1(c). All images were annotated with the polygonal outlines of strawberries and the end-effector using Labelme [23]. Each object instance was labeled as strawberry, hand or table, with the ripeness of strawberries assigned based on [20]. The ripeness score was defined within the range [0,1.1]. The label format for each instance followed the structure: , where cls represented the object class. The dataset was split into training and validation subsets, consisting of 2954 and 779 images, respec- tively. Note that the ripeness attribute applied only to strawberries. To ensure a consistent label format, the table and hand class was assigned alignment flags of 2 and 3. GraspData Using a miniature camera, image data were collected during the defla- tion stage of the end-effector. A total of 594 images were acquired. In the training and validation sets, the numbers of images with strawberries and without strawberries during the deflating stage are 197/202 and 93/102, respectively. 5 Fig. 3: Harvesting fault and recovery in HarvestFlex SnapData For the time-series prediction task during the strawberry snap- off stage, all images were captured using a micro-optical camera mounted on the end-effector based on artificial strawberry instances. In each frame, key visual features—including the normalized strawberry area, normalized gripper area, normalized background area, and the width and height of the strawberry—were extracted using a fine-tuned SRR-Net. These features were structured in the format: , , , , , , where label denotes the presence (0) or absence (1) of the strawberry within the grip- per. A sliding window of 10 consecutive frames was used to form each input sequence. For each sequence, a label was assigned based on the next 3 frames immediately follow- ing the sequence. This label indicated whether the strawberry was at risk of slipping from the gripper. This structure enabled the training and validation of a binary LSTM classifier to estimate the likelihood of strawberry slippage. The training and validation sets comprise 341 and 147 samples, respectively. 3 Visual Fault Diagnosis and Self-Recovery Method 3.1 Overview The HarvestFlex robot has successfully achieved strawberry harvesting in orchard environments. However, several challenges persisted that hindered harvesting per- formance during continuous operations. The core challenges included: (a) limited integration of visual perception; (b) grasping misalignment or failure resulting from positional offsets; and (c) reliance on human intervention to determine success or fail- ure during the picking process, owing to the absence of automated evaluation and recovery mechanisms, as shown in Fig. 3. 6 To address the aforementioned issues, a vision-based harvest fault diagnosis and self-recovery architecture was developed. This architecture integrated three key com- ponents: an end-to-end multi-task perception network, a relative error compensation method, an early abort strategy based on the empty grasp adjustment and strawberry slip prediction. Specifically, the end-to-end multi-task perception network, SRR-Net, simultaneously performed detection, segmentation, and ripeness estimation. Detection, segmentation, and ripeness estimation were integrated into an end-to-end method. A vision-based positional error compensation method was implemented to correct the x- and y-axis offsets between the picking point and the gripper center, thereby enhanc- ing localization accuracy. To detect unsuccessful grasps, a micro-optical camera was embedded within the end-effector to monitor strawberry retention during the deflat- ing stage and predict potential slippage risks during the snap-off stage. A MobileNet V3-Small classifier was applied to determinate the whether strawberry are present at the end-effector during the deflating stage. An LSTM-based visual time-series predic- tion model provided real-time forecasts of slippage, enabled a secondary harvesting attempt, the triggering of an early abort signal, or the continuation of subsequent actions. 3.2 End-to-End Multi-Task Perception Method: Detection, Segmentation and Ripeness Estimation An end-to-end multi-task perception method, SRR-Net, was introduced to simultane- ously perform detection, segmentation, and ripeness estimation of both strawberries and gripper. This method integrated low- and high-level feature representations with ripeness and biological attributes to comprehensively perceive strawberries and reduce errors in ripeness estimation. The method employed a lightweight network with shared weights. For strawberries, the perception task distinguished between ripe and unripe fruits via post-processing classification. For the robotic gripper, relative error between the gripper and the picking point was calculated. This method was built on the general real-time detection and segmentation frame- work YOLOv11. The architecture consisted of a backbone, neck, and head: the backbone extracted features, the neck fused multi-scale features, and the head per- formed classification, bounding box regression, instance segmentation, and ripeness estimation. Compared with the original YOLOv11, the primary modification was the addition of a parallel ripeness estimation branch embedded within the head. An overview of the proposed end-to-end multi-task method was shown in Fig. 4. The ripeness estimation branch consisted of two convolutional layers with a 3×3 kernel, followed by a convolutional layer with a 1×1 kernel and a ReLU activation function. The ReLU constrained the predicted ripeness values within a valid range. To supervise ripeness estimation, the mean absolute error was used as the loss function, scaled by a factor λ to balance its influence with the losses of other tasks. The ripeness loss was defined as follows: Lripeness = λ N N X i=1 |ri −ˆri| (1) 7 Fig. 4: The architecture of end-to-end multi-task perception method Fig. 5: Simultaneous Target-Gripper Detection for Relative Error Compensation where ri was the ground truth ripeness value, ˆri was the predicted ripeness, N was the number of predicted instances, and λ was the weighting coefficient. 3.3 Simultaneous Target-Gripper Detection for Relative Error Compensation Traditional harvesting robots rely on calibration-based methods to transform the object’s location from the camera frame to the robot arm frame. However, these approaches are susceptible to calibration errors and positioning inaccuracies in dynamic environments, exacerbated by system latency, making effective compensation challenging [24]. In contrast, our proposed method detected the target strawberry and the gripper simultaneously in the same camera view, enabling direct estimation and correction of their relative error within that shared frame. This approach minimized the impact of absolute calibration inaccuracies and system latency, as the offset was computed relative to both elements in the same view, providing a more robust and real-time compensation strategy for improved picking accuracy in the HarvestFlex system. 8 In the HarvestFlex arm’s workflow, a relative error compensation method was incorporated into the inflating and approaching and swallowing stages of the picking process, in Fig. 5. When the gripper approached the strawberry picking point, the 3D coordinates of both the end-effector and the picking point in the robot arm coordinate system were obtained. Specifically, using SRR-Net, the 2D image coordinates of the strawberry and the end-effector were extracted in the camera coordinate system. By combining these with the corresponding depth information, their 3D positions in the camera coordinate system were reconstructed and then transformed into the robot arm coordinate system through hand–eye calibration. Let (xp, yp, zp) and (xe, ye, ze) denote the coordinates of the picking point and the end-effector in the robot arm frame, respectively. The positional error between the end-effector and the picking point was defined as a 3D offset (∆x, ∆y, ∆z) , defined as ∆x = xp −xe; ∆y = yp −ye; ∆z = zp −ze (2) Since the snap-off stage necessitated bending the strawberry stem, the vertical position of gripper during the swallowing stage was adjusted to exceed that of the picking point. To simplify computation and reduce processing overhead, only the x- and y-axis errors were considered, while z-axis errors were ignored. Given an error tolerance threshold T, when the end-effector reached a position beneath the picking point and the absolute error exceeded T, the compensated picking point coordinates (xc, yc, zc) were computed as: xc = ( xp + ∆x if |∆x| > T xp otherwise (3) yc = ( yp + ∆y if |∆y| > T yp otherwise (4) zc = zp (5) Here, T denoted the maximum tolerance positional error for each axis, set to 10. The robotic arm was then moved to the compensated point for strawberry grasping. 3.4 Early Aborting Robotic Picking upon Grasp Adjustment and Slip Prediction Empty grasp was a common failure mode in robotic strawberry harvesting during the deflating and snap-off stages. During the deflating stage, empty grasps occurred when the strawberry was not successfully enclosed by the end-effector, primarily due to localization errors or unexpected fruit motion. In the snap-off stage, strawberry slippage occurred when the picked strawberry unintentionally slipped, often due to a non-optimal end-effector pose or inadequate gripping force. To diagnose such faults effectively, a simple yet efficient method involved the installation of a miniature camera (JTS302, JiuTan, China) with a 3 mm diameter at the bottom of the end-effector, as 9 shown in Fig. 6. The micro-optical camera directly captured real-time images of the gripper area, enabling visual monitoring of strawberry presence. Fig. 6: Gripper with a micro-optical camera. (left) Top-view and (right) Side-view Once the deflating stage began, a MobileNet V3-Small was used to determine whether a strawberry was present in the gripper, as shown in Fig. 7. If a strawberry was detected, the robotic arm continued executing the subsequent picking actions. Otherwise, the robot skipped the deflating and placing stages of the picking sequence. Instead, it immediately selected a new target strawberry and proceeded to the next picking cycle. The process flow, indicated by the blue line, was illustrated in Fig. 3. For the slippage scenario, a time-series LSTM classifier was developed to predict whether the strawberry would slip from the end-effector during the snap-off stage. The LSTM classifier architecture was summarized in Table 1. The input layer received a feature vector of dimension 5 from SRR-Net, which was followed by a hidden LSTM layer with 32 units that captured the temporal dependencies in the sequential data. The hidden representation was then mapped through a linear layer of size 32 and finally projected to a single output neuron that produced the classification result. The input of the LSTM classifier consisted of a sequence of features from the past 10 frames, represented as <normalized strawberry area, normalized gripper area, nor- malized background area, w, h >. The normalized strawberry area, gripper area, and background area referred to the proportion of pixels in the image corresponding to the Fig. 7: Empty grasp and early abort 10 Table 1: LSTM classifier architec- ture Layer Number input 5 hidden 32 linear 32 output 1 Fig. 8: Strawberry slip prediction and self-recovery strawberry, the gripper, and the background, respectively. w and h represented the width and height of the strawberry. The model outputted the probability of slippage over the following three frames. If the probability fell below the minimum threshold, the strawberry remained attached to the end-effector after the snap-off motion, indi- cating a successful pick. Conversely, if the probability was equal to or exceeded the maximum threshold, the strawberry had slipped, indicating a failure in the snap-off stage. When the probability fell between the minimum and maximum thresholds, the strawberry was in the process of slipping from the gripper. To enhance reliability, a time-stability rule was implemented. If slippage was pre- dicted in three consecutive instances, an early abort signal was triggered, the placing stage was skipped, and the next picking cycle began. If the number of slipping instances was greater than zero but less than three, a secondary picking mechanism was acti- vated, bypassing the inflating and approaching and swallowing stages, with the gripper inflating and deflating to re-grasp the strawberry during the snap-off actions. When no slip was predicted, the robot proceeded with the remaining steps until the cycle was completed. 3.5 Paradigm of HarvestFlex Visual Fault Diagnosis and Self-Recovery In HarvestFlex, visual fault diagnosis and self-recovery proceeded through the fol- lowing steps, in Fig. 9. First, RGB and depth images were captured by a RealSense 11 Fig. 9: Paradigm of HarvestFlex visual fault diagnosis and self-recovery camera. The RGB images were processed by an end-to-end multi-task perception net- work to detect and segment strawberries and the gripper, and to estimate strawberry ripeness. Using a predefined ripeness threshold, fruits were classified as ripe or unripe, and only ripe strawberries were tracked in real time. Next, the camera coordinates of both tracked ripe and untracked unripe strawberries were transformed into the robot arm coordinate via hand–eye calibration. The end-effector’s coordinates, based on the camera coordinate system, were also extracted, tracked, and transformed to prevent track ID jumps. The system then planned the picking sequence for ripe straw- berries and computed the robot arm motion trajectory to reach the (compensated) picking points. The gripper harvested each strawberry in sequence until no ripe fruits remained, at which point the mobile platform repositioned. The relative positional error compensation method, incorporating visual feedback SRR-Net, and the early abort strategy based on MobileNet V3-Small and the LSTM- based strawberry slippage prediction were implemented as follows. The gripper then inflated and approached a location directly beneath the target strawberry picking point. At this position, the relative error between the gripper and the actual straw- berry position was calculated. If the error exceeded a predefined tolerance threshold, a compensated picking point was generated to correct the deviation before proceeding. 12 Once aligned, the gripper executed swallowing and deflating motions to enclose the strawberry. During this process, a micro-optical camera mounted on the gripper captured images, which were used to determine whether the strawberry was success- fully gripped based on MobileNet V3-Small classifier. If no strawberry was detected in the gripper for three consecutive frames, the robot executed descending and hom- ing actions to initiate a new picking cycle. Conversely, it continuously performed the bending action while running a time-series LSTM classifier to predict strawberry slippage. To improve prediction stability, a time-stability rule was applied. When the prob- ability exceeded the maximum threshold for three consecutive instances, the robot aborted the snap-off stage, skipped the placing stage, and returned to its homing posi- tion, as illustrated by the red lines in Fig. 9. When the probability remained below the minimum threshold, the robot continued the snap-off motion before proceeding to the next stage, as illustrated by the black line in Fig. 9. Meanwhile, if the probability fell between the maximum and minimum thresholds for one or two instances, a new picking workflow was initiated, bypassing the approaching and swallowing stages. In this case, the gripper was inflated and deflated to re-grasp the strawberry during the deflating and snap-off actions, followed by the descending, placing, and homing steps, as illustrated by the orange line in Fig. 9. 4 Experiments 4.1 Evaluation Metrics For the detection and segmentation tasks, precision and recall were used to evaluate performance, following the definitions in [25]. For strawberry ripeness estimation, the mean absolute error (MAE) was adopted as the evaluation metric, defined as: MAE = 1 N N X i=1 |yi −ˆyi| (6) where yi denoted the ground-truth ripeness value, ˆyi was the predicted ripeness, and N was the total number of samples. Additionally, the inference speed was assessed by measuring frames per second (FPS) on the FaultData validation set with resolution of 640×480. For the relative error between the end-effector and the picked strawberry instance, the visual relative errors along the x-, y-, and z- axes under the robotic arm coordinate system were denoted as ∆x, ∆y, and ∆z, respectively. Due to the bending motion required during the snap-off stage, the end-effector’s ascent height could exceed that of the target point. Therefore, the error calculation was simplified from the x-, y-, and z-axes to only the x-, and y- axes. Similarly, the corresponding physical errors along the x-, and y-axes were denoted as ∆xw and ∆yw. The physical error after compensation, denoted as Ex and Ey, was measured to validate the effectiveness of the relative error compensation method. Finally, mean value ¯e was used to evaluate the accuracy of the relative error and physical error before 13 compensation and for each coordinate axis and can be formulated as: ¯e = 1 n n X 1 |e|, where e = {∆x, ∆y, ∆xw, ∆yw, Ex, Ey} (7) 4.2 Implementation Details To evaluate the effectiveness of the proposed method, experiments were conducted on the FaultData, GraspData, and SnapData benchmarks, with GraspData and Snap- Data constructed using artificial strawberries. The end-to-end multi-task perception framework was first trained and validated on the training and validation subsets of FaultData. Due to the limited strawberry growing season, the experiments were carried out indoors with artificial strawberries rather than in a strawberry orchard. During this evaluation, the relative error of the strawberry and the gripper beneath the picking point were measured, along with the physical error before compensation, the relative error before compensation, and the physical error after compensation, to assess picking accuracy. Furthermore, the end-to-end multi-task perception method was fine-tuned on SnapData-again initialized with the best weights from FaultData. The outputs of the multi-task perception model were then used as inputs to an LSTM classifier. 4.3 Results of Detection, Segmentation, Ripeness Estimation SRR-Net, which integrated multi-task learning for both strawberry and hand classes, was evaluated on the FaultData dataset for detection, segmentation, and ripeness esti- mation in Table. 2. For bounding box prediction, it achieved a precision of 0.895, recall of 0.813, mAP@50 of 0.884, and mAP@50-95 of 0.633 for strawberries, while for the hand class, the corresponding values were 0.972, 0.958, 0.977, and 0.788, respectively. In segmentation, SRR-Net reached a precision of 0.887, recall of 0.747, mAP@50 of 0.829, and mAP@50-95 of 0.448 for strawberries, and 0.974, 0.947, 0.964, and 0.655 for hands. Compared with YOLOv11 and YOLOv11-seg, SRR-Net demonstrated comparable detection and segmentation performance across both classes, with minor variations in individual metrics. In detection, its recall for strawberries was slightly higher than YOLOv11, while the mAP values remained similar to YOLOv11-seg. In segmentation, SRR-Net showed a modest gain in recall over YOLOv11-seg, while maintaining com- parable precision. These results indicate a balanced trade-off between precision and recall, with the segmentation branch preserving high precision and competitive recall, enabling reliable identification of strawberries and hands. In ripeness estimation, SRR-Net achieved a mean absolute error of 0.035, high- lighting its advantage in accurately assessing strawberry ripeness levels. Regarding inference speed, SRR-Net reached 163.35 FPS, remaining close to the speed of YOLOv11-seg (167.89 FPS), demonstrating that the added ripeness estimation branch does not compromise real-time performance. Overall, these results suggested that SRR-Net maintained detection and segmen- tation performance on par with established baselines, while also demonstrating solid 14 capability in ripeness estimation accuracy and inference speed. As an end-to-end multi- task visual perception framework, it effectively integrated detection, segmentation, and ripeness estimation, achieving a well-balanced performance across all tasks. Table 2: The detection, segmentation, and ripeness estimation results on FaultData and and SnapData. Method Class Box Mask MAE FPS P R mAP@50 mAP@50-95 P R mAP@50 mAP@50-95 FaultData YOLOv11 strawberry 0.915 0.792 0.884 0.633

201.10 hand 0.966 0.957 0.977 0.793

YOLOv11-seg strawberry 0.892 0.812 0.884 0.663 0.864 0.753 0.824 0.442

167.89 hand 0.968 0.957 0.976 0.791 0.971 0.95 0.975 0.653

SRR-Net strawberry 0.895 0.813 0.884 0.633 0.887 0.747 0.829 0.448 0.035 163.35 hand 0.972 0.958 0.977 0.788 0.974 0.947 0.964 0.655

SnapData SRR-Net strawberry 0.999 0.992 0.995 0.978 0.999 0.992 0.995 0.981

background 0.881 0.922 0.941 0.807 0.888 0.93 0.946 0.824

To more intuitively observe the performance of SRR-Net, the visualization results of SRR-Net, YOLOv11, and YOLOv11-seg on the FaultData dataset are presented in Fig. 10. In Fig. 10 (a) and (b), strawberries and end-effectors were represented with bounding boxes, with the target class label and confidence score displayed above each box. The confidence threshold was set to 0.5. For the same strawberry instance, SRR- Net produced higher confidence scores than YOLOv11 and YOLOv11-seg. In terms of ripeness estimation, ripeness estimation of SRR-Net closely matched the true ripeness of each strawberry. Overall, the visual comparisons illustrated that SRR-Net delivered superior performance compared with YOLOv11 and YOLOv11-seg. 4.4 Evaluation of Relative Error Compensation To validate the effectiveness of relative error compensation, the 3D coordinates of the strawberry picking point (xs, ys, zs), the gripper positioned beneath the picking point (xe, ye, ze), the relative error before compensation (∆x, ∆y), the ground-truth error before compensation (∆xw, ∆yw) , the compensated picking point (xce, yce, zce), and the relative error after compensation (Ex, Ey) in the robot arm coordinate system were measured in Table 3. Due to the requirement of snap-off, the rise distance in the swallowing stage exceeded the height of the strawberry. To simplify the experi- ments and reduce computational overhead, the relative error along the z-axis was not considered. From Table 3, the visual and ground-truth errors before compensation were com- pared, and the ground-truth errors after compensation were recorded. For example, in the first row, the relative errors on the x- and y- axis were 22 mm and -4 mm, while 15 (a) YOLOv11 (b) YOLOv11-seg (c) SRR-Net Fig. 10: Visualization results the corresponding ground-truth errors were 17.3 mm and 4.2 mm. The absolute dif- ferences between visual and ground-truth errors were therefore 4.7 mm and 8.2 mm. After compensation, the physical errors of the x- and y- axis were reduced to 1.5 mm and 6.3 mm, both below the defined threshold. Before compensation, the mean relative errors were 14.07 mm on the x-axis and 8.64 mm on the y-axis, while the mean physical errors were 11.52 mm and 5.15 mm, respectively—indicating that visual estimation tended to overestimate the actual phys- ical errors. With the proposed compensation method, the mean physical error between the strawberry point and the gripper were further reduced to 3.12 mm (x-axis) and 4.11 mm (y-axis). The slightly larger residual y-axis error was likely due to the dif- ficulty of achieving high-precision motor control over very small movement ranges. Despite this phenomenon, the relative error compensation method demonstrated a clear advantage in aligning the gripper with the target and in reducing grasping errors. 16 Table 3: Relative errors between the end-effector and the picking point in the arm coor- dinate system (unit: mm). The final row showed the computed mean absolute error. xs ys zs xe ye ze ∆x ∆y ∆xw ∆yw xce yce zce Ex Ey 709 221 706 686 225 647 22 -4 17.3 4.2 732 219 706 1.5 6.3 464 232 710 450 249 652 14 -17 8.3 -12.7 479 223 710 5.0 -8.2 377 245 693 362 255 626 15 -10 7.2 -2.5 392 239 693 2.6 3.6 700 222 699 681 224 638 18 -2 15.7 3.4 719 221 699 0 2.8 532 234 710 518 245 649 14 -11 10.6 -3.6 546 228 710 2.7 -2.9 711 244 711 692 249 651 18 -5 9.2 -5.1 730 241 711 0 2.2 320 244 692 316 256 628 4 -11 3.6 -2.6 324 239 692 0 2.1 468 236 703 460 245 653 8 -8 12.8 -1.3

652 235 712 631 239 651 20 -3 20.4 6.2 673 233 712 -4.4 -2 816 220 699 793 223 642 23 -2 17.4 6.8 839 219 699 -6.4 5.1 393 238 706 386 245 643 7 -7 6.9 3.0

734 222 699 712 224 638 22 -1 11.8 5.0 757 221 699 -5.3 3.9 445 230 711 436 249 646 8 -19 9.9 -9.5 454 220 711 -4.3 6.2 629 234 721 611 244 659 17 -10 17.5 -2.6 647 228 721 5.9 6.3 299 246 693 295 259 631 4 -12 10 -7.9 304 240 693 -2.4 1.2 453 233 701 443 251 637 10.6 -18.5 10.4 -7.5 464 224 701 7.1 -1.2 752 213 708 729 218 645 22 -4 13.4 6.9 774 221 708 0 4.3 667 215 708 645 222 645 22 -7 14.2 0 690 211 708 -3.8 -6.0 307 246 693 305 252 633 1.71 -6.3 5.7 0

467 232 700 456 248 634 11 -15 8.1 -12.1 479 224 700 -1.6 5.6

14.07 8.64 11.52 5.15

3.12 4.11 4.5 Operational Efficiency Analysis of the Early Abort Strategy To evaluate the effectiveness of the early abort strategy, the original picking time, the minimum time at which the early abort signal was triggered, and the time reduction for each action were calculated and analyzed in Table 4. In the original process, a com- plete picking cycle required approximately 12 s, with the inflating and approaching, swallowing, deflating, snap-off, descending, placing and homing stages taking 2 s, 2 s, 2 s, 1 s, 2 s, 2 s, and 1 s, respectively. When the gripper approached beneath the straw- berry picking point, the relative error was estimated, and if it exceeded the threshold, a compensation action was triggered to align the gripper with the strawberry, adding 1 s to the cycle, as shown in Table. 4 #Case 1. During the deflating stage, MobileNet V3-Small was executed to monitor whether a strawberry was present in the gripper. In accordance with the time-stability rule, the system required three consecutive signals before issuing the abort command. Upon receiving the abort command, the robotic arm immediately aborted the current pro- cess and skipped the descending and homing stages. The shortest time to receive the abort signal during the deflating stage was 0.11 s, resulting in a time reduction of 1.89 s, as shown in Table. 4 #Case 2. Overall, although the compensation action slightly increased the cycle time, the continuous harvesting efficiency was significantly improved. 17 Table 4: Harvesting time and reduction time via early abort of robot arm Action Original Time (s) #Case 1 (s) #Case 2 (s) #Case 3 (s) Relative Error Compensation Grasping failure Slipped Inflation and approaching 2 2 2 2 Compensation

1 Swallowing 2 2 2 2 Deflating 2 2 0.11 2 Snap-off 1 1

0.7 Descending 2 2 2 2 Placing 2 2

Homing 1 1 1 1 Time Reduction (s)

-1 1.89 0.3 Table 5: LSTM classifier results on the validation set of SnapData, where label 0 indicated that the strawberry remained in the gripper and label 1 indicated that it had slipped. Class precision recall F1-score 0 0.9294 1.0000 0.9634 1 1.0000 0.9118 0.9538 For the snap-off stage, the results of the LSTM classifier on the validation set of SnapData were presented in Table. 5, where class 0 and class 1 represented straw- berries remaining in the gripper and strawberries that had slipped, respectively. As shown in Table. 5, the LSTM classifier achieved high accuracy on SnapData. For class 0, all strawberries in the gripper were correctly detected (recall = 1.0000), with a few false positives (precision = 0.9294). For class 1, all predicted slips were correct (pre- cision = 1.0000), though a small fraction were missed (recall = 0.9118). These results indicated that the LSTM classifier provided reliable slip prediction with minimal false alarms, supporting its use in early abort strategies. Classes 0 and 1 were classified using a maximum threshold. It was noted that class 0 encompassed two strawberry states—stable and slipping—so an adaptive minimum threshold was introduced and applied to further distinguish between them. During the snap-off stage, operations were triggered based on the strawberry status in the gripper, with the time-stability rule applied to prevent false positives. When the early abort condition was met for three consecutive checks, the descending and placing stages were skipped, and homing was executed immediately. In randomized tests, the minimum response time for receiving an abort signal was 0.7 s, reducing the overall cycle time by 0.3 s, as presented in Table. 4 #Case 3. Visualization results of the LSTM classifier on the validation of SnapData were pre- sented in Fig. 11. The left panel displays frames 8-16 with the prediction probability 18 and corresponding strawberry statuses, while the right panel illustrated the LSTM pre- diction probabilities. The horizontal axis represented the frame index starting from 0, and the vertical axis indicated the predicted probability of strawberry slippage within the next three frames. The red dashed line marked the maximum threshold, while the blue curve with asterisks showed the predicted probabilities. Red five-pointed stars denoted instances where the strawberry had slipped from the end-effector. An adap- tive minimum threshold, illustrated by a pink dashed line, was employed to determine whether a secondary picking attempt should be executed or the subsequent actions should continue. If the total number of frames did not reach 10, fewer than 10 frames were displayed as labels in the top-left corner of the images. The strawberry began slipping between frames 9 and 11, with corresponding probabilities of 0.406, 0.446, and 0.502. From frame 13 onward, the strawberry was considered slipped. Similarly, a probability curve of strawberry slippage was plotted for intuitive observation, as shown in the right panel of Fig. 11. In Fig. 11 (left), the LSTM classifier presented the predicted probability of straw- berry slippage for frames 8-16. When the number of frames did not reach 10, fewer than 10 frames were displayed as labels at the top-left of the images. The strawberry was slipping from frames 9-11, with corresponding probabilities of 0.406, 0.446, and 0.502; starting from frame 13, the strawberry had slipped. Similarly, a probability curve of strawberry slippage was plotted for intuitive observation in Fig. 11 right. The five-pointed star indicates the frame where the strawberry slipped for the first time. Based on validation results on SnapData, the LSTM classifier was able to timely send early-abort signals, enabling efficient and stable operation. Tests of the early abort strategy for empty grasp in deflating and slip prediction in snap-off were conducted on HarvestFlex. However, experiments involving inflating-and-deflating to re-grasp were not performed or deployed on HarvestFlex due to limitations in the strawberry growing season. 5 Discussion Vision-based fault diagnosis and self-recovery offered an effective means of enhancing the stability of strawberry-harvesting robots. During the inflating and approaching stage, the positional relationship between the gripper and the harvesting point served as a visual indicator of accumulated errors caused by fruit recognition, localization, hand–eye calibration, and inverse kinematics. Without requiring an additional camera, relative errors based on the simultaneous target-gripper detection were computed to align the strawberry with the end-effector. For early abort feedback during the deflat- ing and snap-off stages, a micro-optical camera was embedded in the end-effector to detect and predict the probability of strawberry slippage in the gripper. This approach compensated for the absence of force feedback in the flexible pneumatic gripper, whose deformation during inflation and deflation rendered conventional force sensing impractical. However, a limitation of slip prediction–based early abort was that the material properties of artificial strawberries differed from those of real ones, partic- ularly in weight and texture. Consequently, the secondary inflating-deflating regrasp 19 Fig. 11: LSTM classifier results. (Left) Video frames 8-16; (Right) Predicted proba- bilities from the LSTM model. experiment was evaluated only on the SnapData test set and was not conducted on HarvestFlex, owing to the material constraints of artificial strawberries. In this study, artificial strawberries were used primarily to evaluate relative error compensation and the early abort strategy. Mechanical failures, inverse kinematics errors, and other control-related faults were not addressed in this work. Due to the limitations of straw- berry growth conditions, artificial strawberries were used in the experiments, which did not fully account for biological characteristics such as fruit damage. Additionally, self-learning and adaptive evolutionary perception, planning and decision-making can be applied to reduce dataset dependency and improve the general- ization and robustness of the strawberry-harvesting robot. With the rapid development of embodied artificial intelligence and robotic agents, a new wave of end-to-end, multi- modal, large-model-based perception, self-planning and self-decision frameworks is emerging in the robotics domain. These advances offer valuable insights for building strawberry harvesting robots and enabling collaborative operations among multiple robots. In the future, developing end-to-end active learning and continuously evolving methods is expected to become a major research trend. 6 Conclusion This paper proposed a visual-based fault diagnosis and self-recovery system to address gripper offset and strawberry slippage during harvesting. An end-to-end multi-task perception network with a shared weight backbone and neck was developed to reduce computational overhead. Without the need for an additional external camera, relative error compensation method based on the simultaneous target-gripper detection was estimated, enabling compensation actions to mitigate cumulative errors. To monitor the strawberry’s status in the gripper, a miniature optical camera was embedded at its base. A MobileNet V3-Small classifier was adapted to detect strawberries within the 20 gripper and trigger early abort signal when necessary during the deflating stage based on the GraspData. For strawberry slippage during the snap-off stage, a novel dataset, SnapData, comprising strawberry and background classes was introduced, and the perception network was extended and fine-tuned to segment strawberries, the gripper, and background. Width and height features were combined with a time-series LSTM classifier to predict strawberry slippage from the gripper, enabling early-abort picking actions and improving harvesting efficiency. Experimental results demonstrated that the perception network achieved high accuracy, with a mean absolute error of 0.035 for ripeness estimation. The compensation mechanism reduced relative errors to 3.12 mm and 4.11 mm, while the early-abort strategy shortened execution times by 1.89 s and 0.3 s during the deflating and snap-off stages, respectively. Overall, the proposed system effectively enhanced the robustness and efficiency of strawberry harvesting, providing a practical solution for reliable autonomous fruit picking. Declarations Funding. This work was supported by the Haidian District Bureau of Agricul- ture and Rural Affairs, the Beijing Academy of Agriculture and Forestry Sciences (BAAFS) Innovation Ability Project (KJCX20240321), the Outstanding Youth Foun- dation of BAAFS (YKPY2025007), the BAAFS Talent Recruitment Program and the National Natural Science Foundation of China (NSFC) Excellent Young Scientists Fund (overseas). Conflict of interest. The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Data availability. Data will be released. Code availability. The code will be released. Author contribution. Meili Sun: Methodology, Software, Validation, Writing - original draft. Chunjiang Zhao: Investigation, Funding acquisition, Writing - review editing. Lichao Yang: Data curation, Methodology, Software, Visualization. Hao Liu: Data curation, Methodology, Software. Shimin Hu: Data curation, Software, Visualiza- tion. Ya Xiong: Conceptualization, Methodology, Investigation, Funding acquisition, Writing - review editing. References [1] Hua, W., Zhang, Z., Zhang, W., Liu, X., Hu, C., He, Y., Mhamed, M., Li, X., Dong, H., Saha, C.K., et al.: Key technologies in apple harvesting robot for stan- dardized orchards: A comprehensive review of innovations, challenges, and future directions. Computers and Electronics in Agriculture 235, 110343 (2025) [2] Xiao, X., Wang, Y., Jiang, Y.: Review of research advances in fruit and vegetable harvesting robots. Journal of Electrical Engineering & Technology 19(1), 773–789 (2024) 21 [3] Hou, C., Xu, J., Tang, Y., Jiajun, Z., Tan, Z., Chen, W., Wei, S., Huang, H., Fang, M.: Detection and localization of citrus picking points based on binocular vision. Precision Agriculture 25(5), 2321–2355 (2024) [4] Milecki, A., Nowak, P.: Review of fault-tolerant control systems used in robotic manipulators. Applied Sciences 13(4), 2675 (2023) [5] Quamar, M.M., Nasir, A.: Review on fault diagnosis and fault-tolerant control scheme for robotic manipulators: Recent advances in ai, machine learning, and digital twin. arXiv preprint arXiv:2402.02980 (2024) [6] Rajendran, V., Debnath, B., Mghames, S., Mandil, W., Parsa, S., Parsons, S., Ghalamzan-E, A.: Towards autonomous selective harvesting: A review of robot perception, robot design, motion planning and control. Journal of Field Robotics 41(7), 2247–2279 (2024) [7] Wang, H., Lao, L., Zhang, H., Tang, Z., Qian, P., He, Q.: Structural fault detection and diagnosis for combine harvesters: A critical review. Sensors 25(13), 3851 (2025) [8] Rapalo, A.O.V., Cort´es, C.A.B., Ordo˜nez´Avila, J.L.: Design and analysis of a pneumatic end-effector with an air contraction clamp for soft robotics. In: 2024 9th International Conference on Control and Robotics Engineering (ICCRE), pp. 98–103 (2024). IEEE [9] Khalastchi, E., Kalech, M.: Fault detection and diagnosis in multi-robot systems: A survey. Sensors 19(18), 4019 (2019) [10] Nejati, M., Penhall, N., Williams, H., Bell, J., Lim, J., Ahn, H.S., MacDonald, B.: Kiwifruit detection in challenging conditions. arXiv preprint arXiv:2006.11729 (2020) [11] Shi, X., Wang, S., Zhang, B., Ding, X., Qi, P., Qu, H., Li, N., Wu, J., Yang, H.: Advances in object detection and localization techniques for fruit harvesting robots. Agronomy 15(1), 145 (2025) [12] Tantan, J., Xiongzhe, H., Wang, P., Lyu, Y., Eunha, C., Haetnim, J., Lirong, X.: Performance evaluation of robotic harvester with integrated real-time perception and path planning for dwarf hedge-planted apple orchard. Agriculture 15(15), 1593 (2025) [13] Kamat, P., Gite, S., Chandekar, H., Dlima, L., Pradhan, B.: Multi-class fruit ripeness detection using yolo and ssd object detection models. Discover Applied Sciences 7(9), 931 (2025) [14] Sabry, A.H., Amirulddin, U.A.B.U.: A review on fault detection and diagnosis of industrial robots and multi-axis machines. Results in Engineering 23, 102397 22 (2024) [15] Zhang, C., Mousavi, A.A., Masri, S.F., Gholipour, G., Yan, K., Li, X.: Vibra- tion feature extraction using signal processing techniques for structural health monitoring: A review. Mechanical Systems and Signal Processing 177, 109175 (2022) [16] Vukadinovic, M., Reiterer, B., Rathmair, M., Schuetz, C.G.: Anomaly detection in robot applications: Comparison of rule-based and machine learning methods. In: 2024 9th International Conference on Control, Robotics and Cybernetics (CRC), pp. 1–5 (2024). IEEE [17] Hasan, A., Tahavori, M., Midtiby, H.S.: Model-based fault diagnosis algorithms for robotic systems. IEEE Access 11, 2250–2258 (2023) [18] Zhou, H., Ahmed, A., Liu, T., Romeo, M., Beh, T., Pan, Y., Kang, H., Chen, C.: Finger vision enabled real-time defect detection in robotic harvesting. Computers and Electronics in Agriculture 234, 110222 (2025) [19] Hwang, C.-L.: Cooperation of robot manipulators with motion constraint by real- time rnn-based finite-time fault-tolerant control. Neurocomputing 556, 126694 (2023) [20] Sun, M., Hu, S., Zhao, C., Xiong, Y.: Light-resilient visual regression of strawberry ripeness for robotic harvesting. Computers and Electronics in Agriculture 241, 111169 (2026) https://doi.org/10.1016/j.compag.2025.111169 [21] Howard, A., Sandler, M., Chen, B., Wang, W., Chen, L.-C., Tan, M., Chu, G., Vasudevan, V., Zhu, Y., Pang, R., Adam, H., Le, Q.: Searching for mobilenetv3. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1314–1324 (2019). https://doi.org/10.1109/ICCV.2019.00140 [22] Chen, Y., Miao, Z., Ge, Y., Lin, S., Chen, L., Xiong, Y.: Design and control of a novel six-degree-of-freedom hybrid robotic arm. In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3597–3604 (2024). IEEE [23] Torralba, A., Russell, B.C., Yuen, J.: Labelme: Online image annotation and applications. Proceedings of the IEEE 98(8), 1467–1484 (2010) [24] Ge, Y., Xiong, Y., From, P.J.: Three-dimensional location methods for the vision system of strawberry-harvesting robots: development and comparison. Precision Agriculture 24(2), 764–782 (2023) [25] Zou, Z., Chen, K., Shi, Z., Guo, Y., Ye, J.: Object detection in 20 years: A survey. Proceedings of the IEEE 111(3), 257–276 (2023) https://doi.org/10.1109/ JPROC.2023.3238524 23