Vision-Based Early Fault Diagnosis and Self-Recovery for Strawberry Harvesting Robots

Vision-Based Early Fault Diagnosis and Self-Recovery for Strawberry Harvesting Robots

📝 Abstract

Strawberry harvesting robots faced persistent challenges such as low integration of visual perception, fruit-gripper misalignment, empty grasping, and strawberry slippage from the gripper due to insufficient gripping force, all of which compromised harvesting stability and efficiency in orchard environments. To overcome these issues, this paper proposed a visual fault diagnosis and self-recovery framework that integrated multi-task perception with corrective control strategies. At the core of this framework was SRR-Net, an end-to-end multi-task perception model that simultaneously performed strawberry detection, segmentation, and ripeness estimation, thereby unifying visual perception with fault diagnosis. Based on this integrated perception, a relative error compensation method based on the simultaneous target-gripper detection was designed to address positional misalignment, correcting deviations when error exceeded the tolerance threshold. To mitigate empty grasping and fruit-slippage faults, an early abort strategy was implemented. A micro-optical camera embedded in the end-effector provided real-time visual feedback, enabling grasp detection during the deflating stage and strawberry slip prediction during snap-off through MobileNet V3-Small classifier and a time-series LSTM classifier. Experiments demonstrated that SRR-Net maintained high perception accuracy. For detection, it achieved a precision of 0.895 and recall of 0.813 on strawberries, and 0.972/0.958 on hands. In segmentation, it yielded a precision of 0.887 and recall of 0.747 for strawberries, and 0.974/0.947 for hands. For ripeness estimation, SRR-Net attained a mean absolute error of 0.035, while simultaneously supporting multi-task perception and sustaining a competitive inference speed of 163.35 FPS. The compensation

💡 Deep Analysis

📄 Full Content

Vision-Based Early Fault Diagnosis and Self-Recovery for Strawberry Harvesting Robots Meili Sun1, Chunjiang Zhao2,1*, Lichao Yang2, Hao Liu2, Shimin Hu2, Ya Xiong2,1* 1College of Mechanical and Electrical Engineering, Shihezi University, Xinjiang, 832003, China. 2Intelligent Equipment Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing, 100097, China. *Corresponding author(s). E-mail(s): zhaocj@nercita.org.cn; yaxiong@nercita.org.cn; Abstract Strawberry harvesting robots faced persistent challenges such as low integration of visual perception, fruit-gripper misalignment, empty grasping, and strawberry slippage from the gripper due to insufficient gripping force, all of which compro- mised harvesting stability and efficiency in orchard environments. To overcome these issues, this paper proposed a visual fault diagnosis and self-recovery frame- work that integrated multi-task perception with corrective control strategies. At the core of this framework was SRR-Net, an end-to-end multi-task percep- tion model that simultaneously performed strawberry detection, segmentation, and ripeness estimation, thereby unifying visual perception with fault diagnosis. Based on this integrated perception, a relative error compensation method based on the simultaneous target-gripper detection was designed to address positional misalignment, correcting deviations when error exceeded the tolerance thresh- old. To mitigate empty grasping and fruit-slippage faults, an early abort strategy was implemented. A micro-optical camera embedded in the end-effector provided real-time visual feedback, enabling grasp detection during the deflating stage and strawberry slip prediction during snap-off through MobileNet V3-Small classi- fier and a time-series LSTM classifier. Experiments demonstrated that SRR-Net maintained high perception accuracy. For detection, it achieved a precision of 0.895 and recall of 0.813 on strawberries, and 0.972/0.958 on hands. In seg- mentation, it yielded a precision of 0.887 and recall of 0.747 for strawberries, and 0.974/0.947 for hands. For ripeness estimation, SRR-Net attained a mean absolute error of 0.035, while simultaneously supporting multi-task perception and sustaining a competitive inference speed of 163.35 FPS. The compensation 1 arXiv:2601.02085v1 [cs.RO] 5 Jan 2026 method reduced physical relative errors from 11.52 mm and 5.15 mm to 3.12 mm and 4.11 mm, below the tolerance threshold, while the early-abort strategy shortened execution time by 1.89 s and 0.3 s during deflating and snap-off stages. Overall, the proposed framework enhanced both the robustness and efficiency of strawberry harvesting robots, providing a practical and reliable solution for autonomous fruit harvesting. Keywords: visual fault perception, early diagnosis, relative error compensation, early abort strategy, harvesting robot 1 Introduction Harvesting robots [1] [2] [3] have made significant progress in recent years, showing great potential to reduce labor dependence and improve agricultural productivity. However, several mechanical, electrical, and control faults [4] [5] [6] still arise during robotic operations, compromising operational stability and continuity. For example, mechanical structural failures [7], air leakage in pneumatic end-effectors [8], and frac- tures in end-joint connectors can impair the normal operation of strawberry harvesting robots. Moreover, the absence of active learning and self-updating mechanisms renders models ineffective in adapting to changing conditions. Failure to respond to abnormal signals also poses operational and safety risks to the robot. Once any single component fails or produces an erroneous output, the entire harvesting process is interrupted. Due to the lack of robust fault diagnosis and self-recovery mechanisms, harvesting robots remain prone to frequent work interruptions, which shortens their effective operation time in the field. Generally, fruit harvesting with robots involves seven key steps: image acquisi- tion, fruit detection, segmentation and ripeness estimation, instance tracking and localization, motion planning, execution, result evaluation with fault diagnosis, and result recording with optimization. Along this pipeline, potential faults may occur in multiple stages [9]. On the perception side, problems such as blurred or occluded cam- eras, unstable illumination [10], network/data loss, inaccurate segmentation [11] [12], ripeness misclassification [13], ID tracking drift, and depth or calibration errors can lead to detection failures. At the motion control level, incorrect path planning, colli- sion avoidance errors, inverse kinematics failures, or grasp misalignment may disrupt harvesting. Even after successful contact, execution faults such as loose grasps, failed detachments, fruit damage, or gripper jamming remain common. These are further compounded by inaccurate success/failure judgment and inadequate fault recovery, resulting in prolonged downtime. Without robust fault diagnosis and self-recovery mechanisms, current robots are prone to frequent interruptions, severely limiting their effective operating time in orchards. In the HarvestFlex system (https://xiong-lab.cn/), as shown in Fig. 1 (a) and (b), a RealSense D455F depth camera (Intel, Realsense, USA) captured real-time RGB-D images, which were processed by YOLOv8/11 for strawberry detection and 2 Fig. 1: Images collection devices and environment. (a) HarvestFlex robot with a shad- ing cover; (b) HarvestFlex robot without a shading cover; (c) table-top strawberries Fig. 2: The picking process of strawberry harvesting robot: inflating and approaching, swallowing, deflating, snap off, descending and placing. segmentation. A checkerboard-based calibration enabled eye-to-hand coordinate trans- formation, while polynomial regression compensated spatial errors. For sequence planning, a minimal-height sorting algorithm determined harvesting order, and a point-to-point speed pattern-based method ensured efficient move the end-effector. The end-effector was a flexible pneumatic gripper that inflated and deflated to gently envelop and detach strawberries. The complete harvesting workflow consisted of six steps: inflating and approaching, swallowing, deflating, snap off, descending and plac- ing, in Fig. 2. However, several issues emerged with our HarvestFlex robot that severely compromised its ability to harvest stably and continuously. These primary challenges were: (1) Low integration of visual perception tasks – the limited fusion of detection, segmentation, and ripeness estimation led to increased model complexity and reduced inference speed, thereby constraining overall system efficiency. (2) Positional inaccu- racy between the end-effector and the target fruit – although the eye-to-hand approach offered a stable field of view for dynamic fruit tracking, it remained highly sensitive to hand-eye calibration errors. Even minor deviations could lead to misalignment, empty grasps, unsuccessful detachments, or incorrect placements, thereby disrupting continuous operation. (3) Inadequate gripping and strawberry slippage – although the pneumatic soft gripper reduced fruit damage, its high compliance makes the integra- tion of reliable tactile or slippage sensors challenging. Consequently, slippage often went undetected, resulting in wasted cycle time and, in some cases, unintended fruit release during the snap-off stage. This problem was further compounded by the imma- turity of current tactile sensing technologies, which remained expensive, unstable, 3 and impractical for large-scale deployment in orchard environments. Moreover, exist- ing tactile sensor technologies remained immature, with solutions that are typically expensive, unstable, and challenging to deploy reliably in orchard environments. To mitigate such problems, fault diagnosis and recovery methods were generally classified into two categories: traditional approaches and deep learning-based tech- niques to deal with electrical, software and control, process-related, and human errors faults [14]. Traditionally, various methods were proposed to address robotic faults, such as signal processing techniques [15], rule-based systems [16], model-based approaches [17], etc. However, these traditional methods faced limitations in handling complex fault scenarios, particularly in the case of rule-based and threshold-based approaches. With advances in computer hardware, deep learning emerged a new paradigm for fault diagnosis, enabling automatic feature extraction from multi-modal data such as sensor signals, vibrations, acoustics, and images. CNNs were applied to surface defect detec- tion [18], while RNNs and LSTMs were used to analyze time-series data for early fault detection in [19]. However, these methods often relied on additional sensors to acquire high-quality multi-modal data, which not only increased system cost and complexity but also complicated deployment in real-world agricultural environments. Further- more, ensuring temporal and spatial alignment across heterogeneous data sources (e.g., vision, vibration, and tactile signals) remained a nontrivial challenge, as even slight misalignments could degrade fault diagnosis accuracy. To overcome these limitations, this paper focused on a vision-based fault diagnosis framework for HarvestFlex, thereby avoiding the need for additional multi-modal sen- sors and the complexities of data alignment. The objective was to address software- and control-related faults that compromised harvesting stability and efficiency. To this end, an end-to-end multi-task perception method SRR-Net [20] was introduced and integrated visual perception with fault object detection while maintaining a lightweight model architecture and low computational load. The method comprised three primary subtasks: object detection, instance segmentation, and ripeness estimation. To align the gripper with the harvesting point, a relative error compensation method based on the simultaneous target-gripper detection was implemented in the coordinate frame of robot arm once the gripper reached the position beneath the target. Then, a corrected harvesting point was generated, guiding the gripper to move beneath the target for a second time to ensure accurate alignment. Furthermore, an early abort strategy was introduced to improve the reliability of the harvesting process. Specifically, a micro- optical camera embedded at the base of the gripper continuously monitors the presence and stability of strawberries. In the deflating stage, the MobileNet V3-Small [21] ver- ified whether the fruit had been successfully grasped, while in the snap-off stage, an LSTM classifier estimated the probability of the strawberry slippage from the gripper. If no strawberry was detected, an early abort signal was triggered to terminate the current cycle. This integrated perception–action framework enabled prompt detection of empty grasps or slippage, allowing the robotic arm to respond rapidly and maintain stable operation. The main contributions of this paper were as follows. • An end-to-end multi-task perception framework, SRR-Net, was introduced to integrate visual perception with fault diagnosis. 4 • A relative error compensation method based on the simultaneous target-gripper detection was developed to realign the position of end-effector when approaching the area beneath the picking point. • To enhance operational reliability, an early abort strategy was implemented. During the deflating stage, MobileNet V3-Small was used to classify whether the straw- berry had been successfully grasped. Subsequently, in the snap-off stage, an LSTM classifier predicted the likelihood of fruit slippage from the gripper, enabling timely corrective actions. 2 Dataset Benchmark In this paper, FaultData, GraspData and SnapData were constructed to support the multi-task vision perception task, grasp adjustment task and enable the time-series LSTM classifier task. All data were collected using the HarvestFlex strawberry- harvesting robot, which includes two robotic arms [22], two Realsense D455F cameras, and two controllable light sources, as shown in Fig. 1(a–b). The cameras captured RGB-D images at a resolution of 640×480 pixels, with the distance between the lens and the strawberries ranging from 30 to 90 cm. Data were collected in both natu- ral and controlled lighting environments to enhance the robot’s visual adaptability to complex and dynamic orchard conditions. FaultData For the tasks of detection, segmentation and ripeness estimation of strawberry and gripper, diverse strawberry formations-such as isolated, overlapping, and occluded fruits-as well as the complete operational sequence of the end-effector during picking images were collected. To ensure robust and continuous operation for multi-task vision perception task, the dataset also incorporated a variety of lighting conditions, weather scenarios, and nighttime environments, laying the foundation for stable, autonomous, all-weather performance. In addition, high-resolution images were captured using a Redmi Note 13 Pro smartphone and an Orbbec Gemini Pro RGB-D camera (Orbbec, Gemini Pro, China) providing enhanced texture and color informa- tion. The entire dataset was collected at the Cuihu Factory in Beijing, China, and featured the Fragaria × ananassa ‘Kaorino’ cultivar, as shown in Fig. 1(c). All images were annotated with the polygonal outlines of strawberries and the end-effector using Labelme [23]. Each object instance was labeled as strawberry, hand or table, with the ripeness of strawberries assigned based on [20]. The ripeness score was defined within the range [0,1.1]. The label format for each instance followed the structure: , where cls represented the object class. The dataset was split into training and validation subsets, consisting of 2954 and 779 images, respec- tively. Note that the ripeness attribute applied only to strawberries. To ensure a consistent label format, the table and hand class was assigned alignment flags of 2 and 3. GraspData Using a miniature camera, image data were collected during the defla- tion stage of the end-effector. A total of 594 images were acquired. In the training and validation sets, the numbers of images with strawberries and without strawberries during the deflating stage are 197/202 and 93/102, respectively. 5 Fig. 3: Harvesting fault and recovery in HarvestFlex SnapData For the time-series prediction task during the strawberry snap- off stage, all images were captured using a micro-optical camera mounted on the end-effector based on artificial strawberry instances. In each frame, key visual features—including the normalized strawberry area, normalized gripper area, normalized background area, and the width and height of the strawberry—were extracted using a fine-tuned SRR-Net. These features were structured in the format: , , , , ,

201.10 hand 0.966 0.957 0.977 0.793

YOLOv11-seg strawberry 0.892 0.812 0.884 0.663 0.864 0.753 0.824 0.442

167.89 hand 0.968 0.957 0.976 0.791 0.971 0.95 0.975 0.653

SRR-Net strawberry 0.895 0.813 0.884 0.633 0.887 0.747 0.829 0.448 0.035 163.35 hand 0.972 0.958 0.977 0.788 0.974 0.947 0.964 0.655

SnapData SRR-Net strawberry 0.999 0.992 0.995 0.978 0.999 0.992 0.995 0.981

background 0.881 0.922 0.941 0.807 0.888 0.93 0.946 0.824

To more intuitively observe the performance of SRR-Net, the visualization results of SRR-Net, YOLOv11, and YOLOv11-seg on the FaultData dataset are presented in Fig. 10. In Fig. 10 (a) and (b), strawberries and end-effectors were represented with bounding boxes, with the target class label and confidence score displayed above each box. The confidence threshold was set to 0.5. For the same strawberry instance, SRR- Net produced higher confidence scores than YOLOv11 and YOLOv11-seg. In terms of ripeness estimation, ripeness estimation of SRR-Net closely matched the true ripeness of each strawberry. Overall, the visual comparisons illustrated that SRR-Net delivered superior performance compared with YOLOv11 and YOLOv11-seg. 4.4 Evaluation of Relative Error Compensation To validate the effectiveness of relative error compensation, the 3D coordinates of the strawberry picking point (xs, ys, zs), the gripper positioned beneath the picking point (xe, ye, ze), the relative error before compensation (∆x, ∆y), the ground-truth error before compensation (∆xw, ∆yw) , the compensated picking point (xce, yce, zce), and the relative error after compensation (Ex, Ey) in the robot arm coordinate system were measured in Table 3. Due to the requirement of snap-off, the rise distance in the swallowing stage exceeded the height of the strawberry. To simplify the experi- ments and reduce computational overhead, the relative error along the z-axis was not considered. From Table 3, the visual and ground-truth errors before compensation were com- pared, and the ground-truth errors after compensation were recorded. For example, in the first row, the relative errors on the x- and y- axis were 22 mm and -4 mm, while 15 (a) YOLOv11 (b) YOLOv11-seg (c) SRR-Net Fig. 10: Visualization results the corresponding ground-truth errors were 17.3 mm and 4.2 mm. The absolute dif- ferences between visual and ground-truth errors were therefore 4.7 mm and 8.2 mm. After compensation, the physical errors of the x- and y- axis were reduced to 1.5 mm and 6.3 mm, both below the defined threshold. Before compensation, the mean relative errors were 14.07 mm on the x-axis and 8.64 mm on the y-axis, while the mean physical errors were 11.52 mm and 5.15 mm, respectively—indicating that visual estimation tended to overestimate the actual phys- ical errors. With the proposed compensation method, the mean physical error between the strawberry point and the gripper were further reduced to 3.12 mm (x-axis) and 4.11 mm (y-axis). The slightly larger residual y-axis error was likely due to the dif- ficulty of achieving high-precision motor control over very small movement ranges. Despite this phenomenon, the relative error compensation method demonstrated a clear advantage in aligning the gripper with the target and in reducing grasping errors. 16 Table 3: Relative errors between the end-effector and the picking point in the arm coor- dinate system (unit: mm). The final row showed the computed mean absolute error. xs ys zs xe ye ze ∆x ∆y ∆xw ∆yw xce yce zce Ex Ey 709 221 706 686 225 647 22 -4 17.3 4.2 732 219 706 1.5 6.3 464 232 710 450 249 652 14 -17 8.3 -12.7 479 223 710 5.0 -8.2 377 245 693 362 255 626 15 -10 7.2 -2.5 392 239 693 2.6 3.6 700 222 699 681 224 638 18 -2 15.7 3.4 719 221 699 0 2.8 532 234 710 518 245 649 14 -11 10.6 -3.6 546 228 710 2.7 -2.9 711 244 711 692 249 651 18 -5 9.2 -5.1 730 241 711 0 2.2 320 244 692 316 256 628 4 -11 3.6 -2.6 324 239 692 0 2.1 468 236 703 460 245 653 8 -8 12.8 -1.3

652 235 712 631 239 651 20 -3 20.4 6.2 673 233 712 -4.4 -2 816 220 699 793 223 642 23 -2 17.4 6.8 839 219 699 -6.4 5.1 393 238 706 386 245 643 7 -7 6.9 3.0

734 222 699 712 224 638 22 -1 11.8 5.0 757 221 699 -5.3 3.9 445 230 711 436 249 646 8 -19 9.9 -9.5 454 220 711 -4.3 6.2 629 234 721 611 244 659 17 -10 17.5 -2.6 647 228 721 5.9 6.3 299 246 693 295 259 631 4 -12 10 -7.9 304 240 693 -2.4 1.2 453 233 701 443 251 637 10.6 -18.5 10.4 -7.5 464 224 701 7.1 -1.2 752 213 708 729 218 645 22 -4 13.4 6.9 774 221 708 0 4.3 667 215 708 645 222 645 22 -7 14.2 0 690 211 708 -3.8 -6.0 307 246 693 305 252 633 1.71 -6.3 5.7 0

467 232 700 456 248 634 11 -15 8.1 -12.1 479 224 700 -1.6 5.6

14.07 8.64 11.52 5.15

3.12 4.11 4.5 Operational Efficiency Analysis of the Early Abort Strategy To evaluate the effectiveness of the early abort strategy, the original picking time, the minimum time at which the early abort signal was triggered, and the time reduction for each action were calculated and analyzed in Table 4. In the original process, a com- plete picking cycle required approximately 12 s, with the inflating and approaching, swallowing, deflating, snap-off, descending, placing and homing stages taking 2 s, 2 s, 2 s, 1 s, 2 s, 2 s, and 1 s, respectively. When the gripper approached beneath the straw- berry picking point, the relative error was estimated, and if it exceeded the threshold, a compensation action was triggered to align the gripper with the strawberry, adding 1 s to the cycle, as shown in Table. 4 #Case 1. During the deflating stage, MobileNet V3-Small was executed to monitor whether a strawberry was present in the gripper. In accordance with the time-stability rule, the system required three consecutive signals before issuing the abort command. Upon receiving the abort command, the robotic arm immediately aborted the current pro- cess and skipped the descending and homing stages. The shortest time to receive the abort signal during the deflating stage was 0.11 s, resulting in a time reduction of 1.89 s, as shown in Table. 4 #Case 2. Overall, although the compensation action slightly increased the cycle time, the continuous harvesting efficiency was significantly improved. 17 Table 4: Harvesting time and reduction time via early abort of robot arm Action Original Time (s) #Case 1 (s) #Case 2 (s) #Case 3 (s) Relative Error Compensation Grasping failure Slipped Inflation and approaching 2 2 2 2 Compensation

1

Swallowing 2 2 2 2 Deflating 2 2 0.11 2 Snap-off 1 1

0.7 Descending 2 2 2 2 Placing 2 2

Homing 1 1 1 1 Time Reduction (s)

-1 1.89 0.3 Table 5: LSTM classifier results on the validation set of SnapData, where label 0 indicated that the strawberry remained in the gripper and label 1 indicated that it had slipped. Class precision recall F1-score 0 0.9294 1.0000 0.9634 1 1.0000 0.9118 0.9538 For the snap-off stage, the results of the LSTM classifier on the validation set of SnapData were presented in Table. 5, where class 0 and class 1 represented straw- berries remaining in the gripper and strawberries that had slipped, respectively. As shown in Table. 5, the LSTM classifier achieved high accuracy on SnapData. For class 0, all strawberries in the gripper were correctly detected (recall = 1.0000), with a few false positives (precision = 0.9294). For class 1, all predicted slips were correct (pre- cision = 1.0000), though a small fraction were missed (recall = 0.9118). These results indicated that the LSTM classifier provided reliable slip prediction with minimal false alarms, supporting its use in early abort strategies. Classes 0 and 1 were classified using a maximum threshold. It was noted that class 0 encompassed two strawberry states—stable and slipping—so an adaptive minimum threshold was introduced and applied to further distinguish between them. During the snap-off stage, operations were triggered based on the strawberry status in the gripper, with the time-stability rule applied to prevent false positives. When the early abort condition was met for three consecutive checks, the descending and placing stages were skipped, and homing was executed immediately. In randomized tests, the minimum response time for receiving an abort signal was 0.7 s, reducing the overall cycle time by 0.3 s, as presented in Table. 4 #Case 3. Visualization results of the LSTM classifier on the validation of SnapData were pre- sented in Fig. 11. The left panel displays frames 8-16 with the prediction probability 18 and corresponding strawberry statuses, while the right panel illustrated the LSTM pre- diction probabilities. The horizontal axis represented the frame index starting from 0, and the vertical axis indicated the predicted probability of strawberry slippage within the next three frames. The red dashed line marked the maximum threshold, while the blue curve with asterisks showed the predicted probabilities. Red five-pointed stars denoted instances where the strawberry had slipped from the end-effector. An adap- tive minimum threshold, illustrated by a pink dashed line, was employed to determine whether a secondary picking attempt should be executed or the subsequent actions should continue. If the total number of frames did not reach 10, fewer than 10 frames were displayed as labels in the top-left corner of the images. The strawberry began slipping between frames 9 and 11, with corresponding probabilities of 0.406, 0.446, and 0.502. From frame 13 onward, the strawberry was considered slipped. Similarly, a probability curve of strawberry slippage was plotted for intuitive observation, as shown in the right panel of Fig. 11. In Fig. 11 (left), the LSTM classifier presented the predicted probability of straw- berry slippage for frames 8-16. When the number of frames did not reach 10, fewer than 10 frames were displayed as labels at the top-left of the images. The strawberry was slipping from frames 9-11, with corresponding probabilities of 0.406, 0.446, and 0.502; starting from frame 13, the strawberry had slipped. Similarly, a probability curve of strawberry slippage was plotted for intuitive observation in Fig. 11 right. The five-pointed star indicates the frame where the strawberry slipped for the first time. Based on validation results on SnapData, the LSTM classifier was able to timely send early-abort signals, enabling efficient and stable operation. Tests of the early abort strategy for empty grasp in deflating and slip prediction in snap-off were conducted on HarvestFlex. However, experiments involving inflating-and-deflating to re-grasp were not performed or deployed on HarvestFlex due to limitations in the strawberry growing season. 5 Discussion Vision-based fault diagnosis and self-recovery offered an effective means of enhancing the stability of strawberry-harvesting robots. During the inflating and approaching stage, the positional relationship between the gripper and the harvesting point served as a visual indicator of accumulated errors caused by fruit recognition, localization, hand–eye calibration, and inverse kinematics. Without requiring an additional camera, relative errors based on the simultaneous target-gripper detection were computed to align the strawberry with the end-effector. For early abort feedback during the deflat- ing and snap-off stages, a micro-optical camera was embedded in the end-effector to detect and predict the probability of strawberry slippage in the gripper. This approach compensated for the absence of force feedback in the flexible pneumatic gripper, whose deformation during inflation and deflation rendered conventional force sensing impractical. However, a limitation of slip prediction–based early abort was that the material properties of artificial strawberries differed from those of real ones, partic- ularly in weight and texture. Consequently, the secondary inflating-deflating regrasp 19 Fig. 11: LSTM classifier results. (Left) Video frames 8-16; (Right) Predicted proba- bilities from the LSTM model. experiment was evaluated only on the SnapData test set and was not conducted on HarvestFlex, owing to the material constraints of artificial strawberries. In this study, artificial strawberries were used primarily to evaluate relative error compensation and the early abort strategy. Mechanical failures, inverse kinematics errors, and other control-related faults were not addressed in this work. Due to the limitations of straw- berry growth conditions, artificial strawberries were used in the experiments, which did not fully account for biological characteristics such as fruit damage. Additionally, self-learning and adaptive evolutionary perception, planning and decision-making can be applied to reduce dataset dependency and improve the general- ization and robustness of the strawberry-harvesting robot. With the rapid development of embodied artificial intelligence and robotic agents, a new wave of end-to-end, multi- modal, large-model-based perception, self-planning and self-decision frameworks is emerging in the robotics domain. These advances offer valuable insights for building strawberry harvesting robots and enabling collaborative operations among multiple robots. In the future, developing end-to-end active learning and continuously evolving methods is expected to become a major research trend. 6 Conclusion This paper proposed a visual-based fault diagnosis and self-recovery system to address gripper offset and strawberry slippage during harvesting. An end-to-end multi-task perception network with a shared weight backbone and neck was developed to reduce computational overhead. Without the need for an additional external camera, relative error compensation method based on the simultaneous target-gripper detection was estimated, enabling compensation actions to mitigate cumulative errors. To monitor the strawberry’s status in the gripper, a miniature optical camera was embedded at its base. A MobileNet V3-Small classifier was adapted to detect strawberries within the 20 gripper and trigger early abort signal when necessary during the deflating stage based on the GraspData. For strawberry slippage during the snap-off stage, a novel dataset, SnapData, comprising strawberry and background classes was introduced, and the perception network was extended and fine-tuned to segment strawberries, the gripper, and background. Width and height features were combined with a time-series LSTM classifier to predict strawberry slippage from the gripper, enabling early-abort picking actions and improving harvesting efficiency. Experimental results demonstrated that the perception network achieved high accuracy, with a mean absolute error of 0.035 for ripeness estimation. The compensation mechanism reduced relative errors to 3.12 mm and 4.11 mm, while the early-abort strategy shortened execution times by 1.89 s and 0.3 s during the deflating and snap-off stages, respectively. Overall, the proposed system effectively enhanced the robustness and efficiency of strawberry harvesting, providing a practical solution for reliable autonomous fruit picking. Declarations Funding. This work was supported by the Haidian District Bureau of Agricul- ture and Rural Affairs, the Beijing Academy of Agriculture and Forestry Sciences (BAAFS) Innovation Ability Project (KJCX20240321), the Outstanding Youth Foun- dation of BAAFS (YKPY2025007), the BAAFS Talent Recruitment Program and the National Natural Science Foundation of China (NSFC) Excellent Young Scientists Fund (overseas). Conflict of interest. The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Data availability. Data will be released. Code availability. The code will be released. Author contribution. Meili Sun: Methodology, Software, Validation, Writing - original draft. Chunjiang Zhao: Investigation, Funding acquisition, Writing - review editing. Lichao Yang: Data curation, Methodology, Software, Visualization. Hao Liu: Data curation, Methodology, Software. Shimin Hu: Data curation, Software, Visualiza- tion. Ya Xiong: Conceptualization, Methodology, Investigation, Funding acquisition, Writing - review editing. References [1] Hua, W., Zhang, Z., Zhang, W., Liu, X., Hu, C., He, Y., Mhamed, M., Li, X., Dong, H., Saha, C.K., et al.: Key technologies in apple harvesting robot for stan- dardized orchards: A comprehensive review of innovations, challenges, and future directions. Computers and Electronics in Agriculture 235, 110343 (2025) [2] Xiao, X., Wang, Y., Jiang, Y.: Review of research advances in fruit and vegetable harvesting robots. Journal of Electrical Engineering & Technology 19(1), 773–789 (2024) 21 [3] Hou, C., Xu, J., Tang, Y., Jiajun, Z., Tan, Z., Chen, W., Wei, S., Huang, H., Fang, M.: Detection and localization of citrus picking points based on binocular vision. Precision Agriculture 25(5), 2321–2355 (2024) [4] Milecki, A., Nowak, P.: Review of fault-tolerant control systems used in robotic manipulators. Applied Sciences 13(4), 2675 (2023) [5] Quamar, M.M., Nasir, A.: Review on fault diagnosis and fault-tolerant control scheme for robotic manipulators: Recent advances in ai, machine learning, and digital twin. arXiv preprint arXiv:2402.02980 (2024) [6] Rajendran, V., Debnath, B., Mghames, S., Mandil, W., Parsa, S., Parsons, S., Ghalamzan-E, A.: Towards autonomous selective harvesting: A review of robot perception, robot design, motion planning and control. Journal of Field Robotics 41(7), 2247–2279 (2024) [7] Wang, H., Lao, L., Zhang, H., Tang, Z., Qian, P., He, Q.: Structural fault detection and diagnosis for combine harvesters: A critical review. Sensors 25(13), 3851 (2025) [8] Rapalo, A.O.V., Cort´es, C.A.B., Ordo˜nez´Avila, J.L.: Design and analysis of a pneumatic end-effector with an air contraction clamp for soft robotics. In: 2024 9th International Conference on Control and Robotics Engineering (ICCRE), pp. 98–103 (2024). IEEE [9] Khalastchi, E., Kalech, M.: Fault detection and diagnosis in multi-robot systems: A survey. Sensors 19(18), 4019 (2019) [10] Nejati, M., Penhall, N., Williams, H., Bell, J., Lim, J., Ahn, H.S., MacDonald, B.: Kiwifruit detection in challenging conditions. arXiv preprint arXiv:2006.11729 (2020) [11] Shi, X., Wang, S., Zhang, B., Ding, X., Qi, P., Qu, H., Li, N., Wu, J., Yang, H.: Advances in object detection and localization techniques for fruit harvesting robots. Agronomy 15(1), 145 (2025) [12] Tantan, J., Xiongzhe, H., Wang, P., Lyu, Y., Eunha, C., Haetnim, J., Lirong, X.: Performance evaluation of robotic harvester with integrated real-time perception and path planning for dwarf hedge-planted apple orchard. Agriculture 15(15), 1593 (2025) [13] Kamat, P., Gite, S., Chandekar, H., Dlima, L., Pradhan, B.: Multi-class fruit ripeness detection using yolo and ssd object detection models. Discover Applied Sciences 7(9), 931 (2025) [14] Sabry, A.H., Amirulddin, U.A.B.U.: A review on fault detection and diagnosis of industrial robots and multi-axis machines. Results in Engineering 23, 102397 22 (2024) [15] Zhang, C., Mousavi, A.A., Masri, S.F., Gholipour, G., Yan, K., Li, X.: Vibra- tion feature extraction using signal processing techniques for structural health monitoring: A review. Mechanical Systems and Signal Processing 177, 109175 (2022) [16] Vukadinovic, M., Reiterer, B., Rathmair, M., Schuetz, C.G.: Anomaly detection in robot applications: Comparison of rule-based and machine learning methods. In: 2024 9th International Conference on Control, Robotics and Cybernetics (CRC), pp. 1–5 (2024). IEEE [17] Hasan, A., Tahavori, M., Midtiby, H.S.: Model-based fault diagnosis algorithms for robotic systems. IEEE Access 11, 2250–2258 (2023) [18] Zhou, H., Ahmed, A., Liu, T., Romeo, M., Beh, T., Pan, Y., Kang, H., Chen, C.: Finger vision enabled real-time defect detection in robotic harvesting. Computers and Electronics in Agriculture 234, 110222 (2025) [19] Hwang, C.-L.: Cooperation of robot manipulators with motion constraint by real- time rnn-based finite-time fault-tolerant control. Neurocomputing 556, 126694 (2023) [20] Sun, M., Hu, S., Zhao, C., Xiong, Y.: Light-resilient visual regression of strawberry ripeness for robotic harvesting. Computers and Electronics in Agriculture 241, 111169 (2026) https://doi.org/10.1016/j.compag.2025.111169 [21] Howard, A., Sandler, M., Chen, B., Wang, W., Chen, L.-C., Tan, M., Chu, G., Vasudevan, V., Zhu, Y., Pang, R., Adam, H., Le, Q.: Searching for mobilenetv3. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1314–1324 (2019). https://doi.org/10.1109/ICCV.2019.00140 [22] Chen, Y., Miao, Z., Ge, Y., Lin, S., Chen, L., Xiong, Y.: Design and control of a novel six-degree-of-freedom hybrid robotic arm. In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3597–3604 (2024). IEEE [23] Torralba, A., Russell, B.C., Yuen, J.: Labelme: Online image annotation and applications. Proceedings of the IEEE 98(8), 1467–1484 (2010) [24] Ge, Y., Xiong, Y., From, P.J.: Three-dimensional location methods for the vision system of strawberry-harvesting robots: development and comparison. Precision Agriculture 24(2), 764–782 (2023) [25] Zou, Z., Chen, K., Shi, Z., Guo, Y., Ye, J.: Object detection in 20 years: A survey. Proceedings of the IEEE 111(3), 257–276 (2023) https://doi.org/10.1109/ JPROC.2023.3238524 23

View Original PDF on ArXiv