A Deep Reinforcement Learning Framework for Closed-loop Guidance of Fish Schools via Virtual Agents

Guiding collective motion in biological groups is a fundamental challenge in understanding social interaction rules and developing automated systems for animal management. In this study, we propose a deep reinforcement learning (RL) framework for the…

Authors: Takato Shibayama, Hiroaki Kawashima

A Deep Reinforcement Learning Framework for Closed-loop Guidance of Fish Schools via Virtual Agents
A Deep Reinforcemen t Learning F ramew ork for Closed-lo op Guidance of Fish Sc ho ols via Virtual Agen ts T ak ato Shiba yama and Hiroaki Kaw ashima ∗ Graduate School of Information Science, Universit y of Hy ogo, K ob e, Japan Abstract Guiding collectiv e motion in biological groups is a fundamen tal challenge in understanding so cial in teraction rules and dev eloping automated systems for animal managemen t. In this study , w e prop ose a deep reinforcement learning (RL) framework for the closed-lo op guidance of fish sc ho ols using virtual agen ts. These agen ts are con trolled by policies trained via Pro ximal Policy Optimization (PPO) in simulation and deplo yed in physical exp erimen ts with rummy-nose tetras ( Petitel la bleheri ), enabling real-time interaction betw een artificial agen ts and live individuals. T o cop e with the stochastic behavior of liv e individuals, w e design a comp osite rew ard function to balance directional guidance with so cial cohesion. Our systematic ev aluation of visual parameters shows that a white background and larger stim ulus sizes maximize guidance efficacy in physical trials. F urthermore, ev aluation across group sizes rev ealed that while the system demonstrates effective guidance for groups of fiv e individuals, this capabilit y mark edly degrades as group size increases to eight. This study highligh ts the p otential of deep RL for automated guidance of biological collectives and iden tifies challenges in main taining artificial influence in larger groups. 1 In tro duction Collectiv e b eha vior in biological systems, such as fish schooling and bird flocking, emerges from lo cal interactions among individuals, leading to complex and co ordinated group-lev el pat- terns [ 1 , 2 , 3 , 4 , 5 ]. These b ehaviors allo w groups to respond rapidly to en vironmental stim uli, suc h as predator threats, through the instan taneous transmission of information across the col- lectiv e [ 6 , 7 , 8 , 9 ]. Understanding and guiding these dynamics are of significan t in terest in both fundamen tal biology and practical applications, such as automated aquaculture management and the dev elopment of bio-inspired underwater rob otics [ 10 , 11 , 12 , 13 , 14 , 15 ]. T o influence or control the motion of a collectiv e, researchers ha ve dev elop ed v arious biomimetic agen ts, including rob otic fish [ 16 , 17 , 18 , 19 , 20 ] and visual stim uli display ed on screens or by pro jection [ 21 , 22 , 23 , 24 , 25 , 26 , 27 ]. These to ols allo w for “causal” inv estigations by decou- pling specific so cial cues. Among these, closed-lo op systems, where artificial agents resp ond in real-time to the actions of liv e animals, hav e emerged as a p ow erful to ol for in vestigating so cial in teraction mechanisms [ 16 , 17 , 18 , 19 , 25 ]. A ma jor challenge in designing effective closed-lo op controllers for biological agen ts lies in mo deling stochastic and non-linear collectiv e behavior. T o alleviate the reliance on precise ana- lytical mo dels, mo del-free reinforcement learning (RL) has b een prop osed as a promising frame- w ork [ 28 ]. While prior w ork demonstrated the feasibility of fish guidance using Q-learning, its scop e w as limited to small, highly cohesiv e groups ( N r = 3 ) represen ted by a single centroid [ 28 ]. ∗ Corresp onding author: k a washima@gsis.u-h yogo.ac.jp 1 Suc h approaches rely on discretized state-space representations and rew ard structures limited to ev aluating final outcomes, whic h ma y not scale effectively to larger collectives. This study expands up on this foundation b y implemen ting an adaptive con troller based on Proximal Policy Optimization (PPO), a state-of-the-art deep reinforcemen t learning ap- proac h [ 29 ]. As the group size increases, collective dynamics b ecome significantly more com- plex, often splitting into multiple sub-groups, whic h makes simple discretized representations insufficien t. T o pro vide the necessary capacit y for future scalability and more granular feedback con trol, we transition to a con tin uous state-space representation through PPO. T o the b est of our knowledge, this is one of the first studies to apply PPO-based virtual agen ts to the real-time closed-lo op guidance of biological collectives. By introducing a multi-ob jective rew ard function that balances group in tegrit y with direc- tional guidance, w e facilitate the stable acquisition of interaction p olicies in a simulation environ- men t. Our metho dological approach establishes a robust bridge b etw een computational learning and real-world biological in teraction by training agen ts in a simulation and subsequen tly deploy- ing them in physical exp eriments. W e ev aluate our framework using rumm y-nose tetras ( Petitel la bleheri ). The inv estigation is conducted in tw o phases. First, we utilize small groups ( N r = 3 ) to sys- tematically ev aluate and optimize the virtual agents’ visual parameters, sp ecifically background color and stimulus size, to maximize their salience to the target sp ecies. Second, w e ev aluate the robustness and scalability of the prop osed system across v arying group sizes ( N r = 5 and N r = 8 ) and agent configurations. Sp ecifically , we ev aluate the guidance p erformance under sev eral agent configurations, including independently controlled agents, to compare their effectiv eness across differen t group sizes. Our findings indicate that as group size increases, the efficacy of directional guidance faces significant c hallenges, likely due to interference betw een artificial visual stimuli and intrinsi c so cial interactions. This work highligh ts b oth the p otential of deep RL for auto- mated animal guidance and the fundamental challenges of main taining influence within dense biological en vironments. 2 Metho ds 2.1 Exp erimen tal setup W e used rummy-nose tetras ( Petitel la bleheri , recen tly reclassified from Hemigr ammus bleheri ) as the exp erimental sub jects. This sp ecies was selected due to its strong schooling tendency and its suitabilit y for lab oratory main tenance. The closed-lo op guidance system dev elop ed in this study in tegrated real-time visual trac king of live fish with the application of con trol policies for virtual agen ts, as illustrated in the system arc hitecture (Fig. 1 ). A fron t-facing camera captured the positions of the live fish, which were pro cessed by a PC to determine the mo vemen ts of the virtual agents based on trained reinforce- men t learning p olicies. These agents w ere then presented to the fish in real-time via a liquid crystal display . The camera and displa y were aligned to b e parallel to the tank surface, minimiz- ing geometric and persp ectiv e distortions betw een the image and displa y coordinate systems. The exp erimen tal arena consisted of an acrylic tank with in ternal dimensions of 389 × 213 × 89 mm (width × height × depth), where depth refers to the fron t-to-back dimension. As illustrated in the top-view schematic (Fig. 2 ), we used a 2 mm -thic k acrylic partition to divide the tank into t wo sections with depths of 47 mm and 40 mm ; the fish w ere placed in the 40 mm deep section to constrain their swimming mov emen ts to a quasi-tw o-dimensional plane. Virtual agents w ere presen ted on a liquid crystal displa y (221E9/11, Philips; 483 × 270 mm ) mounted flush against the rear exterior w all of the tank. This spatial configuration ensured that the swimming region for the live fish was p ositioned at a sufficient distance from the display to preven t the stim uli from b eing obscured from the fish’s p ersp ectiv e by reflections at the tank-w ater in terface. A fron t-facing camera w as used to monitor the individuals in real-time. 2 Display Camera State Reward RL Agent ( policy ) Action 1. Simulation Experiment 2. Physical Experiment V irtual agent controller V irtual agent position Per frame Per step Per step Fish detector RL environment Fish position T ank Figure 1: Sc hematic diagram of the closed-lo op system architecture. The p ositions of the live fish are monitored b y a front-facing camera and pro cessed by a PC to apply the learned agent p olicies, whic h are then rendered as virtual agen ts on the display . Display Acrylic partition T ank Swimming region T op view Figure 2: T op-view sc hematic of the exp erimental setup. The hatched area indicates the 40 mm deep (fron t-to-bac k distance) section where the live fish were constrained. The partition ensures t wo-dimensional mo v ement and visibility of the display ed virtual agen ts b y a voiding reflections at the tank-water interface from the fish’s p ersp ectiv e. 2.2 Real-time vision and co ordinate mapping T o achiev e real-time closed-lo op in teraction, we implemented an automated trac king system using YOLOv10 [ 30 ]. The detection mo del w as fine-tuned sp ecifically to identify rummy-nose tetras within the exp erimental arena. The system captures the p ositions of all individuals at 10 fps. The ra w co ordinates ( u j , v j ) of each fish j ∈ { 1 , . . . , N r } detected in the camera frame are mapp ed to a normalized tank co ordinate system x ( r ) j ∈ [0 , 1] 2 . Before the exp eriments, the co ordinates of the top-left u tank0 = ( u tank0 , v tank0 ) ⊤ and bottom-right u tank1 = ( u tank1 , v tank1 ) ⊤ corners of the swimming region in the camera image were recorded. The normalized p osition x ( r ) j is then calculated as follows: x ( r ) j =  ( u j − u tank0 ) / ( u tank1 − u tank0 ) ( v j − v tank0 ) / ( v tank1 − v tank0 )  . (1) T o pro ject the virtual agents managed in the RL environmen t onto the display at the correct ph ysical lo cations, w e establish a mapping b etw een the camera and display co ordinate systems. 3 A set of reference sp ots S is pro jected on to the displa y , and their corresp onding homogeneous co ordinates in the camera frame, F = { ( u 1 , v 1 , 1) ⊤ , . . . , ( u n , v n , 1) ⊤ } , are captured. The 2 × 3 transformation matrix A is then determined b y A = S F ⊤ ( F F ⊤ ) − 1 . Using this matrix A , the displa y pixel coordinates corresponding to the corners of the swim- ming region, d tank k = ( d tank k , e tank k ) ⊤ (for k ∈ { 0 , 1 } ), are obtained by d tank k = A ˜ u tank k , where ˜ u tank k = ( u tank k , v tank k , 1) ⊤ represen ts the homogeneous co ordinates of the corners. Finally , the normalized co ordinates of a virtual agent i , denoted as x ( v ) i , are conv erted into display pixel co ordinates ( d i , e i ) as follows:  d i e i  = x ( v ) i ⊙ ( d tank1 − d tank0 ) + d tank0 , (2) where ⊙ denotes the elemen t-wise pro duct. This pip eline ensures that the visual stimuli are presen ted at precisely defined spatial p ositions relativ e to the live individuals, enabling accurate guidance based on social in teractions. 2.3 Reinforcemen t learning framew ork T o dev elop an autonomous con trol p olicy for virtual agen ts capable of guiding fish schools, w e emplo y ed Proximal P olicy Optimization (PPO), a policy-based deep reinforcement learning algorithm. In this framework, an agen t (i.e., a virtual agent presen ted on the displa y) is defined as a single policy-controlled unit that in teracts with the en vironment at discrete time steps t = 0 , 1 , . . . to maximize rew ards. Dep ending on the exp erimen tal configuration (see Section 2.5.2 ), the visual representation of an agen t is rendered either as a single fish image or as a fixed formation of multiple fish images, where a fish image refers to an individual visual stim ulus displa yed on the screen. Eac h agent op erates based on its own state observ ation and action output. 2.3.1 State and action space The state vector observed by each virtual agent i ∈ { 1 , . . . , N v } at time t is defined by the co ordinates of the real fish and the agent’s own p osition: s i,t = [ s ( r ) ⊤ i,t , s ( v ) ⊤ i,t ] ⊤ , (3) where s ( v ) i,t is the normalized co ordinate x ( v ) i of the i -th virtual agent. While prior work rep- resen ted the real fish collective using its global cen troid [ 28 ], such an approach may lack the gran ularity required to manage fragmented sub-groups. T o ensure a scalable and consistent rep- resen tation of the target collectiv e, we define the real fish information s ( r ) i,t as a 2D co ordinate represen ting a specific guidance reference p oint. W e implemented tw o mo des for defining these reference p oints based on the exp erimental configuration: • Global mo de : The guidance reference p oint s ( r ) i,t is defined as the global centroid of the N r real individuals. In this study , this mo de was applied to the single-agent scenarios ( N v = 1 ), or more generally , to cases where the school is treated as a single cohesive unit and all agents share the same reference point. • Cluster-assignment mo de : The real fish are partitioned into k clusters using the k - means algorithm. This mo de w as applied to our multi-agen t configurations ( N v > 1 ), where eac h virtual agent i is assigned the centroid of a sp ecific cluster c i as its guidance reference p oin t s ( r ) i,t . This mapping ensures that each agent maintains a fixed-length input and focuses on a lo calized sub-group, providing robustness against group fragmentation. 4 Eac h virtual agent i outputs a discrete action a ∈ { 0 , 1 , . . . , 7 } , which corresp onds to eight mo vemen t directions. Based on the selected action, a target co ordinate x ( v ) i, target is determined for eac h virtual agent i . In the physical exp eriments (Section 2.5 ), eac h fish image w as horizon tally flipp ed in real-time to align with its instantaneous mov ement direction, ensuring a natural visual app earance. T o ensure biologically plausible tra jectories, the actual mo v ement of virtual agen t i is mo deled as a first-order lag system: d x ( v ) i dt = 1 τ ( v ) ( x ( v ) i, target − x ( v ) i ) , (4) where τ ( v ) > 0 is the time constant. This dynamic model allows the virtual agen t to reflect the in termittent burst-and-coast mo vemen t and frequen t directional shifts characteristic of rummy- nose tetras [ 31 ]. 2.3.2 Multi-ob jective reward design T o develop an autonomous p olicy for active guidance, we define a comp osite reward function r β . W e first consider a baseline reward r base ∈ [ − 1 , 1] , based solely on c ( r ) x , the horizontal p osition of the collectiv e’s global centroid [ 28 ]: r base = 1 − 2 | c ( r ) x − x target - end | , (5) where x target - end ∈ { 0 , 1 } represents the target end of the tank (0 for leftw ard and 1 for right ward guidance). How ev er, using r base alone can lead to guidance failure. Since rummy-nose tetras are naturally exploratory , they migh t mo ve to w ard the target area indep endently . In suc h cases, if the reward dep ends only on the school’s location, the RL agent receiv es p ositive reinforcemen t without actually exerting guidance control, failing to learn appropriate guiding b ehaviors. T o address this, w e defined r β as a weigh ted sum of social cohesion and directional guidance using a hyperparameter β ∈ [0 , 1] : r β = β r school + (1 − β ) r direction , (6) where b oth rew ard terms are normalized to the range [ − 1 , 1] to ensure a balanced con tribution to the comp osite rew ard. The social cohesion term r school explicitly ev aluates the so cial coupling betw een the liv e fish and the virtual agents. It rew ards the agents for main taining proximit y to the real individuals, thereb y facilitating so cial influence and preven ting the agents from receiving rewards without exerting guidance con trol. Our formulation calculates the distance to the nearest virtual agen t for eac h real individual: r school = 1 − √ 2 N r N r X j =1 min i ∈{ 1 ,...,N v } d ( x ( r ) j , x ( v ) i ) , (7) where d ( x ( r ) j , x ( v ) i ) represen ts the Euclidean distance b etw een real fish j ∈ { 1 , . . . , N r } and virtual agen t i ∈ { 1 , . . . , N v } . This term encourages virtual agents to maintain pro ximit y to the real fish while simultaneously allowing multiple agents to distribute themselves to manage fragmented sub-groups, pro viding robustness against sc ho ol splitting. The directional guidance term r direction ev aluates the progress of the virtual agents themselv es to ward the target end of the tank: r direction = 1 − 2 | c ( v ) x − x target - end | , (8) where c ( v ) x is the horizontal co ordinate of the virtual agents’ centroid. The ov erall structure of this m ulti-ob jective reward is illustrated in Fig. 3 . 5 0 1 1 Guidance direction 0 1 1 Guidance direction 0 1 1 Guidance direction 0 1 1 Guidance direction ● Real fish large ● Virtual agent : small direction : small school : small direction : large school : large direction : large school : large direction : small school Figure 3: Conceptual diagram of the m ulti-ob jective reward design. The cohesion term r school encourages the virtual agen ts to maintain proximit y to the real fish, while r direction rew ards the progress of the virtual agents tow ard the target area. The abov e form ulation is presented in a general form applicable to m ulti-agent settings. During p olicy learning in sim ulation, how ever, we consider the single-agent case ( N v = 1 ), in whic h the rew ard terms reduce to forms defined with resp ect to the centroid of the real fish sc ho ol. 2.4 Agen t training in simulation 2.4.1 Motiv ation for simulation-based training The primary motiv ation for employing a simulation environmen t is to acquire an optimal control p olicy through pre-training before deploying it in the ph ysical en vironment. Reinforcemen t learning typically requires a massiv e num b er of interactions to conv erge, on the order of 10 6 steps in this study , whic h is impractical to p erform with live animals due to time constraints and the need to ensure animal w elfare. Unlike previous w ork [ 28 ], our policy is fully acquired within the sim ulation and subsequently deploy ed in the physical environmen t without further weigh t up dates or online learning. This approac h allo ws us to rigorously ev aluate the robustness of the p olicy and its capacit y for zero-shot transfer from a virtual model to real biological systems. 2.4.2 Sim ulation setup In the sim ulation phase, w e utilize the Global mo de as defined in Section 2.3.1 . Individual real fish are not modeled explicitly; instead, the group is represen ted solely b y its centroid. Sp ecifically , the environmen t consists of one virtual agent and the simulate d school cen troid, with their states represented by normalized co ordinates x ( v ) and c ( r ) , resp ectively . Under this configuration, the reward r school simplifies to 1 − √ 2 d ( c ( r ) , x ( v ) ) , and r direction is calculated based on the x -co ordinate of the single virtual agent. T o approximate the con tinuous dynamics of the agents and the sc ho ol cen troid, the underlying sim ulation state is up dated at a simulation time step of 0 . 1 s , while the agen t selects a discrete action ev ery 1 . 0 s of sim ulation time. This m ulti-rate up date sc heme allo ws the virtual agent to in teract with the collective at a low er up date frequency that b etter matches the c haracteristic b eha vioral timescale of the fish, while main taining the fine temporal resolution required for smo oth motion and stable numerical integration of the first-order lag dynamics. 6 2.4.3 Beha vioral mo del for sim ulated real fish The mov ement of the sim ulated fish school cen troid c ( r ) is gov erned by a stochastic b ehavioral mo del prop osed in [ 28 ]. This mo del assumes that the sc ho ol’s motion consists of a sequence of discrete linear tra jectories, reflecting the burst-and-coast swimming style of rumm y-nose tetras. Under this framework, the sc ho ol is assumed to b eha ve as a highly cohesive unit, where individual mo vemen ts are sufficien tly synchronized to b e effectiv ely represented by their collective cen troid. T o accoun t for the non-deterministic nature of so cial in teractions, we introduce an ignoring probabilit y p , defined as the probabilit y that the school ignores the virtual agent, follo wing the approac h in [ 28 ]. Incorp orating this probabilistic element encourages the RL agent to develop robust policies that do not rely on guaran teed, deterministic reactions from the fish. Eac h phase duration ∆ t ∈ (0 , ∆ t max ] is uniformly sampled. During each phase, the v elo city follo ws a first-order lag system: d c ( r ) dt = 1 τ ( r ) ( c ( r ) target − c ( r ) ) , (9) where c ( r ) target is the target coordinate up dated at the b eginning of each phase based on the distance d ( c ( r ) , x ( v ) ) and the ignoring probability p , as follo ws: • Reaction case : If d ( c ( r ) , x ( v ) ) ≤ θ (within interaction range), the sc ho ol reacts to the agen t with probabilit y 1 − p , setting c ( r ) target = x ( v ) . • Sp on taneous mo v emen t : If d ( c ( r ) , x ( v ) ) > θ , or with probability p even within the in teraction range, the target is updated b y a random displacement: c ( r ) target = c ( r ) +  δ x δ y  , (10) where δ x and δ y are sampled uniformly from [ − δ x max , δ x max ] and [ − δ y max , δ y max ] , resp ec- tiv ely . 2.4.4 Ev aluation pro cedure The performance of the acquired p olicy is ev aluated ov er a v alidation p erio d of T ′ = 5000 steps with the learned p olicy parameters (net work w eigh ts) held fixed. T o ensure a consistent comparison b etw een agen ts trained with different v alues of β , the ev aluation metric R is defined as the time-av erage of the baseline reward r base (Eq. ( 5 )): R = 1 T ′ T ′ X t =1 r base ( t ) (11) As r base dep ends only on c ( r ) x , the horizontal co ordinate of the real school centroid, it provides an ob jective b enchmark for iden tifying the p olicy that yields the most effectiv e guidance b ehavior. 2.5 Ph ysical exp eriment proto cols and ev aluation 2.5.1 Guidance proto col Based on the swimming sp eed of rummy-nose tetras, the duration of each con trol step for the virtual agen ts was set to 1 . 2 s for all physical experiments. Each exp erimental session consisted of 900 steps (appro ximately 18 minutes), with the target direction (the left or righ t end of the tank) switching ev ery 90 steps to ev aluate the adaptive resp onse of the school. T o ensure the robustness and repro ducibility of the results, four indep endent trials were conducted for each exp erimen tal configuration, p erformed at v arious times and across m ultiple dates. 7 T arget area Guidance direction Opposite area Intermediate area (a) Right ward guidance T arget area Guidance direction Opposite area Intermediate area (b) Leftw ard guidance Figure 4: Definition of ev aluation areas for guidance tasks. The target area is defined as the 30% region from the target end, while the opp osite area is the 30% region from the opp osite end. The fish w ere k ept in a separate holding tank and were randomly selected and mo ved to the exp erimen tal arena only for the duration of the trials. T o p erform the guidance tasks in stable, still-w ater conditions, the water circulation system w as temp orarily suspended during all exper- imen ts. T o ensure stable b eha vioral states, the fish were introduced to the experimental tank at least 90 min utes b efore the start of the trials. When switc hing b et w een different exp erimen tal conditions, a 30-minute in terv al was main tained to minimize the carry ov er effects of previous stim uli. 2.5.2 Exp erimen tal design The ph ysical trials were conducted in tw o phases: • Exp erimen t A (Phase 1) : This phase aimed to identify the exp erimental conditions that maximize the stimulus salience of the virtual agents. W e systematically tested three bac kground colors (white, gra y , and black) and three fish-image sizes (small, medium, and large, whic h were appro ximately 0 . 6 × , 1 . 0 × , and 1 . 5 × the size of real fish, resp ectively). These trials were p erformed with a group of three real individuals ( N r = 3 ) using a single fixed-formation virtual agen t rendered as four fish images. The agent op erated in the Global mode, targeting the cen troid of the en tire sc ho ol as describ ed in Section 2.3.1 . • Exp erimen t B (Phase 2) : Using the optimal visual parameters (background color and fish-image size) iden tified in Phase 1, w e ev aluated the guidance p erformance across dif- feren t group sizes ( N r = 5 , 8 ) and agen t configurations. W e compared the fixed-formation baseline used in Phase 1, treated as a single-unit agen t ( N v = 1 ), with indep endently con trolled agents ( N v = 2 , 3 ). In contrast to the fixed-formation agent, the indep enden t agen ts w ere eac h rendered as a single fish image and op erated in the Cluster-assignmen t mo de, where each agent targeted a lo calized sub-group as defined in Section 2.3.1 . 2.5.3 Ev aluation metrics The efficacy of the guidance w as quan tified using the follo wing three metrics: 1. Area o ccupancy ratio : The tank was divided in to three functional zones based on the horizon tal co ordinate (Fig. 4 ): the target area (the 30% region nearest to the target end), the opp osite area (the 30% region at the opp osite end), and the in termediate area (the cen tral 40%). The prop ortion of time spent b y the fish in each area was calculated across the en tire session. 8 2. Directional distribution and Bhattacharyy a distance : W e generated positional his- tograms of the school’s horizontal centroid for b oth leftw ard and right w ard guidance p e- rio ds. T o quantify the separability of these tw o distributions, w e calculated the Bhat- tac haryya distance, where a larger v alue indicates more distinct guidance success. 3. Sub-in terv al distribution : T o assess the stability of the guidance ov er time for repre- sen tative configurations, the distribution of individual p ositions was visualized for each 90-step sub-in terv al using b ox plots. 3 Results 3.1 P olicy optimization through simulation Prior to the physical exp erimen ts, we ev aluated the effectiv eness of the reinforcement learning framew ork in a sim ulation en vironment. The primary ob jectiv es w ere to determine the optimal w eight β for the comp osite rew ard function r β (Eq. ( 6 )) and to ensure that the acquired p olicy remains robust across v arious lev els of sto chasticit y in fish b ehavior, represented b y the ignoring probabilit y p . T o account for the sto c hastic nature of the learning pro cess, we p erformed 10 indep enden t training runs for each parameter com bination, and the mean ev aluation v alue ¯ R w as calculated across these trials. Figure 5 illustrates the transition of the mean ev aluation v alue ¯ R as a function of the training steps T . Ov erall, the performance generally improv ed as T increased across all parameter com- binations. When comparing different reward configurations, w e observ ed that p olicies trained with β = 0 . 1 , 0 . 5 , 0 . 7 , and 0 . 9 yielded p erformance levels comparable to the baseline rew ard r base . Ho wev er, the p olicy trained with β = 0 . 3 demonstrated sup erior p erformance, outp erforming the baseline at T = 10 6 steps for all v alues of p except p = 0 (where the sim ulated fish alw ays react to the agent). The results also confirmed that higher v alues of the ignoring probability p generally lead to low er ev aluation v alues ¯ R , reflecting the increased difficulty of the guidance task when the sc ho ol frequently ignores the virtual agent. Nevertheless, the comp osite reward function r β with an appropriate h yp erparameter successfully facilitated stable p olicy acquisition even under high- noise conditions ( p ≥ 0 . 6 ). Based on these simulation results, we adopted the p olicy trained with T = 10 6 , p = 0 . 6 , and β = 0 . 3 as the autonomous agent con troller for all subsequent ph ysical exp erimen ts described in Section 2.5 . 3.2 Exp erimen t A: Optimization of visual parameters In Experiment A (Phase 1), we systematically ev aluated stimulus salience by v arying the bac k- ground color and fish-image size to maximize the resp onsiv eness of the liv e fish. As describ ed in Section 2.5.2 , these trials w ere p erformed with a group of three individuals ( N r = 3 ) using a single fixed-formation agent op erating in the Global mode. T able 1 summarizes the guidance p erformance metrics for each condition. The results for bac kground color indicated that the white background w as the most effective in biasi ng the sc ho ol’s p osition. The white condition yielded the highest o ccupancy ratio in the target area (24.23%) and the largest separabilit y betw een the target and opp osite areas. This trend is con- sisten tly reflected in the Bhattacharyy a distance, where the white bac kground (0.1589) markedly outp erformed the black background (0.0089). The p ositional histograms (Fig. 6a ) confirm a clear shift in the sc ho ol’s distribution to ward the target direction, particularly during leftw ard guid- ance, under the white and gray conditions. How ever, the distribution remained largely cen tered in the black condition. Regarding fish-image size, the large configuration yielded the most pronounced guidance effect. The large fish-image size ( 1 . 5 × ) achiev ed a target area o ccupancy ratio of 22.57%, out- 9 Baseline p = 0.0 p = 0.3 p = 0.6 p = 0.9 Figure 5: Learning curves showing the transition of the mean ev aluation v alue ¯ R across different training steps T and reward w eigh ts β . Each data p oint represents the av erage of 10 indep endent training trials for the corresp onding parameter combination. The baseline represen ts the policy trained using only the horizontal co ordinate of the sc ho ol’s centroid ( r base ). Each plot compares differen t ignoring probabilities p for the simulated fish. p erforming the medium (18.63%) and small (18.15%) sizes. Notably , the Bhattacharyy a distance for the large size (0.1616) was mark edly higher than those for the medium (0.0595) and small (0.0070) sizes. The histograms shown in Fig. 6b further illustrate that the large size induced a more robust and consisten t bias in the individuals’ horizontal p ositions. Ov erall, the results of Exp eriment A demonstrate that a white background and a large fish- image size maximize b eha vioral resp onsiv eness in our physical environmen t. Consequently , these optimal visual parameters w ere standardized for all subsequen t trials in Exp eriment B. 3.3 Exp erimen t B: Performance of the closed-lo op guidance In Exp eriment B (Phase 2), we ev aluated the guidance p erformance focusing on the influence of agen t configurations and group sizes ( N r = 5 , 8 ). Based on the findings from Exp eriment A, all trials w ere conducted using a white background and the large fish-image size. 3.3.1 Guidance efficacy and group-size dep endence T able 2 summarizes the p erformance metrics for Exp eriment B. F or groups of N r = 5 individuals, the fixed-formation baseline (Global mo de) ac hieved the highest target area o ccupancy (20.71%) and the largest Bhattacharyy a distance (0.1197). While the indep endently controlled agents 10 T able 1: Summary of guidance p erformance in Exp erimen t A. Area o ccupancy ratios represent the total p ercen tage of time sp en t b y the sc ho ol in each zone. Larger Bhattacharyy a distances indicate higher guidance efficacy . P arameter Condition T arget area (%) Opp osite area (%) Bhattac haryya distance Bac kground color white 24 . 23 9 . 53 0 . 1589 gra y 21 . 10 10 . 01 0 . 1055 blac k 8 . 93 7 . 54 0 . 0089 Fish-image size small ( 0 . 6 × ) 18 . 15 16 . 55 0 . 0070 medium ( 1 . 0 × ) 18 . 63 9 . 27 0 . 0595 large ( 1 . 5 × ) 22 . 57 7 . 79 0 . 1616                         Bac kground: white                         Bac kground: gray                         Bac kground: black (a) Effect of background color                         Fish-image size: small                         Fish-image size: medium                         Fish-image size: large (b) Effect of fish-image size Figure 6: Positional histograms of the school’s cen troid for the visual parameters tested in Exp erimen t A. (a) Background color comparison. (b) Fish-image size comparison (small: 0 . 6 × , medium: 1 . 0 × , large: 1 . 5 × ). Each plot shows the ov erlay of leftw ard and right w ard guidance results. 11 T able 2: Summary of guidance p erformance in Exp erimen t B. Results for group sizes N r = 5 and N r = 8 are compared across different agent configurations. N r Configuration (Mode) T arget area (%) Opp osite area (%) Bhattac haryya distance 5 Fixed (Global) 20 . 71 9 . 48 0 . 1197 Indep enden t N v = 2 (Cluster) 16 . 57 12 . 39 0 . 0702 Indep enden t N v = 3 (Cluster) 19 . 92 10 . 99 0 . 0998 8 Fixed (Global) 14 . 15 10 . 97 0 . 0683 Indep enden t N v = 2 (Cluster) 13 . 06 11 . 23 0 . 0406 Indep enden t N v = 3 (Cluster) 14 . 53 10 . 02 0 . 0435                         Baseline (global)                         Indep enden t ( N v = 2 )                         Indep enden t ( N v = 3 ) (a) Group size N r = 5                         Baseline (global)                         Indep enden t ( N v = 2 )                         Indep enden t ( N v = 3 ) (b) Group size N r = 8 Figure 7: Positional histograms of the school’s cen troid for Exp eriment B. The columns compare (a) N r = 5 and (b) N r = 8 conditions. The ro ws represen t the fixed-formation baseline and the indep enden t ( N v = 2 , 3 ) configurations. Each plot shows the ov erlay of left ward and right ward guidance results. 12 ( N v = 2 , 3 , Cluster-assignmen t mode) w ere in tended to improv e guidance by following sub- groups, they yielded slightly low er p erformance (Bhattacharyy a distances: 0.0702 for N v = 2 and 0.0998 for N v = 3 ). The p ositional histograms (Fig. 7 ) reflect these results. A t N r = 5 (Fig. 7a ), the distributions for all configurations sho w a visible shift to ward the target direction, with only marginal visual differences betw een the mo des. Ho wev er, as the group size increased to N r = 8 (Fig. 7b ), the distributions b ecame mark edly more cen tralized across all agen t configurations, and the Bhattac haryya distances dropp ed significantly . This suggests that the efficacy of external visual stim uli is severely limited as internal so cial interactions within the school intensify in larger group sizes. 3.3.2 T emp oral stabilit y and agen t co ordination T o clarify the performance gap b et w een the configurations at N r = 5 , we analyzed the temporal stabilit y of the sc ho ol’s resp onse across 90-step sub-interv als. Fig. 8 compares the horizontal p osition distributions for the fixed-formation baseline and the indep endent configuration ( N v = 3 ). In the fixed-formation baseline (Fig. 8a ), the sc ho ol’s distribution was consistently biased to ward the target direction across nearly all sub-in terv als, indicating highly stable guidance. In contrast, the indep endent agents ( N v = 3 ) exhibited intermitten t failures (Fig. 8b ), where the school remained in the central area during sp ecific sub-in terv als. This instability suggests that the unco ordinated mov ements of multiple virtual agen ts, potentially coupled with frequent switc hing of their target clusters, may hav e confused the real fish rather than pro viding a clear so cial signal. These results indicate that individual optimization of agen t p olicies is insufficient for the guidance of collective b eha vior, highligh ting the necessit y of co op erative multi-agen t con trol. 4 Discussion 4.1 Effectiv eness of the multi-ob jectiv e rew ard design A k ey comp onen t of our reinforcement learning framew ork is the m ulti-ob jective formulation of the guidance task, which balances directional guidance and so cial cohesion. These ob jectives are in tegrated into a single scalar reward through a weigh ted combination, resulting in a comp os- ite reward function for p olicy learning. Sim ulation results (Fig. 5 ) show that this form ulation enables stable p olicy acquisition ev en under sto chastic b eha vioral conditions, and the physical exp erimen ts indicate that the learned p olicy transfers effectively to real-world settings. These findings suggest that appropriately balancing comp eting ob jectiv es is imp ortant for ac hieving robust closed-loop guidance of biological collectiv es. 4.2 Optimization of visual stim uli for collective guidance The results of Exp eriment A show ed that a white background and a large fish-image size yielded the most effective conditions for guiding rumm y-nose tetras. The effectiv eness of the white bac kground lik ely arises from the high visual contrast it provides, which enhances the saliency of the virtual agents against the exp erimen tal environmen t. In contrast, under the black bac kground condition, the school remained mainly in the in termediate area (the region b etw een the target and opp osite areas) (T able 1 , Fig. 6a ). This suggests that the low ambien t brightness of the black en vironment may hav e suppressed the fish’s general activity or exploratory b ehavior tow ard the edges of the tank, thereby diminishing the o v erall guidance efficacy . Regarding the influence of fish-image size, the largest individual fish size, corresp onding to appro ximately 1.5 times that of real individuals, yielded the highest guidance efficacy (T able 1 ). 13 T rial 1 T rial 2 T rial 3 T rial 4                                                                                                                         (a) Fixed formation (baseline) T rial 1 T rial 2 T rial 3 T rial 4                                                                                                                         (b) Indep enden t agents ( N v = 3 ) Figure 8: Spatiotemporal evolution of group centroids ( N r = 5 ). Each row displays four inde- p enden t trials for (a) the baseline (fixed formation) and (b) the indep enden t agent ( N v = 3 ) con- figuration. Horizontal and vertical axes denote horizontal co ordinates and elapsed sub-interv als, resp ectiv ely . T arget direction is indicated by red triangles and b ox colors (orange: leftw ard, blue: righ tw ard). While this might suggest that larger-than-real stimuli are more effective, direct size comparisons should b e in terpreted with caution, as the virtual agent s are presen ted on a screen with a min- im um separation of approximately 47 mm from the fish. Nevertheless, the comparison across the three tested sizes (small, medium, and large) revealed a clear p erformance trend (Fig. 6b ), indicating that larger stimuli provide a more salient cue that promotes collective directional c hanges. 4.3 Group-size dep endent limitations of visual guidance Guidance p erformance markedly degraded as the group size increased from N r = 5 to N r = 8 (T able 2 , Fig. 7 ). This decline indicates that the influence of external visual signals is not absolute but comp etes with internal so cial forces. As group size increases, so cial interactions suc h as alignment and attraction to neighbors likely out weigh the visual cues provided b y the virtual agen ts. F urthermore, sensory comp etition may contribute to this limitation. While our system pro vides visual feedback, real fish also rely on the lateral line system to sense h ydro dynamic 14 c hanges [ 20 , 32 ]. As group size increases and the effectiv e density rises, h ydro dynamic cues from nearb y individuals b ecome more pronounced, p otentially attenuating visual information from the screen. This limitation underscores a fundamen tal challenge: in tegrating m ulti-mo dal stim uli to main tain influence o ver larger, more cohesive groups. 4.4 Challenges in multi-agen t con trol and co ordination The indep endent multi-agen t configuration ( N v = 2 , 3 ) did not outp erform the fixed-formation baseline (Global mo de) (T able 2 ). This difference may b e explained by several factors. First, the fixed-formation baseline utilized four fish images controlled as a single unit, whereas the indep enden t configurations used only t w o or three fish images (Section 2.5.2 ). This larger num b er of visual stim uli in the baseline lik ely increased o verall salience, providing a more robust and easily recognizable directional signal. Bey ond stimulus strength, the lac k of co ordination among agen ts may hav e b een detrimental. Since each agent’s p olicy was trained in isolation, their unco ordinated mo v emen ts likely app eared as “so cial noise” to the school, whic h ma y requires a coheren t signal to main tain collectiv e mo- tion. Additionally , the frequen t switching of target clusters in the Cluster-assignment mo de likely caused abrupt changes in agent tra jectories, p otentially confusing the real fish rather than induc- ing stable following b ehavior. These results indicate that individual optimization is insufficien t; guiding fragmen ted or heterogeneous groups may require co op erative m ulti-agen t reinforcemen t learning (MARL) [ 33 ], where agents learn a join t policy to pro vide a unified guidance signal. 4.5 En vironmental asymmetry and tec hnical constrain ts Although the closed-lo op system achiev ed the desired guidance, w e observed a sligh t p ositional bias tow ard the left side of the tank. This asymmetry likely stems from subtle en vironmen tal factors within the exp erimental setup, such as inhomogeneous ligh ting or feeding habituation. While these factors do not undermine the ov erall effectiv eness of the system, the cause of this bias should b e further inv estigated and mitigated in future studies to ensure a more balanced and con trolled experimental environmen t. In addition to these en vironmen tal factors, tec hnical constraints also remain, suc h as the 2D nature of the stim uli and the lack of hydrodynamic feedback . F uture research will explore the implemen tation of co op erative MARL and the integration of physical rob otic agen ts to ov ercome these limitations and extend guidance capabilities to even larger and more complex sc ho ols. 5 Conclusion In this study , w e prop osed and ev aluated a deep reinforcement learning framework for the real- time, closed-lo op guidance of fish sc ho ols. By emplo ying Proximal P olicy Optimization (PPO) and a comp osite reward function that balances directional guidance with so cial cohesion, w e dev elop ed an autonomous con troller capable of guiding biological collectiv es. Our metho dology demonstrates a bridge b etw een sim ulation-based training and real-w orld application, achieving effectiv e zero-shot transfer to ph ysical exp eriments with rumm y-nose tetras. Our findings reveal imp ortant factors go v erning the efficacy of artificial so cial influence. W e found that the salience of visual stimuli, sp ecifically background contrast and stimulus size, pla ys a significant role in maximizing the resp onsiveness of the fish school. F urthermore, our ev aluation across group sizes and agen t configurations highlights a fundamental trade-off: while virtual agen ts can effectively guide smaller groups, their influence is c hallenged b y intensifying intrinsic so cial interactions and p oten tial sensory comp etition in larger groups. The sup eriority of fixed formations ov er unco ordinated indep endent agents further emphasizes that coherent collective signals are more effectiv e so cial stim uli than individually optimized but unaligned behaviors. 15 This work establishes a scalable foundation for the automated guidance of biological groups. F uture research will fo cus on implementing co op erative m ulti-agent reinforcement learning (MARL) to facilitate the co ordination of agent actions in resp onse to div erse group dynamics, including fragmen ted sub-groups. Such co ordination will b e necessary to provide a coheren t and robust so- cial signal and, alongside the in tegration of m ulti-mo dal stimuli suc h as h ydro dynamic feedback, to main tain influence in dense and complex biological environmen ts. Ethics All animal exp eriments w ere conducted with the approv al of the Graduate School of Information Science Univ ersity of Hy ogo (Appro v al No. UHIS-EC-2024-004). Conflicts of in terest The authors declare no p ersonal or financial comp eting in terests. F unding This study was supp orted b y the JSPS KAKENHI Gran t Num b er JP21H05302. A c kno wledgmen ts The authors are grateful to Y usuke Nishii for dev eloping the foundational framew ork [ 28 ] up on whic h this study builds. References [1] Craig W. Reynolds. Flo c ks, herds and sc ho ols: A distributed b eha vioral mo del. Comput. Gr aph. , 21(4):25–34, 1987. doi: 10.1145/37402.37406 . [2] Ian D. Couzin, Jens Krause, Ric had James, Graeme D. Ruxton, and Nigel R. F ranks. Collectiv e memory and spatial sorting in animal groups. Journal of The or etic al Biolo gy , 218 (1):1–11, 2002. doi: 10.1006/jtbi.2002.3065 . [3] Julia K. Parrish, Steven V. Viscido, and Daniel Grünbaum. Self-organized fish schools: An examination of emergen t properties. The Biolo gic al Bul letin , 202(3):296–305, 2002. doi: 10.2307/1543482 . [4] M. Ballerini, N. Cabibb o, R. Candelier, A. Cav agna, E. Cisbani, I. Giardina, V. Lecomte, A. Orlandi, G. Parisi, A. Procaccini, M. Viale, and V. Zdravk ovic. Interaction ruling ani- mal collectiv e behavior dep ends on top ological rather than metric distance: Evidence from a field study . Pr o c e e dings of the National A c ademy of Scienc es , 105(4):1232–1237, 2008. doi: 10.1073/pnas.0711437105 . [5] Andras Czirok and T amas Vicsek. Collectiv e behavior of in teracting self-propelled par- ticles. Physic a A: Statistic al Me chanics and its Applic ations , 281(1-4):17–29, 2000. doi: 10.1016/S0378-4371(00)00013-3 . [6] Iain D. Couzin, Jens Krause, Nigel R. F ranks, and Simon Levin. Effectiv e leadership and decision-making in animal groups on the mo v e. Natur e , 433(7025):513–6, 2005. doi: 10.1038/nature03236 . 16 [7] James E. Herb ert-Read, Andrea P erna, Ric hard P . Mann, Timoth y M. Schaerf, Da vid J. T. Sumpter, and Ashley J. W. W ard. Inferring the rules of in teraction of shoal- ing fish. Pr o c e e dings of the National A c ademy of Scienc es , 108(46):18726–18731, 2011. doi: 10.1073/pnas.1109355108 . [8] Y ael Katz, Kolb jørn T unstrøm, Christos C. Ioannou, Cristián Huepe, and Iain D. Couzin. Inferring the structure and dynamics of interactions in schooling fish. Pr o c e e dings of the National A c ademy of Scienc es , 108(46):18720–18725, 2011. doi: 10.1073/pnas.1107583108 . [9] K olb jørn T unstrøm, Y ael Katz, Christos C. Ioannou, Cristián Huep e, Matthew J. Lutz, and Iain D. Couzin. Collective states, m ultistability and transitional b ehavior in sc ho oling fish. PL oS Computational Biolo gy , 9(2):e1002915, 2013. doi: 10.1371/journal.p cbi.1002915 . [10] Y ogo T ak ada, Y ukinobu Nak anishi, Ryosuk e Araki, Motohiro Nonogaki, and T omoyuki W akisak a. Effect of material and thickness ab out tail fins on propulsiv e p erfor- mance of a small fish rob ot. Journal of A er o A qua Bio-me chanisms , 1(1):51–56, 2010. doi: 10.5226/jabmec h.1.51 . [11] Jun Shintak e, Herb ert Shea, and Dario Floreano. Biomimetic underw ater rob ots based on dielectric elastomer actuators. In 2016 IEEE/RSJ International Confer enc e on Intel- ligent R ob ots and Systems (IROS) , pages 4957–4962, Daejeon, South Korea, 2016. IEEE. doi: 10.1109/IR OS.2016.7759728 . [12] Florian Berlinger, Jeff Dusek, Melvin Gauci, and Radhik a Nagpal. Robust maneuv erability of a miniature , lo w-cost underwater rob ot using multiple fin actuation. IEEE R ob otics and A utomation L etters , 3(1):140–147, 2018. doi: 10.1109/LRA.2017.2734969 . [13] T akuya Aritani, Naoki Kaw asaki, and Y ogo T ak ada. Small rob otic fish with t w o magnetic actuators for autonomous tracking of a goldfish. Journal of A er o A qua Bio-me chanisms , 8 (1):69–74, 2019. doi: 10.5226/jabmec h.8.69 . [14] Xingyu Chen, Junzhi Y u, Zhengxing W u, Y an Meng, and Shihan K ong. T ow ard a maneuv erable miniature rob otic fish equipp ed with a no vel magnetic actuator system. IEEE T r ansactions on Systems, Man, and Cyb ernetics: Systems , 50(7):2327–2337, 2020. doi: 10.1109/TSMC.2018.2812903 . [15] Florian Berlinger, Melvin Gauci, and Radhik a Nagpal. Implicit co ordination for 3D un- derw ater collectiv e behaviors in a fish-inspired rob ot sw arm. Scienc e R ob otics , 6(50), 2021. doi: 10.1126/SCIR OBOTICS.ABD8668 . [16] Daniel T. Swain, Iain D. Couzin, and Naomi Ehrich Leonard. Real-time feedback-con trolled rob otic fish for b ehavioral experiments with fish sc ho ols. Pr o c e e dings of the IEEE , 100(1): 150–163, 2012. doi: 10.1109/JPR OC.2011.2165449 . [17] Vladisla v Kopman, Jeffrey Laut, Giov anni P olverino, and Maurizio P orfiri. Closed-lo op con trol of zebrafish resp onse using a bioinspired rob otic-fish in a preference test. Journal of The R oyal So ciety Interfac e , 10(78), 2013. doi: 10.1098/rsif.2012.0540 . [18] F rank Bonnet, Alexey Grib o vskiy , José Halloy , and F rancesco Mondada. Closed-lo op inter- actions b etw een a shoal of zebrafish and a group of rob otic fish in a circular corridor. Swarm Intel ligenc e , 12(3):227–244, 2018. doi: 10.1007/s11721-017-0153-6 . [19] Leo Cazenille, Y ohann Chemtob, F rank Bonnet, Alexey Grib ovskiy , F rancesco Mondada, Nicolas Bredec he, and Jose Halloy . How to blend a robot within a group of zebrafish: A chieving so cial acceptance through real-time calibration of a multi-lev el b ehavioural mo del. Biomimetic and Biohybrid Systems, L e ctur e Notes in Computer Scienc e , 10928:73–84, 2018. doi: 10.1007/978-3-319-95972-6_9 . 17 [20] Liang Li, Máté Nagy , Jacob M. Graving, Joseph Bak-Coleman, Guangming Xie, and Iain D. Couzin. V ortex phase matching as a strategy for sc ho oling in rob ots and in fish. Natur e Communic ations , 11(1):5408, 2020. doi: 10.1038/s41467-020-19086-0 . [21] T omohiro Nak ay asu and Eiji W atanab e. Biological motion stim uli are attractive to medak a fish. Animal Co gnition , 17(3):559–575, 2014. doi: 10.1007/s10071-013-0687-y . [22] Hiroaki Kaw ashima, Y u Kanechik a, and T ak ashi Matsuyama. Camera-display system for the in teraction analysis of live fish vs fish-like graphics. The 17th Me eting on Image R e c o gnition and Understanding , 2014. [23] Bertrand Lemasson, Colb y T anner, Christa W o o dley , T ammy Threadgill, Shea Qarqish, and Da vid Smith. Motion cues tune so cial influence in shoaling fish. Scientific R ep orts , 8(1): 9785, 2018. doi: 10.1038/s41598-018-27807-1 . [24] James Miles, Andrew S. V o wles, and P aul S. Kemp. The role of collectiv e b e- ha viour in fish response to visual cues. Behaviour al Pr o c esses , 220:105079, 2024. doi: 10.1016/j.b epro c.2024.105079 . [25] Liang Li, Mate Nagy , Guy Amicha y , W ei W ang, Oliv er Deussen, Daniela Rus, and Iain Couzin. Reverse engineering the con trol la w for sc ho oling in zebrafish using virtual realit y . Scienc e R ob otics , 10(101), 2025. doi: 10.1126/scirob otics.adq6784 . [26] Ra j Ra jeshw ar Malinda, Saeko T akizaw a, Akiyuki Ko yama, T ak ayuki Niizato, Hitoshi Hab e, and Hiroaki Kaw ashima. Sp eed-con trolled visual stimuli mo dulate fish collective dynamics. bioRxiv pr eprint , 2025. doi: 10.64898/2025.12.05.692523 . [27] Hiroaki Kaw ashima, Ra j Ra jeshw ar Malinda, and Saeko T akizaw a. Mo deling and analysis of fish interaction netw orks under pro jected visual stim uli. Pr o c e e dings of the Joint Symp osium of AROB 31st and ISBC 11th (AR OB-ISBC 2026) , 2026. doi: 10.48550/arXiv.2603.01682 . [28] Y usuke Nishii and Hiroaki Kaw ashima. Con trolling fish sc ho ols via reinforcement learning of virtual fish mov emen t. arXiv pr eprint , 2026. doi: 10.48550/arXiv.2603.16384 . (English translation of the bac helor’s thesis b y Y usuk e Nishii originally submitted in 2018). [29] John Sch ulman, Filip W olski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal p olicy optimization algorithms. arXiv pr eprint , 2017. doi: 10.48550/arXiv.1707.06347 . [30] A o W ang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jungong Han, and Guiguang Ding. YOLOv10: Real-time end-to-end ob ject detection. arXiv pr eprint , 2024. doi: 10.48550/arXiv.2405.14458 . [31] Daniel S. Calovi, Alexandra Litchink o, V alen tin Lechev al, Ugo Lop ez, Alfonso Pérez Escud- ero, Hugues Chaté, Clémen t Sire, and Guy Theraulaz. Disentangling and mo deling interac- tions in fish with burst-and-coast swimming reveal distinct alignment and attraction b eha v- iors. PLOS Computational Biolo gy , 14(1):e1005933, 2018. doi: 10.1371/journal.p cbi.1005933 . [32] Hungtang Ko, George Lauder, and Radhik a Nagpal. The role of hydrodynamics in collective motions of fish schools and bioinspired underwater rob ots. Journal of The R oyal So ciety Interfac e , 20(207):20230357, 2023. doi: 10.1098/rsif.2023.0357 . [33] Chao Y u, Ak ash V elu, Eugene Vinitsky , Jiaxuan Gao, Y u W ang, Alexandre Bay en, and Yi W u. The surprising effectiveness of PPO in co op erativ e, multi-agen t games. arXiv pr eprint , 2021. doi: 10.48550/arXiv.2103.01955 . 18

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment