A Deep Reinforcement Learning Framework for Closed-loop Guidance of Fish Schools via Virtual Agents

A Deep Reinforcemen t Learning F ramew ork for Closed-lo op Guidance of Fish Sc ho ols via Virtual Agen ts T ak ato Shiba yama and Hiroaki Kaw ashima ∗ Graduate School of Information Science, Universit y of Hy ogo, K ob e, Japan Abstract Guiding collectiv e motion in biological groups is a fundamen tal challenge in understanding so cial in teraction rules and dev eloping automated systems for animal managemen t. In this study , w e prop ose a deep reinforcement learning (RL) framework for the closed-lo op guidance of ﬁsh sc ho ols using virtual agen ts. These agen ts are con trolled by policies trained via Pro ximal Policy Optimization (PPO) in simulation and deplo yed in physical exp erimen ts with rummy-nose tetras ( Petitel la bleheri ), enabling real-time interaction betw een artiﬁcial agen ts and live individuals. T o cop e with the stochastic behavior of liv e individuals, w e design a comp osite rew ard function to balance directional guidance with so cial cohesion. Our systematic ev aluation of visual parameters shows that a white background and larger stim ulus sizes maximize guidance eﬃcacy in physical trials. F urthermore, ev aluation across group sizes rev ealed that while the system demonstrates eﬀective guidance for groups of ﬁv e individuals, this capabilit y mark edly degrades as group size increases to eight. This study highligh ts the p otential of deep RL for automated guidance of biological collectives and iden tiﬁes challenges in main taining artiﬁcial inﬂuence in larger groups. 1 In tro duction Collectiv e b eha vior in biological systems, such as ﬁsh schooling and bird ﬂocking, emerges from lo cal interactions among individuals, leading to complex and co ordinated group-lev el pat- terns [ 1 , 2 , 3 , 4 , 5 ]. These b ehaviors allo w groups to respond rapidly to en vironmental stim uli, suc h as predator threats, through the instan taneous transmission of information across the col- lectiv e [ 6 , 7 , 8 , 9 ]. Understanding and guiding these dynamics are of signiﬁcan t in terest in both fundamen tal biology and practical applications, such as automated aquaculture management and the dev elopment of bio-inspired underwater rob otics [ 10 , 11 , 12 , 13 , 14 , 15 ]. T o inﬂuence or control the motion of a collectiv e, researchers ha ve dev elop ed v arious biomimetic agen ts, including rob otic ﬁsh [ 16 , 17 , 18 , 19 , 20 ] and visual stim uli display ed on screens or by pro jection [ 21 , 22 , 23 , 24 , 25 , 26 , 27 ]. These to ols allo w for “causal” inv estigations by decou- pling speciﬁc so cial cues. Among these, closed-lo op systems, where artiﬁcial agents resp ond in real-time to the actions of liv e animals, hav e emerged as a p ow erful to ol for in vestigating so cial in teraction mechanisms [ 16 , 17 , 18 , 19 , 25 ]. A ma jor challenge in designing eﬀective closed-lo op controllers for biological agen ts lies in mo deling stochastic and non-linear collectiv e behavior. T o alleviate the reliance on precise ana- lytical mo dels, mo del-free reinforcement learning (RL) has b een prop osed as a promising frame- w ork [ 28 ]. While prior w ork demonstrated the feasibility of ﬁsh guidance using Q-learning, its scop e w as limited to small, highly cohesiv e groups ( N r = 3 ) represen ted by a single centroid [ 28 ]. ∗ Corresp onding author: k a washima@gsis.u-h yogo.ac.jp 1 Suc h approaches rely on discretized state-space representations and rew ard structures limited to ev aluating ﬁnal outcomes, whic h ma y not scale eﬀectively to larger collectives. This study expands up on this foundation b y implemen ting an adaptive con troller based on Proximal Policy Optimization (PPO), a state-of-the-art deep reinforcemen t learning ap- proac h [ 29 ]. As the group size increases, collective dynamics b ecome signiﬁcantly more com- plex, often splitting into multiple sub-groups, whic h makes simple discretized representations insuﬃcien t. T o pro vide the necessary capacit y for future scalability and more granular feedback con trol, we transition to a con tin uous state-space representation through PPO. T o the b est of our knowledge, this is one of the ﬁrst studies to apply PPO-based virtual agen ts to the real-time closed-lo op guidance of biological collectives. By introducing a multi-ob jective rew ard function that balances group in tegrit y with direc- tional guidance, w e facilitate the stable acquisition of interaction p olicies in a simulation environ- men t. Our metho dological approach establishes a robust bridge b etw een computational learning and real-world biological in teraction by training agen ts in a simulation and subsequen tly deploy- ing them in physical exp eriments. W e ev aluate our framework using rumm y-nose tetras ( Petitel la bleheri ). The inv estigation is conducted in tw o phases. First, we utilize small groups ( N r = 3 ) to sys- tematically ev aluate and optimize the virtual agents’ visual parameters, sp eciﬁcally background color and stimulus size, to maximize their salience to the target sp ecies. Second, w e ev aluate the robustness and scalability of the prop osed system across v arying group sizes ( N r = 5 and N r = 8 ) and agent conﬁgurations. Sp eciﬁcally , we ev aluate the guidance p erformance under sev eral agent conﬁgurations, including independently controlled agents, to compare their eﬀectiv eness across diﬀeren t group sizes. Our ﬁndings indicate that as group size increases, the eﬃcacy of directional guidance faces signiﬁcant c hallenges, likely due to interference betw een artiﬁcial visual stimuli and intrinsi c so cial interactions. This work highligh ts b oth the p otential of deep RL for auto- mated animal guidance and the fundamental challenges of main taining inﬂuence within dense biological en vironments. 2 Metho ds 2.1 Exp erimen tal setup W e used rummy-nose tetras ( Petitel la bleheri , recen tly reclassiﬁed from Hemigr ammus bleheri ) as the exp erimental sub jects. This sp ecies was selected due to its strong schooling tendency and its suitabilit y for lab oratory main tenance. The closed-lo op guidance system dev elop ed in this study in tegrated real-time visual trac king of live ﬁsh with the application of con trol policies for virtual agen ts, as illustrated in the system arc hitecture (Fig. 1 ). A fron t-facing camera captured the positions of the live ﬁsh, which were pro cessed by a PC to determine the mo vemen ts of the virtual agents based on trained reinforce- men t learning p olicies. These agents w ere then presented to the ﬁsh in real-time via a liquid crystal display . The camera and displa y were aligned to b e parallel to the tank surface, minimiz- ing geometric and persp ectiv e distortions betw een the image and displa y coordinate systems. The exp erimen tal arena consisted of an acrylic tank with in ternal dimensions of 389 × 213 × 89 mm (width × height × depth), where depth refers to the fron t-to-back dimension. As illustrated in the top-view schematic (Fig. 2 ), we used a 2 mm -thic k acrylic partition to divide the tank into t wo sections with depths of 47 mm and 40 mm ; the ﬁsh w ere placed in the 40 mm deep section to constrain their swimming mov emen ts to a quasi-tw o-dimensional plane. Virtual agents w ere presen ted on a liquid crystal displa y (221E9/11, Philips; 483 × 270 mm ) mounted ﬂush against the rear exterior w all of the tank. This spatial conﬁguration ensured that the swimming region for the live ﬁsh was p ositioned at a suﬃcient distance from the display to preven t the stim uli from b eing obscured from the ﬁsh’s p ersp ectiv e by reﬂections at the tank-w ater in terface. A fron t-facing camera w as used to monitor the individuals in real-time. 2 Display Camera State Reward RL Agent ( policy ) Action 1. Simulation Experiment 2. Physical Experiment V irtual agent controller V irtual agent position Per frame Per step Per step Fish detector RL environment Fish position T ank Figure 1: Sc hematic diagram of the closed-lo op system architecture. The p ositions of the live ﬁsh are monitored b y a front-facing camera and pro cessed by a PC to apply the learned agent p olicies, whic h are then rendered as virtual agen ts on the display . Display Acrylic partition T ank Swimming region T op view Figure 2: T op-view sc hematic of the exp erimental setup. The hatched area indicates the 40 mm deep (fron t-to-bac k distance) section where the live ﬁsh were constrained. The partition ensures t wo-dimensional mo v ement and visibility of the display ed virtual agen ts b y a voiding reﬂections at the tank-water interface from the ﬁsh’s p ersp ectiv e. 2.2 Real-time vision and co ordinate mapping T o achiev e real-time closed-lo op in teraction, we implemented an automated trac king system using YOLOv10 [ 30 ]. The detection mo del w as ﬁne-tuned sp eciﬁcally to identify rummy-nose tetras within the exp erimental arena. The system captures the p ositions of all individuals at 10 fps. The ra w co ordinates ( u j , v j ) of each ﬁsh j ∈ { 1 , . . . , N r } detected in the camera frame are mapp ed to a normalized tank co ordinate system x ( r ) j ∈ [0 , 1] 2 . Before the exp eriments, the co ordinates of the top-left u tank0 = ( u tank0 , v tank0 ) ⊤ and bottom-right u tank1 = ( u tank1 , v tank1 ) ⊤ corners of the swimming region in the camera image were recorded. The normalized p osition x ( r ) j is then calculated as follows: x ( r ) j =  ( u j − u tank0 ) / ( u tank1 − u tank0 ) ( v j − v tank0 ) / ( v tank1 − v tank0 )  . (1) T o pro ject the virtual agents managed in the RL environmen t onto the display at the correct ph ysical lo cations, w e establish a mapping b etw een the camera and display co ordinate systems. 3 A set of reference sp ots S is pro jected on to the displa y , and their corresp onding homogeneous co ordinates in the camera frame, F = { ( u 1 , v 1 , 1) ⊤ , . . . , ( u n , v n , 1) ⊤ } , are captured. The 2 × 3 transformation matrix A is then determined b y A = S F ⊤ ( F F ⊤ ) − 1 . Using this matrix A , the displa y pixel coordinates corresponding to the corners of the swim- ming region, d tank k = ( d tank k , e tank k ) ⊤ (for k ∈ { 0 , 1 } ), are obtained by d tank k = A ˜ u tank k , where ˜ u tank k = ( u tank k , v tank k , 1) ⊤ represen ts the homogeneous co ordinates of the corners. Finally , the normalized co ordinates of a virtual agent i , denoted as x ( v ) i , are conv erted into display pixel co ordinates ( d i , e i ) as follows:  d i e i  = x ( v ) i ⊙ ( d tank1 − d tank0 ) + d tank0 , (2) where ⊙ denotes the elemen t-wise pro duct. This pip eline ensures that the visual stimuli are presen ted at precisely deﬁned spatial p ositions relativ e to the live individuals, enabling accurate guidance based on social in teractions. 2.3 Reinforcemen t learning framew ork T o dev elop an autonomous con trol p olicy for virtual agen ts capable of guiding ﬁsh schools, w e emplo y ed Proximal P olicy Optimization (PPO), a policy-based deep reinforcement learning algorithm. In this framework, an agen t (i.e., a virtual agent presen ted on the displa y) is deﬁned as a single policy-controlled unit that in teracts with the en vironment at discrete time steps t = 0 , 1 , . . . to maximize rew ards. Dep ending on the exp erimen tal conﬁguration (see Section 2.5.2 ), the visual representation of an agen t is rendered either as a single ﬁsh image or as a ﬁxed formation of multiple ﬁsh images, where a ﬁsh image refers to an individual visual stim ulus displa yed on the screen. Eac h agent op erates based on its own state observ ation and action output. 2.3.1 State and action space The state vector observed by each virtual agent i ∈ { 1 , . . . , N v } at time t is deﬁned by the co ordinates of the real ﬁsh and the agent’s own p osition: s i,t = [ s ( r ) ⊤ i,t , s ( v ) ⊤ i,t ] ⊤ , (3) where s ( v ) i,t is the normalized co ordinate x ( v ) i of the i -th virtual agent. While prior work rep- resen ted the real ﬁsh collective using its global cen troid [ 28 ], such an approach may lack the gran ularity required to manage fragmented sub-groups. T o ensure a scalable and consistent rep- resen tation of the target collectiv e, we deﬁne the real ﬁsh information s ( r ) i,t as a 2D co ordinate represen ting a speciﬁc guidance reference p oint. W e implemented tw o mo des for deﬁning these reference p oints based on the exp erimental conﬁguration: • Global mo de : The guidance reference p oint s ( r ) i,t is deﬁned as the global centroid of the N r real individuals. In this study , this mo de was applied to the single-agent scenarios ( N v = 1 ), or more generally , to cases where the school is treated as a single cohesive unit and all agents share the same reference point. • Cluster-assignment mo de : The real ﬁsh are partitioned into k clusters using the k - means algorithm. This mo de w as applied to our multi-agen t conﬁgurations ( N v > 1 ), where eac h virtual agent i is assigned the centroid of a sp eciﬁc cluster c i as its guidance reference p oin t s ( r ) i,t . This mapping ensures that each agent maintains a ﬁxed-length input and focuses on a lo calized sub-group, providing robustness against group fragmentation. 4 Eac h virtual agent i outputs a discrete action a ∈ { 0 , 1 , . . . , 7 } , which corresp onds to eight mo vemen t directions. Based on the selected action, a target co ordinate x ( v ) i, target is determined for eac h virtual agent i . In the physical exp eriments (Section 2.5 ), eac h ﬁsh image w as horizon tally ﬂipp ed in real-time to align with its instantaneous mov ement direction, ensuring a natural visual app earance. T o ensure biologically plausible tra jectories, the actual mo v ement of virtual agen t i is mo deled as a ﬁrst-order lag system: d x ( v ) i dt = 1 τ ( v ) ( x ( v ) i, target − x ( v ) i ) , (4) where τ ( v ) > 0 is the time constant. This dynamic model allows the virtual agen t to reﬂect the in termittent burst-and-coast mo vemen t and frequen t directional shifts characteristic of rummy- nose tetras [ 31 ]. 2.3.2 Multi-ob jective reward design T o develop an autonomous p olicy for active guidance, we deﬁne a comp osite reward function r β . W e ﬁrst consider a baseline reward r base ∈ [ − 1 , 1] , based solely on c ( r ) x , the horizontal p osition of the collectiv e’s global centroid [ 28 ]: r base = 1 − 2 | c ( r ) x − x target - end | , (5) where x target - end ∈ { 0 , 1 } represents the target end of the tank (0 for leftw ard and 1 for right ward guidance). How ev er, using r base alone can lead to guidance failure. Since rummy-nose tetras are naturally exploratory , they migh t mo ve to w ard the target area indep endently . In suc h cases, if the reward dep ends only on the school’s location, the RL agent receiv es p ositive reinforcemen t without actually exerting guidance control, failing to learn appropriate guiding b ehaviors. T o address this, w e deﬁned r β as a weigh ted sum of social cohesion and directional guidance using a hyperparameter β ∈ [0 , 1] : r β = β r school + (1 − β ) r direction , (6) where b oth rew ard terms are normalized to the range [ − 1 , 1] to ensure a balanced con tribution to the comp osite rew ard. The social cohesion term r school explicitly ev aluates the so cial coupling betw een the liv e ﬁsh and the virtual agents. It rew ards the agents for main taining proximit y to the real individuals, thereb y facilitating so cial inﬂuence and preven ting the agents from receiving rewards without exerting guidance con trol. Our formulation calculates the distance to the nearest virtual agen t for eac h real individual: r school = 1 − √ 2 N r N r X j =1 min i ∈{ 1 ,...,N v } d ( x ( r ) j , x ( v ) i ) , (7) where d ( x ( r ) j , x ( v ) i ) represen ts the Euclidean distance b etw een real ﬁsh j ∈ { 1 , . . . , N r } and virtual agen t i ∈ { 1 , . . . , N v } . This term encourages virtual agents to maintain pro ximit y to the real ﬁsh while simultaneously allowing multiple agents to distribute themselves to manage fragmented sub-groups, pro viding robustness against sc ho ol splitting. The directional guidance term r direction ev aluates the progress of the virtual agents themselv es to ward the target end of the tank: r direction = 1 − 2 | c ( v ) x − x target - end | , (8) where c ( v ) x is the horizontal co ordinate of the virtual agents’ centroid. The ov erall structure of this m ulti-ob jective reward is illustrated in Fig. 3 . 5 0 1 1 Guidance direction 0 1 1 Guidance direction 0 1 1 Guidance direction 0 1 1 Guidance direction ● Real fish large ● Virtual agent : small direction : small school : small direction : large school : large direction : large school : large direction : small school Figure 3: Conceptual diagram of the m ulti-ob jective reward design. The cohesion term r school encourages the virtual agen ts to maintain proximit y to the real ﬁsh, while r direction rew ards the progress of the virtual agents tow ard the target area. The abov e form ulation is presented in a general form applicable to m ulti-agent settings. During p olicy learning in sim ulation, how ever, we consider the single-agent case ( N v = 1 ), in whic h the rew ard terms reduce to forms deﬁned with resp ect to the centroid of the real ﬁsh sc ho ol. 2.4 Agen t training in simulation 2.4.1 Motiv ation for simulation-based training The primary motiv ation for employing a simulation environmen t is to acquire an optimal control p olicy through pre-training before deploying it in the ph ysical en vironment. Reinforcemen t learning typically requires a massiv e num b er of interactions to conv erge, on the order of 10 6 steps in this study , whic h is impractical to p erform with live animals due to time constraints and the need to ensure animal w elfare. Unlike previous w ork [ 28 ], our policy is fully acquired within the sim ulation and subsequently deploy ed in the physical environmen t without further weigh t up dates or online learning. This approac h allo ws us to rigorously ev aluate the robustness of the p olicy and its capacit y for zero-shot transfer from a virtual model to real biological systems. 2.4.2 Sim ulation setup In the sim ulation phase, w e utilize the Global mo de as deﬁned in Section 2.3.1 . Individual real ﬁsh are not modeled explicitly; instead, the group is represen ted solely b y its centroid. Sp eciﬁcally , the environmen t consists of one virtual agent and the simulate d school cen troid, with their states represented by normalized co ordinates x ( v ) and c ( r ) , resp ectively . Under this conﬁguration, the reward r school simpliﬁes to 1 − √ 2 d ( c ( r ) , x ( v ) ) , and r direction is calculated based on the x -co ordinate of the single virtual agent. T o approximate the con tinuous dynamics of the agents and the sc ho ol cen troid, the underlying sim ulation state is up dated at a simulation time step of 0 . 1 s , while the agen t selects a discrete action ev ery 1 . 0 s of sim ulation time. This m ulti-rate up date sc heme allo ws the virtual agent to in teract with the collective at a low er up date frequency that b etter matches the c haracteristic b eha vioral timescale of the ﬁsh, while main taining the ﬁne temporal resolution required for smo oth motion and stable numerical integration of the ﬁrst-order lag dynamics. 6 2.4.3 Beha vioral mo del for sim ulated real ﬁsh The mov ement of the sim ulated ﬁsh school cen troid c ( r ) is gov erned by a stochastic b ehavioral mo del prop osed in [ 28 ]. This mo del assumes that the sc ho ol’s motion consists of a sequence of discrete linear tra jectories, reﬂecting the burst-and-coast swimming style of rumm y-nose tetras. Under this framework, the sc ho ol is assumed to b eha ve as a highly cohesive unit, where individual mo vemen ts are suﬃcien tly synchronized to b e eﬀectiv ely represented by their collective cen troid. T o accoun t for the non-deterministic nature of so cial in teractions, we introduce an ignoring probabilit y p , deﬁned as the probabilit y that the school ignores the virtual agent, follo wing the approac h in [ 28 ]. Incorp orating this probabilistic element encourages the RL agent to develop robust policies that do not rely on guaran teed, deterministic reactions from the ﬁsh. Eac h phase duration ∆ t ∈ (0 , ∆ t max ] is uniformly sampled. During each phase, the v elo city follo ws a ﬁrst-order lag system: d c ( r ) dt = 1 τ ( r ) ( c ( r ) target − c ( r ) ) , (9) where c ( r ) target is the target coordinate up dated at the b eginning of each phase based on the distance d ( c ( r ) , x ( v ) ) and the ignoring probability p , as follo ws: • Reaction case : If d ( c ( r ) , x ( v ) ) ≤ θ (within interaction range), the sc ho ol reacts to the agen t with probabilit y 1 − p , setting c ( r ) target = x ( v ) . • Sp on taneous mo v emen t : If d ( c ( r ) , x ( v ) ) > θ , or with probability p even within the in teraction range, the target is updated b y a random displacement: c ( r ) target = c ( r ) +  δ x δ y  , (10) where δ x and δ y are sampled uniformly from [ − δ x max , δ x max ] and [ − δ y max , δ y max ] , resp ec- tiv ely . 2.4.4 Ev aluation pro cedure The performance of the acquired p olicy is ev aluated ov er a v alidation p erio d of T ′ = 5000 steps with the learned p olicy parameters (net work w eigh ts) held ﬁxed. T o ensure a consistent comparison b etw een agen ts trained with diﬀerent v alues of β , the ev aluation metric R is deﬁned as the time-av erage of the baseline reward r base (Eq. ( 5 )): R = 1 T ′ T ′ X t =1 r base ( t ) (11) As r base dep ends only on c ( r ) x , the horizontal co ordinate of the real school centroid, it provides an ob jective b enchmark for iden tifying the p olicy that yields the most eﬀectiv e guidance b ehavior. 2.5 Ph ysical exp eriment proto cols and ev aluation 2.5.1 Guidance proto col Based on the swimming sp eed of rummy-nose tetras, the duration of each con trol step for the virtual agen ts was set to 1 . 2 s for all physical experiments. Each exp erimental session consisted of 900 steps (appro ximately 18 minutes), with the target direction (the left or righ t end of the tank) switching ev ery 90 steps to ev aluate the adaptive resp onse of the school. T o ensure the robustness and repro ducibility of the results, four indep endent trials were conducted for each exp erimen tal conﬁguration, p erformed at v arious times and across m ultiple dates. 7 T arget area Guidance direction Opposite area Intermediate area (a) Right ward guidance T arget area Guidance direction Opposite area Intermediate area (b) Leftw ard guidance Figure 4: Deﬁnition of ev aluation areas for guidance tasks. The target area is deﬁned as the 30% region from the target end, while the opp osite area is the 30% region from the opp osite end. The ﬁsh w ere k ept in a separate holding tank and were randomly selected and mo ved to the exp erimen tal arena only for the duration of the trials. T o p erform the guidance tasks in stable, still-w ater conditions, the water circulation system w as temp orarily suspended during all exper- imen ts. T o ensure stable b eha vioral states, the ﬁsh were introduced to the experimental tank at least 90 min utes b efore the start of the trials. When switc hing b et w een diﬀerent exp erimen tal conditions, a 30-minute in terv al was main tained to minimize the carry ov er eﬀects of previous stim uli. 2.5.2 Exp erimen tal design The ph ysical trials were conducted in tw o phases: • Exp erimen t A (Phase 1) : This phase aimed to identify the exp erimental conditions that maximize the stimulus salience of the virtual agents. W e systematically tested three bac kground colors (white, gra y , and black) and three ﬁsh-image sizes (small, medium, and large, whic h were appro ximately 0 . 6 × , 1 . 0 × , and 1 . 5 × the size of real ﬁsh, resp ectively). These trials were p erformed with a group of three real individuals ( N r = 3 ) using a single ﬁxed-formation virtual agen t rendered as four ﬁsh images. The agent op erated in the Global mode, targeting the cen troid of the en tire sc ho ol as describ ed in Section 2.3.1 . • Exp erimen t B (Phase 2) : Using the optimal visual parameters (background color and ﬁsh-image size) iden tiﬁed in Phase 1, w e ev aluated the guidance p erformance across dif- feren t group sizes ( N r = 5 , 8 ) and agen t conﬁgurations. W e compared the ﬁxed-formation baseline used in Phase 1, treated as a single-unit agen t ( N v = 1 ), with indep endently con trolled agents ( N v = 2 , 3 ). In contrast to the ﬁxed-formation agent, the indep enden t agen ts w ere eac h rendered as a single ﬁsh image and op erated in the Cluster-assignmen t mo de, where each agent targeted a lo calized sub-group as deﬁned in Section 2.3.1 . 2.5.3 Ev aluation metrics The eﬃcacy of the guidance w as quan tiﬁed using the follo wing three metrics: 1. Area o ccupancy ratio : The tank was divided in to three functional zones based on the horizon tal co ordinate (Fig. 4 ): the target area (the 30% region nearest to the target end), the opp osite area (the 30% region at the opp osite end), and the in termediate area (the cen tral 40%). The prop ortion of time spent b y the ﬁsh in each area was calculated across the en tire session. 8 2. Directional distribution and Bhattacharyy a distance : W e generated positional his- tograms of the school’s horizontal centroid for b oth leftw ard and right w ard guidance p e- rio ds. T o quantify the separability of these tw o distributions, w e calculated the Bhat- tac haryya distance, where a larger v alue indicates more distinct guidance success. 3. Sub-in terv al distribution : T o assess the stability of the guidance ov er time for repre- sen tative conﬁgurations, the distribution of individual p ositions was visualized for each 90-step sub-in terv al using b ox plots. 3 Results 3.1 P olicy optimization through simulation Prior to the physical exp erimen ts, we ev aluated the eﬀectiv eness of the reinforcement learning framew ork in a sim ulation en vironment. The primary ob jectiv es w ere to determine the optimal w eight β for the comp osite rew ard function r β (Eq. ( 6 )) and to ensure that the acquired p olicy remains robust across v arious lev els of sto chasticit y in ﬁsh b ehavior, represented b y the ignoring probabilit y p . T o account for the sto c hastic nature of the learning pro cess, we p erformed 10 indep enden t training runs for each parameter com bination, and the mean ev aluation v alue ¯ R w as calculated across these trials. Figure 5 illustrates the transition of the mean ev aluation v alue ¯ R as a function of the training steps T . Ov erall, the performance generally improv ed as T increased across all parameter com- binations. When comparing diﬀerent reward conﬁgurations, w e observ ed that p olicies trained with β = 0 . 1 , 0 . 5 , 0 . 7 , and 0 . 9 yielded p erformance levels comparable to the baseline rew ard r base . Ho wev er, the p olicy trained with β = 0 . 3 demonstrated sup erior p erformance, outp erforming the baseline at T = 10 6 steps for all v alues of p except p = 0 (where the sim ulated ﬁsh alw ays react to the agent). The results also conﬁrmed that higher v alues of the ignoring probability p generally lead to low er ev aluation v alues ¯ R , reﬂecting the increased diﬃculty of the guidance task when the sc ho ol frequently ignores the virtual agent. Nevertheless, the comp osite reward function r β with an appropriate h yp erparameter successfully facilitated stable p olicy acquisition even under high- noise conditions ( p ≥ 0 . 6 ). Based on these simulation results, we adopted the p olicy trained with T = 10 6 , p = 0 . 6 , and β = 0 . 3 as the autonomous agent con troller for all subsequent ph ysical exp erimen ts described in Section 2.5 . 3.2 Exp erimen t A: Optimization of visual parameters In Experiment A (Phase 1), we systematically ev aluated stimulus salience by v arying the bac k- ground color and ﬁsh-image size to maximize the resp onsiv eness of the liv e ﬁsh. As describ ed in Section 2.5.2 , these trials w ere p erformed with a group of three individuals ( N r = 3 ) using a single ﬁxed-formation agent op erating in the Global mode. T able 1 summarizes the guidance p erformance metrics for each condition. The results for bac kground color indicated that the white background w as the most eﬀective in biasi ng the sc ho ol’s p osition. The white condition yielded the highest o ccupancy ratio in the target area (24.23%) and the largest separabilit y betw een the target and opp osite areas. This trend is con- sisten tly reﬂected in the Bhattacharyy a distance, where the white bac kground (0.1589) markedly outp erformed the black background (0.0089). The p ositional histograms (Fig. 6a ) conﬁrm a clear shift in the sc ho ol’s distribution to ward the target direction, particularly during leftw ard guid- ance, under the white and gray conditions. How ever, the distribution remained largely cen tered in the black condition. Regarding ﬁsh-image size, the large conﬁguration yielded the most pronounced guidance eﬀect. The large ﬁsh-image size ( 1 . 5 × ) achiev ed a target area o ccupancy ratio of 22.57%, out- 9 Baseline p = 0.0 p = 0.3 p = 0.6 p = 0.9 Figure 5: Learning curves showing the transition of the mean ev aluation v alue ¯ R across diﬀerent training steps T and reward w eigh ts β . Each data p oint represents the av erage of 10 indep endent training trials for the corresp onding parameter combination. The baseline represen ts the policy trained using only the horizontal co ordinate of the sc ho ol’s centroid ( r base ). Each plot compares diﬀeren t ignoring probabilities p for the simulated ﬁsh. p erforming the medium (18.63%) and small (18.15%) sizes. Notably , the Bhattacharyy a distance for the large size (0.1616) was mark edly higher than those for the medium (0.0595) and small (0.0070) sizes. The histograms shown in Fig. 6b further illustrate that the large size induced a more robust and consisten t bias in the individuals’ horizontal p ositions. Ov erall, the results of Exp eriment A demonstrate that a white background and a large ﬁsh- image size maximize b eha vioral resp onsiv eness in our physical environmen t. Consequently , these optimal visual parameters w ere standardized for all subsequen t trials in Exp eriment B. 3.3 Exp erimen t B: Performance of the closed-lo op guidance In Exp eriment B (Phase 2), we ev aluated the guidance p erformance focusing on the inﬂuence of agen t conﬁgurations and group sizes ( N r = 5 , 8 ). Based on the ﬁndings from Exp eriment A, all trials w ere conducted using a white background and the large ﬁsh-image size. 3.3.1 Guidance eﬃcacy and group-size dep endence T able 2 summarizes the p erformance metrics for Exp eriment B. F or groups of N r = 5 individuals, the ﬁxed-formation baseline (Global mo de) ac hieved the highest target area o ccupancy (20.71%) and the largest Bhattacharyy a distance (0.1197). While the indep endently controlled agents 10 T able 1: Summary of guidance p erformance in Exp erimen t A. Area o ccupancy ratios represent the total p ercen tage of time sp en t b y the sc ho ol in each zone. Larger Bhattacharyy a distances indicate higher guidance eﬃcacy . P arameter Condition T arget area (%) Opp osite area (%) Bhattac haryya distance Bac kground color white 24 . 23 9 . 53 0 . 1589 gra y 21 . 10 10 . 01 0 . 1055 blac k 8 . 93 7 . 54 0 . 0089 Fish-image size small ( 0 . 6 × ) 18 . 15 16 . 55 0 . 0070 medium ( 1 . 0 × ) 18 . 63 9 . 27 0 . 0595 large ( 1 . 5 × ) 22 . 57 7 . 79 0 . 1616                         Bac kground: white                         Bac kground: gray                         Bac kground: black (a) Eﬀect of background color                         Fish-image size: small                         Fish-image size: medium                         Fish-image size: large (b) Eﬀect of ﬁsh-image size Figure 6: Positional histograms of the school’s cen troid for the visual parameters tested in Exp erimen t A. (a) Background color comparison. (b) Fish-image size comparison (small: 0 . 6 × , medium: 1 . 0 × , large: 1 . 5 × ). Each plot shows the ov erlay of leftw ard and right w ard guidance results. 11 T able 2: Summary of guidance p erformance in Exp erimen t B. Results for group sizes N r = 5 and N r = 8 are compared across diﬀerent agent conﬁgurations. N r Conﬁguration (Mode) T arget area (%) Opp osite area (%) Bhattac haryya distance 5 Fixed (Global) 20 . 71 9 . 48 0 . 1197 Indep enden t N v = 2 (Cluster) 16 . 57 12 . 39 0 . 0702 Indep enden t N v = 3 (Cluster) 19 . 92 10 . 99 0 . 0998 8 Fixed (Global) 14 . 15 10 . 97 0 . 0683 Indep enden t N v = 2 (Cluster) 13 . 06 11 . 23 0 . 0406 Indep enden t N v = 3 (Cluster) 14 . 53 10 . 02 0 . 0435                         Baseline (global)                         Indep enden t ( N v = 2 )                         Indep enden t ( N v = 3 ) (a) Group size N r = 5                         Baseline (global)                         Indep enden t ( N v = 2 )                         Indep enden t ( N v = 3 ) (b) Group size N r = 8 Figure 7: Positional histograms of the school’s cen troid for Exp eriment B. The columns compare (a) N r = 5 and (b) N r = 8 conditions. The ro ws represen t the ﬁxed-formation baseline and the indep enden t ( N v = 2 , 3 ) conﬁgurations. Each plot shows the ov erlay of left ward and right ward guidance results. 12 ( N v = 2 , 3 , Cluster-assignmen t mode) w ere in tended to improv e guidance by following sub- groups, they yielded slightly low er p erformance (Bhattacharyy a distances: 0.0702 for N v = 2 and 0.0998 for N v = 3 ). The p ositional histograms (Fig. 7 ) reﬂect these results. A t N r = 5 (Fig. 7a ), the distributions for all conﬁgurations sho w a visible shift to ward the target direction, with only marginal visual diﬀerences betw een the mo des. Ho wev er, as the group size increased to N r = 8 (Fig. 7b ), the distributions b ecame mark edly more cen tralized across all agen t conﬁgurations, and the Bhattac haryya distances dropp ed signiﬁcantly . This suggests that the eﬃcacy of external visual stim uli is severely limited as internal so cial interactions within the school intensify in larger group sizes. 3.3.2 T emp oral stabilit y and agen t co ordination T o clarify the performance gap b et w een the conﬁgurations at N r = 5 , we analyzed the temporal stabilit y of the sc ho ol’s resp onse across 90-step sub-interv als. Fig. 8 compares the horizontal p osition distributions for the ﬁxed-formation baseline and the indep endent conﬁguration ( N v = 3 ). In the ﬁxed-formation baseline (Fig. 8a ), the sc ho ol’s distribution was consistently biased to ward the target direction across nearly all sub-in terv als, indicating highly stable guidance. In contrast, the indep endent agents ( N v = 3 ) exhibited intermitten t failures (Fig. 8b ), where the school remained in the central area during sp eciﬁc sub-in terv als. This instability suggests that the unco ordinated mov ements of multiple virtual agen ts, potentially coupled with frequent switc hing of their target clusters, may hav e confused the real ﬁsh rather than pro viding a clear so cial signal. These results indicate that individual optimization of agen t p olicies is insuﬃcient for the guidance of collective b eha vior, highligh ting the necessit y of co op erative multi-agen t con trol. 4 Discussion 4.1 Eﬀectiv eness of the multi-ob jectiv e rew ard design A k ey comp onen t of our reinforcement learning framew ork is the m ulti-ob jective formulation of the guidance task, which balances directional guidance and so cial cohesion. These ob jectives are in tegrated into a single scalar reward through a weigh ted combination, resulting in a comp os- ite reward function for p olicy learning. Sim ulation results (Fig. 5 ) show that this form ulation enables stable p olicy acquisition ev en under sto chastic b eha vioral conditions, and the physical exp erimen ts indicate that the learned p olicy transfers eﬀectively to real-world settings. These ﬁndings suggest that appropriately balancing comp eting ob jectiv es is imp ortant for ac hieving robust closed-loop guidance of biological collectiv es. 4.2 Optimization of visual stim uli for collective guidance The results of Exp eriment A show ed that a white background and a large ﬁsh-image size yielded the most eﬀective conditions for guiding rumm y-nose tetras. The eﬀectiv eness of the white bac kground lik ely arises from the high visual contrast it provides, which enhances the saliency of the virtual agents against the exp erimen tal environmen t. In contrast, under the black bac kground condition, the school remained mainly in the in termediate area (the region b etw een the target and opp osite areas) (T able 1 , Fig. 6a ). This suggests that the low ambien t brightness of the black en vironment may hav e suppressed the ﬁsh’s general activity or exploratory b ehavior tow ard the edges of the tank, thereby diminishing the o v erall guidance eﬃcacy . Regarding the inﬂuence of ﬁsh-image size, the largest individual ﬁsh size, corresp onding to appro ximately 1.5 times that of real individuals, yielded the highest guidance eﬃcacy (T able 1 ). 13 T rial 1 T rial 2 T rial 3 T rial 4                                                                                                                         (a) Fixed formation (baseline) T rial 1 T rial 2 T rial 3 T rial 4                                                                                                                         (b) Indep enden t agents ( N v = 3 ) Figure 8: Spatiotemporal evolution of group centroids ( N r = 5 ). Each row displays four inde- p enden t trials for (a) the baseline (ﬁxed formation) and (b) the indep enden t agent ( N v = 3 ) con- ﬁguration. Horizontal and vertical axes denote horizontal co ordinates and elapsed sub-interv als, resp ectiv ely . T arget direction is indicated by red triangles and b ox colors (orange: leftw ard, blue: righ tw ard). While this might suggest that larger-than-real stimuli are more eﬀective, direct size comparisons should b e in terpreted with caution, as the virtual agent s are presen ted on a screen with a min- im um separation of approximately 47 mm from the ﬁsh. Nevertheless, the comparison across the three tested sizes (small, medium, and large) revealed a clear p erformance trend (Fig. 6b ), indicating that larger stimuli provide a more salient cue that promotes collective directional c hanges. 4.3 Group-size dep endent limitations of visual guidance Guidance p erformance markedly degraded as the group size increased from N r = 5 to N r = 8 (T able 2 , Fig. 7 ). This decline indicates that the inﬂuence of external visual signals is not absolute but comp etes with internal so cial forces. As group size increases, so cial interactions suc h as alignment and attraction to neighbors likely out weigh the visual cues provided b y the virtual agen ts. F urthermore, sensory comp etition may contribute to this limitation. While our system pro vides visual feedback, real ﬁsh also rely on the lateral line system to sense h ydro dynamic 14 c hanges [ 20 , 32 ]. As group size increases and the eﬀectiv e density rises, h ydro dynamic cues from nearb y individuals b ecome more pronounced, p otentially attenuating visual information from the screen. This limitation underscores a fundamen tal challenge: in tegrating m ulti-mo dal stim uli to main tain inﬂuence o ver larger, more cohesive groups. 4.4 Challenges in multi-agen t con trol and co ordination The indep endent multi-agen t conﬁguration ( N v = 2 , 3 ) did not outp erform the ﬁxed-formation baseline (Global mo de) (T able 2 ). This diﬀerence may b e explained by several factors. First, the ﬁxed-formation baseline utilized four ﬁsh images controlled as a single unit, whereas the indep enden t conﬁgurations used only t w o or three ﬁsh images (Section 2.5.2 ). This larger num b er of visual stim uli in the baseline lik ely increased o verall salience, providing a more robust and easily recognizable directional signal. Bey ond stimulus strength, the lac k of co ordination among agen ts may hav e b een detrimental. Since each agent’s p olicy was trained in isolation, their unco ordinated mo v emen ts likely app eared as “so cial noise” to the school, whic h ma y requires a coheren t signal to main tain collectiv e mo- tion. Additionally , the frequen t switching of target clusters in the Cluster-assignment mo de likely caused abrupt changes in agent tra jectories, p otentially confusing the real ﬁsh rather than induc- ing stable following b ehavior. These results indicate that individual optimization is insuﬃcien t; guiding fragmen ted or heterogeneous groups may require co op erative m ulti-agen t reinforcemen t learning (MARL) [ 33 ], where agents learn a join t policy to pro vide a uniﬁed guidance signal. 4.5 En vironmental asymmetry and tec hnical constrain ts Although the closed-lo op system achiev ed the desired guidance, w e observed a sligh t p ositional bias tow ard the left side of the tank. This asymmetry likely stems from subtle en vironmen tal factors within the exp erimental setup, such as inhomogeneous ligh ting or feeding habituation. While these factors do not undermine the ov erall eﬀectiv eness of the system, the cause of this bias should b e further inv estigated and mitigated in future studies to ensure a more balanced and con trolled experimental environmen t. In addition to these en vironmen tal factors, tec hnical constraints also remain, suc h as the 2D nature of the stim uli and the lack of hydrodynamic feedback . F uture research will explore the implemen tation of co op erative MARL and the integration of physical rob otic agen ts to ov ercome these limitations and extend guidance capabilities to even larger and more complex sc ho ols. 5 Conclusion In this study , w e prop osed and ev aluated a deep reinforcement learning framework for the real- time, closed-lo op guidance of ﬁsh sc ho ols. By emplo ying Proximal P olicy Optimization (PPO) and a comp osite reward function that balances directional guidance with so cial cohesion, w e dev elop ed an autonomous con troller capable of guiding biological collectiv es. Our metho dology demonstrates a bridge b etw een sim ulation-based training and real-w orld application, achieving eﬀectiv e zero-shot transfer to ph ysical exp eriments with rumm y-nose tetras. Our ﬁndings reveal imp ortant factors go v erning the eﬃcacy of artiﬁcial so cial inﬂuence. W e found that the salience of visual stimuli, sp eciﬁcally background contrast and stimulus size, pla ys a signiﬁcant role in maximizing the resp onsiveness of the ﬁsh school. F urthermore, our ev aluation across group sizes and agen t conﬁgurations highlights a fundamental trade-oﬀ: while virtual agen ts can eﬀectively guide smaller groups, their inﬂuence is c hallenged b y intensifying intrinsic so cial interactions and p oten tial sensory comp etition in larger groups. The sup eriority of ﬁxed formations ov er unco ordinated indep endent agents further emphasizes that coherent collective signals are more eﬀectiv e so cial stim uli than individually optimized but unaligned behaviors. 15 This work establishes a scalable foundation for the automated guidance of biological groups. F uture research will fo cus on implementing co op erative m ulti-agent reinforcement learning (MARL) to facilitate the co ordination of agent actions in resp onse to div erse group dynamics, including fragmen ted sub-groups. Such co ordination will b e necessary to provide a coheren t and robust so- cial signal and, alongside the in tegration of m ulti-mo dal stimuli suc h as h ydro dynamic feedback, to main tain inﬂuence in dense and complex biological environmen ts. Ethics All animal exp eriments w ere conducted with the approv al of the Graduate School of Information Science Univ ersity of Hy ogo (Appro v al No. UHIS-EC-2024-004). Conﬂicts of in terest The authors declare no p ersonal or ﬁnancial comp eting in terests. F unding This study was supp orted b y the JSPS KAKENHI Gran t Num b er JP21H05302. A c kno wledgmen ts The authors are grateful to Y usuke Nishii for dev eloping the foundational framew ork [ 28 ] up on whic h this study builds. References [1] Craig W. Reynolds. Flo c ks, herds and sc ho ols: A distributed b eha vioral mo del. Comput. Gr aph. , 21(4):25–34, 1987. doi: 10.1145/37402.37406 . [2] Ian D. Couzin, Jens Krause, Ric had James, Graeme D. Ruxton, and Nigel R. F ranks. Collectiv e memory and spatial sorting in animal groups. Journal of The or etic al Biolo gy , 218 (1):1–11, 2002. doi: 10.1006/jtbi.2002.3065 . [3] Julia K. Parrish, Steven V. Viscido, and Daniel Grünbaum. Self-organized ﬁsh schools: An examination of emergen t properties. The Biolo gic al Bul letin , 202(3):296–305, 2002. doi: 10.2307/1543482 . [4] M. Ballerini, N. Cabibb o, R. Candelier, A. Cav agna, E. Cisbani, I. Giardina, V. Lecomte, A. Orlandi, G. Parisi, A. Procaccini, M. Viale, and V. Zdravk ovic. Interaction ruling ani- mal collectiv e behavior dep ends on top ological rather than metric distance: Evidence from a ﬁeld study . Pr o c e e dings of the National A c ademy of Scienc es , 105(4):1232–1237, 2008. doi: 10.1073/pnas.0711437105 . [5] Andras Czirok and T amas Vicsek. Collectiv e behavior of in teracting self-propelled par- ticles. Physic a A: Statistic al Me chanics and its Applic ations , 281(1-4):17–29, 2000. doi: 10.1016/S0378-4371(00)00013-3 . [6] Iain D. Couzin, Jens Krause, Nigel R. F ranks, and Simon Levin. Eﬀectiv e leadership and decision-making in animal groups on the mo v e. Natur e , 433(7025):513–6, 2005. doi: 10.1038/nature03236 . 16 [7] James E. Herb ert-Read, Andrea P erna, Ric hard P . Mann, Timoth y M. Schaerf, Da vid J. T. Sumpter, and Ashley J. W. W ard. Inferring the rules of in teraction of shoal- ing ﬁsh. Pr o c e e dings of the National A c ademy of Scienc es , 108(46):18726–18731, 2011. doi: 10.1073/pnas.1109355108 . [8] Y ael Katz, Kolb jørn T unstrøm, Christos C. Ioannou, Cristián Huepe, and Iain D. Couzin. Inferring the structure and dynamics of interactions in schooling ﬁsh. Pr o c e e dings of the National A c ademy of Scienc es , 108(46):18720–18725, 2011. doi: 10.1073/pnas.1107583108 . [9] K olb jørn T unstrøm, Y ael Katz, Christos C. Ioannou, Cristián Huep e, Matthew J. Lutz, and Iain D. Couzin. Collective states, m ultistability and transitional b ehavior in sc ho oling ﬁsh. PL oS Computational Biolo gy , 9(2):e1002915, 2013. doi: 10.1371/journal.p cbi.1002915 . [10] Y ogo T ak ada, Y ukinobu Nak anishi, Ryosuk e Araki, Motohiro Nonogaki, and T omoyuki W akisak a. Eﬀect of material and thickness ab out tail ﬁns on propulsiv e p erfor- mance of a small ﬁsh rob ot. Journal of A er o A qua Bio-me chanisms , 1(1):51–56, 2010. doi: 10.5226/jabmec h.1.51 . [11] Jun Shintak e, Herb ert Shea, and Dario Floreano. Biomimetic underw ater rob ots based on dielectric elastomer actuators. In 2016 IEEE/RSJ International Confer enc e on Intel- ligent R ob ots and Systems (IROS) , pages 4957–4962, Daejeon, South Korea, 2016. IEEE. doi: 10.1109/IR OS.2016.7759728 . [12] Florian Berlinger, Jeﬀ Dusek, Melvin Gauci, and Radhik a Nagpal. Robust maneuv erability of a miniature , lo w-cost underwater rob ot using multiple ﬁn actuation. IEEE R ob otics and A utomation L etters , 3(1):140–147, 2018. doi: 10.1109/LRA.2017.2734969 . [13] T akuya Aritani, Naoki Kaw asaki, and Y ogo T ak ada. Small rob otic ﬁsh with t w o magnetic actuators for autonomous tracking of a goldﬁsh. Journal of A er o A qua Bio-me chanisms , 8 (1):69–74, 2019. doi: 10.5226/jabmec h.8.69 . [14] Xingyu Chen, Junzhi Y u, Zhengxing W u, Y an Meng, and Shihan K ong. T ow ard a maneuv erable miniature rob otic ﬁsh equipp ed with a no vel magnetic actuator system. IEEE T r ansactions on Systems, Man, and Cyb ernetics: Systems , 50(7):2327–2337, 2020. doi: 10.1109/TSMC.2018.2812903 . [15] Florian Berlinger, Melvin Gauci, and Radhik a Nagpal. Implicit co ordination for 3D un- derw ater collectiv e behaviors in a ﬁsh-inspired rob ot sw arm. Scienc e R ob otics , 6(50), 2021. doi: 10.1126/SCIR OBOTICS.ABD8668 . [16] Daniel T. Swain, Iain D. Couzin, and Naomi Ehrich Leonard. Real-time feedback-con trolled rob otic ﬁsh for b ehavioral experiments with ﬁsh sc ho ols. Pr o c e e dings of the IEEE , 100(1): 150–163, 2012. doi: 10.1109/JPR OC.2011.2165449 . [17] Vladisla v Kopman, Jeﬀrey Laut, Giov anni P olverino, and Maurizio P orﬁri. Closed-lo op con trol of zebraﬁsh resp onse using a bioinspired rob otic-ﬁsh in a preference test. Journal of The R oyal So ciety Interfac e , 10(78), 2013. doi: 10.1098/rsif.2012.0540 . [18] F rank Bonnet, Alexey Grib o vskiy , José Halloy , and F rancesco Mondada. Closed-lo op inter- actions b etw een a shoal of zebraﬁsh and a group of rob otic ﬁsh in a circular corridor. Swarm Intel ligenc e , 12(3):227–244, 2018. doi: 10.1007/s11721-017-0153-6 . [19] Leo Cazenille, Y ohann Chemtob, F rank Bonnet, Alexey Grib ovskiy , F rancesco Mondada, Nicolas Bredec he, and Jose Halloy . How to blend a robot within a group of zebraﬁsh: A chieving so cial acceptance through real-time calibration of a multi-lev el b ehavioural mo del. Biomimetic and Biohybrid Systems, L e ctur e Notes in Computer Scienc e , 10928:73–84, 2018. doi: 10.1007/978-3-319-95972-6_9 . 17 [20] Liang Li, Máté Nagy , Jacob M. Graving, Joseph Bak-Coleman, Guangming Xie, and Iain D. Couzin. V ortex phase matching as a strategy for sc ho oling in rob ots and in ﬁsh. Natur e Communic ations , 11(1):5408, 2020. doi: 10.1038/s41467-020-19086-0 . [21] T omohiro Nak ay asu and Eiji W atanab e. Biological motion stim uli are attractive to medak a ﬁsh. Animal Co gnition , 17(3):559–575, 2014. doi: 10.1007/s10071-013-0687-y . [22] Hiroaki Kaw ashima, Y u Kanechik a, and T ak ashi Matsuyama. Camera-display system for the in teraction analysis of live ﬁsh vs ﬁsh-like graphics. The 17th Me eting on Image R e c o gnition and Understanding , 2014. [23] Bertrand Lemasson, Colb y T anner, Christa W o o dley , T ammy Threadgill, Shea Qarqish, and Da vid Smith. Motion cues tune so cial inﬂuence in shoaling ﬁsh. Scientiﬁc R ep orts , 8(1): 9785, 2018. doi: 10.1038/s41598-018-27807-1 . [24] James Miles, Andrew S. V o wles, and P aul S. Kemp. The role of collectiv e b e- ha viour in ﬁsh response to visual cues. Behaviour al Pr o c esses , 220:105079, 2024. doi: 10.1016/j.b epro c.2024.105079 . [25] Liang Li, Mate Nagy , Guy Amicha y , W ei W ang, Oliv er Deussen, Daniela Rus, and Iain Couzin. Reverse engineering the con trol la w for sc ho oling in zebraﬁsh using virtual realit y . Scienc e R ob otics , 10(101), 2025. doi: 10.1126/scirob otics.adq6784 . [26] Ra j Ra jeshw ar Malinda, Saeko T akizaw a, Akiyuki Ko yama, T ak ayuki Niizato, Hitoshi Hab e, and Hiroaki Kaw ashima. Sp eed-con trolled visual stimuli mo dulate ﬁsh collective dynamics. bioRxiv pr eprint , 2025. doi: 10.64898/2025.12.05.692523 . [27] Hiroaki Kaw ashima, Ra j Ra jeshw ar Malinda, and Saeko T akizaw a. Mo deling and analysis of ﬁsh interaction netw orks under pro jected visual stim uli. Pr o c e e dings of the Joint Symp osium of AROB 31st and ISBC 11th (AR OB-ISBC 2026) , 2026. doi: 10.48550/arXiv.2603.01682 . [28] Y usuke Nishii and Hiroaki Kaw ashima. Con trolling ﬁsh sc ho ols via reinforcement learning of virtual ﬁsh mov emen t. arXiv pr eprint , 2026. doi: 10.48550/arXiv.2603.16384 . (English translation of the bac helor’s thesis b y Y usuk e Nishii originally submitted in 2018). [29] John Sch ulman, Filip W olski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal p olicy optimization algorithms. arXiv pr eprint , 2017. doi: 10.48550/arXiv.1707.06347 . [30] A o W ang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jungong Han, and Guiguang Ding. YOLOv10: Real-time end-to-end ob ject detection. arXiv pr eprint , 2024. doi: 10.48550/arXiv.2405.14458 . [31] Daniel S. Calovi, Alexandra Litchink o, V alen tin Lechev al, Ugo Lop ez, Alfonso Pérez Escud- ero, Hugues Chaté, Clémen t Sire, and Guy Theraulaz. Disentangling and mo deling interac- tions in ﬁsh with burst-and-coast swimming reveal distinct alignment and attraction b eha v- iors. PLOS Computational Biolo gy , 14(1):e1005933, 2018. doi: 10.1371/journal.p cbi.1005933 . [32] Hungtang Ko, George Lauder, and Radhik a Nagpal. The role of hydrodynamics in collective motions of ﬁsh schools and bioinspired underwater rob ots. Journal of The R oyal So ciety Interfac e , 20(207):20230357, 2023. doi: 10.1098/rsif.2023.0357 . [33] Chao Y u, Ak ash V elu, Eugene Vinitsky , Jiaxuan Gao, Y u W ang, Alexandre Bay en, and Yi W u. The surprising eﬀectiveness of PPO in co op erativ e, multi-agen t games. arXiv pr eprint , 2021. doi: 10.48550/arXiv.2103.01955 . 18

A Deep Reinforcement Learning Framework for Closed-loop Guidance of Fish Schools via Virtual Agents

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment