TrackNet: A Deep Learning Network for Tracking High-speed and Tiny Objects in Sports Applications

T rackNet: A Deep Learning Network for T racking High-speed and T in y Objects in Sports Applications Y u-Chuan Huang I-No Liao Ching-Hsuan Chen Ts ` ı-U ´ ı ˙ Ik ∗ W en-Chih Peng Department of Computer Science, College of Computer Science National Chiao T ung Univ ersity 1001 Uni versity Road, Hsinchu City 30010, T aiwan ∗ Email: cwyi@nctu.edu.tw Abstract —Ball trajectory data are one of the most fundamental and useful information in the e valuation of players’ performance and analysis of game strategies. Although vision-based object tracking techniques hav e been de veloped to analyze sport competition videos, it is still challenging to recognize and position a high-speed and tiny ball accurately . In this paper , we develop a deep learning network, called T rackNet, to track the tennis ball from broadcast videos in which the ball images are small, blurry , and sometimes with afterimage tracks or even in visible. The proposed heatmap-based deep learning network is trained to not only recognize the ball image from a single frame b ut also learn ﬂying patterns from consecutive frames. T rackNet tak es images with the size of 640 × 360 to generate a detection heatmap from either a single frame or several consecutive frames to position the ball and can achie ve high pr ecision e ven on public domain videos. The network is evaluated on the video of the men’s singles ﬁnal at the 2017 Summer Universiade, which is available on Y ouT ube. The precision, recall, and F1-measure of T rackNet reach 99 . 7% , 97 . 3% , and 98 . 5% , respecti vely . T o prev ent overﬁtting, 9 additional videos are partially labeled together with a subset from the pre vious dataset to implement 10-f old cross validation, and the precision, recall, and F1-measur e are 95 . 3% , 75 . 7% , and 84 . 3% , respectively . A con ventional image processing algorithm is also implemented to compare with T rackNet. Our experiments indicate that T rackNet outperforms con- ventional method by a big margin and achieves exceptional ball tracking perf ormance. The dataset and demo video ar e av ailable at https://nol.cs.nctu.edu.tw/ndo3je6av9/. Index T erms —Deep Learning, neural networks, tiny object tracking, heatmap, tennis, badminton I . I N T RO D U C T I O N V ideo considered as logs of visual sensors con- tains a large amount of information. Information extraction from videos has become a hot research topic in the areas of image processing and deep learning. In the applications of sports analyzing and athletes training, videos are helpful in the post-game re view and tactical analysis. In professional sports, high-end cameras hav e been used to record high resolution and high frame rate videos and combined with image processing for referee assistance or data collection. Ho we v er , this solution requires enormous resources and is not affordable for individuals or amateurs. De v eloping a low-cost solution for data acquisition from broadcast videos will be signiﬁcant for massiv e sports data collection. Ball trajectory data are one of the most funda- mental and useful information for game analysis. Ho wev er , for some sports such as tennis, badminton, baseball, etc., the ball is not only small b ut also may ﬂy as fast as sev eral hundred kilometers per hour , resulting in tiny and blurry images. That makes the ball tracking task becomes more challenging than other sports. In this paper , we design a heatmap- based deep learning network, called T rackNet, to precisely position ball of tennis and badminton on broadcast videos or videos recorded by consumer’ s de vices such as smartphones. T rackNet overcomes the issues of blurry and remnant images and can e ven detect occluded ball by learning its trajectory patterns. The proposed network can be applied to other ball-based sports and help both amateurs and professional teams collect data with a moderate budget. Con v entional image recognition is usually based on the object’ s appearance features such as shape, color , size, etc., or statistical features such as HOG, SIFT , etc. Due to a relati v ely long shutter time of consumer or prosumer cameras, images of high- speed objects are prone to suf fer from afterimage or blur issues, resulting in poor image recognition accuracy . The performance of ball tracking can be improv ed by pairing candidates from frame to frame according to trajectory models to ﬁnd the most possible one [1]. In addition, a classical technique in image processing to improv e image quality is by fusing multiple low-quality images. Based on the abov e observations, instead of using the rule- based techniques, we propose to adopt deep learning network to recognize the shape of the ball and learn the trajectory patterns by applying multiple consecuti ve frames to solve the mentioned issues. Object classiﬁcation and detection are two of the earliest studies in deep learning. VGG-16 [2] is one of the most popular netw orks for feature map encoding. T o detect and classify multiple objects in an image, the R-CNN family [3] [4] [5] structurally examine the picture in two stages. It ﬁrstly selects many areas that may contain interesting objects, called Re gion of Interests (RoIs), and then applies object detection and classiﬁcation techniques on these regions. Ho wev er , its performance cannot ful- ﬁll the needs of real-time applications. T o speed up, the Y OLO family [6] de v elops a one-stage end-to- end approach to detect objects in a limited search space, signiﬁcantly reducing the computing time. The streamlined version of T in y Y OLO can e v en run on the Raspberry Pi. Compared to the block-based algorithms, Fully Con volutional Networks (FCN) proceeds pixel-wise classiﬁcation. T o compensate for the size reduction of the feature map during the encoding process, upsampling and Decon vNet [7] are often used to decode the feature map, generating an original size of the data array . In this paper , a deep learning network, called T rackNet, is proposed to realize a precise trajec- tory tracking network. Firstly , VGG-16 is adopted to generate the feature map. Different from other deep learning networks, TrackNet can take multiple consecuti ve frames as input. In this way , TrackNet learns not only the features of the ball but also the characteristics of ball trajectories to enhance its ca- pability of object recognition and positioning. Since images are do wnsampled and encoded by pooling layers, the network follows the upsampling mech- anism of FCN to generate the heatmap for object detection. At last, the position of our target object is calculated based on the heatmap generated by the deep learning netw ork. T o meet the characteristics of tennis and badminton games, our calculation and e valuation are based on the assumption that there is at most one ball on the court. T o e v aluate the proposed network, we have la- beled 20 , 844 frames from the broadcast of men’ s singles ﬁnal at the 2017 Summer Uni versiade. T o assess the performance of the proposed consecu- ti ve input frames technique, both single-frame and multiple-frame versions of T rackNet are imple- mented. Along with the conv entional image recog- nition algorithm [1], a comprehensi ve comparison among dif ferent models is performed. Experiments indicate that the proposed T rackNet outperforms the con v entional image recognition algorithm and effec- ti vely locates fast-mo ving tennis ball from broadcast sport competition videos. Moreover , to pre vent the notorious o verﬁtting issue that happens frequently in deep learning solutions, additional data from 9 tennis games on different courts are added to the training dataset, including grass court, red clay court, hard court, etc. Additionally , to explore the model e xtensibility , badminton tracking by Track- Net is ev aluated. W e hav e labeled 18 , 242 frames from the video of 2018 Indonesia Open Final - T AI Tzu Y ing vs CHEN Y uFei. Although badminton trav els much faster than tennis, our experimental results exhibit a decent performance. The critical contribution of TrackNet comes from its capability of precisely tracking fast-moving and tiny objects by learning the dynamic behavior of the trajectory . In the tennis tracking application, 10-fold cross validation results in an outstanding perfor- mance of 95 . 3% precision, 75 . 7% recall, and 84 . 3% F1-measure. Such capability sho ws great potential in expanding the variety of computer vision applica- tions. The rest of the paper is organized as follows. Section II provides an introduction to the rele v ant researches and the con volutional neural network. Section III introduces the datasets used in this paper . Section IV elaborates the proposed deep learning network and Gaussian heatmap techniques. Section V provides experimental results and performance e valuation. At last, Section VI concludes this paper . 2 I I . R E L A T E D W O R K S In recent years, the analysis of player perfor- mance and game tactics based on the trajectory data of balls and players has receiv ed more and more attention [8] [9] [10] [11]. Many tracking algorithms and systems hav e been dev eloped to compute and collect the trajectory data. Current commercial solutions mainly rely on high resolution and high frame rate video, resulting in high hard- ware in vestment. F or e xample, the Ha wk-Eye sys- tem [12] has been e xtensi vely used in professional competitions to calculate ball trajectories and assist the referee in clarifying controv ersial calls through 3D visual depictions. Nonetheless, the system has to deploy high-end cameras with dedicated operators at selected locations and angles. The expense is too high for non-professional teams. Attempting to position the ball from sports com- petition videos has been studied for years. Howe ver , since the ball size is relati v ely small, it is prone to be confused with objects ha ving similar color or shape, causing false positiv es. Furthermore, due to the high moving speed of the ball, the resulting image is usually blurry , inducing false negati v es. By explor- ing the trajectory pattern from consecutiv e frames, the ball positioning can be effecti v ely improved. In addition, the ﬂight trajectory itself possesses important information and is a subject in many pieces of research [13]. For instance, combining multiple cameras with 3D technology for tennis detection [14], tracking tennis by particle ﬁlter in lo w-quality ﬁlms [15], and adopting two-layer data association approach to calculate the most likely ball trajectory from the results of f ailure detection in the frame-by-frame image processing [16] are enlightening studies. The success of deep learning techniques in image classiﬁcation [2] [17] encourages more researchers to adopt these methods to solve various problems such as object detection and interception [5] [6] [18], computer games, network security , activity recognition [19] [20], te xt and image semantic anal- ysis, and smart stores. The infrastructure of the deep learning network is a structured and huge con volu- tional neural network trained with a large amount of labeled data. The most common operations of CNNs include con volution, rectiﬁer , pooling/down- sampling, and decon v olution/up-sampling. A soft- max layer is usually used as the output layer . For example, the widely used VGG-16 [2] mainly consists of con v olutional, maximum pooling, and ReLU layers. Conceptually , front-end layers learn to identify simple geometric features, and back-end layers are trained to identify object features. In CNNs, each layer is a W × H × D data array . W , H , and D denote the width, height, and depth of the data array , respecti vely . The con v olution operation is a ﬁlter with a kernel of size w × h × D across the W × H range with the stride parameter s being set as 1 in many applications. T o av oid information loss near the boundary or maintain the size of the output data array , columns and rows of the data array can be padded with zero by setting the padding parameter p . Figure 1 depicts the relev ant parameters of the con v olution operation. Let W 0 and H 0 denote the width and height of the ne xt layer . Then, W 0 = W + 2 p − w s + 1 and H 0 = H + 2 p − h s + 1 . Fig. 1. Conv olution operation in deep learning networks. Since the con volution operation is linear and cannot effecti vely capture nonlinear behaviors, an acti vation function called rectiﬁer is introduced to capture nonlinear behaviors. The Rectiﬁed Linear Unit (ReLU) is the most commonly used activ ation function in deep learning models. If the input v alue is neg ativ e, the function returns 0 ; otherwise, the function returns the input value. ReLU can be expressed as f ( x ) = max(0 , x ) . Maximum pooling provides the functionality of down-sampling and feature fusion. Maximum pooling fuses features by encoding data via down-sampling. The block of data will be represented only by the largest one. After pooling, the data size is reduced. On the other 3 hand, to achiev e pixel-by-pixel classiﬁcation, up- sampling is necessary to reconstruct an output with the same size as the original image [21] [22]. In up- sampling, samples are duplicated to expand the data size. Batch normalization is a widely used technique to speed up the training process. Each W × H data array is independently standardized into a normal distribution. Backward propag ation is commonly used in train- ing neural networks to learn the ﬁlter coefﬁcients. Firstly , forward propagation is performed to hav e a preliminary prediction. Then, compared the pre- diction with the ground truth, a loss function will be ev aluated. Finally , the weights of the model, i.e., the ﬁlter coefﬁcients, are updated according to the loss by the gradient descent method. Chain rule is adopted to calculate the gradient of the loss function layer by layer . The process will be repeated again and again until a certain number of repetitions is reached or the loss falls below an acceptable threshold. The design of the loss function is an important factor that affects the training efﬁcienc y and the performance of the network. Commonly used loss functions include Root Mean Square Error (RMSE) and cross-entropy . In this paper , we propose a deep learning network named T rackNet to detect tennis and badminton on broadcast sport competition videos. By training with consecutive input frames, T rackNet can not only recognize the ball but also learn its trajectory pattern. A heatmap which is ideally a Gaussian distribution centered on the ball image is then generated by T rackNet to indicate the position of the ball. The idea of exploiting heatmap for object detection has been adopted in many studies [23] [24]. T o compare and ev aluate the performance of T rackNet, we implement Archanas algorithm [1] which uses con ventional image processing tech- niques to detect tennis ball. Archana’ s algorithm ﬁrstly smooths the image of each frame by a median ﬁlter to remov e noise. After a background model is calculated, background subtraction is performed to obtain the foreground. Then, the difference be- tween frames by logical AND operation is examined to identify fast-moving foreground objects. Those objects are compared with shape, size, and aspect ratio of the tennis ball and selected by applying T ABLE I S E G M E N T S O F L A B E L FI L E S . . . . 0008.jpg, 2, 727, 447, 0 0009.jpg, 1, 735, 457, 0 0010.jpg, 1, 722, 433, 1 0011.jpg, 1, 707, 403, 0 . . . 0029.jpg, 1, 555, 220, 0 0030.jpg, 1, 550, 218, 2 0031.jpg, 1, 547, 206, 0 . . . dilation and erosion to generate candidates. T o ﬁlter out wrong candidates, in our implementation, a fully-connected neural network is trained to classify candidates into positi ve and negati ve cate gories. The one that has the highest probability in the positi ve category is selected, indicating the position of the ball. I I I . D A TA S E T Our ﬁrst dataset is from the broadcast video of the tennis men’ s singles ﬁnal at the 2017 Summer Uni versiade. The resolution, frame rate, and video length are 1280 × 720 , 30 fps, and 75 minutes, respecti vely . By screening out unrelated frames, 81 game-related clips are segmented and each of them records a complete play , starting from ball serving to score. There are 20 , 844 frames in total. Each frame possesses the follo wing attrib utes: ”Frame Name”, ”V isibility Class”, ”X”, ”Y”, and ”T rajectory P at- tern”. T able I is pieces of label ﬁles. ”Frame Name” is the name of the frame ﬁles. ”V isibility Class”, VC for short, indicates the visi- bility of the ball in each frame. The possible v alues are 0, 1, 2, and 3. V C = 0 implies the ball is not within the frame. V C = 1 implies the ball can be easily identiﬁed. V C = 2 implies the ball is in the frame but can not be easily identiﬁed. For example, as shown in Figure 2, the ball in 0079.jpg is hardly visible since the color of the tennis ball is similar to the text ”T aipei” on the court. Howe v er , with the help of neighboring frames, 0078.jpg and 0080.jpg, the unclear ball position of 0079.jpg can be labeled. Figure 2 (d), (e), and (f) illustrate the labeling results. V C = 3 implies the ball is occluded by 4 other objects. For example, as sho wn in Figure 3, the ball in 0139.jpg is occluded by the player . Similarly , based on the information from neighboring frames, 0138.jpg and 0140.jpg, the ball position of 0139.jpg can be estimated. Figure 3 (d), (e), and (f) illustrate the labeling results. In the dataset, the number of frames of V C = 0 , 1 , 2 , 3 are 659, 18035, 2143, and 7, respectiv ely . Fig. 2. The ball image is hardly visible. Fig. 3. The ball is occluded by the player . ”X” and ”Y” indicate the coordinate of tennis in the pixel coordinate. Due to the high moving speed, tennis images in the broadcast video may be blurry and even hav e afterimage trace. In such cases, ”X” and ”Y” are considered as the latest position of the ball’ s trace. For e xample, as shown in Figure 4, the ball is ﬂying from Player1 to Player2 with a prolonged trace and the red dot indicates the labeled coordinate. ”T rajectory Pattern” indicates the ball movement types and are classiﬁed into three categories: ﬂying, Fig. 4. An example of the prolonged tennis trace. hit, and bouncing. They are labeled by 0, 1, and 2, respecti vely . Figure 5 is an example of striking a ball. The ball is ﬂying at 0021.jpg and 0022.jpg. At 0023.jpg, the ball is labeled as hit. Figure 6 shows a bouncing case. The ball has not reached the ground at 0007.jpg and 0008.jpg. At 0009.jpg, the ball hits the ground and is labeled as bouncing. Fig. 5. A hit case: (a) and (b) are labeled as ﬂying, and (c) is labeled as hit. Fig. 6. A bouncing case: (a) and (b) are labeled as ﬂying, and (c) is labeled as bouncing. T o enrich the variety of training dataset, addi- tional 16 , 118 frames are collected. These frames came from 9 videos recorded at different tennis courts, including grass court, red clay court, hard court etc. By learning div erse scenarios, the deep 5 learning model is expected to recognize tennis ball at various courts. That increases the robustness of the model. Further details will be presented in Section V. In addition to tennis, to explore the versatility of the proposed TrackNet in the applications of high-speed and tiny objects tracking, a trial run on badminton match video is performed. T racking badminton is more challenging than tracking tennis since the speed of badminton is much faster than tennis. The fastest serve according to the of ﬁcial records from the Association of T ennis Profession- als is John Isner’ s 253 kilometers per hour at the 2016 Davis Cup. On the other hand, the fastest badminton hit in competition is Lee Chong W ei’ s 417 kilometers per hour smash at the 2017 Japan Open according to Guinness W orld Records, which is over 1 . 6 times faster than tennis. Besides, in professional competitions, the speed of badminton is frequently o ver 300 kilometers per hour . The faster the object mov es, the more difﬁcult it is to be tracked. Hence, it is expected that the performance will degrade for badminton compared with tennis. Our badminton dataset comes from a video of the badminton competition of 2018 Indonesia Open Final - T AI Tzu Y ing vs CHEN Y uFei. The reso- lution is 1280 × 720 and the frame rate is 30 fps. Similarly , unrelated frames such as commercial or highlight replays are screened out. The resulting total number of frames is 18 , 242 . W e label each frame with the following attrib utes: ”Frame Name”, ”V isibility Class”, ”X”, and ”Y”. In badminton dataset, ”V isibility Class” is clas- siﬁed into tw o cate gories, V C = 0 and V C = 1 . V C = 0 means the ball is not in the frame and V C = 1 means the ball is in the frame. Unlike our tennis dataset, we do not classify V C = 2 and V C = 3 categories since the badminton mov es so fast that blurry image happens v ery frequently . Therefore, in the badminton dataset, V C = 1 includes all status of badminton as long as the ball is within the frame no matter it is clearly visible or hardly visible. ”X” and ”Y” indicate the coordinate of bad- minton. Similar to tennis, ”X” and ”Y” are deﬁned by the latest position of the ball’ s trace considering its moving direction if the image is prolonged. In badminton video, prolonged trace often happens and sometimes we could hardly identify the position of the ball. An example of how we label the prolonged images is shown in Figure 7. Fig. 7. An example of the prolonged badminton trace. I V . T R A C K N E T Fig. 8. An example of the detection heatmap. T rackNet is composed of a con volutional neural network (CNN) followed by a decon v olutional neu- ral network (Decon vNet) [7]. It takes consecuti ve frames to generate a heatmap indicating the position of the object. The number of input frames is a network parameter . One input frame is considered the con v entional CNN netw ork. T rackNet with more than one input frame can improv e the moving object detection by learning the trajectory pattern. For the purpose of ev aluation, two networks are implemented. One is with single frame input, and the other is with three consecuti ve frames input. T rackNet utilizes the heatmap-based CNN which has been prov ed useful in sev eral applications [23] [24]. T rackNet is trained to generate a probability- like detection heatmap ha ving the same resolution as the input frames. The ground truth of the heatmap is an ampliﬁed 2D Gaussian distribution located at the 6 Fig. 9. The architecture of the proposed T rackNet. center of the tennis ball. The coordinates of the ball are av ailable in the labeled dataset and the variance of the Gaussian distrib ution refers to the diameter of tennis ball images. Let ( x 0 , y 0 ) be the ball center and the heatmap function is expressed as G ( x, y ) =  1 2 π σ 2 e − ( x − x 0 ) 2 +( y − y 0 ) 2 2 σ 2   2 π σ 2 · 255   , where the ﬁrst part is a Gaussian distribution cen- tered at ( x 0 , y 0 ) with variance of σ 2 , and the second part scales the value to the range of [0 , 255] . σ 2 = 10 is used in our implementation since the av erage ball radius is about 5 pixels, roughly corresponding to the re gion of G ( x, y ) ≥ 128 . Figure 8 is a visualized heatmap function of a tennis ball. The implementation details of TrackNet is illus- trated in Figure 9 and T able II. The input of the proposed network can be some number of consec- uti ve frames. The ﬁrst 13 layers refer to the design of the ﬁrst 13 layers of VGG-16 [2] for object classiﬁcation. The 14-24 layers refer to Decon vNet [7] for semantic segmentation. T o realize the pixel- wise prediction, upsampling is applied to recov er the information loss from maximum pooling layers. Symmetric numbers of upsampling layers and max- imum pooling layers are implemented. The ﬁnal black-white binary detection heatmap is not directly a v ailable at the output of the deep learning netw ork. The network outputs a detection heatmap that has continuous values within the range of [0 , 255] for each pix el. Let L ( i, j, k ) denote the data array of coordinates within (0 , 0) ≤ ( i, j ) ≤ T ABLE II N E T W O R K PA R A M E T E R S O F T R A C K N E T . Layer Filter Size Depth Padding Stride Activ ation Con v1 3 × 3 64 2 1 ReLU+BN Con v2 3 × 3 64 2 1 ReLU+BN Pool1 2 × 2 max pooling and S tr ide = 2 Con v3 3 × 3 128 2 1 ReLU+BN Con v4 3 × 3 128 2 1 ReLU+BN Pool2 2 × 2 max pooling and S tr ide = 2 Con v5 3 × 3 256 2 1 ReLU+BN Con v6 3 × 3 256 2 1 ReLU+BN Con v7 3 × 3 256 2 1 ReLU+BN Pool3 2 × 2 max pooling and S tr ide = 2 Con v8 3 × 3 512 2 1 ReLU+BN Con v9 3 × 3 512 2 1 ReLU+BN Con v10 3 × 3 512 2 1 ReLU+BN UpS1 2 × 2 upsampling Con v11 3 × 3 512 2 1 ReLU+BN Con v12 3 × 3 512 2 1 ReLU+BN Con v13 3 × 3 512 2 1 ReLU+BN UpS2 2 × 2 upsampling Con v14 3 × 3 128 2 1 ReLU+BN Con v15 3 × 3 128 2 1 ReLU+BN UpS3 2 × 2 upsampling Con v16 3 × 3 64 2 1 ReLU+BN Con v17 3 × 3 64 2 1 ReLU+BN Con v18 3 × 3 256 2 1 ReLU+BN Softmax (639 , 359) and depth within 0 ≤ k ≤ 255 . The softmax layer calculates the probability distribution of depth k from possible 256 grayscale values. Let P ( i, j, k ) denote the probability of depth k at ( i, j ) . 7 The softmax function is given by P ( i, j, k ) = e L ( i,j,k ) P 255 l =0 e L ( i,j,l ) . Based on the probability giv en by the softmax layer on each pixel, the depth k with the highest probability is selected as the heatmap v alue of the pixel. For each pixel, let h ( i, j ) = arg max k P ( i, j, k ) denote the softmax layer output at ( i, j ) , indicating the selected grayscale value at ( i, j ) . Once the complete continuous detection heatmap is gener- ated, the coordinate of the ball can be determined by the following two steps. The ﬁrst step is to pixel-wisely con vert the heatmap into a black-white binary heatmap by the threshold t . If a pixel has a v alue lar ger than or equal to t , the pix el is set to 255 . On the contrary , if a pixel has a value smaller than t , the pixel is set to 0 . Based on the previous discussion regarding the mean radius of a tennis ball, threshold t is set as 128 . The second step is to exploit the Hough Gradient Method [25] to ﬁnd the circle on the black-white binary detection heatmap. If exactly one circle is identiﬁed, the centroid of the circle is returned. In other cases, the heatmap is considered no ball detected. During the training phase, the cross-entropy func- tion is used to calculate the loss function based on P ( i, j, k ) . The corresponding ground truth function denoted by Q ( i, j, k ) is giv en by Q ( i, j, k ) =  1 , if G ( i, j ) = k ; 0 , otherwise. Let H Q ( P ) denote the loss function. Then, H Q ( P ) = − X i,j,k Q ( i, j, k ) log P ( i, j, k ) . V . E X P E R I M E N T S The experiment setup is as follo wed. The tennis dataset elaborated in Section III is used to ev aluate the performance of Archana’ s algorithm, a con ven- tional image processing technique, and the proposed T rackNet. The dataset contains 20 , 844 frames and is randomly di vided to the training set and test set. 70% frames are the training set and 30% frames are the test set. T o speed up the training speed, all T ABLE III K E Y PA R A M E T E R S U S E D I N M O D E L T R A I N I N G . Parameters Setting Learning rate 1.0 Batch size 2 Steps per epoch 200 epochs 500 Initial weights random uniform Range of initial weights [ − 0 . 05 , 0 . 05] frames are resized from 1280 × 720 to 640 × 360 . T o optimize weights of the network, the Adadelta optimizer [26] is applied. T able III summarizes other key parameters. Among these parameters, the number of epochs is one of the most critical factors in model training. Underﬁtting happens if it is too small, while ov erﬁtting happens if it is too large. For T rackNet, the characteristic of loss versus the number of epochs is shown in Figure 10. Based on the simulation, we select 500 epochs as our optimal v alue to prev ent both underﬁtting and ov erﬁtting. Fig. 10. The loss curve of TrackNet model training. T o compare the performance of T rackNet frame- works with one input frame and three consecutiv e input frames, two v ersions of TrackNet are imple- mented. For con venience, T rackNet that takes one input frame is named as Model I and T rackNet that takes three consecutive input frames is named as Model II. For Model II, three consecutiv e frames are used to detect the ball coordinate in the last frame. During the training phase, three consecutiv e frames are considered a training sequence if the last frame belongs to the training set. Like wise, three consecuti ve frames are considered a test sequence if the last frame belongs to the test set. Note that 8 T ABLE IV P E R F O R M A N C E S U M M A RY . Archana’ s T rackNet Model I T rackNet Model II T rackNet Model II’ VC0 VC1 VC2 VC3 VC0 VC1 VC2 VC3 VC0 VC1 VC2 VC3 VC0 VC1 VC2 VC3 TP - 4046 418 0 - 4933 497 0 - 5223 565 2 - 5234 598 1 FP 201 334 29 1 1 221 20 0 0 3 3 0 4 6 7 2 TN 9 - - - 195 - - - 210 - - - 206 - - - FN - 947 214 6 - 241 139 7 - 101 93 5 - 87 56 4 T otal 210 5327 661 7 196 5395 656 7 210 5327 661 7 210 5327 661 7 T ABLE V A C C U R A C Y M E T R I C S O F D I FF E R E N T M O D E L S . Model Precision Recall F1-measure Archana’ s [1] 92.5% 74.5% 82.5% T rackNet Model I 95.7% 89.6% 92.5% T rackNet Model II 99.8% 96.6% 98.2% T rackNet Model II’ 99.7% 97.3% 98.5% T rackNet frame work is scalable. Any number of consecuti ve input frames are allowed. T o deﬁne a proper speciﬁcation for prediction error , the size of the tennis is inv estigated. The diameter of tennis images in the video ranges from 2 to 12 pixels and the mean diameter is around 5 pixels. Since the prediction error within a unit size of the ball does not cause misleading in trajec- tory identiﬁcation, we deﬁne the positioning error (PE) speciﬁcation as 5 pixels to indicate whether a ball is accurately detected. Detections with PE larger than 5 pix els belong to false predictions. P E is deﬁned by the Euclidean distance between the model prediction and the ground truth. Figure 11 shows the P E distribution of T rackNet models. The x -axis represents P E in the unit of pixels and the y -axis is the percentage of occurrence. x = 0 stands for perfect detection. x = 1 means PE lies in 0 < P E ≤ 1 , x = 2 means PE lies in 1 < P E ≤ 2 , and so on. Note that occurrence percentages of P E > 5 of Model I and Model II are 4 . 3% and 0 . 1% , respecti vely . That is, 95 . 7% and 99 . 9% detections of Model I and Model II fulﬁll the speciﬁcation. The Archana’ s algorithm [1], an image processing technique de veloped by Archana and Geetha, is implemented for comparison. The prediction details of Archana’ s algorithm, TrackNet Model I, and Fig. 11. The distrib ution of the positioning error . T rackNet Model II are sho wn in T able IV, where TP , FP , TN, and FN stand for true positi v e, false posi- ti ve, true negati ve, and false negati ve, respecti vely . The numbers are grouped by ”V isibility Class”, VC. False positi ve of VC1, VC2, and VC3 stands for predictions with P E lar ger than 5 pixels. F alse negati v e of VC1, VC2, and VC3 means there is no ball detected or there is more than one ball detected when there is actually one ball in the frame. Note that since T rackNet Model I and T rackNet Model II utilize a different number of input frames, the training set and test set numbers are different. Archanas and T rackNet Model II’ follow the same training set and test set as T rackNet Model II. It is observ ed that compared to Archana’ s algo- rithm, both TrackNet Model I and TrackNet Model II signiﬁcantly reduce false positi ves and false neg- ati ves, resulting in an increase of both true positiv es and true negati ves. The comparison presents an ex- ceptional object detection capability of deep learn- ing networks over con ventional image processing algorithms. In addition, T rackNet Model II performs e ven better than T rackNet Model I, proving that 9 T ABLE VI A C C U R A C Y A N A LY S I S O F B A D M I N T O N T R A C K I N G . Model Precision Recall F1-measure T rackNet-T ennis 75.8% 22.9% 35.2% T rackNet-Badminton 85.0% 57.7% 68.7% training T rackNet with consecutiv e input frames can further improv e its dynamic object tracking ability , especially for small objects. Moreover , T rackNet Model II e ven correctly positions occluded balls occasionally . 2 out of 7 occluded balls are precisely detected. This disco very directly e xhibits that con- secuti ve frames provide critical information for the network to learn trajectory patterns of the interested object. By extracting information from neighboring frames, T rackNet Model II not only enhances its tracking precision on normal objects b ut also on blurry or occluded objects. The overall performance in terms of precision, recall, and F1-measure are summarized in T able V. These three metrics are deﬁned by Precision = # of T rue Positiv e # of T rue Positiv e + False Positi ve , Recall = # of T rue Positiv e # of VC1+VC2+VC3 , and F1-measure = 2( Precision × Recall ) Precision + Recall . The Archana’ s algorithm reaches 92 . 5% precision, 74 . 5% recall, and 82 . 5% F1-measure. W ith the help of po werful deep learning network, T rackNet Model I outperforms the Archana’ s algorithm and reaches 95 . 7% precision, 89 . 6% recall, and 92 . 5% F1-measure. By learning how to extract trajec- tory information from neighboring frames, T rackNet Model II further improv es the performance and achie ves 99 . 8% precision, 96 . 6% recall, and 98 . 2% F1-measure. T o pre vent the ov erﬁtting issue that frequently happens in deep learning solutions, another 16 , 118 frames are added to the training set. These 16 , 118 frames are collected from an additional 9 videos recorded at dif ferent tennis courts, including grass court, red clay court, hard court etc. The model trained by the enriched training set is named as T rackNet Model II’. T rackNet Model II’ follo ws the same training logic as T rackNet Model II with the only difference in the variety of training set. The prediction details are sho wn in T able IV. As expected, the performance of T rackNet Model II’ is similar to TrackNet Model II on the same test set as shown in T able V. T rackNet Model II’ achie ves 99 . 7% precision, 97 . 3% recall, and 98 . 5% F1-measure. Furthermore, 10-fold cross validation is adopted on TrackNet Model II’ for the pur- pose of safety and comprehensi ve analysis. At last, T rackNet Model II’ with 10-fold cross v alidation reaches 95 . 3% precision, 75 . 7% recall, and 84 . 3% F1-measure. In addition to tennis, we also apply the proposed T rackNet to the badminton dataset as introduced in Section III. The badminton dataset contains 18 , 242 frames with the resolution of 1280 × 720 . Similarly , all frames are resized from 1280 × 720 to 640 × 360 to speed up the training process. The dataset is randomly di vided to the training set and test set. 70% frames are the training set and 30% frames are the test set. F or badminton dataset, model training parameters, including learning rate, batch size, num- ber of epochs, etc., are set to the same values used in the training of tennis dataset as sho wn in T able III. Before ev aluating T rackNet on badminton, the speciﬁcation of a correct detection is deﬁned by analyzing the dimension of badminton images in the video. Unlike tennis, badminton is not spherical, resulting in a larger size v ariation. W e deﬁne the diameter of a badminton image by taking an a v erage on its largest length and width. The image exists in two e xtreme cases. One happens when the bad- minton mov es to ward the camera at the backcourt and the other happens when the badminton mo ves laterally at the frontcourt. In our dataset, such cases result in a large v ariation in images’ diameter rang- ing from 3 to 24 pixels. Since the mean diameter is around 7 . 5 pixels, we deﬁne the PE speciﬁcation as 7 . 5 pixels to indicate whether a badminton image is accurately detected. Detections with PE larger than 7 . 5 pixels belong to incorrect predictions. Compared with tennis that has PE speciﬁcation of 5 pixels, the PE speciﬁcation of badminton seems to be released. The main reason is that images of badminton are larger than tennis in the video since the badminton court is smaller than the tennis court. Therefore, the 10 camera uses a smaller focal length to capture the entire court, resulting in larger images of ball and players. T o ev aluate the badminton tracking ability of T rackNet, we adopt the transfer learning idea that directly applies the well-trained T rackNet model by tennis dataset for badminton trajectories recogni- tion. Here, we name the transfer learning mode as T rackNet-T ennis which is trained by tennis dataset using three consecuti v e input frames. As sho wn in T able VI, for badminton tracking, T rackNet-T ennis only achie ves precision, recall, and F1-measure of 75 . 8% , 22 . 9% , and 35 . 2% , respectiv ely . Although the precision seems acceptable, the recall is too poor to be used. Such a low recall is due to a large number of false negati ves, implying that the badminton cannot be recognized in many cir - cumstances. The main reason causing such poor performance lies in the fundamental characteristics dif ference between tennis and badminton, including velocity , trajectories, shape, etc. T o verify the feasi- bility of T rackNet frame work on badminton track- ing, we train another model named as T rackNet- Badminton which is trained by badminton dataset using three consecuti v e input frames. As sho wn in T able VI, T rackNet-Badminton reaches precision, recall, and F1-measure of 85 . 0% , 57 . 7% , and 68 . 7% , respecti vely . As expected, T rackNet-Badminton is able to learn the features of badminton, leading to signiﬁcant performance improvement. Furthermore, when we compare tennis and bad- minton tracking performance using T rackNet frame- work, it can be observed that tennis tracking out- performs badminton tracking by a noticeable mar- gin. This is because badminton trav els much faster than tennis, resulting in much more unclear object images in badminton videos. As elaborated in Sec- tion III, the fastest recorded badminton mov es in 417 kilometers per hour , while the fastest recorded tennis mov es in 253 kilometers per hour . Such an enormous increase in velocity causes performance degradation especially in the aspect of the recall due to high false negati ves. High tra veling speed makes the badminton mov e across long distance within only a few frames. The property of dynamic trajectories in such high speed becomes hard to recognize by the model. In addition to the absolute speed, badminton possesses a much higher v ariation in trav eling speed than tennis. For example, in badminton, a drop stroke and a smash stroke hav e a signiﬁcant dif ference in velocity . Such extreme sce- narios commonly happen during a badminton com- petition, making the model hard to ﬁt both scenarios perfectly . Nonetheless, although the performance in tracking badminton is not as phenomenal as tennis, achie ving a precision of 85 . 0% is accurate enough to correctly depict all trajectories in the game. Future research on TrackNet improv ement in the aspects of identifying trajectories of e xtreme fast objects and learning distinct patterns caused by signiﬁcant speed v ariation will be conducted. V I . C O N C L U S I O N In this paper , we proposed TrackNet, a heatmap- based deep learning network comprising both con- volutional and decon v olutional neural network. T rackNet is able to precisely position coordinates of high-speed and tin y objects such as tennis and badminton. W ith T rackNet, accurate predictions can be achie ved on broadcast sports videos without high frame rate and high resolution, signiﬁcantly reducing the cost from recording and processing high speciﬁcation videos. T o enhance T rackNet’ s capability of identifying trajectory patterns of fast- moving objects, we designed a scalable input that allo ws feeding T rackNet with multiple consecuti v e input frames. By e v aluating both conv entional im- age processing algorithm and the proposed T rack- Net on the real tennis video dataset, we demon- strated that T rackNet can achie ve an explainable and e xceptional prediction performance by adopting consecuti ve input frames concept on the deep neural network. Moreover , for the e ven faster objects such as badminton, T rackNet achieves a decent tracking capability according to our experimental results, exhibiting promising e xtensibility to related appli- cations. A C K N O W L E D G E M E N T This w ork of T .-U. ˙ Ik w as supported in part by the Ministry of Science and T echnology , T aiwan under grant MOST 107-2627-H-009-001 and MOST 105- 2221-E-009-102-MY3. This work was ﬁnancially supported by the Center for Open Intelligent Con- necti vity from The Featured Areas Research Center 11 Program within the framework of the Higher Edu- cation Sprout Project by the Ministry of Education (MOE), T aiwan. R E F E R E N C E S [1] M. Archana and M. K. Geetha, “Object detection and tracking based on trajectory in broadcast tennis video, ” Pr ocedia Com- puter Science , vol. 58, pp. 225–232, 2015. [2] K. Simonyan and A. Zisserman, “V ery deep con volutional networks for large-scale image recognition, ” arXiv preprint arXiv:1409.1556 , 2014. [3] R. Girshick, J. Donahue, T . Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmen- tation, ” in Pr oceedings of the IEEE Conference on Computer V ision and P attern Recognition (CVPR 2014) , 23-28 June 2014, pp. 580–587. [4] R. Girshick, “Fast R-CNN, ” in International Confer ence on Computer V ision (ICCV 2015) , 11-18 December 2015, pp. 1440–1448. [5] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: T owards real-time object detection with region proposal networks, ” in Advances in neural information pr ocessing systems , 2015, pp. 91–99. [6] J. Redmon, S. Divv ala, R. Girshick, and A. Farhadi, “Y ou only look once: Uniﬁed, real-time object detection, ” in Proceedings of the IEEE confer ence on computer vision and pattern r ecog- nition , 2016, pp. 779–788. [7] H. Noh, S. Hong, and B. Han, “Learning deconv olution net- work for semantic segmentation, ” in Pr oceedings of the IEEE International Conference on Computer V ision , 2015, pp. 1520– 1528. [8] H.-T . Chen, W .-J. Tsai, S.-Y . Lee, and J.-Y . Y u, “Ball tracking and 3D trajectory approximation with applications to tactics analysis from single-camera volleyball sequences, ” Multimedia T ools and Applications , vol. 60, no. 3, pp. 641–667, October 2012. [9] X. W ang, V . Ablavsk y , H. B. Shitrit, and P . Fua, “T ake your eyes off the ball: Improving ball-tracking by focusing on team play , ” Computer V ision and Image Understanding , vol. 119, pp. 102–115, February 2014. [10] T .-S. Fu, H.-T . Chen, C.-L. Chou, W .-J. Tsai, and S.-Y . Lee, “Screen-strategy analysis in broadcast basketball video using player tracking, ” in Pr ocessing of the 2011 IEEE V isual Com- munications and Image (VCIP) , 6-9 November 2011. [11] H. Myint, P . W ong, L. Dooley , and A. Hopgood, “Tracking a table tennis ball for umpiring purposes, ” in Proceedings of the 14th IAPR International Confer ence on Machine V ision Applications (MV A 2015) , 18-22 May 2015, pp. 170–173. [12] “Hawk-eye, ” https://en.wikipedia.org/wiki/Ha wk- Eye. [13] X. Y u, C.-H. Sim, J. R. W ang, and L. F . Cheong, “ A trajectory- based ball detection and tracking algorithm in broadcast tennis video, ” in 2004 International Conference on Imag e Pr ocessing (ICIP 2004) , vol. 2. Singapore: IEEE, 24-27 October 2004, pp. 1049–1052. [14] V . Ren ` o, N. Mosca, M. Nitti, C. Guaragnella, T . D’Orazio, and E. Stella, “Real-time tracking of a tennis ball by combining 3d data and domain kno wledge, ” in T echnology and Innovation in Sports, Health and W ellbeing (TISHW), International Confer- ence on . IEEE, 2016, pp. 1–7. [15] F . Y an, W . Christmas, and J. Kittler, “ A tennis ball tracking algorithm for automatic annotation of tennis match, ” in Pro- ceedings of the British Machine V ision Confer ence (BMVC 2005) , vol. 2. Durham, England: BMV A, 5-8 September 2005, pp. 619–628. [16] X. Zhou, L. Xie, Q. Huang, S. J. Cox, and Y . Zhang, “T ennis ball tracking using a two-layered data association approach, ” IEEE T ransactions on Multimedia , vol. 17, no. 2, pp. 145–156, 2015. [17] A. Krizhevsk y , I. Sutskever , and G. E. Hinton, “Imagenet classi- ﬁcation with deep con volutional neural networks, ” in Advances in neural information pr ocessing systems , 2012, pp. 1097–1105. [18] O. Ronneberger , P . Fischer, and T . Brox, “U-net: Conv olutional networks for biomedical image segmentation, ” in International Confer ence on Medical image computing and computer-assisted intervention . Springer, 2015, pp. 234–241. [19] W . Jiang and Z. Y in, “Human acti vity recognition using wearable sensors by deep con volutional neural networks, ” in Pr oceedings of the 23r d ACM international confer ence on Multimedia . ACM, 2015, pp. 1307–1310. [20] Y . Chen and Y . Xue, “ A deep learning approach to human ac- tivity recognition based on single accelerometer, ” in 2015 IEEE International Conference on Systems, Man, and Cybernetics (SMC) . IEEE, 2015, pp. 1488–1492. [21] V . Badrinarayanan, A. Handa, and R. Cipolla, “Segnet: A deep con volutional encoder-decoder architecture for robust semantic pixel-wise labelling, ” arXiv pr eprint arXiv:1505.07293 , 2015. [22] J. Long, E. Shelhamer , and T . Darrell, “Fully con v olutional networks for semantic segmentation, ” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2015, pp. 3431–3440. [23] V . Belagiannis and A. Zisserman, “Recurrent human pose estimation, ” in 2017 12th IEEE International Confer ence on Automatic F ace and Gesture Recognition (FG 2017) . IEEE, 2017, pp. 468–475. [24] T . Pﬁster, J. Charles, and A. Zisserman, “Flowing con vnets for human pose estimation in videos, ” in Proceedings of the IEEE International Conference on Computer V ision , 2015, pp. 1913– 1921. [25] “Hough gradient method, ” https://goo.gl/gZTQRm. [26] M. D. Zeiler, “ADADEL T A: an adaptive learning rate method, ” arXiv preprint , vol. abs/1212.5701, 2012. [Online]. A vailable: http://arxiv .org/abs/1212.5701 12

TrackNet: A Deep Learning Network for Tracking High-speed and Tiny Objects in Sports Applications

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment